The Language of the Web PDF

Summary

This document explores the unique characteristics of language on the World Wide Web, examining both linear and non-linear text formats, as well as the various forms of graphic organization used in webpages.

Full Transcript

The language of the Web ‘The vision I have for the Web is about anything being potentially connected with anything.’ This observation by the Web’s inventor, Tim Berners-Lee, on the first page of his biographical account, Weaving the Web (1999), provides a characterization of this element of the Inte...

The language of the Web ‘The vision I have for the Web is about anything being potentially connected with anything.’ This observation by the Web’s inventor, Tim Berners-Lee, on the first page of his biographical account, Weaving the Web (1999), provides a characterization of this element of the Internet which truly strains the notion of ‘situation’ and the accompanying concept of a ‘variety’ of Internet language. After all, language, and any language, in its entirety, is part of this ‘anything’. The Web in effect holds a mirror up to the graphic dimension of our linguistic nature. A significant amount of human visual linguistic life is already there, as well as a proportion of our vocal life. So can it be given a coherent linguistic identity? ‘Graphic’ here refers to all aspects of written (as opposed to spoken) language, including typewritten, handwritten (including calligraphic), and printed text. It includes much more than the direct visual impression of a piece of text, as presented in a particular typography and graphic design on the screen; it also includes all those features which enter into a language’s orthographic system (chiefly its spelling, punctuation, and use of capital letters) as well as the distinctive features of grammar and vocabulary which identify a typically ‘written’ as opposed to a ‘spoken’ medium of communication. Most Web text will inevitably be printed, given the technology generally in use. Typewritten text (in the sense of text produced by a typewriter) is hardly relevant, belonging as it does to a pre- electronic age, though of course it can be simulated, and many of the features of typing style have had an influence on the word-processing age. Handwritten text has only a limited presence, being available only through the use of specially designed packages, and is of little practical value to most Internet users. But printing exists in a proliferation of forms – currently more limited than traditional paper printing in its use of typefaces, but immensely more varied in its communicative options through the availability of such dimensions as colour, movement, and animation. And it is here that even a tiny exposure to the Web demonstrates its re- markable linguistic range. Anything that has been written can, in principle, appear on the Web; and a significant proportion of it has already done so, in the form of digital libraries, electronic text archives, and data services. So, a few minutes Web browsing will bring to light every conceivable facet of our graphic linguistic existence. There will be large quantities of interrupted linear text – that is, text which follows the unidimensional flow of speech, but interrupted by conventions which aid intelligibility – chiefly the use of spaces between words and the division of a text into lines and screens. This is the normal way of using written language, and it dominates the Web as it does any other graphic medium. But there will also be large quantities of non-linear text – that is, text which can be read in a multidimensional way. In non-linear viewing, the lines of a text are not read in a fixed sequence; the eye moves about the page in a manner dictated only by the user’s interest and the designer’s skill, with some parts of the page being the focus of attention and other parts not being read at all. A typical example is a page advertising a wide range of products at different prices. On the Web, many pages have areas allocated to particular kinds of information and designed (through the use of colour, flashing, movement, and other devices) to attract the attention and disturb any process of predictable reading through the screen in a conventional way. On a typical sales page, a dozen locations compete for our attention (search, help, shopping basket, home page, etc.). The whole concept of hypertext linking (see below) is perhaps the most fundamental challenge to linear viewing. But there are yet other kinds of graphic organization. The Web displays many kinds of lists, for example – sequences of pieces of information, ordered according to some principle, which have a clear starting point and a finishing point – such as items in a catalogue, restaurant menus, filmographies, and discographies. As the whole basis of the linguistic organization of a search-engine response to an inquiry is to provide a series of hits in the form of a list, it would seem that list organization is intrinsic to the structure of the Web. Matrices are also very much in evidence – arrangements of linguistic, numerical, or other information in rows and columns, designed to be scanned vertically and horizontally. They will be found in all kinds of technical publications as well as in more everyday contexts such as sites dealing with sports records or personal sporting achievements. And there are branching structures, such as are well-known in family tree diagrams, widely used whenever two or more alternatives need to be clearly identified or when the history of a set of related alternatives needs to be displayed. In an electronic context, of course, the whole of the branching structure may not be visible on a single screen, the different paths through a tree emerging only when users click on relevant ‘hot’ spots on the screen. The Web is graphically more eclectic than any domain of written language in the real world. And the same eclecticism can be seen if we look at the purely linguistic dimensions of written expression – the use of spelling, grammar, vocabulary, and other properties of the discourse (the ways that information is organized globally within texts, so that it displays coherence, logical progression, relevance, and so on). Whatever the variety of written language we have encountered in the paper-based world, its linguistic features have their electronic equivalent on the Web. Among the main varieties of written expression are legal, religious, journalistic, literary, and scientific texts. These are all widely present in their many sub-varieties, or genres. Under the heading of religion, for instance, we can find a wide range of liturgical forms, rituals, prayers, sacred texts, preaching, doctrinal statements, and private affirmations of belief. Each of these genres has its distinctive linguistic character, and all of this stylistic variation will be found on the Web. If we visit a Web site, such as the British Library or the Library of Congress, and call up their catalogues, what we find is exactly the same kind of language as we would if we were to visit these locations in London or Washington, even down to the use of different conventions of spelling and punctuation. The range of the Web extends from the huge database to the individual self-published ‘home page’, and presents contributions from every kind of designer and stylist, from the most professional to the linguistically and graphically least gifted. It thus defies stylistic generalization. All of this is obvious, and yet in its very obviousness there is an important point to be made: in its linguistic character, seen through its linked pages, the Web is an analogue of the written language that is already ‘out there’ in the paper-based world. For the most part, what we see on Web pages is a familiar linguistic world. If we are looking for Internet distinctiveness, novelty, and idiosyncrasy – or wishing to find fuel for a theory of impending linguistic doom – we are not likely to find it here. But distinctiveness there is. If the Web holds a mirror up to our linguistic nature, it is a mirror that both distorts and enhances, providing new constraints and opportunities. It constrains, first of all, in that we see language displayed within the physical limitations of a monitor screen, and subjected to a user-controlled movement (scrolling) – chiefly vertical, sometimes horizontal – that has no real precedent (though the rolled documents of ancient and mediaeval times must have presented similar difficulties). Scrolling down is bound to interfere with our ability to perceive, attend to, assimilate, and remember text. Scrolling sideways is even worse: a browser that does not offer a word-wrap facility may present line lengths of 150 characters or more, with reading continuity very difficult to maintain between successive line. Similarly, it is common to experience difficulty when we encounter screens filled with unbroken text in a single typeface, or screens where the information is typographically complex or fragmented, forbidding easy assimilation of the content. And any author who has tried to put text from a previously published book on the Web knows that it does not translate onto the screen without fresh thought being given to layout and design. Research is needed to establish what the chief factors are, as we transfer our psycholinguistic ability from a paper to an electronic medium. For not everything is easily transferrable, and alternative means need to be devised to convey the contrasts that were expressed through the traditional medium of print. For example, the range of typefaces we are likely to find on the Web is only a tiny proportion of the tens of thousands available in the real world. Although there is no limit in principle, and many typographically innovative sites exist, the general practice is at times boringly uniform, with unknown numbers of Web newcomers believing that electronic life is visible only through Times New Roman spectacles. As Roger Pring puts it, arguing for keeping typographic options open: Can you imagine a world with only one typeface to serve as the vehicle for all communication. How content would you be to see the same face on your supermarket loyalty card as on a wedding invitation?... The way computers work makes it easy to use the same group of faces over and over. Many users do take the easy option, with the result that innumerable sites present their wares to the reader with the same bland, monochrome look. The size of the screen has also exercised a major influence on the kind of language used, regardless of the subject-matter. The point is made explicit in manuals which deal with the style of computer-mediated communication. As we have already seen in chapter 3, the Wired handbook, for example, has this to say about Web style: Look to the Web not for embroidered prose, but for the sudden narrative, the dramatic story told in 150 words. Text must be complemented by clever interface design and clear graphics. Think brilliant ad copy, not long-form literature. Think pert, breezy pieces almost too ephemeral for print. Think turned-up volume – cut lines that are looser, grabbier, more tabloidy. Think distinctive voice or attitude. This, as an empirical statement about Web pages, represents only a limited amount of what is actually ‘out there’; but as a prescription for good practice it is widely followed. With many screens immediately displaying up to 30 functional areas, any initial on-screen textual description of each area is inevitably going to be short – generally a 3–4 word heading or a brief description of 10–20 words. Main pages reflect this trend. For example, a sample of 100 news reports taken from Web-designed BBC, CBN, and ABC sites showed that paragraphs were extremely short, averaging 25 words, and usually consisting of a single sentence; only in one case did a paragraph reach 50 words. Even when specially designed sites had nothing to do with news (such as introductions to educational courses or chambers of commerce), the way their material was displayed took on some of the characteristics of a news-type presentation. On the other hand, sites which simply reproduce material originally written for a paper outlet (such as government reports, academic papers, electronic versions of newspaper articles) move well away from any notion of succinctness. By all accounts, they are more difficult to read, but daily experience suggests that they nonetheless constitute a large proportion of pages on the Web. Certain defining properties of traditional written language are also fundamentally altered by the Web. In particular, its staticness is no longer obligatory, in that the software controlling a page may make the text move about the screen, appear and disappear, change colour and shape, or morph into animated characters. As the user moves the mouse-controlled arrow around a screen, the switch from arrow to hand will be accompanied by the arrival of new text. A mouse-click will produce yet more new text. Some sites bring text on-screen as the user watches – for example, BBC News Online had (October 2000) a top-of-the-screen headline appear in the manner of a teleprinter, letter-by-letter. It is all a dynamic graphology, in which the range of visual contrastivity available for linguistic purposes is much increased, compared with traditional print. One of the immediate consequences of this is that new conventions have emerged as signals for certain types of functionality – for example, the use of colours and underlining to identify hypertext links (see below) and e-mail addresses, or to establish the distinct identity of different areas of the screen (main body, links, help, advertising banner, etc.). Web pages need to achieve coherence while making immediate impact; they need structure as well as detail; interactive areas need to be clear and practicable; words, pictures, and icons need to be harmonized. These are substantial communicative demands, and the increased use of colour is the main means of enabling them to be met. As Roger Pring puts it, in a discussion of Web legibility: Control of the colour of text and background is the single most important issue, followed by an attempt to direct the browser’s choice of size and style of typeface. Whatever else the Web is, it is noticeably a colourful medium, and in this respect alone it is distinct from other Netspeak situations. Hypertext and interactivity Probably the most important use of colour in a well-designed Website is to identify the hypertext links – the jumps that users can make if they want to move from one page or site to another. The hypertext link is the most fundamental structural property of the Web, without which the medium would not exist. It has parallels in some of the conventions of traditional written text – especially in the use of the footnote number or the bibliographical citation, which enables a reader to move from one place in a text to another – but nothing in traditional written language remotely resembles the dynamic flexibility of the Web. At the same time, it has been pointed out that the Web, as it currently exists, is a long way from exploiting the full intertextuality which the term hypertext implies. As Michele Jackson points out, true hypertext ‘entails the complete and automatic interlocking of text, so that all documents are coexistent, with none existing in a prior or primary relation to any other’. This is certainly not the case in today’s Web, where there is no central databank of all documents, and where a link between one site and another is often not reciprocated. There is no reason why it should be: the sites are under different ownership, autonomous, and displaying structures that are totally independent of each other. One site’s designer may incorporate links to other sites, but there is no way in which the owners of those sites know that a link has been made to them (though the obligation to seek permission seems to be growing) and no obligation on them to return the compliment. Nor does the existence of a link mean that it is achievable – as everyone knows who has encountered the mortuarial black type informing them that a connection could not be made. Some servers refuse access; some sites refuse access. Owners may remove pages from their site, or close a site down, without telling anyone else – what is sometimes called ‘link rot’. They may change its location or its name. Whatever the cause, the result is a ‘dead link’ – a navigation link to nowhere. As Tim Berners-Lee points out, a link does not imply any endorsements: ‘Free speech in hypertext implies the “right to link”, which is the very basic building unit for the whole Web.’ The link is simply a mechanism to enable hypertext to come into being. And, as with all tools, it has to be used wisely if it is to be used well – which in the first instance means in moderation. As William of Occam might have said, ‘Links should not be multiplied beyond necessity.’ Because virtually any piece of text can be a link, the risk is to overuse the device – both internally (within a page, or between pages at the same site) and externally (between sites). But just as one can over-footnote a traditional text, so one can over-link a Web page. There is no algorithm for guiding Web authors or designers as to the relevance or informativeness of a link. The designer is in the unhappy position of those unsung heroes, the book indexers, who try to anticipate all the possible information- retrieval questions future readers of a book will make. However, page designers are much worse off, as the ‘book’ of which their particular document is a tiny part is the whole Web. One does one’s best. From the Web user’s point of view, the links are provided by the system. When someone else’s e-mail arrives on our screen, we can, if we wish, edit it – add to it, subtract from it, or change it in some way. This is not possible with the copy of the page which arrives on our computer from our server. We, as readers, cannot alter a Web site: only the site owner can do that. The owner has total control over what we may see and what may be accessed, and also what links we may follow. As Web users, only three courses of action are totally under our control: the initial choice of a particular site address; scrolling through a document once we have accessed it; and cutting and pasting from it. Although we may choose to follow a hypertext link that a designer has provided us, the decision over what those links should be is not ours. As Jackson says: the presence of a link reflects a communicative choice made by the designer. A link, therefore, is strategic. The possible variations for structure are shaped by communicative ends, rather than technological means. We, as users, cannot add our own links. The best we can do is send a message to the owner suggesting an extra link. It is then entirely up to the owner whether to accept the suggestion. But for any of this to happen, interactivity needs to be built into the system. This is the only way in which the Berners-Lee dream can be fully realized: ‘The Web is more a social creation than a technical one... to help people work together’. Genuine working together presupposes a mutuality of communicative access, between site designers and site users. At present, in many cases, the situation is asymmetrical: we, as Web users, can reach their knowledge, but they cannot reach ours (or, at least, our questions and reactions). The authors of Wired Style issue page-designers with a blunt warning: ‘On the Web, you forget your audience at your peril.’ Fortunately, the warning seems to be being heeded. A distinctive feature of an increasing number of Web pages is their interactive character, as shown by the Contact Us, E-mail Us, Join Our List, Help Questions, FAQs, Chat, and other screen boxes. The Web is no longer only a purveyor of information. It has become a communicative tool, which will vastly grow as it becomes a part of interactive television. Doubtless, the trend is being much rein- forced by the e-commerce driver, with its ‘subscribe now’, ‘book here’, ‘e-auction’, ‘stop me and buy one’ character. Web owners have come to realize that, as soon as someone enters a site, there is a greater chance of them staying there if the site incorporates an e-mail option, or offers a discussion forum. Evolution and management Because the linguistic character of the Web is in the hands of its site owners, the interesting question arises of what is going to happen as its constituency develops. Anyone may now publish pages on the Web, and professional designers have been scathing about the untutored typographical hotchpotches which have been the result, and have issued warnings about the need for care. Roger Pring, for example: Web screens may blossom with movies and be garnished with sound tracks but, for the moment, type is the primary vehicle for information and persuasion. Its appearance on screen is more crucial than ever. Intense competition for the user’s attention means that words must attract, inform (and maybe seduce) as quickly as possible. Flawless delivery of the message to the screen is the goal. The road to success is very broad, but the surface rather uneven. The uneven surface is apparent on many current Web pages. Page compilers often fail to respect the need for lines to be relatively short, or fail to appreciate the value of columns. They may overuse colour and type size, or underuse the variations which are available. And they can transfer the habits of typing on paper, forgetting that the HTML conventions (‘Hypertext Markup Language’, which instructs the computer about how to lay out text) may be different. To take just one example, a simple carriage return is enough to mark a paragraph ending on the paper page, but on screen this would not result in a new paragraph: to guarantee that, the HTML needs to be inserted into the text at the appropriate point. Erratic lineation, obscured paragraph divisions, misplaced headings, and other such errors are the outcome. For the linguist, this complicates the task enormously, making it difficult to draw conclusions about the linguistic nature of the medium. The situation resembles that found in language learning, where learners pass through a stage of ‘interlanguage’, which is neither one language nor the other. Many Web pages are, typographically speaking, in an ‘in between’ state. There are other linguistic consequences of Web innocence, when we consider that people are producing content for a potentially worldwide readership. How does one learn to write for potential millions, with clarity and (bearing in mind the international audience) cultural sensitivity? The point is routinely recognized in chatgroups (chapter 5). Usenet help manners, for example, has this to say: Keep Usenet’s worldwide nature in mind when you post articles. Even those who can read your language may have a culture wildly different from yours. When your words are read, they might not mean what you think they mean. The point is even more powerful when we consider the vastly greater range of subject- matter communicated via the Web. But the Web presents us with a rather different problem. Its language is under no central control. On the Web there are no powerful moderators (p. 138). Individual servers may attempt to ban certain types of site, but huge amounts of uncensored language slip through. There are several sites where the aim is, indeed, contrary to conventional standards of politeness and decency, or where the intention is to give people the opportunity to rant about anything which has upset them. Conventional language may be subverted in order to evade the stratagems servers use to exclude pornographic material: a Web address may use a juxtaposition of interesting and innocuous words, and only upon arrival at the site does one realize that the content is not what was conveyed by their dictionary meaning. The debate continues over the many social and legal issues raised by these situations – laws of obscenity and libel, matters of security and policing, questions of freedom of speech – all made more difficult by the many variations in practice which exist between countries. The Internet, as has often been pointed out, is no respecter of national boundaries. Issues associated with textual copyright have particular linguistic consequences. Although we are unable to alter someone else’s Web pages directly, it is perfectly possible to download a document to our own computer, change the text, then upload the new document to a Website we have created for the purpose. In this way, it is relatively easy for people to steal the work of others, or to adapt that work in unsuspected ways. There is a widespread opinion that ‘content is free’, fueled by the many Web pages where this is indeed the case. But freedom needs to be supplemented with responsibility, and this is often lacking. Examples of forgery abound. Texts are sent to a site purporting to be by a particular person, when they are not. I know from personal experience that not all the ‘I am the author’ remarks in some book sites are actually by the author. And there have been several reported instances when a literary author’s work has been interfered with. This does not seem to be stopping the number of authors ready to put their work directly onto the Web, however as the blogging phenomenon (Chapter 8) further illustrates. Most traditional printed texts have a single author – or, if more than one author is involved, they have been authorized by a single person, such as a script editor or a committee secretary. Several pairs of eyes may scrutinize a document, before it is released, to ensure that consistency and quality is maintained. Even individually authored material does not escape, as publishers provide copy-editors and proof-readers to eradicate unintended idiosyncrasy and implement house style. It is in fact extremely unusual to find written language which has not been edited in some way – which is one reason why chatgroup and virtual worlds material is so interesting (p. 176). But on the Web, these checks and balances are often not present. There are multi-authored pages, where the style shifts unexpectedly from one part of a page to another. The more interactive a site becomes, the more likely it will contain language from different dialect backgrounds and operating at different stylistic levels – variations in formality are particularly common. Because reactions to an interactive site are easy to make, they are often made. The linguistic character of a site thus becomes increasingly eclectic. People have more power to influence the language of the Web than in any other medium, because they operate on both sides of the communication divide, reception and production. They not only read a text, they can add to it. The distinction between creator and receiver thus becomes blurred. The nearest we could get to this, in traditional writing, was to add our opinions to the margin of a book or to underline or highlight passages. We can do this now routinely with interactive pages, with our efforts given an identical typography to that used in the original text. It is a stylistician’s nightmare. A nightmare, moreover, made worse by the time-sink effect. A little while ago I was searching the Web for some data on the Bermudas. I received many hits, but the first few dozen were all advertisements for Bermuda shorts, which was not exactly what I had in mind. This is a familiar search-engine problem, but what was noticeable about this particular result was the time-range displayed by the hits. The ads were monthly accounts of the range and prices dating back several years – April 1994, May 1994, and so on. Quite evidently, many owners do not delete their old Web pages; they leave them there. I do not know of any source which will tell me just how much of the Web is an information rubbish-dump of this kind. Unless data-management procedures alter to cope with it, the proportion must increase. And in due course, there will be an implication for anyone who wants to use the Web as a synchronic corpus, in order to make statements about its stylistic character. Let us jump forward fifty years. We call up an interactive site to which people have now been contributing for two generations. The contributions will reflect the language changes of the whole period, displaying words and idioms yet unknown, and perhaps even changes in spelling, grammar, and discourse patterns. Though some sites already date-stamp all contributions (e.g. Amazon’s reader reactions), by no means all do so. In the worst-case scenario, we could encounter a single text created by an indefinite number of people at indefinite times over several years. Several competitors for the ‘world’s longest sentence ever’ are already of this form. While these are instances of language play, the implications for serious stylistic investigation are far-reaching. But handling the increasingly diachronic character of the Web, and coping with its chronological clutter, raises issues which go well beyond the linguistic. The trouble with the notion of ‘knowledge’ is that it is all-inclusive. The price of Bermuda shorts in April 1994 counts as knowledge. So does A.N. Other’s account of his break-up with his girlfriend, which may be found on his Web page. At the heart of knowledge management is therefore the task of evaluation. Judgements have to be made in terms of significance vs. triviality, with reference to a particular point of view, and criteria have to be introduced to enable a notion of relevance to be implemented. The common complaint nowadays is that we are being swamped by knowledge; such phrases as ‘information overload’ are everywhere. What use is it to me to be told that, if I search for ‘linguistics’ on my search engine, I have 86,764 hits? Part of Berners-Lee’s vision was shared knowledge: ‘the dream of people-to-people communication through shared knowledge must be possible for groups of all sizes’. But unless the notion of sharing is subjected to some sort of assessment, the dream begins to take on nightmarish qualities. For Berners-Lee, another part of the dream is a ‘semantic web... capable of analysing all the data on the Web – the content, links and transactions between people and computers’. This is a stirring vision, which will keep generations of semanticists yet unborn in jobs. But no semantic or pragmatic theory yet devised is capable of carrying out the kind of sophisticated relevance analysis which would be required. The Semantic Web (SW) has been variously described as a dream', 'a vision for the future of the Web'. The immediate contrast is with what currently exists: the SW is conceived evolution of the Web. The essential difference is that the Web is human- readable, whereas the SW will be machine-readable. Faced with the Web, in its current form, it is up to the human user to specify, find, and implement relevant links between one page or site and another. In an ideal SW, the links would he processed by computers without human intervention. For example, if l want to find out how to get from my home in Holyhead to a location in London, I might need to know the times of buses to the railway station, the times of trains, the underground route at the other end, and so on. I need to sew these elements of my journey into a seamless sequence. At the moment, this is a major human-led enquiry, as everyone knows. It does not take much imagination to see how the various timetables could be interrelated so that, with a single enquiry everything comes together. That interrelating would be done in the SW. And of course it goes beyond timetables. If l decide to break my journey in Birmingham and want a hotel near the station, the various hotel and map data would likewise be integrated into the system of enquiry. To achieve this requires that all the content, links, and transactions on the Web have been analysed in a way which will allow this kind of integration to take place. Human beings will not be able to do this unaided; rather, computers themselves must help in the task. The aim of the SW is to 'develop languages for expressing information in a machine processable form’, Berners-Lee gives a glossary definition like this: 'the Web of data with meaning in the sense that a computer program can learn enough about what the data means to process it'. The emphasis is on computer learning. Like a child, 'the SW "learns" a concept from frequent contributions from different sources'. The computers will first describe, then infer, and ultimately reason. An inference engine would link different concepts. For example, having identified the synonymy between cur and automobile, then properties of the one would be equated with properties of the other; from a knowledge of the parent-child relationship, it would deduce that siblings would have the same mother and father; and so on. But the first steps are descriptive ones: the evolving collection of knowledge has to be expressed as descriptive statements, and the properties of documents (who wrote the pages, what are they about) have to be specified as metadata. contrast is drawn with the machines of which understand the world, and the machines of the SW, which do no more than 'solve well-defined problems by performing well-defined operations on existing well-defined data'. To do this, there needs to be (as the World Wide Web Consortium puts common framework that allows data to be shared and reused across application, enterprise, and community boundaries'. Technically, this will be achieved through the use of a Resource Description Framework (RDF), which will integrate systems using XMI. for syntax and Uniform Resource Identifiers (URIs) tor naming. Every concept in the SW will be tagged using a URI. The aim is to develop 'an overall language which is sufficiently flexible and powerful to express real life'. And it will do this 'without relying on English or any natural language for understanding'. To achieve this it will be necessary to develop an ontology, i.e. a formal description of the relationships between terms. In addition to the inference rules mentioned above, a major part of this ontology will be a taxonomy. This framework needs to have both a linguistic and an encyclopedic dimension. For example, to achieve a presence for myself on the SW, the linguistically definable features of the entity David Crystal would include 'adult', 'male', 'linguist', etc.; the encyclopedic features would include 'living in Holyhead', 'works the following office hours', 'has this telephone number', etc. It would be difficult to quarrel with the aim of the SW, as stated variously, to 'improve human ability to find, sort, and classify information' and to 'provide a means of collaboration between people, using computers'. But coming to the SW as a linguist, I was immediately struck by its linguistic 'innocence'. From what I have read so far, the examples that are given as illustrations of how the SW will operate are all of the 'easy' kind, and moreover all in English. No reference is made to the kinds of problem well recognized in linguistic semantics, such as: How is the system to cope with language change? A simple method of time-dating URIs will not be enough, for a lexical change alters the balance of meanings within a semantic field. All the values shift, and this has to be taken into account. How is the system to cope with the varying values of words and expressions across cultures? I appreciate that, unlike the sixteenth-century universality models of classification, it is not the intention of the SW to be centralized, but to allow different resources to retain their individuality. But how exactly do we capture the different cultural values between, say, French X and English Y. At one level there is a translation; at another level there is not. How is the system to cope with fuzzy meanings? SW proponents say 'information is given well-defined meaning'. But we know that not all areas of meaning are well defined. There is nothing wrong with starting with the 'easy' areas of enquiry, such as timetables and tax forms, where the information is ordered, well structured, and predictable. To have a simplified linguistics will make it easier to test interoperable metadata standards. But alongside this the SW needs to be testing ways of handling 'messier' data. How is the system to cope with multiple attributes of varying generality? It would be inadequate to define the term 'Winston Churchill' by simply 'politician', for he was also journalist, soldier, statesman, artist, essayist, Nobel prizewinner, and more, and these are only some of his public roles (alongside husband, father, etc.). We need to work with real languages, as well as formal languages. While work in formal semantics has made great progress in the last couple of decades, it is still a generation away from handling the properties of the Web. Apart from anything else, there is dispute over which formal language is best suited for the job, and it is difficult to see how this can be resolved, in the present state of the art. There needs to be a parallel track, with work on natural languages continuing alongside formal approaches. Putting this another way, it is important to spend time, early on in the SW project, working 'bottom- up' from the analysis of real data, with all its metaphorical and idiomatic complexity. The current direction of thinking by SW proponents seems to be largely 'top-down', beginning with general schemata and testing these against some very general properties, but with little evident awareness of the complications, even at this general level. For example, when it comes to defining the classes in an ontology, how many first-order classes define the geographical areas of the world (an example used in some SW expositions)? A typical model would recognize Europe, Asia, Africa, America and Australia. But there are of course many other options. North America. Central America. South America. Latin America? Eurasia. Australasia. Oceania? North Africa and Sub-Saharan Africa? Middle East? Behind any taxonomy is a mindset about the world, and underneath that may lie a political or other ideology. We only have to compare taxonomies – such as the Dewey and Congress systems of library classification – to see how ideologies can shape taxonomies. Any approach to knowledge management, if it is to be successful, needs to he founded on solid linguistic principles. The SW, like the Web, as we have seen, goes well beyond linguistic semantics. When we are talking about knowledge, not just the senses of words in dictionaries, we are talking about content which includes both encyclopedic and linguistic data. And under the encyclopedic heading we are talking about all forms of content and communicative representation, not just language such as photographs, diagrams, paintings, and music. Language however, has primacy, for the obvious reason that when we want to talk or write about this content we have to use language. So even though the focus might be on knowledge or concepts or some other non-linguistic notion, the dimension of linguistic encoding is never far away. The Semantic Web is a linguistic web and it would be a good thing if its proponents recognized their terminological parentage and added a layer of linguistic sophistication to their already sophisticated thinking. The same point applies to search engines. Even the most basic semantic criteria are missing from the heavily frequency dominated information retrieval techniques currently used by search engines. All such engines incorporate an element of encyclopedic classification into their procedure, but this is only a small part of the answer to the question of how to implement relevance. Any search-engine assistant needs to supplement its encyclopedic perspective by a semantic one. The typical problem can be illustrated by the word depression, which if typed into the search box of a search engine will produce a mixed bag of hits in which its senses within psychiatry, geography, and economics are not distinguished (nor, of course, less widespread uses, such as in glassware and literature). The experience of trawling through a load of irrelevant hits before finding one which relates to the context of our enquiry is universal. The solution is obvious: to give the user the choice of which context to select. The user is asked on- screen: ‘Do you mean depression (economics) or depression (psychiatry) or depression (geography)... ?’ Once the choice is made, the software then searches for only those hits relevant to the selection. The procedure sounds simple, but it is not. for the notion of context has to be formalized and the results incorporated into the software. But what is the semantic basis of a domain such as economics or psychiatry, or of any of their relevant subdomains? Which lexical items are the 'key' ones to be searched for, and how are they organized? The task goes well beyond scrutinizing the items listed in a dictionary or thesaurus. These can provide a starting point, but the alphabetical organization of a dictionary and the uncontrolled conceptual clustering of a thesaurus lack the kind of sharp semantic focus required. In linguistics, several notions have been developed to provide such a focus – such as the recognition of lexemes (as opposed to words) semantic fields, sense relations, and the componential analysis of lexical meanings They are not unproblematic, but they do have considerable potential for application in such computer mediated situations as web-searching and automatic document classification, once software is adapted to cope. The lack of even an elementary semantics also bedevils those software systems which attempt to evaluate the content of Web sites (censorware), replacing parts of words by X’s, filtering out pages, or blocking access to sites which contain ‘dangerous’ words. Thus, in one report, a student was barred from his high school’s Web site from the school library because the software objected to his word high. A woman named Hilary Anne was not allowed to register the username hilaryanne with a particular e-mail company because it contained the word aryan. Sites which fight against pornography can be banned because they contain words like pornography. In 2000, Digital Freedom Network held a ‘Foil the Filters’ contest to demonstrate the unreliability of censorware. Their Silicon Eye Award (‘for finding objectionable content where only a computer would look’) was given to someone who wanted to register an account with a site which would not accept the name Heather because it contained the phrase eat her! Honourable mentions were given to another enquirer who could not access a biotechnology site because its name (accessexcellence.org) contained the word sex. Doubtless residents of Essex and Sussex, people called Cockburn and Babcock, or anyone who uses Dick as their first name, encounter such problems routinely. Other examples of words which have been banned include cucumbers (because it contains cum), Matsushita (shit), analysis (anal), class (ass), and speech (pee). More puzzlingly, among the words which some cyberware systems have blocked are golden, mate, and scoop. The linguistic naivety which lies behind such decision-making beggars belief. The linguistic limitations of word-processing and search-engine software affect our ability to find what is on the Web in several ways, and eventually must surely influence our intuitions about the nature of our language. So do the attempts to control usage in areas other than the politically correct. Which writers have not felt angry at the way pedants in the software companies have attempted to interfere with their style, sending a warning when their sentences go beyond a certain length, or when they use which instead of that (or vice versa), or -ise instead of -ize (or vice versa), or dare to split an infinitive? The advice can be switched off, of course; but many people do not bother to switch it off, or do not know how to. Sometimes they do not want to switch it off, as something of value is lost thereby. The software controlling the page I am currently typing, for example, inserts a red wavy line underneath anything which is misspelled, according to the dictionary it uses. I find this helpful, because I am no perfect typist. On the other hand it has just underlined scrutinizing and formalized, in the previous paragraph (though, curiously, not organized). The red lines are a constant irritant, and it takes a real effort of will not to yield to them and go for the software-recommended form. Whether others resist this insidious threat to linguistic variety I do not know. My feeling is that a large number of valuable stylistic distinctions are being endangered by this repeated encounter with the programmer’s prescriptive usage preferences. Online dictionaries and grammars are likely to influence usage much more than their traditional Fowlerian counterparts ever did. It would be good to see a greater descriptive realism emerge, paying attention to the sociolinguistic and stylistic complexity which exists in a language, but at present the recommendations are arbitrary, oversimplified, and depressingly purist in spirit. I am therefore pleased to see the arrival of satire, as a means of drawing attention to the problem. Bob Hirschfeld’s newspaper article, ‘Taking liberties: the pluperfect’, is one such contribution. In it he describes the deadly Strunkenwhite virus which returns e- mail messages to their senders if they contain grammatical or spelling errors. He explains: The virus is causing something akin to panic throughout corporate America, which has become used to the typos, misspellings, missing words and mangled syntax so acceptable in cyberspace. The CEO of LoseItAll.com, an Internet startup, said the virus has rendered him helpless. ‘Each time I tried to send one particular e-mail this morning, I got back this error message: “Your dependent clause preceding your independent clause must be set off by commas, but one must not precede the conjunction.” I threw my laptop across the room.’ His article concludes: ‘We just can’t imagine what kind of devious mind would want to tamper with e-mails to create this burden on communications’, said an FBI agent who insisted on speaking via the telephone out of concern that trying to e-mail his comments could leave him tied up for hours. It is good to see some artists coming on board. Turner prize nominee Tomoko Takahashi has a Web project he devised to object to the way software is imposing a ‘standardised corporate language on to our writing’ while ‘subtly altering its meaning’. He calls it Word Perhect. Some degree of normalization is unavoidable in automatic information retrieval (IR), as US librarian and information scientist Terrence Brookes comments: Although IR searchers are said to be ‘searching a database’ or ‘searching for documents’, these metaphors obscure the reality of the more mundane task of matching query term to index term. In an IR system hosting unrestricted text, the task of matching one string of characters to another string of characters would be very difficult unless there was a normalizing algorithm that processed both the document text and the query text. But for every normalization decision that has negligible consequences for linguistic meaning (such as standardizing the amount of blank space between paragraphs), there are several which result in the loss of important linguistic detail. If careful attention is not paid to punctuation, hyphenation, capitalization, and special symbols (such as &, /, ∗, $) valuable discriminating information can be lost. When contrasts from these areas are ignored in searching, as is often the case, all kinds of anomalies appear, and it is extremely difficult to obtain consistency. Software designers underestimate the amount of variation there is in the orthographic system, the pervasive nature of language change, and the influence context has in deciding whether an orthographic feature is obligatory or optional. For example, there are contexts where the ignoring of an apostrophe in a search is inconsequential (e.g. in St Paul’s Cathedral, where the apostrophe is often omitted in general usage anyway), but in other contexts it can be highly confusing. Proper names can be disrupted – John O’Reilly is not John Oreilly or John O Reilly (a major problem for such languages as French and Italian, where forms such as d ’ and l ’ are common). Hyphens can be critical unifiers, as in CD-ROM and X-ray. Similar problems arise when slashes and dashes are used to separate words or parts of words within an expression, as in many chemical names. Disallowing the ampersand makes it hard to find such firms as AT&T or P&O, whether solid or spaced; no hits may be returned, or the P... O string is swamped by other P O hits, where the ampersand has nothing to do with their identity. When more than one of these conventions are involved in the same search, the extent to which the search-engines simplify the true complexity of a language’s orthography is quickly appreciated. Brookes points out that a string such as Brother-in-Law O’Toole would be normalized in different ways by different IR systems. And it gets worse, if O’Toole turns out to be the author of a particular version of a software program, as in Brother- in-Law O’Toole’s ‘Q & A’ System/Version 1.0. Few of us would know what to expect of any software system processing this search request. The stop words recognized by different systems pose a special problem. These usually comprise a list of the grammatical words which are so frequent and contain so little semantic content that the search mechanism ignores them. The trouble is that these words often form an obligatory part of something which does have semantic content (such as the title of a novel or film) or are homographic with content words – in which case they become irretrievable. For example, the Dutch firm for which the ALFIE project () was undertaken was called AND (the initials of its founders); as and would be on any stop-list, a search engine which is not case- sensitive would make this string virtually impossible to find among the welter of hits in which the word and is prominent. The AND case is not unique, as anyone knows who has tried searching for the discipline of IT – let alone for the Stephen King novel, It. Several forms which are grammatical in one context become content items in another, such as a in Vitamin A, A-team, and the Andy Warhol novel a, or who in Doctor Who, as well as the polysemy involved in such words as will and may (cf. May). Finding US states by abbreviation, under these circumstances, can be tricky: there is no problem with such states as KY (Kentucky) and TX (Texas), but it would be unwise to try searching for Indiana (IN), Maine (ME), or Oregon (OR), or even for Ohio (OH) and Oklahoma (OK). Cross-linguistic differences add further complications: those computers which block an and or in English exclude the words for ‘year’ and ‘gold’ in French (as well as a significant part of English heraldry, where the term or is crucial). C. L. Borgman comments: As the non-English-speaking world comes online and preserves their full character sets in their online catalogs and other retrieval systems, matching filing order, keyboard input, and display, will become ever more complex. And it is precisely this world which is now coming online, in ever-increasing numbers. Languages on the Web The Web is an eclectic medium, and this is seen also in its multi-linguistic inclusiveness. Not only does it offer a home to all linguistic styles within a language; it offers a home to all languages – once their communities have a functioning computer technology. This has been the most notable change since the Web began. It was originally a totally English medium – as was the Internet as a whole, given its US origins. But with the Internet’s globalization, the presence of other languages has steadily risen. In the mid-1990s, a widely quoted figure was that just over 80% of the Net was in English. This was supported by a 1997 survey by Babel, a joint initiative of the Internet Society and Alis Technologies, the first major study of language distribution on the Internet. This study used a random number generator to find 8,000 computers hosting an HTTP server; and a program then subjected a selection of pages to an automatic language identification, using software which could recognize 17 languages. Of 3,239 home pages found, the language distribution (after correction for various types of possible error) was as shown in Table 7.2. The gap between English and the other languages is notable, and supports the widespread impression, frequently reported in newspaper headlines, that the language of the Internet ‘is’ English. ‘World, Wide, Web: 3 English Words’ was the headline of one piece in The New York Times, and the article went on to comment: ‘if you want to take full advantage of the Internet there is only one real way to do it: learn English’. The writer did acknowledge the arrival of other languages: As the Web grows the number of people on it who speak French, say, or Table 7.1 Language distribution on the Web (see fn. 37) Ranking Language Number of pages Corrected percentage 1 English 2,722 82.3 2 German 147 4.0 3 Japanese 101 1.6 4 French 59 1.5 5 Spanish 38 1.1 6 Swedish 35 0.6 7 Italian 31 0.8 8 Portuguese 21 0.7 9 Dutch 20 0.4 10 Norwegian 19 0.3 11 Finnish 14 0.3 12 Czech 11 0.3 13 Danish 9 0.3 14 Russian 8 0.1 15 Malay 4 0.1 None or unknown 5.6 (correction) Total 3,239 100 Russian will become more varied and that variety will be expressed on the Web. That is why it is a fundamentally democratic technology. However, he concluded: But it won’t necessarily happen soon. The evidence is growing that this conclusion was wrong. The estimates for languages other than English have steadily risen since then, with some commentators predicting that before long the Web (and the Internet as a whole) will be predominantly non-English, as communications infrastructure develops in Europe, Asia, Africa, and South America. A Global Reach survey estimated that people with Internet access in non-English-speaking countries increased from 7 million to 136 million between 1995 and 2000. In 1998, the total number of newly created non-English Web sites passed that for newly created English Web sites, with Spanish, Japanese, German, and French the chief players. Alta Vista had six European pean sites in early 2000, and were predicting that by 2002 less than 50% of the Web would be in English. Graddol predicted an even lower figure in due course, 40%. In certain parts of the world, the local language is already dominant. According to Japanese Internet author Yoshi Mikami, 90% of Web pages in Japan are now in Japanese. A report published in October 2000 by Jupiter Media Matrix suggested that the greatest growth in online households over the first half of the ’00s is going to be outside the USA. A Nua Internet Survey the previous month estimated that about 378 million people were online worldwide: of these, 161 million were in North America and 106 million in Europe. What is interesting is that 90 million were in Asia and the Pacific, a total that is likely to pass Europe’s soon, given the population growth differential between those two parts of the world. The 15 million in Latin America and the tiny 3 million in Africa show the potential for growth in those areas one day. The Web is increasingly reflecting the distribution of language presence in the real world, and there is a steadily growing set of sites which provide the evidence. They range from individual Table 7.2 Language distribution on the Web (see fn. 39) Ranking Language Percentage Online 1 English 35.2 2 Chinese 13.7 3 Spanish 9.0 4 Japanese 8.4 5 German 6.9 6 French 4.2 7 Korean 3.9 8 Italian 3.8 9 Portuguese 3.1 10 Malay 1.8 11 Dutch 1.7 12 Arabic 1.7 13 Polish 1.2 14 Russian 0.8 15 Others 4.6 businesses doing their best to present a multilingual identity to major sites collecting data on many languages. Under the for- mer heading we encounter several newspapers, such as the Belgian daily, Le Soir, which is represented by six languages: French, Dutch, English, German, Italian, and Spanish. Under the latter heading we find such sites as the University of Oregon Font Archive, providing 112 fonts in their archives for over 40 languages – including, in a nicely light-hearted addendum, Morse, Klingon, Romulan, and Tolkien (Cirth, Elvish, etc.). The same centre’s Interactive Language Resources Guide provides data on 115 languages. A World Language Resources site lists products for 728 languages. Some sites focus on certain parts of the world: an African resource list covers several local languages; Yoruba, for example, is illustrated by some 5,000 words, along with proverbs, naming patterns, and greetings. Another site deals with no less than 87 European minority languages. Some sites are very small in content, but extensive in range: one gives the Lord’s Prayer in nearly 500 languages. Nobody has yet worked out just how many languages have obtained a modicum of presence on the Web. I started to work my way down the Ethnologue listing of the world’s languages,49 and stopped when I reached 1,000. It was not difficult to find evidence of a Net presence for the vast majority of the more frequently used languages, and for a large number of minority languages too, especially in those technologically developed parts of the world which happen to contain large numbers of minority or endangered languages, such as the USA, Canada, and Australia. I would guess that about a quarter of the world’s languages have some sort of Internet presence now. How much use is made of these sites is, of course, a different matter. Until a critical mass of Internet penetration in a country builds up, and a corresponding mass of content exists in the local language, the motivation to switch from English-language sites will be limited to those for whom issues of identity outweigh issues of information. The notion of ‘critical mass’ is recognized in Metcalfe’s Law (named after Ethernet inventor, Robert M. Metcalfe): networks increase in functionality by the square of the number of nodes they contain. In other words, a single language site is useless, because the owner has nobody to link to; two provides a minimal communicativity; and so on. The future is also very much dependent on the levels of English-speaking ability in individual countries, and the likelihood of further growth in those levels. Code-mixing is also found in many interactive Internet situations, though not so much as yet on the Web. Technological progress (see chapter 8) will also radically alter the situation. There is no doubt that low-cost Internet use is going to grow, all over the world, as wireless networking puts the Internet within reach of people in developing nations who will use access devices powered by solar cells or clock- work generators. Global mobile phones will have dish-on-a-chip transceivers built into them, with communication up and down via LEO [‘low earth orbit’] satellite. All of this must have an impact on language presence. In the above examples, we are encountering language presence in a real sense. These are not sites which only analyse or talk about languages, from the point of view of linguistics or some other academic subject; they are sites which allow us to see languages as they are. In many cases, the total Web presence, in terms of number of pages, is quite small. The crucial point is that the languages are out there, even if represented by only a sprinkling of sites. It is the ideal medium for minority languages, given the relative cheapness and ease of creating a Web page, compared with the cost and difficulty of obtaining a newspaper page, or a programme or advertisement on radio or television. On the other hand, developing a significant cyber-presence is not easy. As Ned Thomas comments, in an editorial for Contact, reflecting on the reduced dominance of English on the Net (p. 216): It is not the case... that all languages will be marginalized on the Net by English. On the contrary, there will be a great demand for multilingual Web sites, for multilingual data retrieval, for machine translation, for voice recognition systems to be multilingual.... The danger for minority languages – and indeed for all small languages – is that they will be left outside the inner circle of languages for which it is commercially viable to develop voice recognition and machine translation systems. Typically, such systems depend on the analysis of large bodies of language which can be expensive to develop and which can take time to develop. The interviews conducted by Marie Lebert for her study indicate that those in the business are fairly unanimous about the future multilinguality of the Internet in general, and the Web in particular. Take this comment, from Marcel Grangier, head of the Section française des Services linguistiques centraux (SLC-f) [‘French Section of the Central Linguistic Services’] of the Swiss Federal Administration: Multilingualism on the Internet can be seen as a happy and above all irreversible inevitability. In this perspective we have to make fun of the wet blankets who only speak to complain about the supremacy of English. This supremacy is not wrong in itself, inasmuch as it is the result of mainly statistical facts (more PCs per inhabitant, more English-speaking people, etc.). The counter-attack is not to ‘fight against English’ and even less to whine about it, but to increase sites in other languages. As a translation service, we also recommend the multilingualism of websites. Tyler Chambers, creator of various Web language projects, agrees: the future of the Internet is even more multilingualism and cross-cultural exploration and understanding than we’ve already seen. The point seems to be uncontentious among those who shaped the Web. Tim Berners- Lee, for example: The Web must allow equal access to those in different economic and political situations; those who have physical or cognitive disabilities; those of different cultures; and those who use different languages with different characters that read in different directions across a page. The problem is a practical one, but a great deal has been done since the mid-1990s. First, the ASCII character set was extended, so that non-English accents and diacritics could be included, but its 8-bit restriction meant that only a maximum of 256 characters could be handled – a tiny number compared with the array of letter-shapes in Arabic, Hindi, Chinese, Korean, and the many other languages in the world which do not use the Latin alphabet. The UNICODE system represents each character with 16 bits, allowing over 65,000 characters; but the implementation of this system is still in its infancy. The Web consortium now has an internationalization activity looking specifically at different alphabets, so that operating systems can support a page in any alphabet. And Berners- Lee looks forward to the day when the linking of meanings, within and between languages, is possible through the use of ‘inference languages’ which ‘will make all the data in the world look like one huge database. A great deal has to be done before this day dawns. There needs to be immense progress in Internet linguistics, especially in semantics and pragmatics, and also in graphology and typography. There is an enormous gap to be filled in comparative lexicography: most of the English technical terms used on the Web have still not been translated into other languages, and a great deal of varying usage exists, with English loanwords and local variants uncertainly co-existing. On the positive side, there has been an enormous growth of interest in translation issues and procedures during the past decade. And localization (the adaptation of a product to suit a target language and culture) is the buzz-word in many circles. There seems little doubt that the character of the Web is going to be increasingly multilingual, and that the issues discussed in the first half of this chapter are going to require revision in the light of what has been said in the second. But I have as yet found no comparative research into the way different languages approach the same problems on their respective Web sites. Nor is it clear what happens linguistically when Internet technology is used in new areas of application, and when new technological developments influence the language to move in different directions. What is clear is that the linguistic future of the Web, and of the Internet as a whole, is closely bound up with these applications and future developments. They therefore provide the topic of the chapters 8 and 9.

Use Quizgecko on...
Browser
Browser