Corpus Linguistics - Future Predictions PDF

Summary

This document, likely chapter 7 of a book on corpus linguistics, discusses future directions and trends within the field. It analyses specialized corpora, non-professional writing, and professional writing, along with linguistic theory and language engineering. The analysis highlights the increasing sophistication and scale of corpus linguistics, emphasizing practical considerations and its role in various language analyses.

Full Transcript

## 7.2. THE FUTURE TO COME One thing is clear from the review of the claims that the first edition made - corpus linguistics is continuing to develop at a rapid rate. Also, the predictions made in the first edition were quite correct. Corpus linguistics has grown, and our estimates of what would ha...

## 7.2. THE FUTURE TO COME One thing is clear from the review of the claims that the first edition made - corpus linguistics is continuing to develop at a rapid rate. Also, the predictions made in the first edition were quite correct. Corpus linguistics has grown, and our estimates of what would happen have largely proven to be true. Corpora are getting bigger, they are covering more languages and they are becoming multimodal. These trends will continue - of that we are quite sure. But what else can we see happening in the near future? In this section, we would like to set fresh predictions for the future of corpus linguistics under the general headings of specialized corpora, non-professional writing, professional writing, linguistic theory and language engineering. ### 7.2.1. SPECIALIZED CORPORA In the previous paragraph, we said that our predictions from the first edition were largely correct. If there was one development we did not predict, it was that, as well as getting larger, corpora would also get smaller! While the individual corpus linguist's handcrafting of their own small corpus in order to address a particular research question was not unusual in the 1980s and early 1990s, one may have assumed that, as corpora have grown, the need for this activity would have faded. This has not proven to be the case. Many linguists are interested in contrasting the language used in large, general-purpose corpora, such as the BNC, with small corpora representing text types or the writings of a single author - not available in the BNC. Such work has been massively enabled by the development of programs such as Wordsmith - indeed contrasting the language of a small corpus with that of the BNC is now a relatively easy matter, at least at the level of lexis, as one may even download the wordlists of the BNC to be read directly by Wordsmith. Researchers are also beginning to explore an unexpected avenue - building and exploiting micro-corpora. This is a trend that is likely to continue, as linguists (or other researchers) with such interests continue to exploit corpus-linguistic methods - but not necessarily general language corpora - to pursue their research goals. So, while corpora may increase in size, the number of small, specialized corpora built by individual researchers or small teams is likely to continue to grow also. ### 7.2.2. NON-PROFESSIONAL WRITING One trend in corpus linguistics - present from its very beginnings in the 1940s - has been its tendency to represent professional writing. There are exceptions to this general rule - the Survey of English Usage did collect non-professional writing and the BNC contains so-called ephemera, which are in essence non-professional writing. But the vast bulk of corpus data available today represents the writings of professional authors - journalists, novelists, technical writers etc. While undoubtedly useful, as a reflection of what everyday written English is truly like - letters to friends, notes to the neighbour, office memos, diary entries, emails, etc. - it is undoubtedly unrepresentative. If we are claiming that written language takes form x and spoken language takes form y, we should be very cautious about what we say, a point returned to in the next section. For the moment, it seems that an inordinate amount of attention has been paid to the writings of professional authors in corpus linguistics and very little has been paid to the writing of the vast majority of writers on the planet, who are non-professional authors. We would hope that this imbalance will change in the near future. ### 7.2.3. PROFESSIONAL WRITING When a professional written text is studied in a corpus, what can we say about the production of that text? A rather simple-minded view is to say that we can comment on the writing of the author of the text or the authors of a particular genre. Yet, professional writing is rarely the work of one author, even if a by-line in a newspaper article or the name on the dust-jacket of a book says so. Professionally composed texts go through a whole range of processes that may successively adapt and change what an author wrote. Considering this book as an example, it is the work of two authors who have read and commented on one another's work. Beyond that, the authors have taken the opinions of other readers and previous reviewers into account. The book has also been checked by a copyeditor and proof-reader. In short, the processes that produce the final written form of a professionally composed piece seldom solely represent the author or authors of the piece. This can impact upon a corpus analysis. We came across an interesting example of this some years ago on a visit to a BBC newsroom. We were observing the process of a news article being composed for a World Service news broadcast. The text was sent around the newsroom on an internal network, drafted by one author, amended by another, and passed to the newsroom editor for comment. It finally returned to the author, who made a few more changes. The article was given a byline which indicated it had been written by the original drafter of the piece, but the language had been heavily edited in the meantime by others. Interestingly, the newsroom editor had systematically edited out certain constructions from the original, as they violated what he believed should be the 'simple' syntax of the news report. If one was unaware of this process, one may have formed the erroneous conclusion that either the author did not use construct x or that construct x was systematically avoided by the author in news writing. Neither conclusion is true. Yet, without an understanding of the context within which the text was produced, one could have easily reached either conclusion. If the context shifts then the language may change also - as was shown, for example, in the Schmied-Hudson-Ettle (1996) study of the language of East African English newspapers. This study clearly showed that norms of copy-editing were different between the British and East African contexts. This anecdote leads us to our next prediction. We have argued in this book that corpus linguistics is a methodology. As a methodology, it can be linked to other methodologies in pursuit of a research question. The anecdote above is an example of the type of combination of methodologies we think needs to occur, certainly in the study of professional writing in corpora. Without this type of ethnomethodological study, we could easily have been misled in our attempt to use the corpus data we gathered for explanatory purposes. The description of the data itself would not change, but what we might like to conclude on the basis of that description would. We remain sceptical at times when we read claims by corpus linguists that x and y occurring in corpus text z means that the author holds opinion a. It may mean nothing of the sort. x and y may be there through no act of the supposed author at all. Approaching the analysis of corpus texts by having an understanding of the processes that produced those texts seems to us a necessary step that corpus linguistics has to take, especially when it comes to professionally composed texts, where the process of production often means that the work of any supposed author is actually the work of many readers/editors/copyeditors/proof-readers. In suggesting this avenue of future development, we believe that corpus linguists will take on board work such as that of Fairclough (1993:78-86) which argues that discourse is linked to the processes of production, distribution and consumption. The linking of such observations to corpus linguistics seems to us a sensible step to take. ### 7.2.4. LINGUISTIC THEORY The linkage of corpora and linguistic theory has been slow to emerge to date. The reasons for this are in part historical - a review of the first chapter of this book will mainly explain why most generative grammarians will not use corpus data. Yet corpora have also been slow to have an impact in areas where no such ideological objections exist. Areas such as sociolinguistics and pragmatics, for example, are areas of linguistics in which corpus data could have a great role to play. So why has it not happened? To contradict ourselves, we could say that it already has. Sociolinguists have often worked with 'real' language data. Indeed, it is difficult to conceive of most sociolinguistics being undertaken on the basis of introspection alone. Similarly in pragmatics, most of the work one finds published in journals such as _The Journal of Pragmatics_ has a strong corpus flavour. The main problems to date appear to have been those of scale, the nature of the data and of relevance. On the question of scale, researchers in pragmatics tend to work with much smaller datasets than are common in corpus linguistics. Indeed, they typically tend to work with the sort of microcorpus which represent a growing trend in corpus linguistics. As such we may hope that as this 'corpora have to be big' culture is changed to accept the usefulness of smaller corpora, this culture difference will be removed. Another difference which is surely fading is related to the nature of the data. Researchers in sociolinguistics and pragmatics are most often interested in the study of spoken language - spoken corpus resources have always been fewer and smaller in number than written corpus resources, as they are more expensive - in terms of time and money - to construct than written corpora. However, with the advent of corpora such as the spoken corpus of the BNC, spoken corpus resources of some relevance to such researchers are now clearly available. With the publication of the _Survey of English Dialects_, further spoken language data will become available, this time including both the acoustic and transcribed data, which should prove helpful to researchers in sociolinguistics and pragmatics. Also, corpora which include more of the context of their production are vital if work in areas such as discourse analysis and pragmatics is going to exploit corpus data. Let us finally turn to the question of relevance. In short, why should researchers in, say, sociolinguistics use corpus data? Sociolinguistics has been using corpora of sorts for decades - what has corpus linguistics got to offer them that they do not have already? The answer, we believe, is shown in the way corpus linguists are using micro- and large-scale corpora at the moment - corpora such as the BNC can provide a useful yardstick to compare micro-corpora against. If a sociolinguist believes that a particular word/structure is typical of a dialect, then general-purpose corpora may be able to provide evidence to support that hypothesis. A further way corpus linguistics differs from other areas of linguistics in which naturally occurring language data has been examined is with respect to the articulation and manipulation of data. Corpus linguistics has had a focus upon the development of schemes to allow corpora to be encoded reliably. On the basis of that encoding, corpus linguists have developed programmes to exploit that data. This emphasis on the systematic encoding of data and tools for its manipulation is an area where corpus linguistics excels. So, as well as the corpora produced, other areas of linguistics should take an interest in how corpora have been composed and manipulated by corpus linguists. We hope that in the near future a full marriage of corpus linguistics with a wide range of linguistic theories will occur, in part encouraged by the observations such as those we have made here. ### 7.2.5. LANGUAGE ENGINEERING While corpora have had an increasing impact on linguistics, their impact upon language engineering - which we will loosely describe here as application-oriented computational linguistics - has been no less noticeable. It is now difficult to find in the proceedings of language-engineering conferences papers which do not mention corpora. Corpora are becoming the raw material from which language-engineering applications are built. Architectures for language engineering, such as the General Architecture for Text Engineering (GATE, Gaizauskas et al. 1996), now regularly take the representation of corpus data and corpus annotations into account. The advantage we can see for corpus linguists in this is that there is evidence that, not too far in the distance, access to tools for corpus annotation will become much more widespread, and the application of those tools to microcorpora will become ever easier. To take GATE as an example, this is slowly migrating into an internet-based architecture. There are already examples of limited tagging systems available on the internet. Imagine how much more work we will be able to do when, with the minimum of effort, corpus texts can be tagged, parsed and lemmatised on machines much more powerful than our desktop PCs, while appearing never to move from them. An architecture such as GATE, being presented in a web browser but in fact running on a powerful machine hundreds of miles away, will allow all - not just the hierophants of computer corpus linguistics - to use up-to-date corpus annotation software to build their corpora. ### 7.3. CONCLUSION This is not a conclusion - it cannot be. Corpus linguistics is an area which seems to be developing at such an amazing rate that any conclusions can only be halting and temporary. Indeed, by the time this manuscript appears in published form, further developments may already have begun. There is little doubt in our mind that, within a few years of the appearance of this book, the field will have moved on so far that we will need to substantially rewrite this book. We welcome that. Corpus linguistics needs to develop further and to continue doing so into the foreseeable future - because in doing so, it is changing our view of what we should do when using language data to explore linguistic hypotheses. What was wild fancy in one year suddenly becomes technically possible in the next. What seemed like an off-the-wall idea for a corpus at one point suddenly seems like the best way to approach a subject when the corpus is built and exploited. Corpora are challenging our view of what it is possible to study in linguistics and how we should study it. Such challenges should always be welcomed.

Use Quizgecko on...
Browser
Browser