A few words on finding and choosing Corpora

The concordancer is the 'machine', in order for the machine to function the operator needs some 'fuel', a body of text or corpus. Acquiring large amounts of text that can be 'read' by a computer is not a problem. The internet and email ensure that there is now an abundance of machine readable text. It is now possible to subscribe to email versions of newspapers thus ensuring a weekly download of text. There are also a number of web sites that allow individuals to download entire books. The links to some of these sites can be found at the author's web site. The following should be considered as the main sources of text.

1. The world wide web
2. Email
3. Commercially sold corpora
4. Students homework (for error analysis)
5. Scanning (beware of copyright infringement)

Choosing corpora

While the acquisition 'some kind of electronic text' is straightforward, the acquisition of an 'appropriate' body of text is the most problematic step to be taken in the production of concordances. The first factor that dictates the choice of text are the objectives of the operator. A study of a certain usage in spoken English requires a corpus of spoken English, while the creation of language material to assist students write scientific papers would require and entirely different kind of corpus. The next point to be aware of is a technical one concerning the form of the text. For a concordancer to function the text has to be machine readable. In other words it has to be saved on a computer disk. Furthermore, it should not be formatted in anyway, and thus it should be saved as a 'text file'. Some commercially sold corpora have tagged texts, which means that various information about the texts and individual words have been saved with the texts. This information is not the same as formatting, and is invaluable if the operator wishes to make searches of certain grammatical forms as opposed to specific words (see figure one above).

A third variable is that of text level. For the language teacher this is especially a matter of concern in two areas. There is a distinct shortage of both spoken text, and authentic text of any kind that can be easily understood by lower- intermediate learners. Of these two shortages the most troublesome is that of finding authentic text that will allow teachers to produce concordances in which the contexts are not too far beyond the comprehension level of their students. Above, it was noted that the attraction of concordances is that one is able to expose students to a large number of authentic examples. However, the very authenticity of this material renders much of it unusable for the instruction of beginners. Thus, if there is a weakness with the concept of concordancing and teaching, it is that the very concordances that are intended to enlighten learners may only serve to mystify and frustrate them. Despite this weakness, concordancers still have a role in the teaching of lower level learners. At the very least the information gleaned from an analysis of concordancers should help teachers in the construction of language material. There is also the possibility of using student created text as a corpus (Mark and Minagawa). This would be invaluable for the teacher in analysing students' errors, and for the students themselves to learn from their mistakes.

The final factor to consider is the size of the text. The smaller the text, the fewer the instances of any given word, and the less able the user is to draw conclusions about usage in the that particular type of text. For a more in depth account of the creation of a corpus the reader should consult chapter one of Sinclair. Suffice to say that the producer of concordances should consider the following four points when selecting a corpus:

1. Your objectives dictate type/genre of text
2. The form of text; is it already machine readable, is it plain text or is it formatted?
3. Level of text and students
4. Size of text; the bigger the better

This page: http://www.nsknet.or.jp/~peterr-s/concordancing/corpora.html