Glossary of terms used in concordancing literature and on this site

Collocate

A collocate is a word that appears with another word more often than simple chance would suggest. Thus, 'coffee' is a collocate of 'hot'. Usually, for a word to be considered as a collocate the word has to occur within four or five words to left or right of the node word.

Collocation

The word collocation is often used in two ways. First, it is used to denote groups of words that often occur within a short space of each other. These groups of words have been referred to as, 'clusters', 'lexical phrases', 'lexical combinations', 'sequences of lexical items', and various other terms. 'Collocation' is also used to refer to the restrictions on how words can be used together. For example, the word 'dream' can not correctly be used with the word 'see'. Instinctive knowledge of collocations, in both senses, is a characteristic of the fluent user of a language. Thus, a knowledge of collocations is vital for successful mastery of English.

Concordance

The concordance is at the centre of corpus linguistics, because it gives access to many important language patterns in texts. A concordance is a list of all the words, or a certain word, used in a text or a body of texts, together with the context in which the words appear. This context is usually no more than 7 or 8 words to the left and right of the node word.

Concordances are lines of text. Each line contains a node word or phrase, which is usually in the centre of the page. The words to the left and right of the node are the context in which each instance of the word or phrase appears in a certain text. The node is the word or phrase for which the 'researcher' elected to search. The presentation of words and their context in this way is often referred to as 'key word in context' or KWIK. The number of words to the left and right of the key word or phrase can often be adjusted. This means that a concordance line need not always have to be just one line of text with about six or seven words to the left or right of the node. If the operator wishes, she could produce a concordance with many more context words. Furthermore, depending on the software, it is possible to produce concordances in which the context is only the sentence in which the key word appears, rather than a fixed number of words to the left and right. This avoids having the concordance lines beginning and ending in mid sentence as in the examples below. It should be remembered that each concordance line is independent of the line before and after. Their only connection is that they may be from the same text.

The following ten concordances were extracted from the Bank of English using Cobuild Direct. The search term was 'TO+RB+VB' which means, 'to' as an infinitive marker, plus any adverb, plus any verb in its base form.

The next ten concordances were extracted from a corpus of the Guardian Weekly Newspaper using Word Smith. The search term was for the word 'nature' . The concordances were then sorted so that the word immediately to the right of the node appeared in alphabetical order.

Concordancers

Originally a concordancer was the person that made concordances by reading through a text and writing all instances of a word and its contexts. Now the word refers to computer programs that do this task.

Concordancers have to be one of the most efficient uses of a computer that a language teacher can can utilize. Many tasks that we undertake with a computer could in fact be done quite easily (if not as prettily) with pencil and paper (and perhaps a calculator). However, unless you have a lot of time on your hands, a producing concordances can only be carried out by a computer. Thus, combined with the fact that concordances are the only way of providing students with a lot of authentic contextual data, it seems obvious that concordancers should be standard issue for all language teachers. Concordances are the only way to expose students to large numbers of collocations in authentic contexts.

Furthermore, because concordances are computer-generated the concordances can be manipulated in various ways. For example, the order of concordances can be organised according to particular context words. The node word, or key word, can be blanked out, etc. The ability to manipulate concordances is invaluable to both teacher and researcher.

Corpus and corpora

A body of text is a corpus. Usually, it refers to a body of text that has been created with the specific intention of carrying out some kind of analysis. Corpora is the plural form of the noun. (see notes on 'Finding and Choosing Corpora')

There are two kinds of corpora. The first is the `sample corpus', which is a finite collection of texts, chosen as representative of English in it entirety, or because they a representative of a certain genre. Once a sample corpus is established, it is not added to or changed in any way. The other kind of corpus is the `monitor corpus', which re-uses language text that has been prepared in machine-readable form for other purposes.

DDL - Data Driven Learning

The term Data Driven Learning (DDL) was coined by Tim Johns.

Fixed width font

Perhaps not an obvious candidate to appear in a glossary of this kind. However, for those of you wishing to transfer a concordance to a word processor, it is worth remembering that most fonts are not of 'fixed width'. That is to say that each letter is of a different width. This means that the node word that was perfectly centered in the concordancing software will no longer be in the centre of the page when pasted into a word processor unless a fixed width font such as courier is used. The font in the above two examples is 'courier'.

KWIC - Key word in context

When a concordance is made of a certain word, that word is the key word, and in a concordance it appears in context, i.e. with a predefined number of words to the left and right.

Node

The 'node' or 'node word' is the word the appears in the middle of the screen in a list of concordances. It is the 'key word'.

Running words

This term is used in measuring the length of a text. Each successive word-form is counted once, whether or not that particular form has occurred before. For example, the `The wind in the willows' contains 5 running words. 'Running words' are the same thing as 'tokens'

Span

This is the measurement, in words, of the co-text of the node. A span of -4, +4 means that four words on either side of the node word will be taken to be its relevant verbal environment.

Text file

Almost all concordancers require corpora to be saved as a text file. This is the simplest form of file on which words are stored. There is no formatting. A text file can be read by any computer regardless of operating system. In the windows environment, the name given to any text file must end in '.txt'.

Types and Tokens

The 'tokens' of a corpus refers to the simple word count, the number of running words in the corpus. The number of 'types' in a corpus refers to the number of different words in the corpus. These are the words that appear in a word index. In some concordancers (e.g. WordSmith) there is a command to calculate the type:token ratio. In can be said that the higher the ratio the harder the text is to comprehend.

This page: http://www.nsknet.or.jp/~peterr-s/concordancing/glossary.html