A collocate is a word that appears with
another word more often than simple chance would suggest. Thus,
'coffee' is a collocate of 'hot'. Usually, for a word to be considered
as a collocate the word has to occur within four or five words
to left or right of the node word.
The word collocation is often used in
two ways. First, it is used to denote groups of words that often
occur within a short space of each other. These groups of words
have been referred to as, 'clusters', 'lexical phrases', 'lexical
combinations', 'sequences of lexical items', and various other
terms. 'Collocation' is also used to refer to the restrictions
on how words can be used together. For example, the word 'dream'
can not correctly be used with the word 'see'. Instinctive knowledge
of collocations, in both senses, is a characteristic of the fluent
user of a language. Thus, a knowledge of collocations is vital
for successful mastery of English.
The concordance is at the centre of
corpus linguistics, because it gives access to many important
language patterns in texts. A concordance is a list of all the
words, or a certain word, used in a text or a body of texts,
together with the context in which the words appear. This context
is usually no more than 7 or 8 words to the left and right of
the node word.
Concordances are lines of text. Each
line contains a node word or phrase, which is usually in the
centre of the page. The words to the left and right of the node
are the context in which each instance of the word or phrase
appears in a certain text. The node is the word or phrase for
which the 'researcher' elected to search. The presentation of
words and their context in this way is often referred to as 'key
word in context' or KWIK. The number of words to the left and
right of the key word or phrase can often be adjusted. This means
that a concordance line need not always have to be just one line
of text with about six or seven words to the left or right of
the node. If the operator wishes, she could produce a concordance
with many more context words. Furthermore, depending on the software,
it is possible to produce concordances in which the context is
only the sentence in which the key word appears, rather than
a fixed number of words to the left and right. This avoids having
the concordance lines beginning and ending in mid sentence as
in the examples below. It should be remembered that each concordance
line is independent of the line before and after. Their only
connection is that they may be from the same text.
The following ten concordances were
extracted from the Bank of English using Cobuild Direct. The
search term was 'TO+RB+VB' which means, 'to' as an infinitive
marker, plus any adverb, plus any verb in its base form.
The next ten concordances were extracted
from a corpus of the Guardian Weekly Newspaper using Word Smith.
The search term was for the word 'nature' . The concordances
were then sorted so that the word immediately to the right of
the node appeared in alphabetical order.
Originally a concordancer was the person
that made concordances by reading through a text and writing
all instances of a word and its contexts. Now the word refers
to computer programs that do this task.
Concordancers have to be one of the
most efficient uses of a computer that a language teacher can
can utilize. Many tasks that we undertake with a computer could
in fact be done quite easily (if not as prettily) with pencil
and paper (and perhaps a calculator). However, unless you have
a lot of time on your hands, a producing concordances can only
be carried out by a computer. Thus, combined with the fact that
concordances are the only way of providing students with a lot
of authentic contextual data, it seems obvious that concordancers
should be standard issue for all language teachers. Concordances are the only way to expose students
to large numbers of collocations in authentic contexts.
Furthermore, because concordances are
computer-generated the concordances can be manipulated in various
ways. For example, the order of concordances can be organised
according to particular context words. The node word, or key
word, can be blanked out, etc. The ability to manipulate concordances
is invaluable to both teacher and researcher.
Corpus and corpora
A body of text is a corpus. Usually,
it refers to a body of text that has been created with the specific
intention of carrying out some kind of analysis. Corpora is the
plural form of the noun. (see
notes on 'Finding and Choosing Corpora')
There are two kinds of corpora. The
first is the `sample corpus', which is a finite collection of
texts, chosen as representative of English in it entirety, or
because they a representative of a certain genre. Once a sample
corpus is established, it is not added to or changed in any way.
The other kind of corpus is the `monitor corpus', which re-uses
language text that has been prepared in machine-readable form
for other purposes.
DDL - Data Driven Learning
The term Data Driven Learning (DDL)
was coined by Tim Johns.
Fixed width font
Perhaps not an obvious candidate to
appear in a glossary of this kind. However, for those of you
wishing to transfer a concordance to a word processor, it is
worth remembering that most fonts are not of 'fixed width'. That
is to say that each letter is of a different width. This means
that the node word that was perfectly centered in the concordancing
software will no longer be in the centre of the page when pasted
into a word processor unless a fixed width font such as courier
is used. The font in the above two examples is 'courier'.
KWIC - Key word in context
When a concordance is made of a certain
word, that word is the key word, and in a concordance it appears
in context, i.e. with a predefined number of words to the left
The 'node' or 'node word' is the word
the appears in the middle of the screen in a list of concordances.
It is the 'key word'.
This term is used in measuring the length
of a text. Each successive word-form is counted once, whether
or not that particular form has occurred before. For example,
the `The wind in the willows' contains 5 running words. 'Running
words' are the same thing as 'tokens'
This is the measurement, in words, of
the co-text of the node. A span of -4, +4 means that four words
on either side of the node word will be taken to be its relevant
Almost all concordancers require corpora
to be saved as a text file. This is the simplest form of file
on which words are stored. There is no formatting. A text file
can be read by any computer regardless of operating system. In
the windows environment, the name given to any text file must
end in '.txt'.
Types and Tokens
The 'tokens' of a corpus refers to the
simple word count, the number of running words in the corpus.
The number of 'types' in a corpus refers to the number of different
words in the corpus. These are the words that appear in a word
index. In some concordancers (e.g. WordSmith) there is a command
to calculate the type:token ratio. In can be said that the higher
the ratio the harder the text is to comprehend.