* Concordancers


This paper is taken from a paper I did in my MA, and is thus not up to date in terms of references or many other matters! It’s in 2 parts: this one, and then the second part which discusses how to use concordancers with students.

A concordancer is a computer program that will search a text stored in electronic form for a target item (an item of punctuation, a morpheme, a word, a phrase or a combination of words) and display all the examples it finds with the contexts in which they occur.

The program is used to examine these questions:

– What words occur in the text?

– How often does each word occur? (Frequency counts.)

– In how many different types of text (different subject areas, different modes, different mediums) does the word appear?

– Are there any significant subsets? (For example, in English, the 700 most frequent words account for 70% of all text.)

– What are the collocations of the target item?

– What are the contexts in which the word appears?

Thus, taking a word as the search item, a concordancer will list all the different words in a text alphabetically, or in the order they appear, it will count how often each word occurs and rank order them in terms of frequency, it will indicate what type of text the word appears in, and it will display the instances of the word in its context in a variety of formats (the most usual being the KWIC (Key Word In Context) format.

Apart from providing access to large amounts of natural language, perhaps the most impressive feature of a concordancer, the one that largely explains why it is such a powerful tool, is its ability to group text in such a way that patterns in the language are clearly visible.

To study general features of the language, a large general corpus is needed, consisting of texts from as many different sources, and of as many different types, as possible. The following corpora are examples of those currently being used by linguists and lexicographers:

COCA: Corpus of Contemporary American English
The British National Corpus
The Oxford English Corpus
Bank of English (Cobuild Corpus)

See http://www.corpora4learning.net/ for a more complete list.

The combination of user-friendly concordance software and readily-available very large corpora to work with produces a very powerful tool for exploring the language and for language learning. Let’s look at a few areas of interest.


The use of concordance programs to analyze large corpora has led to the discovery of new patterns in the English language, and subsequent criticism of pedagogical grammars and the type of grammar teaching most ESL/EFL teachers are assumed to be engaged in. The suggestion is that as Tim Johns says “The evidence thrown up by the data has left no escape from the conclusion that the description of English underlying our teaching, whether home-made or inherited from other teachers and linguists, needs major reassessment.”

One of the most dramatic examples is the use of the word “any”. Many teachers, it is claimed, still explain that “any” is the interrogative and negative form of “some”. But a search of “any” in any (!) large corpora reveals that the majority of the occurrences of “any” are in fact examples of its use in the affirmative, in the sense of “it does not matter which”.

There is a more generalised failure to give topics their proper “weighting”. Most ESL/EFL pedagogical grammars agree as to the core topics of English grammar; they have similar organisations, and give the topics the same priorities. But this consensus is not based on an empirical analyses of actual patterns in use. Biber (1994) gives the example of postnominal modifiers. Grammar books consistently give relative clauses extensive treatment, while giving little to participial clauses, and even less to prepositional phrases. Text analysis reveals however that prepositional phrases are far more common than relative clauses or participial phrases across a range of popular English registers.

Willis (1990) points out that corpus-based research has shown that the passive voice is inadequately treated in most course books, and he is similarly critical of the way in which they insist on the “myth” of the three conditionals. Any search for “if” in a corpus will quickly illustrate that there are in fact far more than three conditionals, and that the three which course books usually focus on are not the commonest.

The way that course books and teachers present reported speech has also been shown to be faulty. Corpus-based research, especially of spoken text, indicates that the set of reported speech procedures described in course books are in fact rarely used in natural spoken discourse.

As a final example, Fox (1993) suggests that while most teachers see verbs as being transitive or intransitive, this is not the best way of looking at verbs for language learners, and that it is better to concentrate on clauses, looking at the relationships between the verb group and the subject and object groups where these occur. By so doing during her work on the COBUILD project, Fox found that most verbs cannot be labelled as transitive or intransitive, but can be used in both transitive and intransitive clauses.


Corpus-based lexicographic research shows that our intuitions about a word often do not match the actual patterns of use. For example, Sinclair (1991) analyses the word “back”. While most dictionaries list the human body part as the first meaning, the COBUILD Corpus shows this meaning to be relatively rare, and the adverbial sense of “in, to or towards the original starting point” (not usually given prominence) to be the most common. Biber concurs, making the rider that although the adverbial “back” is more common across all registers, the human body part sense is more frequent in fiction.

Biber uses a concordancer to analyze the word “certain”. He observes that the actual patterns of use depart markedly from our intuitions, in that the word “certain” rarely marks “certainty”. More commonly it is used to mark a referent – a certain kind, in certain places. Furthermore, the two major senses of “certain” are not at all uniformly distributed across registers. For example, “certain” marking certainty is more common in fiction than in Social Science, while “certain” as a referent marker is more common in Social Science.

I trust these two examples will suffice to illustrate the general claim made by Sinclair, Biber and others that there is a deep and widespread mismatch between our intuitions about individual words, and the data about them that emerges from concordance analysis.

A related issue is that of frequency. By examining the corpus, the COBUILD team found that the 700 most frequent words of English account for around 70% of all English text. “That is to say around 70% of the English we speak and hear, read and write is made up of the 700 commonest words in the language” (Willis, 1990). The most frequent 1,500 words account for around 76% of text, and the most frequent 2,500 for 80%. There is nothing very new about this – Last had made a “general service list” of English vocabulary more than fifty years earlier based on frequency counts – but the COBUILD team was the first to claim that working with a large corpus gives more credence to their results, and confirms this very counter-intuitive fact about the English language.

Grammar or lexis? The lexical phrase

Pawley and Syder (1983) drew attention to what they called “lexicalized sentence stems” -“chunks” of formulaic language, of clause length or longer, a normal competent native speaker having many thousands of them at his disposal. “A lexicalized sentence stem is a unit of clause length or longer whose grammatical form and lexical content is wholly or largely fixed; its fixed elements form a standard label for a culturally recognised concept, a term in the language. Many such stems have a grammar that is unique in that they are subject to an idiosyncratic range of phrase structure and transformational restrictions; that is to say, by applying generally productive rules to these units one may produce an utterance that is grammatical but unnatural or highly marked. (Pawley and Syder, 1983)

Pawley and Syder suggested that the existence of these lexicalized sentence stems questions the traditional compartmentalization of grammar into syntax (productive rules) and dictionary (fixed, arbitrary usages), and also presents learners with two problems: how to learn a means of knowing which of possible well-formed sentences are nativelike (the puzzle of nativelike selection), and second, how to produce lexicalised sentence stems (and often multi-clause units) without hesitating in mid-clause (the puzzle of nativelike fluency).

Nattinger and DeCarrico, drawing on Pawley and Syder and also on more recent research, argue that what they call the “lexical phrase” is at the heart of the English language. Work done by computational linguists has uncovered recurring patterns of lexical co-occurrence, and patterns among function words as well. As a result of such research, there is a growing belief among computational linguists that linguistic knowledge cannot be strictly divided into grammatical rules and lexical items, that rather, there is an entire range of items from the very specific (a lexical item) to the very general (a grammar rule), and since elements exist at every level of generality, it is impossible to draw a sharp border between them. There is, in other words, a continuum between these different levels of language.


Pedagogical Applications

1 The “Teach the facts” view

Given the new facts about English which corpus-based research has revealed, Biber, Sinclair and others argue that teaching practice must fit the new, more accurate, description. They go further, and suggest that now teachers have the data available to them, it should form the basis for instruction.

2. Discourse complexity

As far as discourse complexity is concerned, Biber argues that the multivariate analyses and frequency counts done on different types of text indicate that simplistic readability formulas do not provide adequate guidelines for the construction of introductory reading materials since long sentences containing frequent adverbial and verb complement clauses should not necessarily be considered complex, and on the other hand relatively short sentences with extensive informational integration (nouns, attributive adjectives, prepositional phrases) are markedly complex. Biber recommends that the four types of complexity features which he has detected be applied to the different types of text he describes in order to produce a more sophisticated and reliable indicator of text complexity which can then be used to produce more appropriate reading texts for beginners and lower intermediate learners.

3. ESP and register analysis

Biber says that those working in ESP and EAP hold the view that there are important & systematic differences among text varieties at all linguistic levels. Biber argues that teachers of advanced students should therefore focus on the English of particular varieties in naturally-occurring discourse, rather than the “general” patterns that are culled from linguists intuitions. Computer searches of large corpora can investigate patterns of variation across a large number of registers, with respect to many relevant linguistic characteristics. Such analyses, says Biber, provide an important foundation for work in ESP in that they characterise particular registers relative to other registers and help to demonstrate the extent of linguistic differences across registers, and thereby the need for proficiency in particular registers.

4. The lexical syllabus

Willis (1990) outlined a lexical syllabus which he claims provides a “new approach to language teaching”. Willis starts from the “contradiction” between a grammatical syllabus and a communicative methodology, and cites the typical way in which the present simple tense (which is neither simple nor present) is presented. Even if the issues were dealt with less simplistically, presentation of language forms does not provide enough input for learning a language. A successful methodology must be based on use not usage, yet must also offer a focus on form, rather than be based on form and give some incidental focus on use.

Willis claimed that the COBUILD English course embodies this view. Word frequency determines the contents of the courses. For Level 1 they created a corpus which contextualised the 700 most frequent words and their meanings and uses, and provided a range of activities aimed at using and exploring these words. Willis argues that the lexical syllabus does not simply identify the commonest words, it focuses on commonest patterns too, and indicates how grammatical structures should be exemplified by emphasising the importance of natural language.

5. The lexical phrase as a key unit for learning

Above, Nattinger and DeCarrico’s argument that the lexical phrase is an important element in the description of the English language was outlined. Basically, they suggest that the distinction between grammar and lexis in descriptions of the English language is too rigid. The suggested application of this to language teaching is that lexis – and in particular the lexical phrase – should be the focus of instruction. This approach is quite different to Willis’ (which takes frequency as the main criterion) and rests on two main arguments. First, some cognitive research (particularly in the area of PDP and related connectionist models of knowledge) suggests that we store different elements of language many times over in different chunks. This multiple lexical storage is characteristic of recent connectionist models of knowledge, which assume that all knowledge is embedded in a network of processing units joined by complex connections, and accord no privilege to parsimonious, non-redundant systems. “Rather, they assume that redundancy is rampant in a model of language, and that units of description, whether they be specific categories such as “word” or “sentence”, or more general concepts such as “lexicon” or “syntax” are fluid, indistinctly bounded units, separated only as points on a continuum”. (Nattinger and DeCarrico, 1992)

If this is so, then the role of analysis (grammar) in language learning becomes more limited, and the role of memory (the storage of, among other things, lexical phrases) more important.

The second argument is that language acquisition research suggests that formulaic language is highly significant. Peters (1983) and Atkinson (1989) shows that a common pattern in language acquisition is that learners pass through a stage in which they use a large number of unanalyzed chunks of language – prefabricated language. This formulaic speech is seen as being basic to the creative rule-forming processes which follow. Starting with a few basic unvarying phrases, first language speakers subsequently, through analogy with similar phrases, learn to analyze them as smaller patterns, and finally into individual words, thus finding their own way to the regular rules of syntax. It has also been suggested by Skehan (1991) that a further step in the language acquisition process is the “re-lexicalization” of various patterns of words.

The computational analysis of language confirms the significance of patterned phrases as basic, intermediary units between the levels of lexis and grammar. Cognitive research and language acquisition research support the argument that such phrases play an important role in the learning process. In other words, current corpus-based research and research in language acquisition converge in a way that reveals the lexical phrase as an ideal unit which can be exploited for language teaching.



First, note that Biber’s claim that computational text analysis has provided better criteria for defining discourse complexity, thus demonstrating that the former “intuitive” criteria are inadequate, is challenged by Widdowson, who points out that the criteria Biber gives all relate to linguistic features and co-textual co-occurrences. “What is analyzed is text, not discourse” (Widdowson, 1993). Biber takes readability to be a matter of the formal complexity in the text itself, without dealing with how, as Widdowson puts it “an appropriate discourse is realized from the text by reference to schematic knowledge, that is to say to established contextual constructs”. (ibid). Adequate guidelines for the construction of reading materials need to take discourse into account, and it is not self evident that the criteria for textual complexity suggested by Biber are relevant to reading. Moreover, since concordancing is limited to the analysis of text, since the language is abstracted from the conditions of use,it cannot reveal the discourse functions of textual forms.

Although corpus analysis provides a detailed profile of what people do with the language, it does not tell us everything about what people know. Chomsky, Quirk et al.(1972, 1985), and Greenbaum (1988) argue that we need to describe language not just in terms of the performed (as Sinclair, Biber, Willis, and Lewis suggest) but in terms of the possible. The implication of Sinclair and Biber’s argument is that what is not part of the corpus is not part of competence, and this is surely far too narrow a view, which seems to hark back to the behaviourist approach. Surely Chomsky was right to insist that language is a cognitive process, and surely Hymes, in arguing for the need to broaden our view of competence was not arguing that we look only at attested behaviour.

Widdowson cites Rosch (1975) who devised a questionnaire to elicit from subjects the word which first sprang to mind as an example of a particular category. The results of this conceptual elicitation showed that subjects consistently chose the same hyponym for a particular category: given the superordinate “bird”, “robin” was elicited, the word “vegetable” consistently elicited “pea”, and so on. The results did not coincide with frequency profiles, and are evidence of a “mental lexicon” that concordancers cannot reach.

Quite apart from the question of the way in which we choose to describe language, and of the limitations of choosing a narrow view of attested behaviour which can tell us nothing directly about knowledge, there is the wider issue of what kinds of conclusions can be drawn from empirically attested data. The claim made by Biber, Sinclair and others is that, faced with all the new evidence, we must abandon our traditionally-held beliefs about language, accept the evidence, and consequently change our description of the language, language materials, and language instruction too. Now, the argument goes, that we have the facts, we should describe and teach the facts (and only the facts) about English. Widdowson (1990) points out that the relationship between the description of language and the prescription of language for pedagogical purposes “cannot be one of determinacy.” This strikes me as so obvious that I am surprised that Sinclair, Biber and others seem not to have fully grasped it. No description has any necessary prescriptive implications: one cannot jump from statements about the world to judgements and recommendations for action as if the facts made the recommendations obvious and undeniable. Thus, as Widdowson points out, descriptions of language cannot determine what a teacher does. Descriptions of language tell us about the destinations that language learners are travelling towards, but they do not provide any directions about how to get there. Only prescriptions can do that.

While Sinclair is justified in expecting corpus-based research to influence syllabus design, there is no justification for the assumption that it must necessarily do so, and much less that such research should determine syllabus design. A case must be made for the approach which he seems to regard as somehow self-evident. When Sinclair says that the categories and methods we use to describe English are not appropriate to the new material, we need to know by what criteria appropriateness is being judged. Similarly, when Biber says “Consensus does not mean validity”, and when he claims that corpus-based research offers the possibility of “more effective and appropriate pedagogical applications”, we need to ask by what criteria (pedagogical presumably) validity, effectiveness and appropriateness are to be judged. When he talks of data from frequency counts “establishing” the “inadequacy” of discourse complexity he is presumably once again referring to assumptions, criteria which are not made explicit. When he suggests that the evidence of corpus-based research indicates that there is something special about the written mode, in that it enables a kind of linguistic expression not possible in speech, he is once again making an inadmissible conclusion.

Facts do not “support” prescriptions, but our view of language will influence our prescriptions about how to teach and learn it. If we view language as attested behaviour, we are more likely, as Willis does, to recommend that frequently attested items of lexis form the core vocabulary of a general English course. Willis appreciates that his approach to syllabus design is in any way “proved” by facts, but he still takes a very narrow view. To return to the discussion above about Rosch’s “prototype words” (the mental lexicon), I do not think that such words should be ignored simply because they are not frequently attested, and it could well be argued that they should be one of the criteria for identifying a core vocabulary.

Widdowson takes the case further. He suggests that Chomsky’s idea of “kernel sentences” indicates the possibility that there are also prototype sentences which have an intuitive role. They do not figure as high frequency units in text, but they do figure in descriptive grammars, and their presence there can be said to be justified by their intuitive significance, their psychological reality, as prototypes. Furthermore, they are the stock in trade of language teaching. Teachers may all be wrong about the significance of such kernel sentences, but we cannot simply dismiss the possibility of their prescriptive value on the grounds that they do not occur frequently in electronically-readable corpora.

More evidence of the limitations of sticking to frequently attested language forms comes from research which led to the specification of core language to be included in Le Français Fundemental (Gougenheim et al. 1956). The research team began with frequency counts of actual language, but they felt that some words were still missing: french people had a knowledge of words which the researchers felt intuitively should be included despite their poor showing in performance. So the researchers carried out an exercise in conceptual elicitation. They identified categories like furniture, clothing, occupations, and asked thousands of school children which nouns they thought it would be most useful to know in these categories. Once again, the lists did not correspond to frequency counts, and gave rise to the idea of “disponibilité” or availability. As Widdowson says, the difference between the french research and Rosch’s is that availability is a prescriptive criterion: the words are prescribed as useful not because they are frequently used but because they appear to be readily available in the minds of the users.

Widdowson (1990) suggests that there are more direct pedagogical criteria to consider than those of frequency and range of language use. In terms of the purpose of learning, he sights coverage as a criterion described by Mackay: “The coverage .. of an item is the number of things one can say with it. It can be measured by the number of things it can displace” (Mackay 1985). Most obviously, this criterion will prevail where the purpose of learning is to acquire a minimal productive competence across a limited range of predictable situations. The process version of coverage is what Widdowson calls valency – the potential of an item to generate further learning. He gives the example of the lexical item “bet” as described in the COBUILD dictionary (1987). Analysis reveals that the canonical meaning of the word, “to lay a wager”, is not as frequently attested as its informal occurrence as a modal marker as in “I bet he’s late”. It does not follow, however, that the more frequent usage should be given pedagogical preference. First, the informal meaning tends to occur mainly in the environment of first person singular and present tense, and is idiomatic, and it is thus limited in its productive generality. Second, the modal meaning is derivable from the canonical lexical meaning but not the other way round. In this sense the former has a greater valency and so constitutes a better learning investment. Widdowson proposes a general principle: high valency items are to be taught so that high frequency items can be more effectively learned.

Widdowson says “ Communicative competence is a matter of knowing a stock of partially pre-assembled patterns, formulaic frameworks, and a kit of rules, so to speak, and being able to apply the rules to make whatever adjustments are necessary according to contextual demands. (Widdowson, 1989) Communicative competence is a matter of adaption, and rules are not generative but regulative and subservient. Competence consists of knowing how the scale of variability in the legitimate application of generative rules is applied – when analysis is called for and when it is not. Ignorance of the variable application of grammatical rules constitutes incompetence. (Widdowson, 1990)

Our criteria for pedagogical prescription do not have to change as a result of this new formulation of competence, but I think we are nearer to identifying pedagogically key units of language – parts of the language that activate the learning process. The suggestion is that grammar’s role is subservient to lexis, and this implies a radical shift in pedagogical focus. If, as Widdowson thinks, we should provide patterns of lexical co-occurrence for rules to operate on so that they are suitably adjusted to the communicative purpose required of the context, then Nattinger and DeCarrico’s work, which identifies lexical phrases and then prescribes exposure to and practice of sequences of such phrases, can surely play a key role. They present a language teaching program based on the lexical phrase which leads students to use prefabricated language in a similar way as first language speakers do, and which they claim avoids the shortcomings of relying too heavily on either theories of linguistic competence on the one hand or theories of communicative competence on the other.

Bibliography (which includes references to above)

Bialystok, E. (1982) On the relationship between knowing and using linguistic form. Applied Linguistics 3/3

Biber, D. (1993) Corpus-based approaches to issues in applied linguistics. AAAL colloquium on Discourse Analysis

Chomsky, N. (1986) Knowledge of language: Its nature, origin and use. New York, Praeger

Chomsky, N. (1988) Language and problems of knowledge. Cambridge, Mass. MIT press

Church, K., and Hanks, P. (1989) Word Association Norms, Mutual Information, and Lexicography. ACL Proceedings

Church, K., Gale, W., Hanks, P., and Hindle, D. (1991) Using Statistics in Lexical Analysis. ACL Proceedings

Fox, G. (1991) Context Dependency and Transitivity in English. In Johns, T. & King, P. (Eds.) 1991 Classroom Concordancing OUP.

Garside, R., Leech, G., and Sampson G., (1987) The Computational Analysis of English: a corpus-based approach. London, Longman

Gougenheim, G et al (1956) L’Elaboration du français élementaire Paris, Didier

Greenbaum, S. (1988) Good English and the grammarian London, Longman

Hardisty, W and Windeatt, S. (1989) Call Oxford, OUP.

Higgins, J. and Johns, T. (1984) Computers in Language Learning. London, Collins.

Higgins, J. (1991) Looking For Patterns. In Johns, T. & King, P. (Eds.) 1991 Classroom Concordancing OUP.

Johns, T. (1986) Micro-concord, a Language Learner’s Research Tool. System 14 (2): p. 151 – 162

Johns, T. (1988) Whence and whither classroom concordancing? In Bongarts T et al, Computer Applications in Language Learning, Floris

Johns, T. (1991) Should you be Persuaded: Two Examples of Data-driven Learning. In Johns, T. & King, P. (Eds.) 1991 Classroom Concordancing

Jordan, G. (1988) Using CALL programs in the classroom. System Vol. 14 No. 2

Mackay, R. (1965) Language Teaching Analysis London: Longmans

Nattinger, J. and DeCarrico, J. (1992) Lexical Phrases and Language Teaching. Oxford, OUP

Pawley, A and Syder, F. (1983) Two puzzles for linguistic theory: native selection and nativelike fluency. In J. Richards and R. Schmidt (eds) Language and Communication London, Longman

Quirk, R., Greenbaum, S., Leech, G., and Svartvik (1972) A grammar of contemporary English London: Longman

Rosch, E. (1975) Cognitive representations of semantic categories. In Journal of Experimental Psychology 104

Sinclair, J. (1987) (ed) Looking Up London, Collins

Sinclair, J. et al (1987) Collins COBUILD Essential English Dictionary London, Collins

Sinclair, J., Fox, G. et al. (1991) Collins COBUILD English Grammar London, Collins

Sinclair, J. (1991) Corpus, Concordance, Collocation Oxford, OUP

Tribble, C. and Jones, G. (1990) Concordances in the classroom London, Longmans

Widdowson, H.G. (1989) Knowledge of language and ability for use. Applied Linguistics Vol 10 No.2

Widdowson, H.G. (1990) Aspects of Language Teaching. Oxford, OUP

Widdowson, H.G. (1993) Response to Biber. AAAL colloquium on Discourse Analysis

Willis, D. (1990) The Lexical Syllabus London, Collins

One thought on “* Concordancers

Leave a Reply

Fill in your details below or click an icon to log in:

WordPress.com Logo

You are commenting using your WordPress.com account. Log Out /  Change )

Google photo

You are commenting using your Google account. Log Out /  Change )

Twitter picture

You are commenting using your Twitter account. Log Out /  Change )

Facebook photo

You are commenting using your Facebook account. Log Out /  Change )

Connecting to %s