Lessons from the analysis of chunks

Michael McCarthy

Corpus analysis is relatively straightforward when the computer searches for single words to generate vocabulary lists. However, when we look for recurrences of more than one word (i.e., pairs of words or larger groupings), the results suggest there are lessons to be learned about how we describe the vocabulary of a language, and implications for what teachers teach and how learners develop fluency.

Units consisting of more than one word, such as phrasal verbs, compounds, and idioms, are often taught only at higher levels. Exceptions include greetings and everyday expressions (e.g., How are things? See you tomorrow), functional phrases (e.g., Happy New Year, Good luck), prepositional phrases (e.g., on the weekend, in May), and compounds (e.g., cell phone, bookstore). Meanwhile, collocation has become an accepted part of vocabulary pedagogy at all levels.

Words in Corpora

Studies of large corpora by linguists such as Sinclair (1991) have shown lexis to have a central role in the organization of language and the creation of meaning. Corpora reveal that much of our linguistic output consists of multiword units rather than just single words. Language is available for use in ready-made "chunks" which do not need re-assembling every time they are used. Such chunks seem to be at least as significant as single-word vocabulary in the semantics and pragmatics of language. Chunks also partly account for the notion of fluency, a term frequently used to describe expert performance in a language, but one that is often only loosely defined. One could reasonably suggest that teaching single words alone may leave learners ill-prepared both in terms of processing heavily chunked input such as casual conversation, and developing their own productive fluency. As with most high-frequency phenomena, the contribution of chunks to language use is subliminal and not immediately accessible to our intuition. This is where a corpus comes in.

Looking at Corpus Data

Using a 4.7-million-word sample of North American English conversation from the Cambridge International Corpus (CIC), and applying corpus analytical software to obtain a frequency count for recurrent chunks, the following totals emerge for chunks occurring more than twenty times:

  • two-word chunks 19,509
  • three-word chunks 12,681
  • four-word chunks 2,953
  • five-word chunks 385

Tables 1 and 2 show the top ten items in the list of chunks for two- and four-word items.

Table 1: Top 10 two-word chunks

  Chunk Total in corpus
1 you know 45,873
2 I don't 17,708
3 I think 17,046
4 in the 13,979
5 and I 13,757
6 of the 12,040
7 I mean 11,735
8 it was 11,271
9 a lot 10,174
10 kind of 9,962

Table 2: Top 10 four-word chunks

  Chunk Total in corpus
1 I don't now if 999
2 a lot of people 759
3 I don't know what 709
4 or something like that 570
5 a lot of the 560
6 and things like that 499
7 I don't want to 479
8 I don't know how 466
9 there's a lot of 448
10 what do you think 442

Chunks and Single Words

Only 14 items in the single-word frequency list occur more often than the most frequent chunk (i.e., you know, which occurs 45,873 times). Of the first 100 items in the overall frequency list, 11 are two-word chunks, including I think and I mean. By the time we reach 500 items, there are 177 two-word chunks and 7 three-word chunks. In other words, 35 percent of the most frequent items are chunks, not single words. A selection of chunks with greater frequency than some common single words is given in Table 3.

Table 3: High frequency chunks and single words

you know 45,873
really 20,838
I think 17,046
people 11,984
kind of 9,962
and then 8,971
I don't know 8,074
where 7,851
their 6,487
something like that 1,027
friend 1,014
I don't know if 999
a lot of people 759
under 743

Table 3 suggests that many high-frequency chunks are more frequent and more central to communication than even very frequent single words. However, the question remains whether something like and then arises merely from the high frequency and weak collocability of its component words and their inevitable repeated collision in the corpus, or whether such co-occurrences reveal something about conversation.

Chunks as Units of Interaction

Pragmatic integrity

Many of the chunks above are syntactic fragments; i.e., they are not complete phrases or clauses (for example, in the, and I, and I think it's). However, they do have an interactive identity. I think it's is indicative of the ubiquity of I think as a hedge prefacing evaluations of situations referred to by it. Other chunks seem less pragmatically integrated (e.g., it was). The frequency of their occurrence is probably due to the regularity of objects in the real world (content categories), or they may simply be fragments from the propositional world that, despite their frequency, have little independent significance. It is in pragmatic categories rather than syntactic or semantic ones that we are likely to find the reasons why many chunks occur so frequently. Pragmatic categories are distinct from content—and from propositional categories—because they result from the need to create meaning in the context of speaker-listener interactions. They include such functions as discourse marking, the preservation of face, expressions of politeness, hedging, and purposive vagueness.

Discourse marking

Some of the most frequent chunks are discourse markers, e.g., you know, I mean, I guess, (do) you know what I mean. You know, the most frequent chunk, is an important token of projected shared knowledge between speaker and listener. I mean is also of high frequency, used when speakers need to paraphrase or elaborate. Extract (1) shows both chunks at work.

(1) Like I remember when I went to public school in Jersey and not that it wasn't that bad. I mean I'm from a middle middle class town. You know we had people that you know... We had kids that whose family made you know a hundred and hundred fifty thousand dollars a year and people that generally didn't make anything at all. You know.

The extended chunk (do) you know what I mean has a similar function of checking shared knowledge.

(2) He's totally like, you know what I mean, it's like he's very liberal. I mean, he's open minded. He doesn't care. You know and so...

Saving/preserving face and politeness

Speakers use indirect forms to perform speech acts such as directives (e.g., commands, requests, suggestions, etc.) to protect the face of their addressees, and the chunks reveal common frames for such acts. Indirectness is also important in the polite and non-face-threatening negotiation of attitude and stance. Chunks in this category include Do you think, I don't know if, what do you think, and I was going to say.

(3) [Someone describing how to keep food fresh] I have the plastic bags that are supposed to keep everything... I don't know if you've seen it where they put the vacuum cleaner on top and it sucks out all the air.

Some of the most frequent chunks have a hedging function; i.e., they modify propositions to make them less assertive and less open to refutation. These include: I think, kind of, I don't know, I don't think, a little bit.

(4) [Someone reminiscing about a family vacation] Just like any other summer I went to Spain with my family and for the months of June and July and August and I was sixteen. I was starting to discover kind of girls and stuff and um we really didn't do much me and friends there.

Vagueness and approximation

Equally apparent are chunks expressing purposive vagueness and approximation. Vagueness is central to informal conversation, and its absence can make utterances blunt and pedantic, especially in references to number and quantity. Vagueness also enables speakers to refer to semantic categories in an open-ended way that calls on shared knowledge to fill in category members referred to obliquely. Such tokens include: a couple of, and things like that, and stuff like that.

(5) [Someone talking about hobbies] I do enjoy baking and I guess I always liked making, uh, cookies and bars and things like that, that was more my specialty.

Conclusions and Implications

These chunks show the all-pervasiveness of interactive meaning-making in conversation. The addition of chunks to the vocabulary syllabus is not an optional extra, since their meanings are extremely frequent, necessary, and fundamental to successful interaction. They make fluency a reality. But what descriptive and pedagogical lessons should we draw from all this? We offer the following:

  • High-frequency chunks are often more frequent than core single words.
  • The most frequent chunks, like the most frequent single words, perform core communicative functions in everyday interaction.
  • Fluency must involve the ability to call on a vocabulary of ready-assembled chunks.
  • We should not assume, however, that high-frequency chunks should be obligatory components of the learner's productive repertoire. It may be that receptive mastery is more important than productive repertoire.
  • Chunks are chunks: analyzing them and taking them apart may not be useful, and they should be processed and retrieved holistically (see Wray, 2002).
  • Conversation materials should, where possible, incorporate useful, high-frequency chunks as attested in everyday use (see McCarthy et al, in press).

I am extremely grateful to my Touchstone co-author, Jeanne McCarten, for her assistance in the collection of data and preparation of this article.


McCarthy, M., McCarten, J., Sandiford, H. (In press). Touchstone series. New York: Cambridge University Press.
Sinclair, J. (1991). Corpus, concordance, collocation. Oxford: Oxford University Press.
Wray, A. (2002). Formulaic language and the lexicon. Cambridge: Cambridge University Press.

Michael McCarthy is an internationally recognized authority on ESL/EFL and Applied Linguistics. He is Emeritus Professor of Applied Linguistics at the University of Nottingham, where he is co-director of the CANCODE spoken English corpus project, sponsored by Cambridge University Press. He is also Adjunct Professor of Applied Linguistics at Pennsylvania State University. He has published professional books in the areas of discourse and vocabulary with Cambridge, Oxford, and Longman. For Cambridge, he is co-author of English Vocabulary in Use, Essential English Vocabulary in Use, and their American English adaptations (Basic Vocabulary in Use and Vocabulary in Use Upper Intermediate) and is author of professional titles including Discourse Analysis for Language Teachers, Spoken Language and Applied Linguistics and Issues in Applied Linguistics. Finally, he is also editor of the Cambridge Word Routes series.

