N-grams data

Corpus of Contemporary American English


Overview
Compare to Google
Samples (COCA)
   Level 1 (free)
   Level 2
   Level 3
Historical (COHA)
Processing the data

Spanish data
Portuguese data

Purchase data
Free (1 million)

Related sites
  Word frequency
  WordAndPhrase
  Collocates
  corpus.byu.edu

Contact us


As far as we are aware, the only other large downloadable n-grams sets for contemporary English are the Google n-grams. The following is a brief comparison of the two datasets.

(You might also be interested in our adaptation of the Google Books n-grams datasets (historical), which allow you to do many things that are not possible with the simple Google Books n-grams interface.)
 

  COCA-based n-grams Google n-grams
Corpus    

 

Corpus of Contemporary American English [COCA].  450 million words, 1990-2012.

The Web (as of 2006)

Balanced by genre

Yes

No

N-grams    

Includes part of speech

Yes

No

Includes lemma

Yes (Level 3)

No

Platform

Can be installed and used on a personal computer

Probably only installable and usable on a server or high-end workstation

Minimum tokens per n-gram 3 tokens (Level 2), 1 token (Level 3) 40
 

This difference is significant. While the number of tokens (total number of words) in the Google n-grams "corpus" (the Web) is much larger than in COCA, the number of types (unique strings of words) in their n-grams datasets is proportionately much smaller. This is because the Google n-grams only include strings that occur at least 40 times, which means that only an extremely small percentage of all types are in their dataset. In our n-grams, on the other hand, even strings that occur 1 or 3 times (depending on the version of the n-grams) are included in the n-grams datasets.