N-grams data


As far as we are aware, the only other large downloadable n-grams sets for contemporary English are the Google n-grams. The following is a brief comparison of the two datasets.

(You might also be interested in our adaptation of the Google Books n-grams datasets (historical), which allow you to do many things that are not possible with the simple Google Books n-grams interface.)
 
  COCA-based n-grams Google n-grams
Corpus    

 

Corpus of Contemporary American English [COCA].  520 million words, 1990-2015.

The Web (as of 2006)

Balanced by genre

Yes

No

N-grams    

Includes part of speech

Yes

No

Includes lemma

Yes (Level 3)

No

Platform

Can be installed and used on a personal computer

Probably only installable and usable on a server or high-end workstation

Minimum tokens per n-gram 3 tokens (Level 2), 1 token (Level 3) 40
 

This difference is significant. While the number of tokens (total number of words) in the Google n-grams "corpus" (the Web) is much larger than in COCA, the number of types (unique strings of words) in their n-grams datasets is proportionately much smaller. This is because the Google n-grams only include strings that occur at least 40 times, which means that only an extremely small percentage of all types are in their dataset. In our n-grams, on the other hand, even strings that occur 1 or 3 times (depending on the version of the n-grams) are included in the n-grams datasets.