As far as we are aware, the only other large downloadable n-grams sets for contemporary English are the Google n-grams. The following is a brief comparison of the two datasets.

  COCA-based n-grams Google n-grams


Corpus of Contemporary American English [COCA].  520 million words, 1990-2015.

The Web (as of 2006)

Balanced by genre




Includes part of speech



Includes lemma

Yes (Level 3)



Can be installed and used on a personal computer

Probably only installable and usable on a server or high-end workstation

Minimum tokens per n-gram 3 tokens (Level 2), 1 token (Level 3) 40

This difference is significant. While the number of tokens (total number of words) in the Google n-grams "corpus" (the Web) is much larger than in COCA, the number of types (unique strings of words) in their n-grams datasets is proportionately much smaller. This is because the Google n-grams only include strings that occur at least 40 times, which means that only an extremely small percentage of all types are in their dataset. In our n-grams, on the other hand, even strings that occur 1 or 3 times (depending on the version of the n-grams) are included in the n-grams datasets.