N-grams data


As far as we are aware, the only other large downloadable n-grams sets for contemporary English are the Google n-grams (and our own n-grams from iWeb). The following is a brief comparison of the COCA n-grams and the Google n-grams).

(You might also be interested in our adaptation of the Google Books n-grams datasets (historical), which allow you to do many things that are not possible with the simple Google Books n-grams interface.)
 
  COCA-based n-grams Google n-grams
Corpus    

 

Corpus of Contemporary American English [COCA].  One billion words, 1990-2019.

The Web (as of 2006)

Balanced by genre

Yes

No

N-grams    

Includes part of speech

Yes

No

Includes lemma

Yes (WordID format)

No

Platform

Can be installed and used on a personal computer

Probably only installable and usable on a server or high-end workstation

Minimum tokens per n-gram 4 tokens 40 tokens
 

This difference is significant. While the number of tokens (total number of words) in the Google n-grams "corpus" (the Web) is much larger than in COCA, the number of types (unique strings of words) in their n-grams datasets is proportionately much smaller. This is because the Google n-grams only include strings that occur at least 40 times, which means that only an extremely small percentage of all types are in their dataset. In our n-grams, on the other hand, even strings that occur 4 or 5 times are included in the n-grams datasets.