N-grams data


NEW: Most of the information at this website deals with data from the COCA corpus, which was about 400 million words in size when this word frequency data was compiled. In May 2018 we released n-grams data  from the 14 billion word iWeb corpus, which is about 35 times as large as COCA.


These n-grams are based on the largest publicly-available, genre-balanced corpus of English -- the Corpus of Contemporary American English (COCA) (Note that the data is from when it was about 430 million words in size; it continues to grow each year). With this n-grams data (2, 3, 4, 5-word sequences, with their frequency), you can carry out powerful queries offline -- without needing to access the corpus via the web interface.

Short sample:

 

frequency

word1

word2

word3

1419

much

the

same

461

much

more

likely

432

much

better

than

266

much

more

difficult

235

much

of

the

226

much

more

than

A few more examples (from among an unlimited number of n-grams) might be:

 NOUN + NOUN sequences  three word strings with a preposition in the middle position
 VERB + the + NOUN sequences  two word strings, where the words begin or end with certain letters
 like + word + word  (potential) phrasal verb: VERB + ADV particle

The data is available in several different formats:

1 Free lists

1 million most frequent 2, 3, 4, and 5-grams

2 Inexpensive data sets

All n-grams that occur three times or more: 6.2 million 2-grams, 11.9 million 3-grams, and 8.3 million 4-grams

3 All 2, 3, and 4-grams

Up to 155 million distinct strings -- searchable by word form and part of speech (as above), and also lemma

If you're interested in the frequency of single words (including frequency by genre and sub-genre), or collocates (all words "near by" a given word), you might look at http://www.wordfrequency.info.