N-grams data


Most of the information at this website deals with data from the COCA corpus. You might also be interested in the n-grams data from the 14 billion word iWeb corpus.

These n-grams are based on the largest publicly-available, genre-balanced corpus of English -- the one billion word Corpus of Contemporary American English (COCA). With this n-grams data (2, 3, 4, 5-word sequences, with their frequency), you can carry out powerful queries offline -- without needing to access the corpus via the web interface.

 
frequency word1 word2 word3
31891much ofthe
13261muchof a
8000muchmore than
7396muchas i
5650muchthe same
5633muchof it
4229muchbetter than
4191muchas the

A few more examples (from among an unlimited number of queries) might be:

 NOUN + NOUN sequences  three word strings with a preposition in the middle position
 VERB + the + NOUN sequences  two word strings, where the words begin or end with certain letters
 like + word + word  (potential) phrasal verb: VERB + ADV particle

The data is available in three different formats, and when you purchase the data you have access to all three formats. (The numbers refer to how many millions of entries there are for that format / n-grams set).

Type

Data

2-grams 3-grams 4-grams 5-grams
1 Words 8.2 m 16.3 m 13.1 m 6.2 m
2 Words+  + part of speech 11.6 m 28.5 m 28.2 m 17.4 m
db Database: integer values + lexicon 13.5 m 18.9 m 27.1 m 16.2 m

You might also be interested in the frequency of single words (including frequency by genre and sub-genre), or collocates (all words "near by" a given word).