NEW: Most of the information at this
website deals with data from the COCA corpus, which was about 400
million words in size when this word frequency data was compiled. In
May 2018 we released n-grams data from
the 14 billion word
iWeb corpus,
which is about 35 times as large as COCA. |
These n-grams are based on the largest publicly-available, genre-balanced corpus
of English -- the Corpus of
Contemporary American English (COCA) (Note that the data is from
when it was about 430 million words in size; it continues to grow each year). With this n-grams data (2,
3, 4, 5-word sequences, with their frequency), you can carry
out powerful queries offline -- without needing to access the
corpus via the web interface.Short sample:
|
frequency |
word1 |
word2 |
word3 |
1419
|
much
|
the
|
same
|
461
|
much
|
more
|
likely
|
432
|
much
|
better
|
than
|
266
|
much
|
more
|
difficult
|
235
|
much
|
of
|
the
|
226
|
much
|
more
|
than
|
|
A few more examples (from among an
unlimited number of n-grams) might be:
The data is available in several different
formats:
1 |
Free lists |
1 million most
frequent 2, 3, 4, and 5-grams |
2 |
Inexpensive data sets
|
All n-grams that occur three times or more:
6.2 million
2-grams, 11.9 million 3-grams, and 8.3 million 4-grams |
3 |
All 2, 3, and 4-grams
|
Up to
155
million distinct strings -- searchable by word form and part of speech
(as above), and also lemma |
If you're interested in the frequency
of single words (including frequency by genre and sub-genre), or collocates (all
words "near by" a given word), you might look at
http://www.wordfrequency.info.
|