|
Overview
Compare to Google
Samples (COCA)
Level 1 (free)
Level 2
Level 3
Historical (COHA)
Processing the data
Spanish data
Portuguese data
Purchase data
Free (1 million)
Related sites
Word frequency
WordAndPhrase
Collocates
corpus.byu.edu
Contact us
|
These n-grams are based on the largest
publicly-available, genre-balanced corpus
of English -- the
450 million word Corpus of
Contemporary American English (COCA). With this n-grams data (2,
3, 4, 5-word sequences, with their frequency), you can carry
out powerful queries offline -- without needing to access the
corpus via the web interface.
A few examples (from among an
unlimited number of searches) might be:
The data is available in several different formats:
| 1 |
Free lists |
1 million most
frequent 2, 3, 4, and 5-grams |
| 2 |
Inexpensive data sets |
All n-grams that occur three times or more:
6.2 million
2-grams, 11.9 million 3-grams, and 8.3 million 4-grams |
| 3 |
All 2, 3, and 4-grams |
Up to
155
million distinct strings -- searchable by word form and part of speech
(as above), and also lemma |
If you're interested in the frequency
of single words (including frequency by genre and sub-genre), or collocates (all
words "near by" a given word), you might look at
http://www.wordfrequency.info.
|