|
Overview
Compare to Google
Samples (COCA)
Level 1 (free)
Level 2
Level 3
Historical (COHA)
Processing the data
Spanish data
Portuguese data
Purchase data
Free (1 million)
Related sites
Word frequency
WordAndPhrase
Collocates
corpus.byu.edu
Contact us
|
As far as we are aware, the only other large
downloadable n-grams sets for
contemporary English are the
Google n-grams. The following is a brief comparison of the two
datasets.
(You might also be interested in
our adaptation of the
Google Books n-grams
datasets (historical), which allow you to do
many
things that are not possible with the simple Google Books n-grams
interface.)
| |
COCA-based n-grams |
Google n-grams |
| Corpus |
|
|
|
|
Corpus of
Contemporary American English [COCA].
450 million words, 1990-2012. |
The Web (as of 2006) |
|
Balanced by genre |
Yes |
No |
| N-grams |
|
|
|
Includes part of speech |
Yes |
No |
|
Includes lemma |
Yes (Level 3) |
No |
|
Platform |
Can be installed and used on a personal
computer |
Probably only installable and usable on a
server or high-end workstation |
| Minimum tokens per
n-gram |
3 tokens (Level
2), 1 token (Level 3) |
40 |
| |
This difference is significant. While the number of
tokens (total number of words) in the Google n-grams "corpus" (the
Web) is much larger than in COCA, the number of types (unique
strings of words) in their n-grams datasets is proportionately much
smaller. This is because the Google n-grams only include strings that
occur at least 40 times, which
means that only an extremely small
percentage of all types are in their dataset. In our n-grams, on the
other hand, even strings that occur 1 or 3 times (depending on the
version of the n-grams) are included in the n-grams datasets. |
|