As far as we are aware, the only other large
downloadable n-grams sets for
contemporary English are the
Google n-grams
(and our own n-grams from iWeb). The following is a brief comparison of the
COCA n-grams and the Google n-grams).
(You might also be interested in
our adaptation of the
Google Books n-grams
datasets (historical), which allow you to do
many
things that are not possible with the simple Google Books n-grams
interface.)
|
COCA-based n-grams |
Google n-grams |
Corpus |
|
|
|
Corpus of
Contemporary American English [COCA].
One billion words, 1990-2019. |
The Web (as of 2006) |
Balanced by genre |
Yes |
No |
N-grams |
|
|
Includes part of speech |
Yes |
No |
Includes lemma |
Yes (WordID
format) |
No |
Platform |
Can be installed and used on a personal
computer |
Probably only installable and usable on a
server or high-end workstation |
Minimum tokens per
n-gram |
4 tokens |
40 tokens |
|
This difference is significant. While the number of
tokens (total number of words) in the Google n-grams "corpus" (the
Web) is much larger than in COCA, the number of types (unique
strings of words) in their n-grams datasets is proportionately much
smaller. This is because the Google n-grams only include strings that
occur at least 40 times, which
means that only an extremely small
percentage of all types are in their dataset. In our n-grams, on the
other hand, even strings that occur 4 or 5 times are included in the n-grams datasets. |
|