|
N-GRAMS |
||||||||
| home | compare to Google | samples | using the data | historical (COHA) | non-English | free downloads | purchase | |

|
The Corpus of Historical American English (COHA) contain 400 million words of text from 1810-2009, and all of the n-grams from the corpus can be freely downloaded. They contain all n-grams that occur at least three times total in the corpus, and you can see the frequency of each of these n-grams in each decade from the 1810s-2000s. This data can be used offline to carry out powerful searches on a wide range of phenomena in the history of American English. For the 2-grams, 3-grams, and 4-grams, the number listed below the column heading is the approximate number of unique n-grams (in millions of words), followed by the total number of rows in the n-grams file (realizing that a given n-gram usually appears several times in the file -- once for each decade in which it appears in the corpus).
Click on [*] below to see small samples of
each n-grams (entries for the word light). Download
of the full n-grams sets is free, but we ask you to first
input your name and email address.
Please note that there are textual errors in COHA -- it would be impossible to have a completely "clean" 400 million word historical corpus (especially when the corpus was created essentially by one person). You will find words with typos (SO (=50), tlus (=thus), somctimes, etc), words that are incorrectly fused together (whois, cansay), and other problems. In terms of types (unique forms) in the n-grams files, it may seem like a lot, but in terms of tokens (total number of words affected), it is very small. Based on informal texts that we have done, the texts are on average about 99.85% accurate, resulting in one error about every 500-1000 words. The question is whether you would rather have an almost perfectly clean 1,000,000 word corpus that can only be used for a very small range of studies (mainly high frequency syntactic phenomena), or a corpus with a small number of errors but which is large enough to be used for a wide range of studies. We have chosen the latter. Finally, if this data results in a publication,
please cite the data as follows, and please enter information in our
publications database.
Thank you.
| ||||||||||||||||||||||||||||