N-GRAMS
from the COCA and COHA corpora of American English

home compare to Google samples using the data historical (COHA) non-English free downloads purchase


The Corpus of Historical American English (COHA) contain 400 million words of text from 1810-2009, and all of the n-grams from the corpus can be freely downloaded. They contain all n-grams that occur at least three times total in the corpus, and you can see the frequency of each of these n-grams in each decade from the 1810s-2000s. This data can be used offline to carry out powerful searches on a wide range of phenomena in the history of American English.

For the 2-grams, 3-grams, and 4-grams, the number listed below the column heading is the approximate number of unique n-grams (in millions of words), followed by the total number of rows in the n-grams file (realizing that a given n-gram usually appears several times in the file -- once for each decade in which it appears in the corpus).

Click on [*] below to see small samples of each n-grams (entries for the word light). Download of the full n-grams sets is free, but we ask you to first input your name and email address.
 
  1-gram (i.e. unique words)   2-gram 3-gram 4-gram
Part of speech?  not case sensitive case sensitive   7m / 32m 11m / 54m 8m / 36m
NO download * download *   download * download * download *
YES download * download *   download * download * download *

Please note that there are textual errors in COHA -- it would be impossible to have a completely "clean" 400 million word historical corpus (especially when the corpus was created essentially by one person). You will find words with typos (SO (=50), tlus (=thus), somctimes, etc), words that are incorrectly fused together (whois, cansay), and other problems. In terms of types (unique forms) in the n-grams files, it may seem like a lot, but in terms of tokens (total number of words affected), it is very small. Based on informal texts that we have done, the texts are on average about 99.85% accurate, resulting in one error about every 500-1000 words. The question is whether you would rather have an almost perfectly clean 1,000,000 word corpus that can only be used for a very small range of studies (mainly high frequency syntactic phenomena), or a corpus with a small number of errors but which is large enough to be used for a wide range of studies. We have chosen the latter.

Finally, if this data results in a publication, please cite the data as follows, and please enter information in our publications database. Thank you.

Davies, Mark. (2011) N-grams and word frequency data from the Corpus of Historical American English (COHA). Downloaded from http://www.ngrams.info on February 22, 2012.