N-grams data

Corpus of Historical American English


Overview
Compare to Google
Samples (COCA)
   Level 1 (free)
   Level 2
   Level 3
Historical (COHA)
Processing the data

Spanish data
Portuguese data

Purchase data
Free (1 million)

Related sites
  Word frequency
  WordAndPhrase
  Collocates
  corpus.byu.edu

Contact us


The Corpus of Historical American English (COHA) contain 400 million words of text from 1810-2009, and all of the n-grams from the corpus can be freely downloaded. They contain all n-grams that occur at least three times total in the corpus, and you can see the frequency of each of these n-grams in each decade from the 1810s-2000s. This data can be used offline to carry out powerful searches on a wide range of phenomena in the history of American English.

For the 2-grams, 3-grams, and 4-grams, the number listed below the column heading is the approximate number of unique n-grams (in millions of words), followed by the total number of rows in the n-grams file (realizing that a given n-gram usually appears several times in the file -- once for each decade in which it appears in the corpus).

Click on [*] below to see small samples of each n-grams (entries for the word light). Download of the full n-grams sets is free, but we ask you to first input your name and email address.
 

  1-gram (i.e. unique words)   2-gram 3-gram 4-gram
Part of speech?  not case sensitive case sensitive   7m / 32m 11m / 54m 8m / 36m
NO download * download *   download * download * download *
YES download * download *   download * download * download *

We have had a problem with companies downloading the COHA data for free and then using it in their programs. As a result, the data for the 1990s-2000s is by default deleted from the 1-grams files (single words). If you are a researcher at a university, send us an email from your academic email address and we will update things so that you can download 1-grams lists that have the 1990s-2000s as well.

Please note that there are textual errors in COHA -- it would be impossible to have a completely "clean" 400 million word historical corpus (especially when the corpus was created essentially by one person). You will find words with typos (SO (=50), tlus (=thus), somctimes, etc), words that are incorrectly fused together (whois, cansay), and other problems. In terms of types (unique forms) in the n-grams files, it may seem like a lot, but in terms of tokens (total number of words affected), it is very small. Based on informal texts that we have done, the texts are on average about 99.85% accurate, resulting in one error about every 500-1000 words. The question is whether you would rather have an almost perfectly clean 1,000,000 word corpus that can only be used for a very small range of studies (mainly high frequency syntactic phenomena), or a corpus with a small number of errors but which is large enough to be used for a wide range of studies. We have chosen the latter.

Finally, if this data results in a publication, please cite the data as follows, and please enter information in our publications database. Thank you.

Davies, Mark. (2011) N-grams and word frequency data from the Corpus of Historical American English (COHA). Downloaded from http://www.ngrams.info on May 19, 2013.