N-grams data

Corpus of Historical American English


 Purchase data 

Overview
Compare to Google
Processing the data

Samples (COCA)
   Level 1 (free)
   Level 2
   Level 3

Historical (COHA)
Free (1 million)

Spanish data
Portuguese data

Related sites
  Full-text data 
  Word frequency
  Collocates
  WordAndPhrase
  Academic vocabulary
  corpus.byu.edu

Contact us


Note: see also the downloadable, full-text version of COHA (385 million words in 115,000 texts). If you download this data, you will have the texts on your own computer, and you can do anything that you would like with the data -- generating n-grams, collocates, word frequency, and much more.

The Corpus of Historical American English (COHA) contain 400 million words of text from 1810-2009, and all of the n-grams from the corpus (millions of rows of data) can be freely downloaded. They contain all n-grams (including individual words) that occur at least three times total in the corpus, and you can see the frequency of each of these n-grams in each decade from the 1810s-2000s. This data can be used offline to carry out powerful searches on a wide range of phenomena in the history of American English.

For the 2-grams, 3-grams, and 4-grams, the number listed below the column heading is the approximate number of unique n-grams (in millions of words), followed by the total number of rows in the n-grams file (realizing that a given n-gram usually appears several times in the file -- once for each decade in which it appears in the corpus).

Click on [*] below to see small samples of each n-grams (entries for the word light). Download of the full n-grams sets is free, but we ask you to first input your name and email address.
 
  1-gram (i.e. unique words)   2-gram 3-gram 4-gram
Part of speech?  not case sensitive case sensitive   7m / 32m 11m / 54m 8m / 36m
NO download * download *   download * download * download *
YES download * download *   download * download * download *