N-grams data

Corpus of Contemporary American English


 Purchase data 

Overview
Compare to Google
Processing the data

Samples (COCA)
   Level 1 (free)
   Level 2
   Level 3

Historical (COHA)
Free (1 million)

Spanish data
Portuguese data

Related sites
  Full-text data 
  Word frequency
  Collocates
  WordAndPhrase
  Academic vocabulary
  corpus.byu.edu

Contact us


Note: this data is based on corpora that were created solely by Mark Davies, Professor of Linguistics at Brigham Young University. As the result of an agreement between BYU and Mark Davies, all transactions regarding payments and licenses for this data are made solely with Mark Davies, rather than with BYU.


The n-grams are available in a number of different formats:

Level

Data

Size

Samples

Price

1

Most frequent 2, 3, and 4-grams

1 million entries each

See

Free

2

All 2, 3, 4-grams that occur at least 3 times. Available case sensitive, part of speech (more info)

6.2 million, 11.9 million, and 8.3 million n-grams, respectively

See

$55

$95

$195

3

All 2, 3, and 4-grams, including those that occur just 1-2 times

More than 155 million rows (for the 3-grams). The format allows users to specify word, PoS, and lemma.

See

$95

$195

$395

License: GRAD = graduate student, ACAD = other academic, COM = commercial 

GRAD

ACAD

COM

To purchase the files (Levels 2 and 3):

1. Download and fill out the appropriate non-disclosure agreement (NDA) by clicking on one of the links in the blue sections above, and then send it back to us as an email attachment. For both GRAD and ACAD licenses, the NDA must be sent back from a university email account. For GRAD, you must also provide proof of status via a university web page (on the NDA).

2. Once we receive the NDA, we'll send you a request for payment from PayPal.

3. As soon as we receive confirmation of the payment, we'll send you the link to download the data.

Thanks for your interest.