N-grams data

Corpus of Contemporary American English


 Purchase data 

Overview
Compare to Google
Processing the data

Samples (COCA)
   Level 1 (free)
   Level 2
   Level 3

Historical (COHA)
Free (1 million)

Spanish data
Portuguese data

Related sites
  Full-text data 
  Word frequency
  Collocates
  WordAndPhrase
  Academic vocabulary
  corpus.byu.edu

Contact us


SAMPLES: LEVEL 2

You can purchase n-grams sets that contain all 1, 2, 3, and 4-grams that occur at least three times in the 520 million word Corpus of Contemporary American English. Although you probably only need a subset of these files, all 21 files are included in the purchase price. The samples files that are available from this page include all entries for words beginning with the letter [u]. (See note 1)

For the 2-grams and 3-grams, you have a choice of n-grams with or without part of speech (i.e. 2 options), and either case sensitive or case insensitive (i.e. 2 options), as well as n-grams that are all words, or n-grams containing at least one punctuation or number (i.e. 2 options). In other words, there are eight options total for both the 2-grams and 3-grams. For the 4-grams, there is just one file available (case sensitive, with part of speech, and no punctuation) and for the 1-grams there are no entries for punctuation.

You can also download all of these files as one ZIP file.

PoS? CS? words/punctNum 1-gram 2-gram 3-gram 4-gram
no no words download download download  
no no punctNum   download download  
no yes words download download download  
no yes punctNum   download download  
yes no words download download download  
yes no punctNum   download download  
yes yes words download download download download
yes yes punctNum   download download  

*: In the table above, PoS refers to [Part of Speech] and [CS] refers to case sensitivity.

Note 1. In the case of the [punctNum] files (n-grams including punctuation or a number), just those entries with a [u] in the second position are included in the sample files.

Note 2. The page http://ucrel.lancs.ac.uk/claws7tags.html provides a listing of the part of speech tags (see also the notes at the bottom of that page, regarding tags like ii31).

Note 3. In both the sample files and the full n-grams, to save space there is only one part of speech listed for each word, even if the tagger originally suggested two or three options. The PoS listed is the one that was ranked most likely by the tagger.

Note 4. In the sample files and the full n-grams, the columns refer to:

frequencyOfNgram word1 (word2) (word3) (word4) (pos1) (pos2) (pos3) (pos4)