N-grams data


You can purchase n-grams sets that contain all 1, 2, 3, and 4-grams that occur at least three times in the Corpus of Contemporary American English (when it was about 430 million words in size; it continues to grow each year). Although you probably only need a subset of these files, all 21 files are included in the purchase price. The samples files that are available from this page include all entries for words beginning with the letter [u]. (See note 1)

For the 2-grams and 3-grams, you have a choice of n-grams with or without part of speech (i.e. 2 options), and either case sensitive or case insensitive (i.e. 2 options), as well as n-grams that are all words, or n-grams containing at least one punctuation or number (i.e. 2 options). In other words, there are eight options total for both the 2-grams and 3-grams. For the 4-grams, there is just one file available (case sensitive, with part of speech, and no punctuation) and for the 1-grams there are no entries for punctuation.

You can also download all of these files as one ZIP file.

PoS? CS? words/punctNum 1-gram 2-gram 3-gram 4-gram
no no words download download download  
no no punctNum   download download  
no yes words download download download  
no yes punctNum   download download  
yes no words download download download  
yes no punctNum   download download  
yes yes words download download download download
yes yes punctNum   download download  

*: In the table above, PoS refers to [Part of Speech] and [CS] refers to case sensitivity.

Note 1. In the case of the [punctNum] files (n-grams including punctuation or a number), just those entries with a [u] in the second position are included in the sample files.

Note 2. The page http://ucrel.lancs.ac.uk/claws7tags.html provides a listing of the part of speech tags (see also the notes at the bottom of that page, regarding tags like ii31).

Note 3. In both the sample files and the full n-grams, to save space there is only one part of speech listed for each word, even if the tagger originally suggested two or three options. The PoS listed is the one that was ranked most likely by the tagger.

Note 4. In the sample files and the full n-grams, the columns refer to:

frequencyOfNgram word1 (word2) (word3) (word4) (pos1) (pos2) (pos3) (pos4)