N-grams data


SAMPLES: words (+ PoS)

You can purchase n-grams sets that contain all 1, 2, 3, 4, and 5-grams that occur at least four times in the one billion word Corpus of Contemporary American English . The samples files that are available on this page include the first 50,000 entries for words beginning with the letter [m]. Explanation of columns in these sample files.

When you purchase the data, you can either use the "word" (this page of samples) or the "wordID + lexicon" format. For the "word" format, there are two different options (again, both of which you have access to when you purchase the data).

type description (see above) 2-grams 3-grams 4-grams 5-grams
1 words see sample see sample see sample see sample

Just the words (e.g. my life). No part of speech is included, and "words" do not include punctuation, numbers, etc

2 words+ + PoS see sample see sample see sample see sample

Both the words and the PoS (first letter of the part of speech code) of each word. "Words" include punctuation, numbers, etc

You can also download all of these sample files as one ZIP file.