SAMPLES: LEVEL 2
You can purchase n-grams sets that contain all 1, 2, 3, and 4-grams that
occur at least three times in the Corpus of Contemporary
American English (when it was about 430 million words in size; it continues to
grow each year). Although you probably only need a subset of these files, all
21 files are included in the purchase price. The samples files that are
available from this page include all entries for words beginning with the
letter [u]. (See note 1)
For the 2-grams and 3-grams, you have a choice of n-grams with or without
part of speech (i.e. 2 options), and either case sensitive or case insensitive
(i.e. 2 options), as well as n-grams that are all words, or n-grams containing
at least one punctuation or number (i.e. 2 options). In other words, there are
eight options total for both the 2-grams and 3-grams. For the 4-grams, there is
just one file available (case sensitive, with part of speech, and no punctuation)
and for the 1-grams there are no entries for punctuation.
You can also download all of these files as
one ZIP file.
*: In the table above, PoS refers to [Part of Speech] and [CS] refers to case
Note 1. In the case of the [punctNum] files (n-grams including punctuation or
a number), just those entries with a [u] in the second position are included in
the sample files.
Note 2. The page http://ucrel.lancs.ac.uk/claws7tags.html provides a listing
of the part of speech tags (see also the notes at the bottom of that page,
regarding tags like ii31).
Note 3. In both the sample files and the full n-grams, to save space there is
only one part of speech listed for each word, even if the tagger originally
suggested two or three options. The PoS listed is the one that was ranked most
likely by the tagger.
Note 4. In the sample files and the full n-grams, the columns refer to:
frequencyOfNgram word1 (word2) (word3) (word4) (pos1) (pos2) (pos3) (pos4)