N-grams data


Download files:

Format 2-grams 3-grams 4-grams 5-grams
words download download download download
words+ + PoS download download download download
database (lexicon) download download download download

Format:

There are three formats for the n-grams. The examples below are for 3-grams. If it were a 2-grams, 4-gram, or 5-gram, it would have fewer (2-grams) or more (4, 5-grams) columns.

 1   Words. Does not include punctuation or numbers, and does not include the part of speech of the words

freq word1 word2 word3
20715 make sure that
20432 may have been

 2   Words+ + PoS. "Word" is anything: word, number, punctuation. Includes the part of speech of the words (the first letter of the CLAWS tags).

freq word1 word2 word3 PoS1 PoS2 PoS3
32433 member of the n i a
20432 me . i p y p

 3   Database. This is the most "complicated" version, but perhaps also the most powerful. Each word is represented as an integer value, and the value of these integer values is found in the "lexicon" file (where it indicates the case sensitive word form, the lemma, and the part of speech).

The leftmost column in all of the n-grams tables is the frequency of the n-grams. The other columns are the integer values for the words (two columns for 2-grams, three for 3-grams, etc).

freq wordID1 wordID2 wordID3
593192765 3
47138689222131 11
370694465 3
355311335 3
32657682 7

Each number corresponds to an entry in the [lexicon] table. For example, the three entries [276], [5], and [3] in the lexicon table are:

wordID word (case sensitive) lemma part of speech (info)
3thethe at
5ofof io
276mostmost dat

This means that the first entry in the 3-grams table above is for the string [ most of the ], which occurs 59,319 times in the corpus.

Note that you would be responsible for creating the SQL  statements to group by lemma, word, PoS, etc and to limit and sort the data. We assume a good knowledge of SQL, as well as the ability to create the databases and tables from the CSV files.