Download files:
Format:
There are three formats for the
n-grams. The examples below are for 3-grams. If it were a 2-grams,
4-gram, or 5-gram, it would have fewer (2-grams) or more (4,
5-grams) columns.
1
Words. Does not include punctuation or numbers, and
does not include the part of speech of the words
freq |
word1 |
word2 |
word3 |
20715 |
make |
sure |
that |
20432 |
may |
have |
been |
2
Words+ + PoS. "Word" is anything: word, number,
punctuation. Includes the part of speech of the words (the first letter of the
CLAWS tags).
freq |
word1 |
word2 |
word3 |
PoS1 |
PoS2 |
PoS3 |
32433 |
member |
of |
the |
n |
i |
a |
20432 |
me |
. |
i |
p |
y |
p |
3
Database. This is the most "complicated" version,
but perhaps also the most powerful. Each word is represented as an integer
value, and the value of these integer values is found in the "lexicon" file
(where it indicates the case sensitive word form, the lemma, and the part of
speech).
The leftmost column in all of the n-grams tables is
the frequency of the n-grams. The other columns are the integer values for the
words (two columns for 2-grams, three for 3-grams, etc).
freq |
wordID1 |
wordID2 |
wordID3 |
59319 | 276 | 5 |
3 |
47138 | 68 | 9222131 |
11 |
37069 | 446 | 5 |
3 |
35531 | 133 | 5 |
3 |
32657 | 68 | 2 |
7 |
Each number corresponds to an entry in the [lexicon] table. For example, the
three entries [276], [5], and [3] in the lexicon table are:
wordID |
word (case sensitive) |
lemma |
part of speech (info) |
3 | the | the |
at |
5 | of | of |
io |
276 | most | most |
dat |
This means that the first entry in the 3-grams table above is for the string
[ most of the ], which occurs 59,319 times in the corpus.
Note that you would be responsible for creating the SQL statements to group by lemma, word, PoS,
etc and to limit and sort the data. We assume a good knowledge of SQL, as well
as the ability to create the databases and tables from the CSV files.
|