The n-grams data shows the frequency of the most
frequent 2, 3, 4, and 5-word strings from the
14 billion
word iWeb corpus. If you choose the "wordID" format (right,
below), you will have the top 100 million 2-grams (two word sequences), the top
100 3-grams, 100 million 4-grams, and 100 million 5-grams from the corpus.
That's a total of 400 million rows of data. The other format is the "word
format" (left, below), which gives you 50 million rows of data for each of the
2-grams, 3-grams, 4-grams, and 5-grams. If you
purchase the data, you have access to both formats --
whichever meets your needs the best.
An explanation of the data and columns shown below
is found in the
downloadable sample.
Words (this sample is from the 3-grams file) |
Word ID (combine with the information from the
lexicon for the full-text data |
freq |
word1 |
word2 |
word3 |
pos1 |
pos2 |
pos3 |
2,077,302 |
some |
of |
the |
DD |
IO |
AT |
2,075,479 |
part |
of |
the |
NN1 |
IO |
AT |
1,934,603 |
the |
end |
of |
AT |
NN1 |
IO |
1,837,603 |
you |
want |
to |
PPY |
VV0 |
TO |
1,656,612 |
out |
of |
the |
II21 |
II22 |
AT |
1,651,036 |
to |
be |
a |
TO |
VBI |
AT1 |
1,580,036 |
in |
order |
to |
BCL21 |
BCL22 |
TO |
|
freq |
word1_ID |
word2_ID |
word3_ID |
2,678,127 |
24 |
270 |
9 |
2,644,589 |
2 |
30 |
56 |
2,445,731 |
22 |
76 |
50 |
2,410,305 |
3 |
162 |
56 |
2,275,585 |
2 |
19 |
12 |
2,265,037 |
2 |
52 |
12 |
2,238,175 |
3 |
6 |
57 |
|
|