N-grams data

from the 14 billion word iWeb corpus

intro samples related get data

The n-grams data shows the frequency of the most frequent 2, 3, 4, and 5-word strings from the  14 billion word iWeb corpus. If you choose the "wordID" format (right, below), you will have the top 100 million 2-grams (two word sequences), the top 100 3-grams, 100 million 4-grams, and 100 million 5-grams from the corpus. That's a total of 400 million rows of data. The other format is the "word format" (left, below), which gives you 50 million rows of data for each of the 2-grams, 3-grams, 4-grams, and 5-grams. If you purchase the data, you have access to both formats -- whichever meets your needs the best.

An explanation of the data and columns shown below is found in the downloadable sample.

Words (this sample is from the 3-grams file) Word ID (combine with the information from the lexicon for the full-text data
freq word1 word2 word3 pos1 pos2 pos3
2,077,302 some of the DD IO AT
2,075,479 part of the NN1 IO AT
1,934,603 the end of AT NN1 IO
1,837,603 you want to PPY VV0 TO
1,656,612 out of the II21 II22 AT
1,651,036 to be a TO VBI AT1
1,580,036 in order to BCL21 BCL22 TO
freq word1_ID word2_ID word3_ID
2,678,127 24 270 9
2,644,589 2 30 56
2,445,731 22 76 50
2,410,305 3 162 56
2,275,585 2 19 12
2,265,037 2 52 12
2,238,175 3 6 57