N-grams data

from the 14 billion word iWeb corpus

The n-grams data shows the frequency of the most frequent 2, 3, 4, and 5-word strings from the 14 billion word iWeb corpus. If you choose the "wordID" format (right, below), you will have the top 100 million 2-grams (two word sequences), the top 100 3-grams, 100 million 4-grams, and 100 million 5-grams from the corpus. That's a total of 400 million rows of data. The other format is the "word format" (left, below), which gives you 50 million rows of data for each of the 2-grams, 3-grams, 4-grams, and 5-grams. If you purchase the data, you have access to both formats -- whichever meets your needs the best.

An explanation of the data and columns shown below is found in the downloadable sample.

Words (this sample is from the 3-grams file)

Word ID (combine with the information from the lexicon for the full-text data

freq	word1	word2	word3	pos1	pos2	pos3
2,077,302	some	of	the	DD	IO	AT
2,075,479	part	of	the	NN1	IO	AT
1,934,603	the	end	of	AT	NN1	IO
1,837,603	you	want	to	PPY	VV0	TO
1,656,612	out	of	the	II21	II22	AT
1,651,036	to	be	a	TO	VBI	AT1
1,580,036	in	order	to	BCL21	BCL22	TO

freq	word1_ID	word2_ID	word3_ID
2,678,127	24	270	9
2,644,589	2	30	56
2,445,731	22	76	50
2,410,305	3	162	56
2,275,585	2	19	12
2,265,037	2	52	12
2,238,175	3	6	57