N-grams data

SAMPLES: wordID + lexicon

There are three sample files for the full Level 3 n-grams. These are small samples of the 3-grams data, but you you also receive similar files for 2-grams and 4-grams.

level3_lexicon.txt

Lexicon entries for "words" #600-1000 in the lexicon for the Corpus of Contemporary American English (word form, case senstive word form, part of speech, lemma, etc). This is a sample of one of the files that you would receive if you purchase the ngrams.

level3_wordID.txt

3-grams (three word sequences) for "words" #600-1000 in the lexicon. Each word is represented as a numeric value, which corresponds to the entries in <ngrams_lexicon.txt>. This is a sample of one of the files that you would receive if you purchase the ngrams.

level3_alpha.txt

For the purposes of these sample entries, these are the n-grams that are obtained after doing a SQL JOIN between the two preceding files. If you purchase the n-grams, you will not receive a file that looks like this, but rather you will receive files like the two files above. You would then use SQL JOIN statements (like those below) to create a file that looks like this one. But with the lexicon and wordID (n-gram) files, you could also do much more powerful queries, involving word form (+/- case sensitive), lemmas, and part of speech.

Note that you would be responsible for creating the SQL statements to group by lemma, word, PoS, etc and to limit and sort the data. We assume a good knowledge of SQL, as well as the ability to create the databases and tables from the CSV files. Without an additional consulting fee, we cannot help you to create the SQL statements, and there would be no refund for those who are unable to set up the databases and create the SQL statements to extract the data.

Although you would be responsible for creating your own SQL statements, here we give you a few sample SQL JOIN statements, so that you can get some sense of what your queries might look like:

1. 3-grams for the word <likes> as a verb, ordered by frequency:

select sum(ngrams.freq), lex1.word, lex2.word, lex3.word

from ngrams, lex as lex1, lex as lex2, lex as lex3 where

lex1.word = 'likes' and

lex1.pos like 'v%' and

lex1.wID = ngrams.w1 and

lex2.wID = ngrams.w2 and

lex3.wID = ngrams.w3

group by lex1.word, lex2.word, lex3.word

order by ngrams.freq desc

example (top 10 entries here):

267	likes	to	say
242	likes	to	be
185	likes	it	.
133	likes	to	play
129	likes	to	do
121	likes	to	think
117	likes	to	tell
113	likes	to	call
110	likes	to	talk
109	likes	to	have

2. 2-grams for the lemma <like> as a verb, where the following word as a plural noun (NN2), grouped by word:

select sum(ngrams.freq), lex1.word, lex2.word

from ngrams, lex as lex1, lex as lex2 where

lex1.lemma = 'like' and lex1.pos like 'v%' and

lex1.wID = ngrams.w1 and

lex2.pos like 'nn2%' and

lex2.wID = ngrams.w2

group by lex1.word, lex2.word

order by (ngrams.freq) desc

example (top 10 entries here):

98	like	things
79	like	women
69	like	men
51	like	dogs
42	like	kids
42	liked	women
35	like	girls
34	like	children
31	liked	things
28	like	surprises

3. 3-grams where the first word is a lexical verb (VV%), the second word is an article (a%) or determiner (d%), and the third word belongs to one of the lemmas ('house','home'), grouped by lemma, and which occur at least five times:

select sum(ngrams.freq), lex1.lemma, lex2.lemma, lex3.lemma

from ngrams, lex as lex1, lex as lex2, lex as lex3 where

lex1.pos like ('vv%') and

lex1.wID = ngrams.word and

(lex2.pos like 'a%' or lex2.pos like 'd%') and

lex2.wID = ngrams.w2 and

lex3.lemma in ('house','home') and

lex3.wID = ngrams.w3

group by lex1.lemma, lex2.lemma, lex3.lemma

order by sum(ngrams.freq) desc

example (top 10 entries here):

1232	leave	the	house
874	buy	a	house
441	build	a	house
428	buy	a	home
356	find	a	home
345	enter	the	house
331	leave	their	home
322	hit	a	home
292	buy	the	house
291	sell	the	house