N-grams data


SAMPLES: wordID + lexicon

There are three sample files for the full Level 3 n-grams. These are small samples of the 3-grams data, but you you also receive similar files for 2-grams and 4-grams.
 

level3_lexicon.txt

Lexicon entries for "words" #600-1000 in the lexicon for the Corpus of Contemporary American English (word form, case senstive word form, part of speech, lemma, etc). This is a sample of one of the files that you would receive if you purchase the ngrams.

level3_wordID.txt

3-grams (three word sequences) for "words" #600-1000 in the lexicon. Each word is represented as a numeric value, which corresponds to the entries in <ngrams_lexicon.txt>. This is a sample of one of the files that you would receive if you purchase the ngrams.

level3_alpha.txt

For the purposes of these sample entries, these are the n-grams that are obtained after doing a SQL JOIN between the two preceding files. If you purchase the n-grams, you will not receive a file that looks like this, but rather you will receive files like the two files above. You would then use SQL JOIN statements (like those below) to create a file that looks like this one. But with the lexicon and wordID (n-gram) files, you could also do much more powerful queries, involving word form (+/- case sensitive), lemmas, and part of speech.

Note that you would be responsible for creating the SQL  statements to group by lemma, word, PoS, etc and to limit and sort the data. We assume a good knowledge of SQL, as well as the ability to create the databases and tables from the CSV files. Without an additional consulting fee, we cannot help you to create the SQL statements, and there would be no refund for those who are unable to set up the databases and create the SQL statements to extract the data.

Although you would be responsible for creating your own SQL statements, here we give you a few sample SQL JOIN statements, so that you can get some sense of what your queries might look like:

1. 3-grams for the word <likes> as a verb, ordered by frequency:

select sum(ngrams.freq), lex1.word, lex2.word, lex3.word

from ngrams, lex as lex1, lex as lex2, lex as lex3 where

lex1.word = 'likes' and

lex1.pos like 'v%' and

lex1.wID = ngrams.w1 and

lex2.wID = ngrams.w2 and

lex3.wID = ngrams.w3

group by lex1.word, lex2.word, lex3.word

order by ngrams.freq desc

 example (top 10 entries here):

267

likes

to

say

242

likes

to

be

185

likes

it

.

133

likes

to

play

129

likes

to

do

121

likes

to

think

117

likes

to

tell

113

likes

to

call

110

likes

to

talk

109

likes

to

have

2. 2-grams for the lemma <like> as a verb, where the following word as a plural noun (NN2), grouped by word:

select sum(ngrams.freq), lex1.word, lex2.word

from ngrams, lex as lex1, lex as lex2 where

lex1.lemma = 'like' and lex1.pos like 'v%' and

lex1.wID = ngrams.w1 and

lex2.pos like 'nn2%' and

lex2.wID = ngrams.w2

group by lex1.word, lex2.word

order by (ngrams.freq) desc

 example (top 10 entries here):

98

like

things

79

like

women

69

like

men

51

like

dogs

42

like

kids

42

liked

women

35

like

girls

34

like

children

31

liked

things

28

like

surprises

3. 3-grams where the first word is a lexical verb (VV%), the second word is an article (a%) or determiner (d%), and the third word belongs to one of the lemmas ('house','home'), grouped by lemma, and which occur at least five times:

select sum(ngrams.freq), lex1.lemma, lex2.lemma, lex3.lemma

from ngrams, lex as lex1, lex as lex2, lex as lex3 where

lex1.pos like ('vv%') and

lex1.wID = ngrams.word and

(lex2.pos like 'a%' or lex2.pos like 'd%') and

lex2.wID = ngrams.w2 and

lex3.lemma in ('house','home') and

lex3.wID = ngrams.w3

group by lex1.lemma, lex2.lemma, lex3.lemma

order by sum(ngrams.freq) desc

 example (top 10 entries here):

1232

leave

the

house

874

buy

a

house

441

build

a

house

428

buy

a

home

356

find

a

home

345

enter

the

house

331

leave

their

home

322

hit

a

home

292

buy

the

house

291

sell

the

house