SAMPLES: wordID + lexicon
There are three sample files for the full Level 3 n-grams. These are small
samples of the 3-grams data, but you you also receive similar files for 2-grams
and 4-grams.
level3_lexicon.txt |
Lexicon entries for "words" #600-1000 in the lexicon for the
Corpus of Contemporary American English (word form, case senstive word form, part of speech, lemma, etc). This is
a sample of one of the files that you would receive if you purchase the ngrams. |
level3_wordID.txt |
3-grams (three word sequences) for "words" #600-1000 in the lexicon. Each word is represented as a numeric value, which corresponds to the entries in <ngrams_lexicon.txt>. This is
a sample of one of the files that you would receive if you purchase the ngrams. |
level3_alpha.txt |
For the purposes of these sample entries, these are the n-grams that are obtained after doing a SQL JOIN between the two preceding files. If you purchase the n-grams, you will not receive a file that looks like this, but rather you will receive files like the two files above. You would then use SQL JOIN statements (like those below) to create a file that looks like this one.
But with the lexicon and wordID (n-gram) files, you could also do much more
powerful queries, involving word form (+/- case sensitive), lemmas, and part of
speech. |
Note that you would be responsible for creating the SQL statements to group by lemma, word, PoS, etc and to limit and sort the data. We assume a good knowledge of SQL, as well as the ability to create the databases and tables from the CSV files. Without an additional consulting fee, we cannot help you to create the SQL statements, and there would be no refund for those who are unable to set up the databases and create the SQL statements to extract the data. |
Although you would be responsible for creating your own SQL statements, here we give you a few sample SQL JOIN statements, so that you can get some sense of what your queries might look like:
1. 3-grams for the word <likes> as a verb, ordered by frequency:
select sum(ngrams.freq), lex1.word, lex2.word, lex3.word
from ngrams, lex as lex1, lex as lex2, lex as lex3 where
lex1.word = 'likes' and
lex1.pos like 'v%' and
lex1.wID = ngrams.w1 and
lex2.wID = ngrams.w2 and
lex3.wID = ngrams.w3
group by lex1.word, lex2.word, lex3.word
order by ngrams.freq desc
example (top 10 entries here):
267 |
likes |
to |
say |
242 |
likes |
to |
be |
185 |
likes |
it |
. |
133 |
likes |
to |
play |
129 |
likes |
to |
do |
121 |
likes |
to |
think |
117 |
likes |
to |
tell |
113 |
likes |
to |
call |
110 |
likes |
to |
talk |
109 |
likes |
to |
have |
2. 2-grams for the lemma <like> as a verb, where the following word as a plural noun (NN2), grouped by word:
select sum(ngrams.freq), lex1.word, lex2.word
from ngrams, lex as lex1, lex as lex2 where
lex1.lemma = 'like' and lex1.pos like 'v%' and
lex1.wID = ngrams.w1 and
lex2.pos like 'nn2%' and
lex2.wID = ngrams.w2
group by lex1.word, lex2.word
order by (ngrams.freq) desc
example (top 10 entries here):
98 |
like |
things |
79 |
like |
women |
69 |
like |
men |
51 |
like |
dogs |
42 |
like |
kids |
42 |
liked |
women |
35 |
like |
girls |
34 |
like |
children |
31 |
liked |
things |
28 |
like |
surprises |
3. 3-grams where the first word is a lexical verb (VV%), the second word is an article (a%) or determiner (d%), and the third word belongs to one of the lemmas
('house','home'), grouped by lemma, and which occur at least five times:
select sum(ngrams.freq), lex1.lemma, lex2.lemma, lex3.lemma
from ngrams, lex as lex1, lex as lex2, lex as lex3 where
lex1.pos like ('vv%') and
lex1.wID = ngrams.word and
(lex2.pos like 'a%' or lex2.pos like 'd%') and
lex2.wID = ngrams.w2 and
lex3.lemma in ('house','home') and
lex3.wID = ngrams.w3
group by lex1.lemma, lex2.lemma, lex3.lemma
order by sum(ngrams.freq) desc
example (top 10 entries here):
1232 |
leave |
the |
house |
874 |
buy |
a |
house |
441 |
build |
a |
house |
428 |
buy |
a |
home |
356 |
find |
a |
home |
345 |
enter |
the |
house |
331 |
leave |
their |
home |
322 |
hit |
a |
home |
292 |
buy |
the |
house |
291 |
sell |
the |
house |
|