This page briefly describes the format
of the n-grams, and how you might search them on your computer.
Most of the n-grams are in a ZIP file,
with 26 different files -- for each letter A-Z. After unzipping the
files, you'll probably want to concatenate (join) these together
into one big file. To do this on a Windows machine, open a command
line window (Start / Run / cmd), change the directory ( cd
) to the folder with the files (e.g. cd c:\myNgrams\
), and type:
copy *.txt nameOfFile.txt
(where "nameOfFile.txt" is the name you want to
give it). Each file has a format like the following, with a tab between
frequency, each word, and each part of speech:
freq |
word1 |
word2 |
word3 |
PoS1 |
Pos2 |
PoS3 |
22 |
filled |
with |
fluid |
vvn |
iw |
nn1 |
6 |
filled |
with |
flying |
vvn |
iw |
jj |
4 |
filled |
with |
foam |
vvn |
iw |
nn1 |
In each case, the leftmost column is the frequency
of the n-grams. This is followed by one column for each word (i.e. four columns
in the 4-gram file). If the n-gram set that you're using has part of speech,
there is one column on the right side for each of the "word" columns on the left
side -- in the same order. Finally, if you are using the
COHA (historical English) files, the leftmost
column will be the frequency in a particular decade, and the rightmost column (a
number) will be the decade: 1-20 (1810s-2000s).
Many of you will want to immediately take the files
and import them into a robust relational database, like mySQL or SQL Server
(that's what I'd do too). But you can also just search the files using a robust
text editor. One that I like for Windows is the shareware program
TextPad, which can easily
handle even files 100 MB or more in size. One of the advantages of TextPad is
that it does regular expressions, for advanced pattern matching (but note:
regular expressions are a little different (read: weird) in TextPad, and lots of
other programs do regular expressions in a more "regular" way).
If you're using TextPad, you'll want to read their
help files on regular expressions, but here are a few simple examples.
What you want to find |
search |
what it means |
NOUN + NOUN |
nn.*\tnn.*$ |
Two part of speech tags starting with
nn (the code for
nouns), each followed by
anything else (since it could be nn1 = singular noun, nn2 =
plural noun, etc), separated by a
tab, and
end of line |
*ed word + d* word |
^[[:digit:]]+\t[^\t]+ed\td |
Beginning of line , followed by
one or more numbers (the
frequency count), followed by
word ending in ed,
tab, and
word starting with d |
VERB the
NOUN |
^[[:digit:]]+\t[^\t]+\tthe\t[^\t]\t
\tvv0\t[^\t]+\tnn.*$ |
Beginning of line , followed by
one or more numbers (the
frequency count), followed by the
first word (i.e.
any number of characters
that aren't tabs,
tab,
the,
tab,
third word,
tab,
verb (base form),
tab,
any part of speech,
tab,
noun,
end of line |
|