N-grams data

This page briefly describes the format of the n-grams, and how you might search them on your computer.

Most of the n-grams are in a ZIP file, with 26 different files -- for each letter A-Z. After unzipping the files, you'll probably want to concatenate (join) these together into one big file. To do this on a Windows machine, open a command line window (Start / Run / cmd), change the directory ( cd ) to the folder with the files (e.g. cd c:\myNgrams\ ), and type:

copy *.txt nameOfFile.txt

(where "nameOfFile.txt" is the name you want to give it). Each file has a format like the following, with a tab between frequency, each word, and each part of speech:

freq

word1

word2

word3

PoS1

Pos2

PoS3

22

filled

with

fluid

vvn

iw

nn1

6

filled

with

flying

vvn

iw

jj

4

filled

with

foam

vvn

iw

nn1

In each case, the leftmost column is the frequency of the n-grams. This is followed by one column for each word (i.e. four columns in the 4-gram file). If the n-gram set that you're using has part of speech, there is one column on the right side for each of the "word" columns on the left side -- in the same order. Finally, if you are using the COHA (historical English) files, the leftmost column will be the frequency in a particular decade, and the rightmost column (a number) will be the decade: 1-20 (1810s-2000s).

Many of you will want to immediately take the files and import them into a robust relational database, like mySQL or SQL Server (that's what I'd do too). But you can also just search the files using a robust text editor. One that I like for Windows is the shareware program TextPad, which can easily handle even files 100 MB or more in size. One of the advantages of TextPad is that it does regular expressions, for advanced pattern matching (but note: regular expressions are a little different (read: weird) in TextPad, and lots of other programs do regular expressions in a more "regular" way).

If you're using TextPad, you'll want to read their help files on regular expressions, but here are a few simple examples.

What you want to find	search	what it means
NOUN + NOUN	nn.\tnn.$	Two part of speech tags starting with nn (the code for nouns), each followed by anything else (since it could be nn1 = singular noun, nn2 = plural noun, etc), separated by a tab, and end of line
ed word + d word	^[[:digit:]]+\t[^\t]+ed\td	Beginning of line , followed by one or more numbers (the frequency count), followed by word ending in ed, tab, and word starting with d
VERB the NOUN	^[[:digit:]]+\t[^\t]+\tthe\t[^\t]\t \tvv0\t[^\t]+\tnn.*$	Beginning of line , followed by one or more numbers (the frequency count), followed by the first word (i.e. any number of characters that aren't tabs, tab, the, tab, third word, tab, verb (base form), tab, any part of speech, tab, noun, end of line