N-grams data

Most of the information at this website deals with data from the COCA corpus. You might also be interested in the n-grams data from the 14 billion word iWeb corpus.

These n-grams are based on the largest publicly-available, genre-balanced corpus of English -- the one billion word Corpus of Contemporary American English (COCA). With this n-grams data (2, 3, 4, 5-word sequences, with their frequency), you can carry out powerful queries offline -- without needing to access the corpus via the web interface.

frequency	word1	word2	word3
31891	much	of	the
13261	much	of	a
8000	much	more	than
7396	much	as	i
5650	much	the	same
5633	much	of	it
4229	much	better	than
4191	much	as	the

A few more examples (from among an unlimited number of queries) might be:

NOUN + NOUN sequences	three word strings with a preposition in the middle position
VERB + the + NOUN sequences	two word strings, where the words begin or end with certain letters
like + word + word	(potential) phrasal verb: VERB + ADV particle

The data is available in three different formats, and when you purchase the data you have access to all three formats. (The numbers refer to how many millions of entries there are for that format / n-grams set).

Type	Data	2-grams	3-grams	4-grams	5-grams
1	Words	8.2 m	16.3 m	13.1 m	6.2 m
2	Words+ + part of speech	11.6 m	28.5 m	28.2 m	17.4 m
db	Database: integer values + lexicon	13.5 m	18.9 m	27.1 m	16.2 m

You might also be interested in the frequency of single words (including frequency by genre and sub-genre), or collocates (all words "near by" a given word).