N-grams data

In addition to the COCA and COHA-based n-grams of English, we also have n-grams for Portuguese, based on the 20 million words of texts from the 1900s in the 45 million word Corpus do Português. Although the Spanish and Portuguese n-grams are based on much smaller corpora than COCA and COHA, they are still the only n-grams that we are aware of that are based on large, genre-balanced corpora.

The following are small samples of the n-grams data, each of which include the 50,000 most frequent n-grams (along with part of speech):

2-grams 3-grams 4-grams 5-grams

download (zip) download (zip) download (zip) download (zip)

The following are the approximate number of n-grams:

2-grams	3-grams	4-grams	5-grams
2,600,000	6,200,000	6,900,000	5,700,000

The n-grams data for Portuguese is available in two different formats:

Format	Explanation
Words	All 2, 3, and 4-grams in the corpus, along with part of speech (as in the examples above).
Databases	2, 3, 4, 5-grams with words as unique integer values for each word form, as well as one "lexicon" file with information about each integer value -- word form (+/- case sensitive, lemma, and part of speech). You then create SQL joins between the n-grams and lexicon, as with the English files. More complicated (need to know SQL), but able to create much more powerful queries.

Pricing for either format shown above is:

$95: academic
$195: commercial

To order, please email us at mark.davies@corpusdata.org. We will send you a short one-page NDA (non-disclosure agreement) for the desired product, and will then send a request for payment from PayPal. For an academic license, the NDA you send back must come from a university email account.

Thanks.