N-grams data


In addition to the COCA and COHA-based n-grams of English, we also have n-grams for Spanish (based on the Corpus del Español) and Portuguese (based on the Corpus do Português). Although the Spanish and Portuguese n-grams are based on much smaller corpora than COCA and COHA, they are still the only n-grams that we are aware of that are based on large, genre-balanced corpora.

The following are small samples of the n-grams data, each of which include the 50,000 most frequent n-grams (along with part of speech):

  2-grams 3-grams 4-grams 5-grams
Spanish download (zip) download (zip) download (zip) download (zip)
Portuguese download (zip) download (zip) download (zip) download (zip)

The following are the approximate number of n-grams for each language:

  2-grams 3-grams 4-grams 5-grams
Spanish 2,400,000 6,000,000 7,500,000 6,900,000
Portuguese 2,600,000 6,200,000 6,900,000 5,700,000

The n-grams data for Spanish and Portuguese is available in two different formats:

Format Explanation
Words

All 2, 3, and 4-grams in the corpus, along with part of speech (as in the examples above).

Databases

2, 3, 4, 5-grams with words as unique integer values for each word form, as well as one "lexicon" file with information about each integer value -- word form (+/-  case sensitive, lemma, and part of speech). You then create SQL joins between the n-grams and lexicon, as with the English files. More complicated (need to know SQL), but able to create much more powerful queries.

Pricing for either format shown above, and for either Spanish or Portuguese is:

  • $95: academic

  • $195: commercial

To order, please email us at mark.davies@corpusdata.org. We will send you a short one-page NDA (non-disclosure agreement) for the desired product, and will then send a request for payment from PayPal. For an academic license, the NDA you send back must come from a university email account.

Thanks.