N-grams data


In addition to the COCA and COHA-based n-grams of English, we also have n-grams for Portuguese, based on the 20 million words of texts from the 1900s in the 45 million word Corpus do PortuguÍs. Although the Spanish and Portuguese n-grams are based on much smaller corpora than COCA and COHA, they are still the only n-grams that we are aware of that are based on large, genre-balanced corpora.

The following are small samples of the n-grams data, each of which include the 50,000 most frequent n-grams (along with part of speech):
 
2-grams 3-grams 4-grams 5-grams
download (zip) download (zip) download (zip) download (zip)

The following are the approximate number of n-grams:

2-grams 3-grams 4-grams 5-grams
2,600,000 6,200,000 6,900,000 5,700,000

The n-grams data for Portuguese is available in two different formats:

Format Explanation
Words

All 2, 3, and 4-grams in the corpus, along with part of speech (as in the examples above).

Databases

2, 3, 4, 5-grams with words as unique integer values for each word form, as well as one "lexicon" file with information about each integer value -- word form (+/-  case sensitive, lemma, and part of speech). You then create SQL joins between the n-grams and lexicon, as with the English files. More complicated (need to know SQL), but able to create much more powerful queries.

Pricing for either format shown above is:

  • $95: academic

  • $195: commercial

To order, please email us at corpus@byu.edu. We will send you a short one-page NDA (non-disclosure agreement) for the desired product, and will then send a request for payment from PayPal. For an academic license, the NDA you send back must come from a university email account.

Thanks.