In addition to the COCA and
COHA-based n-grams of English, we also have
n-grams for Portuguese, based on the 20 million words of texts from the 1900s in
the 45 million word
Corpus do Português. Although the Spanish and Portuguese n-grams are based on much smaller corpora
than COCA and COHA, they are still the only n-grams that we are aware of that
are based on large, genre-balanced corpora.
The following are small samples of the n-grams
data, each of which include the 50,000 most frequent n-grams (along with part of
speech):
The following are the approximate number of n-grams:
2-grams |
3-grams |
4-grams |
5-grams |
2,600,000 |
6,200,000 |
6,900,000 |
5,700,000 |
The n-grams data for Portuguese is
available in two different formats:
Format |
Explanation |
Words |
All 2, 3, and 4-grams in the corpus, along with part of
speech (as in the examples above). |
Databases |
2, 3, 4, 5-grams with words as unique integer values
for each word form, as well as one "lexicon" file with information about
each integer value -- word form (+/- case sensitive, lemma, and
part of speech). You then create SQL joins between the n-grams and
lexicon, as with the English files. More
complicated (need to know SQL), but able to create much more powerful
queries. |
Pricing for either format shown above is:
-
$95: academic
-
$195: commercial
To order, please email us at
mark.davies@corpusdata.org. We will send you a short
one-page NDA (non-disclosure agreement) for the desired product, and will then
send a request for payment from PayPal. For an academic license, the NDA you send back must come from a university email
account.
Thanks.
|