|
N-grams dataCorpus of Historical American English |
|||||||||
|
Overview |
We have had a problem with companies downloading the COHA data for free and then using it in their programs. As a result, the data for the 1990s-2000s is by default deleted from the 1-grams files (single words). If you are a researcher at a university, send us an email from your academic email address and we will update things so that you can download 1-grams lists that have the 1990s-2000s as well. Please note that there are textual errors in COHA -- it would be impossible to have a completely "clean" 400 million word historical corpus (especially when the corpus was created essentially by one person). You will find words with typos (SO (=50), tlus (=thus), somctimes, etc), words that are incorrectly fused together (whois, cansay), and other problems. In terms of types (unique forms) in the n-grams files, it may seem like a lot, but in terms of tokens (total number of words affected), it is very small. Based on informal texts that we have done, the texts are on average about 99.85% accurate, resulting in one error about every 500-1000 words. The question is whether you would rather have an almost perfectly clean 1,000,000 word corpus that can only be used for a very small range of studies (mainly high frequency syntactic phenomena), or a corpus with a small number of errors but which is large enough to be used for a wide range of studies. We have chosen the latter. Finally, if this data results in a publication,
please cite the data as follows, and please enter information in our
publications database.
Thank you.
| ||||||||||||||||||||||||||||