N-grams data



DOWNLOAD LIST OF ALL 485,179 TEXTS AND SUMMARY BY YEAR, GENRE, AND SUB-GENRE


The Corpus of Contemporary American English (COCA) is the only large, recent, genre-balanced corpus of English. It is composed of more than one billion words in 485,202 texts, including 20 million words each year from 1990-2019. For each year (and therefore overall, as well), the corpus is evenly divided between the genres of TV and Movies subtitles, spoken, fiction, popular magazines, newspapers, and academic journals.

YEAR BLOG WEB TV / MOVIES SPOKEN FICTION MAGAZINE NEWSPAPER ACADEMIC TOTAL
  125,496,215 129,899,426 128,013,334 127,396,916 119,505,292 127,352,014 122,959,393 120,988,348 1,001,610,938
1990   3,207,900 4,374,4694,162,242 4,101,4474,082,931 3,983,14323,914,122
1991   3,379,151 4,316,8984,192,646 4,209,8384,104,806 4,051,04624,256,376
1992   3,183,858 4,523,0543,893,956 4,288,6944,092,031 4,028,14724,011,732
1993   3,785,924 4,487,9783,921,244 4,254,3514,153,070 4,150,67124,755,231
1994   4,375,338 4,457,7263,870,757 4,310,3754,147,947 4,047,11525,211,252
1995   5,006,966 4,548,6023,846,412 4,314,7374,122,703 4,016,37125,857,786
1996   4,384,976 4,095,2663,758,787 4,338,7664,099,305 4,110,20924,789,305
1997   4,380,670 3,904,9963,617,741 4,368,9174,153,906 4,420,78624,849,013
1998   4,390,197 4,446,2173,779,801 4,393,8354,122,295 4,111,45325,245,796
1999   4,381,144 4,445,5644,154,537 4,391,1464,107,423 4,023,28225,505,095
2000   4,385,593 4,455,8153,942,474 4,387,9354,037,086 4,093,99125,304,894
2001   4,389,164 4,026,2403,894,789 4,298,6364,072,447 3,965,65424,648,931
2002   4,384,475 4,372,2903,766,673 4,310,6344,114,280 4,054,35925,004,713
2003   4,386,799 4,445,2704,125,039 4,332,7084,056,245 4,047,80225,395,866
2004   4,378,535 4,359,0844,099,691 4,337,3094,121,117 4,009,35925,307,099
2005   4,382,594 4,438,8774,101,737 4,364,7764,124,225 3,925,92725,340,141
2006   4,369,684 4,345,9954,113,173 4,302,7134,120,732 4,019,20025,273,503
2007   4,384,406 3,914,4244,063,116 4,225,5114,002,299 4,303,99324,895,756
2008   4,376,702 3,467,3154,147,216 4,289,6414,021,006 3,977,79024,281,678
2009   4,360,676 3,942,5124,072,580 3,972,2903,956,523 3,975,12824,281,718
2010   4,386,795 4,097,7603,897,459 3,832,5764,226,666 3,838,63724,281,903
2011   4,366,464 4,706,6354,165,068 4,194,9663,941,853 4,474,07225,851,069
2012   4,379,595 4,411,2813,862,889 4,306,9124,126,669 4,384,26325,473,621
2013   4,379,396 3,986,1064,256,880 4,190,8544,106,654 3,559,74824,481,651
2014   4,380,134 3,850,6834,172,260 4,264,5034,140,151 3,476,42924,286,174
2015   4,377,018 3,980,6604,218,823 4,205,8074,108,436 3,638,40624,531,165
2016   4,380,381 4,168,3033,258,473 4,053,1564,059,857 3,968,77923,890,965
2017   4,384,822 4,225,2483,940,337 4,212,8094,154,518 4,052,43524,972,186
2018   4,353,912 4,300,9904,109,362 4,143,3114,158,845 4,200,04725,268,485
2019   4,350,065 4,300,6584,099,130 4,152,8614,123,367 4,080,10625,108,206

The texts come from a variety of sources:

  • TV/Movies subtitles: (128 million words [128,013,334]). These come from the American part of the TV and Movies corpora. These subtitles are as informal (or more informal) than actual spoken data. The texts were taken from the OpenSubtitles collection. In cases where there were multiple subtitles files for a given TV episode (which was the norm), we used the "highest ranked" file, in terms of accuracy (from the ratings at OpenSubtitles).

  • Spoken: (127 million words [127,396,916]) Transcripts of unscripted conversation from more than 150 different TV and radio programs (examples: All Things Considered (NPR), Newshour (PBS), Good Morning America (ABC), Today Show (NBC), 60 Minutes (CBS), Hannity and Colmes (Fox), Jerry Springer, etc).

  • Fiction: (120 million words [119,505,292]) Short stories and plays from literary magazines, children’s magazines, popular magazines, first chapters of first edition books 1990-present, and movie scripts.

  • Popular Magazines: (127 million words [127,352,014]) Nearly 100 different magazines, with a good mix (overall, and by year) between specific domains (news, health, home and gardening, women, financial, religion, sports, etc). A few examples are Time, Men’s Health, Good Housekeeping, Cosmopolitan, Fortune, Christian Century, Sports Illustrated, etc.

  • Newspapers: (123 million words [122,959,393]) Ten newspapers from across the US, including: USA Today, New York Times, Atlanta Journal Constitution, San Francisco Chronicle, etc. In most cases, there is a good mix between different sections of the newspaper, such as local news, opinion, sports, financial, etc.

  • Academic Journals: (121 million words [120,988,348]) Nearly 100 different peer-reviewed journals. These were selected to cover the entire range of the Library of Congress classification system (e.g. a certain percentage from B (philosophy, psychology, religion), D (world history), K (education), T (technology), etc.), both overall and by number of words per year.

  • Blogs: (125 million words [125,496,215]). These texts represent a subset of the texts from the United States in the GloWbE corpus. At that time, Google allowed searches to be restricted to blogs, so nearly all of these texts are actually blogs.

  • Web pages: (130 million words [129,899,426]). They represent a subset of the "General" texts from the United States in the GloWbE corpus. Some of these texts are actually blogs (there was no way to search "NOT blogs" in Google at that time).
    -- For both blogs and general web pages, these were subsequently categorized by Serge Sharoff, so that in COCA you can limit searches to a particular web genre.
    -- Note that these web and blog texts were all collected in Oct 2012, so they are more of a "snapshot" of this genre, rather than year by year (as above). As a result, they are not included in the "historical" data, when you compare the frequency across decades or year. All historical data (for each year 1990-2019) comes from the other six genres listed above.