Quantitative Analysis of Culture Using Millions of Digitized Books Science PDF
Document Details
Uploaded by AdulatorySavannah1895
Dalhousie University
Jean-Baptiste Michel, Yuan Kui Shen, Aviva Presser Aiden, Adrian Veres, Matthew K. Gray, The Google Books Team, Joseph P. Pickett, Dale Hoiberg, Dan Clancy, Peter Norvig, Jon Orwant, Steven Pinker, Ma
Tags
Summary
A research article on culturomics, analyzing cultural trends reflected in English language between 1800 and 2000, using a vast corpus of digitized books. Examining linguistic change, cultural phenomena, lexicography, and history.
Full Transcript
RESEARCH ARTICLE pages of 1208 books. The corpus contains...
RESEARCH ARTICLE pages of 1208 books. The corpus contains 386,434,758 words from 1861; thus, the frequency is 5.5 × 10−5. The use of “slavery” peaked during Quantitative Analysis of Culture the Civil War (early 1860s) and then again during the civil rights movement (1955–1968) (Fig. 1B) Using Millions of Digitized Books In contrast, we compare the frequency of “the Great War” to the frequencies of “World War I” and “World War II”. References to “the Great Jean-Baptiste Michel,1,2,3,4,5*† Yuan Kui Shen,2,6,7 Aviva Presser Aiden,2,6,8 Adrian Veres,2,6,9 War” peak between 1915 and 1941. But although Matthew K. Gray,10 The Google Books Team,10 Joseph P. Pickett,11 Dale Hoiberg,12 its frequency drops thereafter, interest in the un- Dan Clancy,10 Peter Norvig,10 Jon Orwant,10 Steven Pinker,5 derlying events had not disappeared; instead, they Martin A. Nowak,1,13,14 Erez Lieberman Aiden1,2,6,14,15,16,17*† are referred to as “World War I” (Fig. 1C). We constructed a corpus of digitized texts containing about 4% of all books ever printed. Analysis of this These examples highlight two central factors corpus enables us to investigate cultural trends quantitatively. We survey the vast terrain of ‘culturomics,’ that contribute to culturomic trends. Cultural change focusing on linguistic and cultural phenomena that were reflected in the English language between guides the concepts we discuss (such as “slavery”). 1800 and 2000. We show how this approach can provide insights about fields as diverse as lexicography, Linguistic change, which, of course, has cultural the evolution of grammar, collective memory, the adoption of technology, the pursuit of fame, roots, affects the words we use for those concepts censorship, and historical epidemiology. Culturomics extends the boundaries of rigorous quantitative (“the Great War” versus “World War I”). In this inquiry to a wide array of new phenomena spanning the social sciences and the humanities. paper, we examine both linguistic changes, such as changes in the lexicon and grammar, and cul- eading small collections of carefully cho- by publishers. Metadata describing the date and tural phenomena, such as how we remember peo- R sen works enables scholars to make pow- place of publication were provided by the li- ple and events. Downloaded from https://www.science.org on February 03, 2025 erful inferences about trends in human braries and publishers and supplemented with The full data set, which comprises over two thought. However, this approach rarely enables bibliographic databases. Over 15 million books billion culturomic trajectories, is available for precise measurement of the underlying phenome- have been digitized [~12% of all books ever download or exploration at www.culturomics.org na. Attempts to introduce quantitative methods published (7)]. We selected a subset of over 5 and ngrams.googlelabs.com. into the study of culture (1–6) have been ham- million books for analysis on the basis of the The size of the English lexicon. How many pered by the lack of suitable data. quality of their OCR and metadata (Fig. 1A and words are in the English language (9)? We report the creation of a corpus of fig. S1) (7). Periodicals were excluded. We call a 1-gram “common” if its frequency is 5,195,769 digitized books containing ~4% of all The resulting corpus contains over 500 billion greater than one per billion. [This corresponds to books ever published. Computational analysis of words, in English (361 billion), French (45 billion), the frequency of the words listed in leading dic- this corpus enables us to observe cultural trends Spanish (45 billion), German (37 billion), Chinese tionaries (7) (fig. S3).] We compiled a list of all and subject them to quantitative investigation. (13 billion), Russian (35 billion), and Hebrew common 1-grams in 1900, 1950, and 2000, based ‘Culturomics’ extends the boundaries of scientific (2 billion). The oldest works were published in on the frequency of each 1-gram in the preced- inquiry to a wide array of new phenomena. the 1500s. The early decades are represented by ing decade. These lists contained 1,117,997 com- The corpus has emerged from Google’s effort only a few books per year, comprising several mon 1-grams in 1900, 1,102,920 in 1950, and to digitize books. Most books were drawn from hundred thousand words. By 1800, the corpus 1,489,337 in 2000. over 40 university libraries around the world. grows to 98 million words per year; by 1900, 1.8 Not all common 1-grams are English words. Each page was scanned with custom equipment billion; and by 2000, 11 billion (fig. S2). Many fell into three nonword categories: (i) 1-grams (7), and the text was digitized by means of optical The corpus cannot be read by a human. If you with nonalphabetic characters (“l8r”, “3.14159”), character recognition (OCR). Additional vol- tried to read only English-language entries from (ii) misspellings (“becuase”, “abberation”), and umes, both physical and digital, were contributed the year 2000 alone, at the reasonable pace of 200 (iii) foreign words (“sensitivo”). words/min, without interruptions for food or sleep, To estimate the number of English words, we 1 Program for Evolutionary Dynamics, Harvard University, it would take 80 years. The sequence of letters is manually annotated random samples from the Cambridge, MA 02138, USA. 2Cultural Observatory, Harvard 1000 times longer than the human genome: If lists of common 1-grams (7) and determined what University, Cambridge, MA 02138, USA. 3Institute for you wrote it out in a straight line, it would reach fraction were members of the above nonword Quantitative Social Sciences, Harvard University, Cambridge, to the Moon and back 10 times over (8). categories. The result ranged from 51% of all MA 02138, USA. 4Department of Psychology, Harvard University, Cambridge, MA 02138, USA. 5Department of To make release of the data possible in light common 1-grams in 1900 to 31% in 2000. Systems Biology, Harvard Medical School, Boston, MA 02115, of copyright constraints, we restricted this initial Using this technique, we estimated the num- USA. 6Laboratory-at-Large, Harvard University, Cambridge, MA study to the question of how often a given 1-gram ber of words in the English lexicon as 544,000 in 02138, USA. 7Computer Science and Artificial Intelligence or n-gram was used over time. A 1-gram is a string 1900, 597,000 in 1950, and 1,022,000 in 2000. Laboratory, Massachusetts Institute of Technology (MIT), Cambridge, MA 02139, USA. 8Harvard Medical School, Boston, of characters uninterrupted by a space; this in- The lexicon is enjoying a period of enormous MA, 02115, USA. 9Harvard College, Cambridge, MA 02138, cludes words (“banana”, “SCUBA”) but also num- growth: The addition of ~8500 words/year has USA. 10Google, Mountain View, CA 94043, USA. 11Houghton bers (“3.14159”) and typos (“excesss”). An n-gram increased the size of the language by over 70% Mifflin Harcourt, Boston, MA 02116, USA. 12Encyclopaedia is a sequence of 1-grams, such as the phrases “stock during the past 50 years (Fig. 2A). Britannica, Chicago, IL 60654, USA. 13Department of Organismic and Evolutionary Biology, Harvard University, Cambridge, MA market” (a 2-gram) and “the United States of Notably, we found more words than appear in 02138, USA. 14Department of Mathematics, Harvard Uni- America” (a 5-gram). We restricted n to 5 and lim- any dictionary. For instance, the 2002 Webster’s versity, Cambridge, MA 02138, USA. 15Broad Institute of ited our study to n-grams occurring at least 40 Third New International Dictionary (W3), which Harvard and MIT, Harvard University, Cambridge, MA 02138, times in the corpus. keeps track of the contemporary American lexicon, USA. 16School of Engineering and Applied Sciences, Harvard Usage frequency is computed by dividing the lists approximately 348,000 single-word wordforms University, Cambridge, MA 02138, USA. 17Harvard Society of Fellows, Harvard University, Cambridge, MA 02138, USA. number of instances of the n-gram in a given year (10); the American Heritage Dictionary of the En- *These authors contributed equally to this work. by the total number of words in the corpus in that glish Language, Fourth Edition (AHD4) lists †To whom correspondence should be addressed. E-mail: year. For instance, in 1861, the 1-gram “slavery” 116,161 (11). (Both contain additional multiword [email protected] (J.-B.M.); [email protected] (E.L.A.) appeared in the corpus 21,460 times, on 11,687 entries.) Part of this gap is because dictionaries often 176 14 JANUARY 2011 VOL 331 SCIENCE www.sciencemag.org RESEARCH ARTICLE exclude proper nouns (fig. S4) and compound sharp increases in frequency (>2× from 1950 to dwelled. We defined a verb’s “regularity” as the words (“whalewatching”). Even accounting for 2000) (Fig. 2C). Nevertheless, there was a lag be- percentage of instances in the past tense (i.e., the these factors, we found many undocumented words, tween lexicographers and the lexicon. Over half sum of “drived”, “drove”, and “driven”) in which such as “aridification” (the process by which a geo- the words added to AHD4 were part of the English the regular form is used. Most irregulars have been graphic region becomes dry), “slenthem” (a musical lexicon a century ago (frequency >10−9 from 1890 stable for the past 200 years, but 16% underwent instrument), and, appropriately, the word “deletable.” to 1900). In fact, some newly added words, such a change in regularity of 10% or more (Fig. 2F). This gap between dictionaries and the lexicon as “gypseous” and “amplidyne”, have already un- These changes occurred slowly: It took 200 results from a balance that every dictionary must dergone a steep decline in frequency (Fig. 2D). years for our fastest-moving verb (“chide”) to go strike: It must be comprehensive enough to be a Not only must lexicographers avoid adding from 10% to 90%. Otherwise, each trajectory useful reference but concise enough to be printed, words that have fallen out of fashion, they must was sui generis; we observed no characteristic shipped, and used. As such, many infrequent also weed obsolete words from earlier editions. shape. For instance, a few verbs, such as “spill”, words are omitted. To gauge how well dictio- This is an imperfect process. We found 2220 ob- regularized at a constant speed, but others, such naries reflect the lexicon, we ordered our year-2000 solete 1-gram headwords (“diestock”, “alkales- as “thrive” and “dig”, transitioned in fits and starts lexicon by frequency, divided it into eight deciles cent”) in AHD4. Their mean frequency declined (7). In some cases, the trajectory suggested a rea- (ranging from 10−9 to 10−8, to 10−2 to 10−1) and throughout the 20th century and dipped below son for the trend. For example, with “sped/speeded” sampled each decile (7). We manually checked 10−9 decades ago (Fig. 2D, inset). the shift in meaning from “to move rapidly” and how many sample words were listed in the Our results suggest that culturomic tools will toward “to exceed the legal limit” appears to have Oxford English Dictionary (OED) (12) and in the aid lexicographers in at least two ways: (i) find- been the driving cause (Fig. 2G). Merriam-Webster Unabridged Dictionary (MWD). ing low-frequency words that they do not list, and Six verbs (burn, chide, smell, spell, spill, and (We excluded proper nouns, because neither the (ii) providing accurate estimates of current fre- thrive) regularized between 1800 and 2000 (Fig. OED nor MWD lists them.) Both dictionaries quency trends to reduce the lag between changes 2F). Four are remnants of a now-defunct phono- had excellent coverage of high-frequency words in the lexicon and changes in the dictionary. logical process that used -t instead of -ed; they are but less coverage for frequencies below 10−6: The evolution of grammar. Next, we exam- members of a pack of irregulars that survived by Downloaded from https://www.science.org on February 03, 2025 67% of words in the 10−9 to 10−8 range were ined grammatical trends. We studied the English virtue of similarity (bend/bent, build/built, burn/ listed in neither dictionary (Fig. 2B). Consistent irregular verbs, a classic model of grammatical burnt, learn/learnt, lend/lent, rend/rent, send/sent, with Zipf’s famous law, a large fraction of the change (14–17). Unlike regular verbs, whose past smell/smelt, spell/spelt, spill/spilt, and spoil/spoilt). words in our lexicon (63%) were in this lowest- tense is generated by adding -ed (jump/jumped), Verbs have been defecting from this coalition for frequency bin. As a result, we estimated that 52% irregular verbs are conjugated idiosyncratically centuries (wend/went, pen/pent, gird/girt, geld/ of the English lexicon—the majority of the words (stick/stuck, come/came, get/got) (15). gelt, and gild/gilt all blend/blent into the domi- used in English books—consists of lexical “dark All irregular verbs coexist with regular com- nant -ed rule). Culturomic analysis reveals that matter” undocumented in standard references (12). petitors (e.g., “strived” and “strove”) that threaten the collapse of this alliance has been the most To keep up with the lexicon, dictionaries are to supplant them (Fig. 2E and fig. S5). High- significant driver of regularization in the past updated regularly (13). We examined how well frequency irregulars, which are more readily 200 years. The regularization of burnt, smelt, spelt, these changes corresponded with changes in ac- remembered, hold their ground better. For in- and spilt originated in the United States; the tual usage by studying the 2077 1-gram headwords stance, we found “found” (frequency: 5 × 10−4) forms still cling to life in British English (Fig. 2, added to AHD4 in 2000. The overall frequency of 200,000 times more often than we finded “finded.” E and F). But the -t irregulars may be doomed in these words, such as “buckyball” and “netiquette”, In contrast, “dwelt” (frequency: 1 × 10−5) dwelt in England too. Each year, a population the size of has soared since 1950: Two-thirds exhibited recent our data only 60 times as often as “dwelled” Cambridge adopts “burned” in lieu of “burnt”. Fig. 1. Culturomic analy- A B ses study millions of books at once. (A) Top row: Au- thors have been writing for millennia; ~129 mil- lion book editions have been published since the 129 million books advent of the printing press published (upper left). Second row: Libraries and publishing houses provide books to Google for scanning (mid- dle left). Over 15 million 15 million books C books have been digitized. scanned Third row: Each book is associated with metadata. Five million books are cho- sen for computational anal- ysis (bottom left). Bottom 5 million books row: A culturomic time line analyzed Frequency of the shows the frequency of word "apple" “apple” in English books over time (1800–2000). Year (B) Usage frequency of “slavery”. The Civil War (1861–1865) and the civil rights movement (1955–1968) are highlighted in red. The number in the upper left (1e-4 = 10–4) is the unit of frequency. (C) Usage frequency over time for “the Great War” (blue), “World War I” (green), and “World War II” (red). www.sciencemag.org SCIENCE VOL 331 14 JANUARY 2011 177 RESEARCH ARTICLE Although irregulars generally yield to regu- the English-speaking population switches from quency of 1-grams such as “1951” could be used lars, two verbs did the opposite: light/lit and “sneaked” to “snuck” every year. Someone will to measure interest in the events of the corre- wake/woke. Both were irregular in Middle English, have snuck off while you read this sentence. As sponding year, and we created plots for each year were mostly regular by 1800, and subsequently before, this trend is more prominent in the United between 1875 and 1975. backtracked and are irregular again today. The States but recently sneaked across the Atlantic: The plots had a characteristic shape. For fact that these verbs have been going back and America is the world’s leading exporter of both example, “1951” was rarely discussed until forth for nearly 500 years highlights the gradual regular and irregular verbs. the years immediately preceding 1951. Its fre- nature of the underlying process. Out with the old. Just as individuals forget quency soared in 1951, remained high for 3 years, Still, there was at least one instance of rapid the past (18, 19), so do societies (20) (fig. S6). To and then underwent a rapid decay, dropping by progress by an irregular form. Presently, 1% of quantify this effect, we reasoned that the fre- half over the next 15 years. Finally, the plots Fig. 2. Culturomics has profound consequences for A B the study of language, lexicography, and grammar. (A) The size of the English lexicon over time. Tick marks show the number of single words in three dictionaries (see text). (B) Fraction of words in the lexicon that appear in two different dictionaries as a OED function of usage frequency. (C) Five words added W3 by the AHD in its 2000 update. Inset: Median fre- quency of new words added to AHD4 in 2000. The AHD4 frequency of half of these words exceeded 10−9 as far back as 1890 (white dot). (D) Obsolete words added to AHD4 in 2000. Inset: Mean frequency of Downloaded from https://www.science.org on February 03, 2025 the 2220 AHD headwords whose current usage fre- quency is less than 10−9. (E) Usage frequency of C D x10-9 irregular verbs (red) and their regular counterparts rare words still in AHD Mean frequency (blue). Some verbs (chide/chided) have regularized 4 during the past two centuries. The trajectories for 2 “speeded” and “speed up” (green) are similar, re- x10-8 Median frequency 0 flecting the role of semantic factors in this instance 4 words added to AHD in 2000 1800 Decade 2000 of regularization. The verb “burn” first regularized 2 in the United States (U.S. flag) and later in the United Kingdom (UK flag). The irregular “snuck” is 0 1800 Decade 2000 rapidly gaining on “sneaked”. (F) Scatterplot of the irregular verbs; each verb’s position depends on its regularity (see text) in the early 19th century (x coor- dinate) and in the late 20th century (y coordinate). F For 16% of the verbs, the change in regularity was E greater than 10% (large font). Dashed lines sepa- rate irregular verbs (regularity < 50%) from reg- ular verbs (regularity > 50%). Six verbs became regular (upper left quadrant, blue), whereas two be- came irregular (lower right quadrant, red). Inset: The regularity of “chide” over time. (G) Median reg- ularity of verbs whose past tense is often signified with a -t suffix instead of -ed (burn, smell, spell, spill, dwell, learn, and spoil) in U.S. (black) and UK (gray) books. G 178 14 JANUARY 2011 VOL 331 SCIENCE www.sciencemag.org RESEARCH ARTICLE enter a regime marked by slower forgetting: contrast, “1973” declined to half its peak by they were first invented (1800–1840, 1840–1880, Collective memory has both a short-term and a 1983, a lag of only 10 years. We are forgetting and 1880–1920) (7). We tracked the frequency long-term component. our past faster with each passing year (Fig. 3A). of each invention in the nth year after it was But there have been changes. The amplitude We were curious whether our increasing invented as compared to its maximum value and of the plots is rising every year: Precise dates are tendency to forget the old was accompanied by plotted the median of these rescaled trajectories increasingly common. There is also a greater fo- more rapid assimilation of the new (21). We di- for each cohort. cus on the present. For instance, “1880” declined vided a list of 147 inventions into time-resolved The inventions from the earliest cohort to half its peak value in 1912, a lag of 32 years. In cohorts based on the 40-year interval in which (1800–1840) took over 66 years from invention A B Year of invention Median frequency (% of peak value) Frequency x10-5 Frequency 5 0 Downloaded from https://www.science.org on February 03, 2025 C D Frequency (log) Frequency E F Median frequency (log) Median frequency Half life: 73 yrs yrs :4 me g ti blin Dou Fig. 3. Cultural turnover is accelerating. (A) We forget: frequency of “1883” 1871 (gray lines; median, thick dark gray line). Five examples are highlighted. (blue), “1910” (green), and “1950” (red). Inset: We forget faster. The half-life (E) The median trajectory of the 1865 cohort is characterized by four of the curves (gray dots) is getting shorter (gray line: moving average). (B) Cultural parameters: (i) initial age of celebrity (34 years old, tick mark); (ii) doubling adoption is quicker. Median trajectory for three cohorts of inventions from three time of the subsequent rise to fame (4 years, blue line); (iii) age of peak celebrity different time periods (1800–1840, blue; 1840–1880, green; 1880–1920, (70 years after birth, tick mark), and (iv) half-life of the post-peak forgetting red). Inset: The telephone (green; date of invention, green arrow) and radio phase (73 years, red line). Inset: The doubling time and half-life over time. (blue; date of invention, blue arrow). (C) Fame of various personalities born (F) The median trajectory of the 25 most famous personalities born between between 1920 and 1930. (D) Frequency of the 50 most famous people born in 1800 and 1920 in various careers. www.sciencemag.org SCIENCE VOL 331 14 JANUARY 2011 179 RESEARCH ARTICLE to widespread impact (frequency >25% of peak). famous people born in that year. For example, the Fame comes sooner and rises faster. Between the Since then, the cultural adoption of technology has 1882 cohort includes “Virginia Woolf” and “Felix early 19th century and the mid-20th century, the become more rapid. The 1840–1880 invention Frankfurter”; the 1946 cohort includes “Bill age of initial celebrity declined from 43 to 29 cohort was widely adopted within 50 years; the Clinton” and “Steven Spielberg”. We plotted the years, and the doubling time fell from 8.1 to 3.3 1880–1920 cohort within 27 (Fig. 3B and fig. S7). median frequency for the names in each cohort years. As a result, the most famous people alive “In the future, everyone will be famous for over time (Fig. 3, D and E). The resulting trajectories today are more famous—in books—than their 7.5 minutes” – Whatshisname. People, too, rise to were all similar. Each cohort had a pre-celebrity predecessors. Yet this fame is increasingly short- prominence, only to be forgotten (22). Fame can be period (median frequency 5). In German, the distribution is much wider, and skewed to the left: Suppression in Nazi Germany was not the exception, but the rule (Fig. 4F). At the far left, 9.8% of individuals showed strong suppression (s < 1/5). This population is highly enriched in documented victims of repression, such as Pablo Picasso (s = 0.12), the Bauhaus architect Walter Gropius (s = 0.16), and Hermann Maas (s < 0.01), an influential Protestant minister E F who helped many Jews flee (7). (Maas was later recognized by Israel’s Yad Vashem as one of the “Righteous Among the Nations.”) At the other extreme, 1.5% of the population exhibited a dra- matic rise (s > 5). This subpopulation is highly enriched in Nazis and Nazi-supporters, who bene- fited immensely from government propaganda (7). These results provide a strategy for rapidly identifying likely victims of censorship from a large pool of possibilities, and highlight how cul- turomic methods might complement existing his- G H torical approaches. Culturomics. Culturomics is the application of high-throughput data collection and analysis to the study of human culture. Books are a begin- ning, but we must also incorporate newspapers (29), manuscripts (30), maps (31), artwork (32), and a myriad of other human creations (33, 34). Of course, many voices—already lost to time— lie forever beyond our reach. Culturomic results are a new type of evidence Fig. 5. Culturomics provides quantitative evidence for scholars in many fields. (A) Historical epi- in the humanities. As with fossils of ancient crea- demiology: “influenza” is shown in blue; the Russian, Spanish, and Asian flu epidemics are highlighted. tures, the challenge of culturomics lies in the in- (B) History of the Civil War. (C) Comparative history. (D) Gender studies. (E and F) History of science. (G) terpretation of this evidence. Considerations of Historical gastronomy. (H) History of religion: “God”. space restrict us to the briefest of surveys: a www.sciencemag.org SCIENCE VOL 331 14 JANUARY 2011 181 RESEARCH ARTICLE handful of trajectories and our initial interpreta- 5. P. Niyogi, The Computational Nature of Language 29. Google News Archive Search, http://news.google.com/ tions. Many more fossils (Fig. 5 and fig. S13), Learning and Evolution (MIT, Cambridge, MA, 2006). archivesearch. 6. G. K. Zipf, The Psycho-biology of Language 30. Digital Scriptorium, www.scriptorium.columbia.edu. with shapes no less intriguing, beckon: (Houghton Mifflin, Boston, 1935). 31. Visual Eyes, www.viseyes.org. (i) Peaks in “influenza” correspond with 7. Materials and methods are available as supporting 32. ARTstor, www.artstor.org. dates of known pandemics, suggesting the value material on Science Online. 33. Europeana, www.europeana.eu. of culturomic methods for historical epidemiol- 8. E. S. Lander et al.; International Human Genome 34. Hathi Trust Digital Library, www.hathitrust.org. Sequencing Consortium, Nature 409, 860 (2001). 35. J. M. Barry, The Great Influenza: The Epic Story of the ogy (35) (Fig. 5A and fig. S14). Deadliest Plague in History (Viking Press, New York, 2004). 9. A. W. Read, Am. Speech 8, 10 (1933). (ii) Trajectories for “the North”, “the South”, 10. Webster's Third New International Dictionary of the 36. J.-B.M. was supported by the Foundational Questions in and finally “the enemy” reflect how polarization English Language, Unabridged, P. B. Gove, Ed. Evolutionary Biology Prize Fellowship and the Systems of the states preceded the descent into the Civil (Merriam-Webster, Springfield, MA, 1993). Biology Program (Harvard Medical School). Y.K.S. was 11. The American Heritage Dictionary of the English supported by internships at Google. S.P. acknowledges War (Fig. 5B). support from NIH grant HD 18381. E.L.A. was supported Language, Fourth Edition, J. P. Pickett, Ed. (iii) In the battle of the sexes, the “women” (Houghton Mifflin, Boston, 2000). by the Harvard Society of Fellows, the Fannie and are gaining ground on the “men” (Fig. 5C). 12. Oxford English Dictionary, J. A. Simpson, E. S. C. Weiner, John Hertz Foundation Graduate Fellowship, a National (iv) “féminisme” made early inroads in M. Proffitt, Eds. (Clarendon, Oxford, 1993). Defense Science and Engineering Graduate Fellowship, an 13. J. Algeo, A. S. Algeo, Fifty Years among the New Words: NSF Graduate Fellowship, the National Space Biomedical France, but the United States proved to be a more A Dictionary of Neologisms, 1941–1991 Research Institute, and National Human Genome Research fertile environment in the long run (Fig. 5D). Institute grant T32 HG002295. This work was supported by (Cambridge Univ. Press, Cambridge, 1991). (v) “Galileo”, “Darwin”, and “Einstein” may 14. S. Pinker, Words and Rules (Basic Books, New York, a Google Research Award. The Program for Evolutionary be well-known scientists, but “Freud” is more 1999). Dynamics acknowledges support from the Templeton deeply ingrained in our collective subconscious 15. Anthony S. Kroch, Language Variation Change 1, Foundation, NIH grant R01GM078986, and the Bill and 199 (1989). Melinda Gates Foundation. Some of the methods described (Fig. 5E). in this paper are covered by U.S. patents 7463772 and 16. J. L. Bybee, Language 82, 711 (2006). (vi) Interest in “evolution” was waning when 7508978. We are grateful to D. Bloomberg, A. Popat, 17. E. Lieberman, J. B. Michel, J. Jackson, T. Tang, “DNA” came along (Fig. 5F). M. A. Nowak, Nature 449, 713 (2007). M. McCormick, T. Mitchison, U. Alon, S. Shieber, (vii) The history of the American diet offers 18. B. Milner, L. R. Squire, E. R. Kandel, Neuron 20, E. Lander, R. Nagpal, J. Fruchter, J. Guldi, J. Cauz, C. Cole, Downloaded from https://www.science.org on February 03, 2025 445 (1998). P. Bordalo, N. Christakis, C. Rosenberg, M. Liberman, many appetizing opportunities for future research; J. Scheidlower, B. Zimmer, R. Darnton, and A. Spector for the menu includes “steak”, “sausage”, “ice cream”, 19. H. Ebbinghaus, Memory: A Contribution to Experimental discussions; to C.-M. Hetrea and K. Sen for assistance with Psychology (Dover Publications, New York, 1987). “hamburger”, “pizza”, “pasta”, and “sushi” 20. M. Halbwachs, On Collective Memory, Lewis A. Coser, Encyclopaedia Britannica's database; to S. Eismann, (Fig. 5G). W. Treß, and the City of Berlin Web site (berlin.de) for transl. (Univ. of Chicago Press, Chicago, 1992). assistance in documenting victims of Nazi censorship; to (viii) “God” is not dead but needs a new 21. S. Ulam, Bull. Am. Math. Soc. 64, 1 (1958). C. Lazell and G. T. Fournier for assistance with annotation; publicist (Fig. 5H). 22. L. Braudy, The Frenzy of Renown: Fame & Its History to M. Lopez for assistance with Fig. 1; to G. Elbaz and (Vintage Books, New York, 1997). These, together with the billions of other W. Gilbert for reviewing an early draft; and to Google’s 23. Wikipedia, 23 August 2010, www.wikipedia.org/. trajectories that accompany them, will furnish a library partners and every author who has ever picked up a 24. Encyclopaedia Britannica, D. Hoiberg, Ed. pen, for books. great cache of bones from which to reconstruct (Encyclopaedia Britannica, Chicago, 2002). the skeleton of a new science. 25. Censorship: 500 Years of Conflict, V. Gregorian, Ed. (New York Public Library, New York, 1984). Supporting Online Material www.sciencemag.org/cgi/content/full/science.1199644/DC1 References and Notes 26. W. Treß, Wider Den Undeutschen Geist: Materials and Methods 1. E. O. Wilson, Consilience (Knopf, New York, 1998). Bücherverbrennung 1933 (Parthas, Berlin, 2003). Figs. S1 to S19 2. D. Sperber, Man (London) 20, 73 (1985). 27. G. Sauder, Die Bücherverbrennung: 10. Mai 1933 References 3. S. Lieberson, J. Horwich, Sociol. Methodol. 38, 1 (2008). (Ullstein, Frankfurt am Main, Germany, 1985). 4. L. L. Cavalli-Sforza, W. Marcus, X. Feldman, Cultural 28. S. Barron, P. W. Guenther, Degenerate Art: The Fate of 27 October 2010; accepted 6 December 2010 Transmission and Evolution (Princeton Univ. Press, the Avant-garde in Nazi Germany (Los Angeles County Published online 16 December 2010; Princeton, NJ, 1981). Museum of Art, Los Angeles, 1991). 10.1126/science.1199644 182 14 JANUARY 2011 VOL 331 SCIENCE www.sciencemag.org