Culturomics

[div class=attrib]From the Wall Street Journal:[end-div]

Can physicists produce insights about language that have eluded linguists and English professors? That possibility was put to the test this week when a team of physicists published a paper drawing on Google’s massive collection of scanned books. They claim to have identified universal laws governing the birth, life course and death of words.

The paper marks an advance in a new field dubbed “Culturomics”: the application of data-crunching to subjects typically considered part of the humanities. Last year a group of social scientists and evolutionary theorists, plus the Google Books team, showed off the kinds of things that could be done with Google’s data, which include the contents of five-million-plus books, dating back to 1800.

Published in Science, that paper gave the best-yet estimate of the true number of words in English—a million, far more than any dictionary has recorded (the 2002 Webster’s Third New International Dictionary has 348,000). More than half of the language, the authors wrote, is “dark matter” that has evaded standard dictionaries.

The paper also tracked word usage through time (each year, for instance, 1% of the world’s English-speaking population switches from “sneaked” to “snuck”). It also showed that we seem to be putting history behind us more quickly, judging by the speed with which terms fall out of use. References to the year “1880” dropped by half in the 32 years after that date, while the half-life of “1973” was a mere decade.

In the new paper, Alexander Petersen, Joel Tenenbaum and their co-authors looked at the ebb and flow of word usage across various fields. “All these different words are battling it out against synonyms, variant spellings and related words,” says Mr. Tenenbaum. “It’s an inherently competitive, evolutionary environment.”

When the scientists analyzed the data, they found striking patterns not just in English but also in Spanish and Hebrew. There has been, the authors say, a “dramatic shift in the birth rate and death rates of words”: Deaths have increased and births have slowed.

English continues to grow—the 2011 Culturonomics paper suggested a rate of 8,500 new words a year. The new paper, however, says that the growth rate is slowing. Partly because the language is already so rich, the “marginal utility” of new words is declining: Existing things are already well described. This led them to a related finding: The words that manage to be born now become more popular than new words used to get, possibly because they describe something genuinely new (think “iPod,” “Internet,” “Twitter”).

Higher death rates for words, the authors say, are largely a matter of homogenization. The explorer William Clark (of Lewis & Clark) spelled “Sioux” 27 different ways in his journals (“Sieoux,” “Seaux,” “Souixx,” etc.), and several of those variants would have made it into 19th-century books. Today spell-checking programs and vigilant copy editors choke off such chaotic variety much more quickly, in effect speeding up the natural selection of words. (The database does not include the world of text- and Twitter-speak, so some of the verbal chaos may just have shifted online.)

[div class=attrib]Read the entire article here.[end-div]