In an attempt to preserve endangered languages, UC Berkeley scientists create a computer program that reconstructs ancient tongues using their modern descendants.
One of my favorite things about watching old movies is hearing how people might have spoken in eras past — the expressions they used, their old-school smack talk. But what did the languages from thousands of years back sound like? Hollywood, as far as I know, has yet to make a movie in which characters talk in authentic proto-Austronesian.
The program, a linguistic time machine of sorts, can quickly crunch data on some of the earliest-known “proto-languages” that gave rise to modern languages such as Hawaiian, Javanese, Malay, Tagalog, and others spoken in Southeast Asia, parts of continental Asia, Australasia, and the Pacific.
The speed and scale of the work is key here, as proto-languages are typically reconstructed through a timely and painstaking manual process that involves comparing two or more languages that hail from a shared ancestor.
“What excites me about this system is that it takes so many of the great ideas that linguists have had about historical reconstruction, and it automates them at a new scale: more data, more words, more languages, but less time,” says Dan Klein, an associate professor of computer science at UC Berkeley and co-author of a paper on the system published this week in the journal Proceedings of the National Academy of Sciences.
The program relies on an algorithm known as the Markov chain Monte Carlo sampler to examine hundreds of modern Austronesian languages for words of common ancestry from thousands of years back. The computational model is based on the established linguistic theory that words evolve as if along the branches of a genealogical tree.
In their paper (if you want more specifics on how the researchers did things like associate a context vector to each phoneme, read this PDF), the team details the process of reconstructing more than 600 proto-Austronesian languages from an existing database of more than 140,000 words.
They say more than 85 percent of the reconstructions come within a character of those done manually by a linguist.
Linguists needn’t worry about competition from the computer, however. “This really is not an attempt to put historical linguists out of a job,” Klein tells CNET. “It’s not like chess where we’re seeing if we can one-up the humans.
“It would take a very long time for people to be able to cross-reference every language and every word that we feed into the program,” Klein adds. “Of course in principle a human could do that, but in practice they just won’t. Humans are not well-suited to crunching all of this data. Humans are well-suited to other things.”
They can, for example, examine linguistic influences a computer can’t yet grok — the sociological and geographical factors that lead to morphological changes that turn, say, “cat” into “kitty cat” and how spelling errors reveal sound confusions for languages that are also written.
Implications for future-speak?
“It’s the same way that astronomers haven’t been replaced by computerized telescopes,” Klein says. “We hope that linguists can use tools like ours to see further into the past — and more of it, too.”
Clues from the past, yes. But the work may also be able to extrapolate how language might change in the future, according to Tom Griffiths, associate professor of psychology, director of UC Berkeley’s Computational Cognitive Science Lab, and another co-author of the paper.
Which of course begs the question: how will Twitter-speak look 6,000 years from now?
About Leslie Katz
Leslie Katz, senior editor of CNET’s Crave, covers gadgets, games, and myriad other digital distractions. As a co-host of the now-retired CNET News Daily Podcast, she was sometimes known to channel Terry Gross and still uses her trained “podcast voice” to bully the speech recognition software on automated customer service lines. E-mail Leslie.