[ home | newsletter | past | join | listserve | shareware | directory | links | md9 ]

Mouse Lamb Mouse Uncle Goat

(Preliminary Note: Watson Wins Against Humans on Jeopardy! Read more in the April column, including "random" selection of Jeopardy categories and controversy about whether Watson had an unfair advantage in being able to buzz in quicker than the human players.)

Last month's column was about artificial intelligence techniques in cryptography. Many of the same principles and techniques apply to cracking the "code" of lost languages and to translating from one living language to another.

A foreign language can be impenetrable. One of the few military codes never broken was the Navajo code used by the U.S. in World War II. The Navajo language has no link to any European or Asian languages and uses complex verbs that incorporate adverbs, reflect whether the speaker personally experienced what he or she is talking about, and take endings that depend on the physical category their object belongs to (such as long, granular, viscous or bundled). The Marine Corps embedded native Navajo speakers in combat battalions. These "code talkers" translated communiques into Navajo and then transmitted the encoded message vocally by radio to a recipient code talker who translated it into English. The code used specially-designated Navajo words for military jargon; for instance, "whale" for "battleship" and "hummingbird" for "fighter plane". A few words, like place names, had to be spelled out. The title of this month's column is the English translation of the Navajo code for "MLMUG". In Navajo, it would be "Na-as-tso-si Dibeh-yazzi Na-as-tso-si Shi-da Klizzie".

The Rosetta Stone is an example of a cryptographic "crib," a key word or phrase that has a known decryption or translation. The Rosetta Stone contains the same text in Egyptian hieroglyphics, Egyptian demotic (a simplified script for everyday use) and Greek. Its discovery in 1799 led to the deciphering of Egyptian hieroglyphics, the knowledge of which had been lost for over fourteen centuries.

Deciphering a lost language is harder than decrypting a secret message in a known language because we usually have no idea of the content or context of the ancient words. Using letter and word frequency to crack an ancient language is of limited use because the number of examples available is small. We may not even know if the language is phonetically based (like English), syllabic (like Cherokee) or logosyllabic (like Mayan and Chinese). The assumption that Egyptian hieroglyphs were pictures because the Egyptians were too "primitive" for phonetic writing unduly delayed their translation even after the discovery of the Rosetta Stone. Another strongly-held false assumption delayed the translation of Linear B, a language found primarily on engraved stones from archaeological excavations in Crete. Many scholars insisted that Linear B could not be a different alphabet (or, in this case, syllabic script) for spoken Greek. To complete the understanding of Linear B required the efforts of numerous scholar-fanatics, including one who was able to extrapolate backwards from the Greek of Homer using linguistic principles of how pronunciation, grammar and word usage tend to change in a language over time.

There are several lost and partially-translated ancient languages remaining to be translated; among them, Iberian, the Phaistos Disc (found in Crete), Indus and Etruscan. Last year, archaeologists announced the discovery of a 400-year-old document that may include the only known written words from a lost Peruvian language, possibly one of the known spoken lost languages Quingnam or Pescadora.

We humans now have the assistance of AI to help decipher lost languages. A U.S.-Indian team recently calculated randomness in the small number of Indus fragments available. The Indus people lived in what today is eastern Pakistan and northwestern India from 2600 to 1900 B.C.E. The computer analysis compared the statistical patterns in sequences of Indus symbols to the regularities in spoken languages and the patterns of nonlinguistic systems such as DNA and a computer programming language. They found that the randomness in Indus falls squarely in the middle of spoken language statistics, demolishing the widely-held view that Indus symbols are just ornaments or basic symbols (like street sign arrows). Analyzing Indus for its grammatical rules is the next step.

Last year, scientists from MIT and the University of Southern California tested a computer program that uses AI probabilistic modeling to decipher ancient languages that are similar to known languages. The program deciphered Ugaritic, which is similar to Hebrew, in a few hours. The original human decipherment of Ugaritic took years. The AI program starts with guesses about alphabet correspondences based on symbol frequency and refines its analysis by crunching through the data thousands of times.

At a recent MLMUG meeting, MLMUGer Helge Gunther gave me some insights into basic problems and developments in translating living languages. Encryption of messages to prevent detection and decipherment of lost languages almost always focus on written language, whereas translation of speech is at least as important as text translation when it comes to living languages.

Basic machine translation (MT) computer programs use databases of word correspondences. These operate best in specific contexts, like tourism or corporate communications. For example, every day Ford updates over 5,000 auto assembly manual instructions and uses MT software to translate them into the languages of its assembly plants. MT programs can also be AI expert systems, using complex rules developed by human programmers.

Most recent developments in MT programs, however, are in statistical systems. Free online translators, like Google Translate (available online and also as "Language Translator," a free dashboard widget for your Mac), analyze existing texts in two languages, looking for consistent translations, and store the results. When you submit a text for translation, they choose the most likely meaning based on the previously-learned information and on context. Many foreign languages don't have substantial volumes of translated texts, limiting the number of languages these programs can cover. Adding the probabilistic modeling approach of the MIT system tested on Ugaritic to existing statistical programs could improve machine translation and substantially increase the number of languages that can be translated. Helge reports that online crowdsourcing to improve translation programs is also a big trend.

In the "innovative uses for an iPad" contest, there is a new entry for assisting human translators. Helge told me about a friend who translates between English and Spanish in court. When she translates consecutively (waiting for the speaker to finish a segment before translating), she has enough time to use an iPad to look up slang terms online. (I'll also nominate the human translator for a multitasking award.)

American Sign Language (ASL) is the fourth most-used language in the U.S. ASL is a combination of hand configuration, hand positions, body positions, body movement, and facial expressions to support communication with deaf people. Its grammar is significantly different from English. The National Science Foundation has sponsored projects on machine translation of ASL. DePaul University is a leader in developing a program to convert spoken English into an animated ASL avatar (which they call a "synthetic corpora") that includes the important context of the translator's facial expressions.

Context is a recurring theme in machine translation, but humans can usually weed out the inaccurate machine translations caused by missing context. Language context is even more important in AI in breaking the barrier which will allow robots and other AIs to understand the "natural language" spoken by humans, making them more useful to us. Rather than bulldozing down the data-intensive path of educating robots in millions of contextual details, machine translation suggests another angle. Advances in AI translation are already resulting in new understandings of language. This could lead to new methods for teaching humans to understand, read and speak new languages, perhaps by more quickly developing the new human neural pathways needed. Those of you who are less squeamish than me might be envisioning a neural implant that helps us learn new languages faster. We could dispense with machine translation altogether by enhancing and augmenting our human abilities.

I'm reminded that some geeks claim to be so attuned to their computers that they can read and program in machine language (the lowest level representation of a computer program), even though the venerable U.S. Copyright Office insists that it's unreadable. Here's a wild idea, but you read it here first: rather than teaching our robots to understand our human natural language, would it be easier — and safer — for us to converse with them in a simple, invented, spoken machine language?

Sources and additional information:

Simon Singh, The Code Book, Random House,1999.

Doris A. Paul, The Navajo Code Talkers, Dorrance, 1993. (There are also fictional and documentary films about the code talkers.)

"Indus Script Encodes Language," ScienceDaily, April 30, 2009,
www.sciencedaily.com/releases/2009/04/090423142316.htm

Larry Hardesty, "Computer automatically deciphers ancient language," MITnews, June 30, 2010,
web.mit.edu/newsoffice/2010/ugaritic-barzilay-0630.html

"Machine Translation," AITopics, Association for the Advancement of Artificial Intelligence,
www.aaai.org/AITopics/pmwiki/pmwiki.php/AITopics/MachineTranslation

The DePaul University American Sign Language Project, asl.cs.depaul.edu/


Kathy Garges is a member of MLMUG.

[ home | newsletter | past | join | listserve | shareware | directory | links | md9 ]

©2011 by Kathy Garges & MLMUG
Posted 03/07/11
Updated xx/xx/11