Fb's new polyglot AI can translate between 100 languages
The news: Facebook is an open-sourced model for a new AI language model called M2M-100 that can translate between any pair of 100 languages. Of the 4,450 possible language combinations, 1,100 are translated directly. This is in contrast to previous multilingual models which relied heavily on English as an intermediate level. For example, a translation from Chinese to French is usually carried over from Chinese to English and then from English to French, which increases the likelihood of errors.
Data curation: The model was trained on 7.5 billion sentence pairs. To put together such a large data set, researchers relied heavily on automated curation. They used web crawlers to strip billions of sentences from the web and had the language identified by another language model called FastText. (They didn't use Facebook data.) Then they used a program called LASER 2.0, previously developed by Facebook's AI research lab, that uses unsupervised learning – machine learning that doesn't require manually labeled data – to match sentences across languages its meaning.
LASER 2.0 creates so-called "embeddings" from large, unstructured data sets. It trains the available sentence examples in each language and maps their relationships to one another based on how often and how close they are to each other. These embeddings help the machine learning model to approximate the meaning of each sentence, whereby LASER 2.0 can automatically link sentences that have the same meaning in different languages.
Pairing languages: The researchers focused on the language combinations they believed would be requested most often. They grouped languages based on linguistic, geographic, and cultural similarities, with the assumption that people living in the same region would communicate more often. For example, one language group included the most widely spoken languages in India, including Bengali, Hindi, Tamil, and Urdu. LASER 2.0 then looked specifically for sentence pairs for all possible language pairs within each group.
Ongoing challenges: Languages spoken in countries like Africa and Southeast Asia still suffer from translation quality issues due to insufficient language data available to be removed from the Internet, says Angela Fan, the project's lead researcher. Given the reliance on web data, researchers also need to develop techniques to identify and eliminate embedded sexism, racism, and other discriminatory biases. Right now, researchers have used an profanity filter to clean up a particularly egregious language, but it's mostly limited to English.
Research only: Facebook currently has no plans to use the model in its products. M2M-100 is for research purposes only, says Fan. Ultimately, however, the goal of the model is to improve and expand Facebook's existing translation capabilities. Applications include user communication (e.g., the feature that allows users to translate posts into their native language) and possibly content moderation.