The internet excludes Asian Americans who don’t speak English


Chen said the content moderation guidelines from Facebook, Twitter, and others managed to filter out some of the most obvious disinformation in the English language. However, the system often misses such content when it is in other languages. That work instead had to be done by volunteers like her team who looked for disinformation and were trained to defuse it and minimize its spread. “These mechanisms, which are supposed to capture certain words and things, don’t necessarily capture this disinformation and misinformation when it is written in a different language,” she says.

Google’s translation services and technologies like Translatotron and real-time translation headphones use artificial intelligence to convert between languages. For Xiong, however, these tools are insufficient for Hmong, an extremely complex language in which context is incredibly important. “I think we are very complacent and dependent on advanced systems like Google,” she says. “They claim to be ‘linguistically accessible’ and then I read it and it says something completely different.”

(A Google spokesperson admitted that smaller languages ​​”are a more difficult translation task,” but said the company has “invested in research that particularly benefits resource-poor language translations” by using machine learning and community feedback.)

All the way down

The challenges of online language extend beyond the United States – and literally to the underlying code. Yudhanjaya Wijeratne is a researcher and data scientist at the Sri Lankan think tank LIRNEasia. In 2018, he began tracking bot networks whose social media activities were promoting violence against Muslims: In February and March of this year, a series of riots by Sinhalese Buddhists against Muslims and mosques in the cities of Ampara and Kandy were directed. His team documented the “hunting logic” of the bots, cataloged hundreds of thousands of Sinhalese social media posts and brought the results to Twitter and Facebook. “They’d say all kinds of nice and well-intentioned things – can statements, basically,” he says. (In a statement, Twitter says it uses human scrutiny and automated systems to “apply our rules impartially to anyone on duty, regardless of background, ideology, or placement in the political spectrum.”)

When contacting MIT Technology Review, a Facebook spokesperson announced that the company had commissioned an independent human rights assessment of the platform’s role in violence in Sri Lanka, published in May 2020, and made changes in the wake of the attacks , including hiring dozens of Sinhala and Tamil-speaking content moderators. “We have used proactive hate speech detection technology in Sinhala to help us identify potentially harmful content faster and more effectively,” they said.

“What I can do in English with three lines of code in Python has literally taken me two years to look at 28 million Sinhala words.”

Yudhanjaya Wijeratne, LIRNEasia

When the bot behavior continued, Wijeratne became skeptical of the platitudes. He decided to look at the code libraries and software tools the companies were using and found that the mechanisms to monitor hate speech in most non-English languages ​​were not yet in place.

“Much of the research for many languages ​​like ours just hasn’t been done,” says Wijeratne. “What I can do with three lines of code in Python in English took me literally two years to look at 28 million Sinhala words to create the core corpuses, create the core tools, and then get things up to the level I was at Could you possibly do this level of text analysis. “

After suicide bombers attacked churches in Colombo, the capital of Sri Lanka, in April 2019, Wijeratne built a tool to analyze hate speech and misinformation in Sinhala and Tamil. The system called Watchdog is a free mobile application that gathers messages and attaches warnings to false stories. The warnings come from volunteers trained in fact-checking.

Wijeratne emphasizes that this work goes far beyond translation.

“Many of the algorithms that we take for granted and that are frequently cited in research, especially in natural language processing, show excellent results for English,” he says. “And yet many identical algorithms that are even used for languages ​​that are only a few degrees apart – whether they are West German or come from the Romance language tree – can produce completely different results.”

Processing in natural language is the basis for automated systems for moderating content. Wijeratne published a paper in 2019 examining the discrepancies between their accuracy in different languages. He argues that the more computing resources there are for a language such as datasets and web pages, the better the algorithms can work. Languages ​​from poorer countries or communities are disadvantaged.

“For example, if you’re building the Empire State Building for English, you have the blueprints. You have the materials, ”he says. “You have everything at hand and just have to put this stuff together. You don’t have blueprints for any other language.

“You have no idea where the concrete will come from. You have no steel and you have no workers. So you will sit there and knock off one stone at a time, hoping that maybe your grandchildren will complete the project. “

Deep-seated issues

The movement to make these blueprints available is known as linguistic justice and is not new. The American Bar Association describes linguistic justice as a “framework” that preserves people’s rights “to communicate, understand and be understood in the language they prefer and feel most articulate and powerful”.


Steven Gregory