HT TECH wants to start sending you push notifications. Click allow to subscribe

Biased GPT? Singapore builds AI model to 'represent' Southeast Asians

Like millions worldwide, Southeast Asians have been trying out large language models such as Meta's Llama 2 and Mistral AI - but in their native Bahasa Indonesia or Thai. The result has usually been gibberish in English.

By: REUTERS
Updated on: Feb 08 2024, 12:20 IST
A Southeast Asian language model (LLM) called SEA-LION has been created by a Singapore government-led initiative to provide better representation for the region. (REUTERS)

Like millions worldwide, Southeast Asians have been trying out large language models such as Meta's Llama 2 and Mistral AI - but in their native Bahasa Indonesia or Thai. The result has usually been gibberish in English.

This leaves them at a disadvantage, tech experts warn, as generative artificial intelligence transforms education, work and governance worldwide.

You may be interested in

Mobiles Tablets Laptops
7% OFF
Apple iPhone 15 Pro Max
  • Black Titanium
  • 8 GB RAM
  • 256 GB Storage
₹148,900₹159,900
Buy now
23% OFF
Samsung Galaxy S23 Ultra 5G
  • Green
  • 12 GB RAM
  • 256 GB Storage
₹115,999₹149,999
Buy now
Google Pixel 8 Pro
  • Obsidian
  • 12 GB RAM
  • 128 GB Storage
₹106,998
Check details
Apple iPhone 15 Plus
  • Black
  • 6 GB RAM
  • 128 GB Storage
₹87,900
Check details
21% OFF
Acer Swift Go SFG14 41 NX KG3SI 002 Laptop
  • Pure Silver
  • 8 GB RAM
  • 512 GB SSD
₹58,990₹74,999
Buy now
39% OFF
Acer Aspire 5 A515 57G Laptop
  • Gray
  • 16 GB RAM
  • 512 GB SSD
₹54,949₹89,999
Buy now
22% OFF
Acer Aspire 3 A315 24 NX KDESI 004 Laptop
  • Silver
  • 8 GB RAM
  • 512 GB SSD
₹33,499₹42,999
Buy now
39% OFF
Asus VivoBook 15 X515JA BQ322WS Laptop
  • Transparent Silver
  • 8 GB RAM
  • 512 GB SSD
₹31,490₹51,990
Buy now
34% OFF
Xiaomi Pad 6
  • Mist Blue
  • 6 GB RAM
  • 128 GB Storage
₹26,299₹39,999
Buy now
55% OFF
Lenovo Tab M10 5G
  • Abyss Blue
  • 6 GB RAM
  • 128 GB Storage
₹20,999₹47,000
Buy now
32% OFF
Realme Pad 2
  • Imagination Grey
  • 6 GB RAM
  • 128 GB Storage
₹19,790₹28,999
Buy now
Honor Pad X9
  • Gray
  • 4 GB RAM
  • 128 GB Storage
₹14,999
Check details

A Singapore government-led initiative aims to correct the imbalance with a Southeast Asian LLM, the first in a family of models named SEA-LION - Southeast Asian Languages in One Network - trained in the region's languages and cultural norms.

Also read: Looking for a smartphone? To check mobile finder click here.

Trained on data in 11 Southeast Asian languages including Vietnamese, Thai and Bahasa Indonesia, the open-sourced model is a cheaper and more efficient option for the region's businesses, governments and academia, said Leslie Teo at AI Singapore.

"Do we want to force every person in Southeast Asia to adapt to the machine, or do we want to make it more accessible so people in the region can make full use of the technology without having to be an English speaker?" he said.

"We are not trying to compete with the big LLMs; we are trying to complement them, so there can be better representation of us," Teo, senior director for AI products, told the Thomson Reuters Foundation.

There are over 7,000 languages spoken worldwide. Yet LLMs including Open AI's GPT-4 and Meta's Llama 2 that are used to build AI systems such as chatbots and other tools, have largely been developed for, and are trained on, the English language.

Governments and tech firms are trying to bridge this gap, with India creating datasets in local languages, an LLM in the United Arab Emirates powering generative AI tools in Arabic, and AI models in China, Japan and Vietnam in local languages.

These models can help local populations participate more equitably in the global AI economy that is largely dominated by big tech firms, said Nuurrianti Jalli, an assistant professor at Oklahoma State University's school of communications.

"Regional LLMs are also needed because they support technology self-reliance," she said. “Less reliance on Western LLMs could provide better privacy for local populations, and also align better with national or regional interest.”

VERIFY AND FILTER

Multilingual language models that are trained on text from several languages at once, can infer semantic and grammatical connections between high resource languages that have more data, and low resource languages, researchers say.

These models can be used in a variety of applications from translation to customer-service chatbots, to content moderation on social media platforms that have struggled to identify hate speech in low resource languages such as Burmese or Amharic.

About 13% of SEA-LION's data is sourced from Southeast Asian languages - more than any other major LLM, said Teo. More than 9% of its data is from Chinese text, and about 63% from English.

Multilingual language models often train on translated text and other poor quality data that may have errors, so AI Singapore is "careful" about the data used in training SEA-LION, Teo said in his office at the National University of Singapore.

"The age of pristine data has passed - a lot of the stuff on the internet now is material that is generated by LLMs, so we need to verify and filter," he said.

"We cannot be perfect, but we also cannot take out everything we consider to be bad," he added.

More governments are contributing data, and businesses are testing SEA-LION, which due to its smaller size can be deployed faster and is cheaper to fine-tune and adopt, Teo said.

At Indonesian e-commerce company Tokopedia, a majority of customer interactions is in Bahasa Indonesia, so models "with that local fluency will enhance our ability to connect with customers and improve their experiences," said Paul Condylis, Tokopedia's associate vice president of data science.

BIAS IN THE DATA

As more countries and regions build their own LLMs, digital and human rights experts fret that they will reproduce only the dominant views expressed online, which can be particularly problematic in nations with authoritarian governments or strict media censorship, or those lacking a strong civil society.

Chinese social media platforms, for example, censor references to the Tiananmen Square uprising and criticism of the government, while several Southeast Asian nations have enacted laws to curb content that authorities deem as misleading.

"Training models on such data risks perpetuating biased, prejudiced, incomplete and even misleading narratives," said Jalli.

"The models may fail to surface important socio-political issues like human rights abuse, corruption, or valid criticism of political powers," she said.

In response to a query on Indonesian former president Suharto, for example, Llama 2 and GPT-4 mentioned his spotty human rights record, while SEA-LION's response focused largely on his achievements.

If a model is only trained on favourable articles about a government, then the model is "likely to adopt a worldview where the government is wholly positive and leave behind dissenting viewpoints," said Aliya Bhatia, a policy analyst at the Center for Democracy & Technology, a U.S. non-profit.

"Regional LLMs may better reflect the linguistic and cultural nuances of local language speakers, but they may also have less information about the world in general," she added.

"There is a real risk of government-backed models instilling a revisionist view of history and undermining democratic values."

But the alternative - relying entirely on Western LLMs with "disproportionately large influences" from wealthy, liberal, western democracies - means perpetuating different biases related to cultural values, political beliefs and social norms, according to AI Singapore.

"These LLMs have a very particular West Coast American bias - they are very woke. They do not represent us," said Teo.

“We are not saying ours is the only perspective - we are just trying to rebalance it.”

Also, read these top stories today:

Cookies are crumbling! The little data files that helped companies stalk users around the web are vanishing. But that doesn’t mean a return to privacy. Some interesting details in this article. Check it out here.

Meta will challenge the EU! Meta announced on Wednesday it would challenge in court an EU demand for fees under a content moderation law, which is the EU's legal weaponry to rein in Big Tech. Read all about it here.

Microsoft to cut more jobs! The FTC seeks a response after Microsoft's plans surfaced revealing that the Satya Nadella-led company aims to cut 1900 jobs from the newly acquired Activision Blizzard. Dive in here.

Catch all the Latest Tech News, Mobile News, Laptop News, Gaming news, Wearables News , How To News, also keep up with us on ,Twitter, Facebook, , and Instagram. For our latest videos, subscribe to our YouTube channel.

First Published Date: 08 Feb, 12:19 IST

Sale

Mobiles Tablets Laptops
4% OFF
Samsung Galaxy S24 Ultra
  • Titanium Black
  • 12 GB RAM
  • 256 GB Storage
₹129,999₹134,999
Buy now
7% OFF
Apple iPhone 15 Pro Max
  • Black Titanium
  • 8 GB RAM
  • 256 GB Storage
₹148,900₹159,900
Buy now
13% OFF
Xiaomi 14
  • Matte Black
  • 12 GB RAM
  • 512 GB Storage
₹69,999₹79,999
Buy now
10% OFF
Apple iPhone 15 Plus
  • Black
  • 6 GB RAM
  • 128 GB Storage
₹80,590₹89,900
Buy now
33% OFF
Xiaomi Pad 6
  • Mist Blue
  • 6 GB RAM
  • 128 GB Storage
₹26,999₹39,999
Buy now
38% OFF
Lenovo Tab M10 5G
  • Abyss Blue
  • 6 GB RAM
  • 128 GB Storage
₹20,999₹34,000
Buy now
28% OFF
Realme Pad 2
  • Imagination Grey
  • 6 GB RAM
  • 128 GB Storage
₹17,999₹24,999
Buy now
11% OFF
Samsung Galaxy Tab S9 5G 256GB
  • Graphite
  • 8 GB RAM
  • 256 GB Storage
₹96,999₹108,699
Buy now
38% OFF
Acer Aspire 3 A315 24 NX KDESI 004 Laptop
  • Silver
  • 8 GB RAM
  • 512 GB SSD
₹32,790₹52,999
Buy now
27% OFF
Infinix INBook X1 Neo XL22 Laptop Intel Celeron Quad Core 8 GB 256 GB SSD Windows 11
  • Blue
  • 4 GB RAM
  • 128 GB SSD
₹21,990₹29,990
Buy now
30% OFF
Asus TUF Gaming F15 FX507ZE HN038W Laptop
  • Mecha Gray
  • 16 GB RAM
  • 512 GB SSD
₹59,990₹85,990
Buy now
29% OFF
Asus TUF Gaming A15 FA566II HN233T Laptop
  • Fortress Gray
  • 16 GB RAM
  • 512 GB SSD
₹53,990₹75,990
Buy now
NEXT ARTICLE BEGINS