Multilingual/Bilingual Large Language Models (LLMs): Tailoring AI Applications for Southeast Asia

Written by Benjamin Chan | Nov 22, 2024 5:00:00 AM

Translating something from one language to another can be tricky. Translations are often prone to mistakes, as evidenced by book translations and movie subtitle mistakes. Our world is incredibly diverse, with more than 7,000 various languages spoken worldwide. In Southeast Asia, one of the most populated regions, more than 1,200 different languages are used.

The region is also undergoing a massive digital transformation journey, with Generative Artificial Intelligence (Gen AI) at the forefront of novel technologies leveraged by businesses and governments. Historically, Southeast Asian linguistics, cultural values, social norms, customs, and other elements of a country’s identity have been excluded from Large Language Models (LLMs). LLMs are trained on billions of parameters, mostly based on Western sources of information. Unsurprisingly, Gen AI application outputs tend to be biased.

To overcome this challenge, governments and technology leaders in Southeast Asia are heavily focused on multilingual/bilingual LLMs. While these LLMs will continue to support English—the global language of business—these models also support languages such as Thai, Indonesian, Lao, Vietnamese, Mandarin, and more. Beyond that, multilingual LLMs are increasingly being trained using country-specific data (e.g., literature, local news sources, etc.). As a result, developers can fine-tune Gen AI apps to capture the nuances of specific populations.

Growing Support for Multilingual/Bilingual LLMs

Southeast Asia is extremely diverse, with numerous languages and cultural differences. Therefore, LLMs must be trained locally to reflect local values and optimally contextualize data. Generic LLMs like GPT-4 and BERT are primarily trained in the English language and Western cultural characteristics. App developers in Indonesia, for example, would be better served using a region-specific LLM like WIZ.AI for an intuitive Artificial Intelligence (AI) chatbot. Although WIZ.AI is mostly trained with Western data sources, it also leverages 10 billion Indonesian tokens to ensure the models account for cultural nuances. As this example illustrates, LLMs in Southeast Asia are not doing away with Western-based pre-training. Rather, they are simply adding significantly more inputs from countries in the region.

Hyperscaler LLMs can often support various languages, but the outputs are not always ideal. The LLMs typically favor the ethical and equity frameworks, languages, and culture of the platform’s country of origin (usually a Western nation). Without localizing LLMs, developers will lack the accuracy, reliability, and applicability required for country-tailored AI applications.

Governments and tech companies in the Southeast Asian region have allocated significant time and resources to developing local LLMs. These LLMs are explicitly trained for certain cultures and can be multilingual and/or bilingual.

Table 1 lists the nine publicly available LLMs in the region, including the variants.

Table 1: LLMs in Southeast Asia (2024)

(Source: ABI Research)

LLMs in Southeast Asia	Parameters (Billions)	Tokens (Billions)	Architecture	Date Launched	Languages
Climind	Not Stated	Not Stated	Not Stated	Dec-22	2 (EN, ZH)
WIZ.AI
WIZ.AI-7B	7	10	Not Stated	Apr-23	2 (EN, ID)
WIZ.AI-13B	13	10	Not Stated	Nov-23	3 (EN, TH)
SEA-LION
SEA-LION 3B	3	980	MPT	Nov-23	11*
SEA-LION 7B	7	980	MPT	Nov-23	11*
SEA-LION 7B Instruct	7	980	MPT	Nov-23	11*
SEA-LION v2	8	48	Llama 3	Aug-24	5 (EN, ID, TH, VN, TA)
SeaLLM
SeaLLM-7B-v1	7	150	Llama-2-7B	Dec-23	10**
SeaLLM-13B-v1	13	150	Llama-2- 13B	Dec-23	10**
SeaLLM-7B-v2	7	150	Mistral-7B	Dec-23	10**
SeaLLM-7B-v2.5	7	150	Gemma-7B	Dec-23	10**
VinaLLaMA
VinaLLaMA-2.7B	2.7	800	LLaMA-2	Dec-23	2 (EN, VN)
VinaLLaMA-7B	7	800	LLaMA-2	Dec-23	2 (EN, VN)
CompassLLM
CompassLLM-SFT	7	1700	LLaMA	Apr-24	3 (EN, ZH, ID)
CompassLLM-DPO	7	1700	LLaMA	Apr-24	3 (EN, ZH, ID)
Yellow.AI	Not Stated	Not Stated	Llama2	Apr-24	3 (EN, ZH, ID)
Sailor
Sailor-0.5B	0.5	400	Qwen1.5	Apr-24	7***
Sailor-1.8B	1.8	200	Qwen1.5	Apr-24	7***
Sailor-4B	4	200	Qwen1.5	Apr-24	7***
Sailor-7B	7	200	Qwen1.5	Apr-24	7***
Sailor-14B	14	200	Qwen1.5	Apr-24	7***
Typhoon
Typhoon-7B	7	186	Mistral-7B	Dec-23	2 (EN, TH)
Typhoon-1.5 8B	8	Not Stated	Llama3	May-24	2 (EN, TH)
Typhoon-1.5 70B	70	Not Stated	Qwen1.5	May-24	2 (EN, TH)
Typhoon-1.5X 72B	72	Not Stated	Llama3	May-24	2 (EN, TH)

Note for Table 1: LLMs developed for regional usage, organized by launch date, split by parameters and tokens trained, architecture used, dates launched, and languages supported. *11 major regional languages—Indonesian, Thai, Vietnamese, Filipino, Burmese, Khmer, English, Mandarin, Malay, Tamil, and Lao. **10 official languages used in Southeast Asia. ***7 languages—Indonesian, Thai, Vietnamese, Malay, Lao, English, and Mandarin.

LLMs with Multilingual Capabilities

SEA-LION, SeaLLM, CompassLLM, and Sailor are multilingual LLMs used by developers in Southeast Asia. These LLMs are trained with regional cultures in mind, enabling users to develop personalized AI applications.

Mandarin and Indonesian are Southeast Asia's most commonly supported languages for LLMs. For example, the four main multilingual LLMs in the region—SEA-LION, SeaLLM, CompassLLM, and Sailor—were all trained with Indonesian resources. Languages that lack substantial literature and other resources, such as Lao, receive less LLM support.

SEA-LION, which stands for Southeast Asian Language in One Network, was trained in 11 regional languages: Indonesian, Thai, Vietnamese, Filipino, Burmese, Khmer, English, Mandarin, Malay, Tamil, and Lao. Enterprises that use SEA-LION can be sure it is trained on content generated in specific Southeast Asian contexts, maximizing the accuracy of the LLM output. SEA-LION’s training went through 26X more Southeast Asian languages than traditional Western-made LLMs like LlaMA-2.
SeaLLM was developed by Chinese e-commerce juggernaut Alibaba. According to Hugging Face, the newest iteration, SeaLLM-v3, processes 12 regional languages: English, Mandarin, Indonesian, Vietnamese, Thai, Tagalog, Malay, Burmese, Khmer, Lao, Tamil, and Javanese. SeaLLM can reportedly process up to 9X longer non-Latin text than ChatGPT and perform more complex execution capabilities.
CompassLLM was also developed by an e-commerce company, the Singapore-based Shopee Group. English, Mandarin, and Indonesian are the main languages used in pre-training datasets with 1.7 trillion tokens.
Sailor was co-developed by SEA AI Lab and the Singapore University of Technology and Design (SUTD). Pre-trained from language models like Qwen 1.5, in tandem with publicly available data sources, Sailor supports 70% of Southeast Asian languages. Its tokens are primarily trained in the Indonesian, Vietnamese, Thai, English, and Mandarin languages.

Accurate LLMs Ensure a Smooth Digital Transformation Journey

Bilingual LLMs are also helpful for AI application developers in the Asia-Pacific region. These local LLMs are even more fine-tuned for a specific country than multilingual LLMs. They support English-based content and a secondary language. VinaLLaMA and Typhoon are prominent examples of bilingual LLMs.

Many businesses have high hopes for Gen AI, with its many opportunities established in our recent whitepaper. However, Gen AI will only be as effective as the data used to train the LLMs on which applications are built. In this regard, businesses require multilingual/bilingual LLMs that account for cultural, linguistic, and value differences.

To date, LLMs have primarily been trained on Western-centric sources. This bias means an AI-based chatbot might not detect a Vietnamese citizen's use of slang during a conversation—resulting in poor customer service outcomes. Or a Gen AI application might fail to account for regional banking laws in Indonesia. Without localizing the context of LLM training data, the list of potential issues using AI will be exhaustive.

With Southeast Asia fast becoming a technology hub—thanks partly to being the largest manufacturing region worldwide—AI innovators must increasingly leverage country-specific sources to construct local LLMs. Only then will the true value of digital transformation be realized in these growing economies.

This content is part of ABI Research's Next-Gen Hybrid Cloud Solutions and Southeast Asia Digital Transformation services.

About the Author

Benjamin Chan, Research Analyst

Research Analyst Benjamin Chan is a member of the Asia-Pacific Advisory team focused on issues related to Artificial Intelligence (AI) and Machine Learning (ML) implementation and digital transformation. Benjamin also focuses on key technological developments within the Southeast Asian region.

View full post