Language is more than words. It is a window into a culture’s distinct perspective. In Asia, artificial intelligence (AI) could be the tool that flings this window wide open.

In Google’s virtual “AI Now” session on Wednesday, August 6, the technology giant shone a spotlight on their partnerships with AI-focused projects in Southeast Asia (SEA), India, and Japan.

Together with Google, these projects hope to field the language barrier and promote these nation’s cultures by expanding the corpus of Asian languages available to AI or leveraging existing tools.

Training AI to be Asian

Google kicked off their discussion by diving into SEA’s Project SEALD and India’s Project Vaani. These initiatives are working to build training datasets composed of Asian languages, which will provide a foundation for future AI tools steeped in Asian contexts.

Why, though, is there an emphasis on gathering Asian languages? Simple. English dominates the Internet and with it, so do Western cultural views.

According to Google, over 99% of today’s high-quality, open Web data used on Common Crawl and by today’s leading Large Language Model (LLM) developers is in English. This English majority limits existing AI tools’ processing of diverse Asian languages and their usefulness in Asian, cultural contexts.

Case in point, Pratyusha Mukherjee, Asia Pacific lead for Google’s Gen AI and SEA Research Partnerships, revealed that much of the less than 1% of content counted as SEA languages within these data repositories include ads and gambling messages. Not exactly high-quality content reflective of SEA culture.

Project SEALD and Project Vaani aim to fill the information gaps by collecting and digitizing under represented linguistic and cultural data.

Project SEALD, short for Southeast Asian Languages in One Network Data, is a research collaboration between AI Singapore (AISG) and Google Research working to grow datasets intended to train, fine-tune, and evaluate LLMs in Southeast Asian languages.

Over the past few years, Project SEALD concentrated their efforts on gathering tens of thousands of “heavily cultured data” responses in SEA languages aided by partners and communities across the region. Ranging from topics like social norms and cultural identity, these data are considered high priority as they can easily be lost as communities age.

On top its targeted effort at gathering cultural data, Google has helped Project SEALD and their other regional partners create Aquarium, an open data platform for SEA languages, including Tagalog and Cebuano.

“[Aquarium] is envisioned to be a one stop portal for people to discuss, contribute, and gather data about South East Asian languages. It will also give a map of the latest state of language data in the region so we can minimize duplicated [data gathering] effort,” said William Tjhi, AI Singapore’s head of applied research, during the meeting.

As its datasets grow, Aquarium aims to facilitate the development of AI tools that authentically delivers the region’s cultural and linguistic nuances.

Moving onto Project Vaani, this massive data collection venture is a collaboration between the Indian Institute of Science (IISc) and Google. It targets collecting high-quality audio records in more than 1300 languages in 773 target districts in India, equaling roughly 156,000 speech hours.

“Project Vaani started with the motivation of [filling] the existing speech and language data gaps,” explained professor Prasanta Ghosh, IISc Associate Professor.

“Current, publicly available speech data sets for Indian languages are fragmented and cover less than only 5% of the linguistic landscape, without sufficient depth for robust AI model training.”

The project started three years ago and has gathered 21.5 thousand hours of speech audio and 835 hours of transcribed speech from more than 112 thousand speakers, which represents 86 unique languages across 120 districts.

This initiative’s data set is notably being made freely available as part of India’s national language translation mission, Bhashini, to help develop Indian language speech recognition, translation, and natural language understanding tools. Moreover, as of 2024, data has already been released and leveraged by researchers and startups.

Ghosh held up SandLogic, a Bangalore-based AI startup using Project Vaani’s datasets to enable enterprise-grade, multilingual automatic speech recognition and train a text-to-speech model capable of expressive speech.

AI translates uncharted comedy territory

Beyond localizing AI for Asian culture, this technology is also being maximized to promote Asian culture to other countries. Google is working on one such project in partnership with Japan’s largest entertainment agency, the Yoshimoto Group.

Since 2024, Google and the Yoshimito Group have joined forces to translate the wide variety of Japanese comedy formats through CHAD 2, a subtitle service specializing in comedy powered by Google’s Gemini Flash 2.0.

Just by uploading a movie file to CHAD 2, the Yoshimoto group can automatically get a subtitle file in English, Chinese, and Korean.

CHAD 2 has reportedly cut translation time from a month to moments, with minimal corrections compared to traditional methods, even taking into account the precise phrasing, timing, and delivery essential to conveying Japanese humor.

“Gemini achieved superior accuracy with transcription and translation rates of approximately 90% accuracy, which significantly outperforms other models with 60 to 75% accuracy,” said Chad Mullane, a Tokyo-based comedian involved with the project, during the meeting.

Mullane exclaimed: “This truly is amazing. When Google Translate first came out, we used to joke about how bad these translations were. Now, Gemini doesn’t only understand those jokes, it can make comedians funnier in another language!”

Initiatives like Project SEALD, Project Vaani, and CHAD 2 are realizing the advantages of AI to foster language use and promote culture with Google’s support.

From localizing AI in order to take full advantage of this technology’s powerful capabilities to utilizing this technology to globalize previously inaccessible parts of Asian culture, these projects are proving that technology can strengthen cultural expression rather than leave it behind.