
A non-profit initiative led by Cohere For AI is developing a large language model (LLM) called Aya that can converse in over 100 different languages. This ambitious project aims to address the current language bias in AI, which heavily favors English and underrepresents many of the world's languages.
The English bias in AI technology
Modern AI chatbots, powered by large language models (LLMs) like OpenAI's GPT-4, display a considerable bias towards English. LLMs are not as proficient in other languages, especially underrepresented ones like Malagasy. The issue stems from the fact that LLMs are typically trained on internet data, 64% of which is in English. Consequently, other languages, particularly those from countries with lower internet use, are neglected. This problem is amplified for languages like Malagasy, which are hardly considered in any language model training, leading to a technology gap.
To address this language bias in AI, Cohere For AI, a non-profit research lab, has embarked on a project to develop Aya, a new LLM capable of conversing in 101 languages. Unlike existing AI models, Aya intends to provide equal proficiency across languages, including those often overlooked, such as Somali, Yoruba, Malay, Telugu, and Vietnamese. By doing so, Aya is leveling the playing field, ensuring that technology is more inclusive and accessible to non-English speakers around the world.
The Aya project stands out not just for its ambitious goals, but also for its collaborative and open-source approach. Rather than being developed by a centralized private entity, Aya is a collective effort involving more than a thousand individuals, including independent researchers and computer-science students. Encouraging open-source research and collaboration, it embodies a shared vision of making AI more inclusive and representative of global languages.
Volunteers: The backbone of Aya's multilingual capabilities
The development of Aya is not just a technological challenge, but also a linguistic one. To be proficient in various languages, Aya requires thousands of text examples for learning. Volunteers play a critical role in improving existing multilingual datasets, checking for coherence and grammar, and providing new examples. Each language's nuances and colloquialisms are noted and considered, ensuring that Aya is not just multilingual, but also culturally sensitive and accurate.
Language bias and safety risks in AI
The underrepresentation of certain languages in AI not only restricts access to technology for non-English speakers but also poses potential safety risks. According to a study from Brown University, safety measures that prevent AI chatbots from providing harmful information can be easily bypassed if the request is first translated into a low-resource language. This vulnerability highlights the urgent need for more comprehensive and inclusive language models in AI.
Aya: A catalyst for change in the AI industry
By addressing the language bias in AI, the Aya project also hopes to spark broader change within the industry and governments. The aim is to encourage enhanced language representation and support for open-source development in AI. The success of such an endeavor could facilitate more inclusive technological advancements, ensuring that the benefits of AI are shared more equitably across the globe.