In the southwestern Indian state of Karnataka, villagers recently participated in a groundbreaking project to build the country’s first AI-based chatbot for Tuberculosis. The project aimed to address the linguistic diversity of India, where over 121 languages are spoken by 10,000 people or more, but very few are covered by natural language processing (NLP), the branch of artificial intelligence that enables computers to understand text and spoken words. This lack of coverage has resulted in a significant portion of the Indian population being excluded from access to useful information and economic opportunities.
Kalika Bali, principal researcher at Microsoft Research India, highlighted the need for AI tools to cater to people who don’t speak English or other widely spoken languages. However, collecting enough data in Indian languages to train large language models, like GPT, would take another 10 years. To overcome this challenge, researchers are creating layers on top of existing generative AI models like ChatGPT and Llama.
Thousands of speakers of different Indian languages, including the villagers in Karnataka, are generating speech data for tech firm Karya. The data sets being built by Karya, as well as those by the Indian government, aim to train AI models for education, healthcare, and other services. The government’s platform, called Bhashini, is an AI-led language translation system that includes a crowdsourcing initiative for people to contribute sentences in various languages, validate transcriptions, translate texts, and label images. Tens of thousands of Indians have already contributed to Bhashini.
Currently, out of the more than 7,000 living languages in the world, fewer than 100 are covered by major NLP systems, with English being the most advanced. Efforts are being made by governments and start-ups to bridge this gap. Grassroots organization Masakhane is focusing on strengthening NLP research in African languages, and in the United Arab Emirates, a large language model called Jais is being developed for Arabic. India, renowned for its crowdsourcing capabilities, is effectively using this approach to collect speech and language data.
Kalika Bali emphasizes the importance of ethical crowdsourcing to capture linguistic nuances and minimize bias. Awareness of gender, ethnic, and socio-economic biases is crucial, and workers should be educated, paid, and efforts should be made to collect data from smaller languages. Safiya Husain, co-founder of Karya, explains that the rapid growth of AI has created a demand for languages that were previously overlooked. By empowering workers and offering them royalties, Karya creates economic value with the data they generate, particularly in areas like healthcare and farming.
In India, where less than 11% of the population speaks English, several AI models focus on speech and speech recognition. Google-funded Project Vaani is collecting speech data from around 1 million Indians and open-sourcing it for automatic speech recognition and speech-to-speech translation. EkStep Foundation’s AI-based translation tools are being used at the Supreme Court, and the government-backed AI4Bharat center has launched Jugalbandi, an AI-based chatbot for welfare schemes accessible via WhatsApp.
Overall, the use of AI-based language processing is becoming increasingly important to address the linguistic diversity of India. By leveraging crowdsourcing and developing language data sets, AI tools can be made more inclusive, enabling a larger portion of the population to access information and services. This not only benefits individuals but also has the potential to drive economic growth and empowerment in various sectors.
Use the share button below if you liked it.