Indian colleges expand work on Indic languages gen AI

Even as technology companies move platforms like ChatGPT, Bing and Bard, top engineering colleges in India are beginning a growing number of generative artificial intelligence (AI) research projects, most of which are looking to understand how technology can help create tools similar to OpenAI’s ChatGPT, but in Indian languages.

Generative AI platforms have been the rage since the second half of last year with Microsoft and Google pushing these programs into their existing services. Even the Ministry of Electronics and Information Technology (MeitY), on 3 February, said it is “cognizant" of the emergence and proliferation of generative AI and noted that AI can be a “kinetic enabler" for growth in India.

However, researchers at institutes underline a host of challenges for generative AI projects in academia, the biggest of which lie in sourcing ample data of Indic languages, the cost of such projects, and the scale of computing power needed. Indian researchers have been working on such projects for more than three years.

“In academia, we’re using techniques from language models, namely the transformer architecture, for different tasks such as classification of data, answering questions, machine translation and building chatbots," said Tapas Kumar Mishra, assistant professor of computer science engineering at National Institute of Technology (NIT), Rourkela.

The transformer AI model is the underlying algorithm for generative AI tools. They can process conversational human language inputs and generative output after understanding the context.While global platforms work mostly in English, Mishra said researchers under him are working on languages like Hindi, Bangla and Kannada, creating models that can take questions in these languages and generate output in English. They aren’t using OpenAI’s tools for this but have achieved “very good" scores according to the industry standard BiLingual Evaluation Understudy (BLEU) test.

He said NIT Rourkela has achieved scores of between 25 to 30 on Hindi to English, and 19 on Bangla to English. For reference, OpenAI’s GPT-4 model has scores of 22.9 in English to French outputs. The institute published a research paper on translations from Hindi to English last month with the Association for Computing Machinery—a US scientific educational community that publishes research work on natural language processing (NLP).

NIT Rourkela isn’t the only one doing this either. Students from the Indian Institute of Technology (IIT) Madras have also taken up such projects. Harish Guruprasad, assistant professor, of computer science engineering at IIT Madras said that one such project includes “better translated YouTube videos in Tamil". “Students mostly took this up to compare their own research language models with GPT-4, and eventually publish a paper on new approaches of translating videos into Indian languages," he added. Generative AI projects have also been a part of research initiatives beyond Indic languages.

For instance, Debanga Raj Neog, assistant professor, of data science and AI at IIT Guwahati, said the institute is presently working on creating “affordable visual animation models that study eyes and facial movements from open-source visual databases, and use this to replicate the process." IIT Guwahati, too, is working on a research paper on this.

Professor Mausam, the founding head of the Yardi School of Artificial Intelligence at IIT Delhi, said that in 2022, he, along with Anoop Krishnan, associate professor, and a team of students, created a language model called ‘MatSciBert’ — specifically for the field of material science research.“The eventual goal is to discover new materials with the help of AI. The first step is to process scientific articles and extract from their knowledge about materials and their properties. We developed MatSciBert in 2022 — it is a language model skilled in reading material science papers more effectively than other generic language models like Bert. MatSciBert has been downloaded o almost 100,000 times in the last year and has been found useful for various material science tasks by numerous groups all over the world," said Mausam, who goes by one name.

The key problem for most researchers though is computing power. NIT Rourkela has 13 machines with 24GB graphic processing units (GPUs) each. Mausam noted that the scale of computing power required is “exorbitant and prohibitive".

“For instance, one training run of GPT-3 would cost $4.6 million, not accounting for any errors and re-trials during training. No academic institution or any Indian company, apart from the top tech firms, can afford training such large models regularly. Looking to train India-specific language models is therefore premature unless we create massive compute infrastructure in the country," IIT Delhi’s Mausam said.

A senior executive, who was formerly working on government tech projects, said on condition of anonymity that there is “a lack of clarity in terms of enabling access to India’s supercomputer infrastructure owned by the Meity-backed Centre for Development of Advanced Computing (C-DAC." Mint reported in on July 6 last year, India’s supercomputing power is also well behind global systems. The executive added that while multiple top institutes, including IIT Delhi, have been consulted on using the infrastructure for their research initiatives, not much progress has taken place in this regard.

Availability of data is another problem for India. For instance, NIT Rourkela uses various public datasets, such as the Samantaral database released by IIT Madras. “This consisted of low-resource language pairs of Indic languages. We’re also using our own datasets by scraping newspapers and converting to various languages — and then working on that. We’re also using publicly available data, such as state government-backed local language data repositories," said Mishra. To accelerate AI research in India, Meity launched ‘Bhashini’ in May last year an Indic language database that can be tapped by institutes.

However, access to the scale of data needed for such projects continues to remain an issue. “When a language has a huge amount of data available, transformer architectures can produce great efficiency of translation. But, with small amounts of data, this is difficult to work with. For instance, translating from Odiya to Hindi, such models are not very efficient," IIT Madras’ Guruprasad said.