India's AI Push Driven by Local Language Data: Thousands Contribute to Building LLMs

AI & Web 3 Sep 24, 2024 0 534 Add to Reading List

India's AI Push Driven by Local Language Data: Thousands Contribute to Building LLMs

A major initiative is underway across India as organizations strive to develop foundational large language models (LLMs) trained on extensive datasets in Indian languages. Several firms, including DesiCrew Solutions, Karya, KeyPoint Technologies, and CSTS, have mobilized thousands of individuals to gather multimodal data—text, images, video, and voice—to train artificial intelligence models.

For many contributors, the task offers a secondary income, while for others, it's a way to preserve and promote their native languages. DesiCrew Solutions, based in Tamil Nadu and incubated at IIT Madras, is a key player in this AI ecosystem. As a partner in the AI4Bharat initiative, the company has collected over 13,000 hours of voice data from 25,000+ speakers across 221 districts, covering 22 official languages and numerous dialects.

"India may have 22 official languages, but there are over 300 dialects. We’ve gathered between 1,000-2,000 hours of data for each language," said Manivannan JK, CEO of DesiCrew. The company emphasizes diversity in its dataset, ensuring voices are sourced from varied social groups, genders, and regions, capturing the rich linguistic diversity of India.

DesiCrew’s workforce is largely female, with 70% of its staff spread across Tamil Nadu and Karnataka. Employees visit homes and villages to record voices, ensuring each participant only records for 30 minutes to maintain uniqueness.

Other organizations, like the Centre for Studies of Tradition and Systems (CSTS), are also playing a vital role. CSTS, led by Savita Jha, has been working to promote marginalized languages like Maithili, which has a rich literary history but limited visibility. CSTS collaborated with IISc's Respin project, aimed at building speech recognition systems for underserved sectors like agriculture and finance. Within weeks, CSTS mobilized a team of 400-500 people from Bihar to contribute data for this initiative.

Organizations like Bengaluru-based Karya are also empowering rural, low-income communities by providing them opportunities to contribute to AI development. Karya’s workforce focuses on building datasets in crucial sectors such as healthcare, agriculture, and banking.

The large-scale participation in these projects highlights a sense of pride in preserving and promoting regional languages. While the financial incentives are modest, the impact on local economies and communities is significant, with many participants using their earnings to improve their livelihoods.

As India continues to develop AI systems rooted in its linguistic diversity, the efforts of these organizations and their contributors are driving a digital transformation that could have far-reaching implications for AI on a global scale.

Click Here to Visit