Manan Uppadhyay


2026

Developing culturally grounded multilingual AI systems remains challenging, particularly for low-resource languages. While synthetic data offers promise, its effectiveness in multilingual and multicultural contexts is underexplored. We investigate bottom-up synthetic data generation using large open-source LLMs (>= 235B parameters) grounded in language-specific Wikipedia content, complementing dominant top-down translation-based approaches from English. We introduce Updesh, a high-quality large-scale synthetic instruction-following dataset comprising 9.5M data points across 13 Indian languages and English, encompassing diverse reasoning and generative tasks emphasizing on enhancing long-context and multi-turn capabilities while improving alignment with Indian cultural contexts. Comprehensive evaluation using automated metrics and 10K human assessments confirms high data quality. Downstream evaluations performed by fine-tuning models on various datasets and assessing performance across 13 diverse multilingual datasets and model comparative evaluations, demonstrate that models trained on Updesh consistently obtain significant improvements on NLG tasks and remain competitive on NLU tasks. Improvements are most pronounced for low and medium-resource languages, effectively narrowing performance gaps with high-resource languages. Our findings provide empirical evidence that effective multilingual AI development requires multi-faceted, culturally grounded data curation strategies beyond translation-based approaches.