Saurabh Singh
2025
Field to Model: Pairing Community Data Collection with Scalable NLP through the LiFE Suite
Karthick Narayanan R
|
Siddharth Singh
|
Saurabh Singh
|
Aryan Mathur
|
Ritesh Kumar
|
Shyam Ratan
|
Bornini Lahiri
|
Benu Pareek
|
Neerav Mathur
|
Amalesh Gope
|
Meiraba Takhellambam
|
Yogesh Dawer
Proceedings of the Fourth Workshop on NLP Applications to Field Linguistics
We present LiFE Suite as a “Field-to-Model” pipeline, designed to bridge community-centred data collection with scalable language model development. This paper describes the various tools integrated into the LiFE Suite that make this unified pipeline possible. Atekho, a mobile-first data collection platform, is designed to empower communities to assert their rights over their data. MATra-Lab, a web-based data processing and annotation tool, supports the management of field data and the creation of NLP-ready datasets with support from existing state-of-the-art NLP models. LiFE Model Studio, built on top of Hugging Face AutoTrain, offers a no-code solution for building scalable language models using the field data. This end-to-end integration ensures that every dataset collected in the field retains its linguistic, cultural, and metadata context, all the way through to deployable AI models and archive-ready datasets.
2017
All that is English may be Hindi: Enhancing language identification through automatic ranking of the likeliness of word borrowing in social media
Jasabanta Patro
|
Bidisha Samanta
|
Saurabh Singh
|
Abhipsa Basu
|
Prithwish Mukherjee
|
Monojit Choudhury
|
Animesh Mukherjee
Proceedings of the 2017 Conference on Empirical Methods in Natural Language Processing
n this paper, we present a set of computational methods to identify the likeliness of a word being borrowed, based on the signals from social media. In terms of Spearman’s correlation values, our methods perform more than two times better (∼ 0.62) in predicting the borrowing likeliness compared to the best performing baseline (∼ 0.26) reported in literature. Based on this likeliness estimate we asked annotators to re-annotate the language tags of foreign words in predominantly native contexts. In 88% of cases the annotators felt that the foreign language tag should be replaced by native language tag, thus indicating a huge scope for improvement of automatic language identification systems.
Search
Fix author
Co-authors
- Abhipsa Basu 1
- Monojit Choudhury 1
- Yogesh Dawer 1
- Amalesh Gope 1
- Ritesh Kumar 1
- show all...