Neerav Mathur
2025
Field to Model: Pairing Community Data Collection with Scalable NLP through the LiFE Suite
Karthick Narayanan R
|
Siddharth Singh
|
Saurabh Singh
|
Aryan Mathur
|
Ritesh Kumar
|
Shyam Ratan
|
Bornini Lahiri
|
Benu Pareek
|
Neerav Mathur
|
Amalesh Gope
|
Meiraba Takhellambam
|
Yogesh Dawer
Proceedings of the Fourth Workshop on NLP Applications to Field Linguistics
We present LiFE Suite as a “Field-to-Model” pipeline, designed to bridge community-centred data collection with scalable language model development. This paper describes the various tools integrated into the LiFE Suite that make this unified pipeline possible. Atekho, a mobile-first data collection platform, is designed to empower communities to assert their rights over their data. MATra-Lab, a web-based data processing and annotation tool, supports the management of field data and the creation of NLP-ready datasets with support from existing state-of-the-art NLP models. LiFE Model Studio, built on top of Hugging Face AutoTrain, offers a no-code solution for building scalable language models using the field data. This end-to-end integration ensures that every dataset collected in the field retains its linguistic, cultural, and metadata context, all the way through to deployable AI models and archive-ready datasets.
2023
An Open-source Web-based Application for Development of Resources and Technologies in Underresourced Languages
Siddharth Singh
|
Shyam Ratan
|
Neerav Mathur
|
Ritesh Kumar
Proceedings of the 3rd Workshop for Natural Language Processing Open Source Software (NLP-OSS 2023)
The paper discusses the Linguistic Field Data Management and Analysis System (LiFE), a new open-source, web-based software that systematises storage, management, annotation, analysis and sharing of linguistic data gathered from the field as well as that crawled from various sources on the web such as YouTube, Twitter, Facebook, Instagram, Blog, Newspaper, Wikipedia, etc. The app supports two broad workflows - (a) the field linguists’ workflow in which data is collected directly from the speakers in the field and analysed further to produce grammatical descriptions, lexicons, educational materials and possibly language technologies; (b) the computational linguists’ workflow in which data collected from the web using automated crawlers or digitised using manual or semi-automatic means, annotated for various tasks and then used for developing different kinds of language technologies. In addition to supporting these workflows, the app provides some additional features as well - (a) it allows multiple users to collaboratively work on the same project via its granular access control and sharing option; (b) it allows the data to be exported to various formats including CSV, TSV, JSON, XLSX, , PDF, Textgrid, RDF (different serialisation formats) etc as appropriate; (c) it allows data import from various formats viz. LIFT XML, XLSX, JSON, CSV, TSV, Textgrid, etc; (d) it allows users to start working in the app at any stage of their work by giving the option to either create a new project from scratch or derive a new project from an existing project in the app.The app is currently available for use and testing on our server (http://life.unreal-tece.co.in/) and its source code has been released under AGPL license on our GitHub repository (https://github.com/unrealtecellp/life). It is licensed under separate, specific conditions for commercial usage.
Search
Fix author
Co-authors
- Ritesh Kumar 2
- Shyam Ratan 2
- Siddharth Singh 2
- Yogesh Dawer 1
- Amalesh Gope 1
- show all...