Shyam Ratan

2025

We present LiFE Suite as a “Field-to-Model” pipeline, designed to bridge community-centred data collection with scalable language model development. This paper describes the various tools integrated into the LiFE Suite that make this unified pipeline possible. Atekho, a mobile-first data collection platform, is designed to empower communities to assert their rights over their data. MATra-Lab, a web-based data processing and annotation tool, supports the management of field data and the creation of NLP-ready datasets with support from existing state-of-the-art NLP models. LiFE Model Studio, built on top of Hugging Face AutoTrain, offers a no-code solution for building scalable language models using the field data. This end-to-end integration ensures that every dataset collected in the field retains its linguistic, cultural, and metadata context, all the way through to deployable AI models and archive-ready datasets.

2024

pdf bib
Proceedings of the Fourth Workshop on Threat, Aggression & Cyberbullying @ LREC-COLING-2024
Ritesh Kumar | Atul Kr. Ojha | Shervin Malmasi | Bharathi Raja Chakravarthi | Bornini Lahiri | Siddharth Singh | Shyam Ratan
Proceedings of the Fourth Workshop on Threat, Aggression & Cyberbullying @ LREC-COLING-2024

2023

pdf bib abs
An Open-source Web-based Application for Development of Resources and Technologies in Underresourced Languages
Siddharth Singh | Shyam Ratan | Neerav Mathur | Ritesh Kumar
Proceedings of the 3rd Workshop for Natural Language Processing Open Source Software (NLP-OSS 2023)

The paper discusses the Linguistic Field Data Management and Analysis System (LiFE), a new open-source, web-based software that systematises storage, management, annotation, analysis and sharing of linguistic data gathered from the field as well as that crawled from various sources on the web such as YouTube, Twitter, Facebook, Instagram, Blog, Newspaper, Wikipedia, etc. The app supports two broad workflows - (a) the field linguists’ workflow in which data is collected directly from the speakers in the field and analysed further to produce grammatical descriptions, lexicons, educational materials and possibly language technologies; (b) the computational linguists’ workflow in which data collected from the web using automated crawlers or digitised using manual or semi-automatic means, annotated for various tasks and then used for developing different kinds of language technologies. In addition to supporting these workflows, the app provides some additional features as well - (a) it allows multiple users to collaboratively work on the same project via its granular access control and sharing option; (b) it allows the data to be exported to various formats including CSV, TSV, JSON, XLSX, , PDF, Textgrid, RDF (different serialisation formats) etc as appropriate; (c) it allows data import from various formats viz. LIFT XML, XLSX, JSON, CSV, TSV, Textgrid, etc; (d) it allows users to start working in the app at any stage of their work by giving the option to either create a new project from scratch or derive a new project from an existing project in the app.The app is currently available for use and testing on our server (http://life.unreal-tece.co.in/) and its source code has been released under AGPL license on our GitHub repository (https://github.com/unrealtecellp/life). It is licensed under separate, specific conditions for commercial usage.

2022

pdf bib abs
Towards a Unified Tool for the Management of Data and Technologies in Field Linguistics and Computational Linguistics - LiFE
Siddharth Singh | Ritesh Kumar | Shyam Ratan | Sonal Sinha
Proceedings of the Workshop on Resources and Technologies for Indigenous, Endangered and Lesser-resourced Languages in Eurasia within the 13th Language Resources and Evaluation Conference

The paper presents a new software - Linguistic Field Data Management and Analysis System - LiFE for endangered and low-resourced languages - an open-source, web-based linguistic data analysis and management application allowing systematic storage, management, usage and sharing of linguistic data collected from the field. The application enables users to store lexical items, sentences, paragraphs, audio-visual content including photographs, video clips, speech recordings, etc, with rich glossing and annotation. For field linguists, it provides facilities to generate interactive and print dictionaries; for NLP practitioners, it provides the data storage and representation in standard formats such as RDF, JSON and CSV. The tool provides a one-click interface to train NLP models for various tasks using the data stored in the system and then use it for assistance in further storage of the data (especially for the field linguists). At the same time, the tool also provides the facility of using the models trained outside of the tool for data storage, transcription, annotation and other tasks. The web-based application, allows for seamless collaboration among multiple persons and sharing the data, models, etc with each other.

The Universal Morphology (UniMorph) project is a collaborative effort providing broad-coverage instantiated normalized morphological inflection tables for hundreds of diverse world languages. The project comprises two major thrusts: a language-independent feature schema for rich morphological annotation, and a type-level resource of annotated data in diverse languages realizing that schema. This paper presents the expansions and improvements on several fronts that were made in the last couple of years (since McCarthy et al. (2020)). Collaborative efforts by numerous linguists have added 66 new languages, including 24 endangered languages. We have implemented several improvements to the extraction pipeline to tackle some issues, e.g., missing gender and macrons information. We have amended the schema to use a hierarchical structure that is needed for morphological phenomena like multiple-argument agreement and case stacking, while adding some missing morphological features to make the schema more inclusive. In light of the last UniMorph release, we also augmented the database with morpheme segmentation for 16 languages. Lastly, this new release makes a push towards inclusion of derivational morphology in UniMorph by enriching the data and annotation schema with instances representing derivational processes from MorphyNet.

In this paper, we discuss the development of a multilingual dataset annotated with a hierarchical, fine-grained tagset marking different types of aggression and the “context” in which they occur. The context, here, is defined by the conversational thread in which a specific comment occurs and also the “type” of discursive role that the comment is performing with respect to the previous comment. The initial dataset, being discussed here consists of a total 59,152 annotated comments in four languages - Meitei, Bangla, Hindi, and Indian English - collected from various social media platforms such as YouTube, Facebook, Twitter and Telegram. As is usual on social media websites, a large number of these comments are multilingual, mostly code-mixed with English. The paper gives a detailed description of the tagset being used for annotation and also the process of developing a multi-label, fine-grained tagset that has been used for marking comments with aggression and bias of various kinds including sexism (called gender bias in the tagset), religious intolerance (called communal bias in the tagset), class/caste bias and ethnic/racial bias. We also define and discuss the tags that have been used for marking the different discursive role being performed through the comments, such as attack, defend, etc. Finally, we present a basic statistical analysis of the dataset. The dataset is being incrementally made publicly available on the project website.

2021

pdf bib abs
Multilingual Protest News Detection - Shared Task 1, CASE 2021
Ali Hürriyetoğlu | Osman Mutlu | Erdem Yörük | Farhana Ferdousi Liza | Ritesh Kumar | Shyam Ratan
Proceedings of the 4th Workshop on Challenges and Applications of Automated Extraction of Socio-political Events from Text (CASE 2021)

Benchmarking state-of-the-art text classification and information extraction systems in multilingual, cross-lingual, few-shot, and zero-shot settings for socio-political event information collection is achieved in the scope of the shared task Socio-political and Crisis Events Detection at the workshop CASE @ ACL-IJCNLP 2021. Socio-political event data is utilized for national and international policy- and decision-making. Therefore, the reliability and validity of these datasets are of the utmost importance. We split the shared task into three parts to address the three aspects of data collection (Task 1), fine-grained semantic classification (Task 2), and evaluation (Task 3). Task 1, which is the focus of this report, is on multilingual protest news detection and comprises four subtasks that are document classification (subtask 1), sentence classification (subtask 2), event sentence coreference identification (subtask 3), and event extraction (subtask 4). All subtasks had English, Portuguese, and Spanish for both training and evaluation data. Data in Hindi language was available only for the evaluation of subtask 1. The majority of the submissions, which are 238 in total, are created using multi- and cross-lingual approaches. Best scores are above 77.27 F1-macro for subtask 1, above 85.32 F1-macro for subtask 2, above 84.23 CoNLL 2012 average score for subtask 3, and above 66.20 F1-macro for subtask 4 in all evaluation settings. The performance of the best system for subtask 4 is above 66.20 F1 for all available languages. Although there is still a significant room for improvement in cross-lingual and zero-shot settings, the best submissions for each evaluation scenario yield remarkable results. Monolingual models outperformed the multilingual models in a few evaluation scenarios.

pdf bib abs
Demo of the Linguistic Field Data Management and Analysis System - LiFE
Siddharth Singh | Ritesh Kumar | Shyam Ratan | Sonal Sinha
Proceedings of the 18th International Conference on Natural Language Processing (ICON)

In the proposed demo, we will present a new software - Linguistic Field Data Management and Analysis System - LiFE - an open-source, web-based linguistic data management and analysis application that allows for systematic storage, management, sharing and usage of linguistic data collected from the field. The application allows users to store lexical items, sentences, paragraphs, audio-visual content including photographs, video clips, speech recordings, etc, along with rich glossing / annotation; generate interactive and print dictionaries; and also train and use natural language processing tools and models for various purposes using this data. Since its a web-based application, it also allows for seamless collaboration among multiple persons and sharing the data, models, etc with each other. The system uses the Python-based Flask framework and MongoDB (as database) in the backend and HTML, CSS and Javascript at the frontend. The interface allows creation of multiple projects that could be shared with the other users. At the backend, the application stores the data in RDF format so as to allow its release as Linked Data over the web using semantic web technologies - as of now it makes use of the OntoLex-Lemon for storing the lexical data and Ligt for storing the interlinear glossed text and then internally linking it to the other linked lexicons and databases such as DBpedia and WordNet. Furthermore it provides support for training the NLP systems using scikit-learn and HuggingFace Transformers libraries as well as make use of any model trained using these libraries - while the user interface itself provides limited options for tuning the system, an externally-trained model could be easily incorporated within the application; similarly the dataset itself could be easily exported into a standard machine-readable format like JSON or CSV that could be consumed by other programs and pipelines. The system is built as an online platform; however since we are making the source code available, it could be installed by users on their internal / personal servers as well.

pdf bib
Proceedings of the 18th International Conference on Natural Language Processing: Shared Task on Multilingual Gender Biased and Communal Language Identification
Ritesh Kumar | Siddharth Singh | Enakshi Nandi | Shyam Ratan | Laishram Niranjana Devi | Bornini Lahiri | Akanksha Bansal | Akash Bhagat | Yogesh Dawer
Proceedings of the 18th International Conference on Natural Language Processing: Shared Task on Multilingual Gender Biased and Communal Language Identification

pdf bib abs
ComMA@ICON: Multilingual Gender Biased and Communal Language Identification Task at ICON-2021
Ritesh Kumar | Shyam Ratan | Siddharth Singh | Enakshi Nandi | Laishram Niranjana Devi | Akash Bhagat | Yogesh Dawer | Bornini Lahiri | Akanksha Bansal
Proceedings of the 18th International Conference on Natural Language Processing: Shared Task on Multilingual Gender Biased and Communal Language Identification

This paper presents the findings of the ICON-2021 shared task on Multilingual Gender Biased and Communal Language Identification, which aims to identify aggression, gender bias, and communal bias in data presented in four languages: Meitei, Bangla, Hindi and English. The participants were presented the option of approaching the task as three separate classification tasks or a multi-label classification task or a structured classification task. If approached as three separate classification tasks, the task includes three sub-tasks: aggression identification (sub-task A), gender bias identification (sub-task B), and communal bias identification (sub-task C). For this task, the participating teams were provided with a total dataset of approximately 12,000, with 3,000 comments across each of the four languages, sourced from popular social media sites such as YouTube, Twitter, Facebook and Telegram and the the three labels presented as a single tuple. For the test systems, approximately 1,000 comments were provided in each language for every sub-task. We attracted a total of 54 registrations in the task, out of which 11 teams submitted their test runs. The best system obtained an overall instance-F1 of 0.371 in the multilingual test set (it was simply a combined test set of the instances in each individual language). In the individual sub-tasks, the best micro f1 scores are 0.539, 0.767 and 0.834 respectively for each of the sub-task A, B and C. The best overall, averaged micro f1 is 0.713. The results show that while systems have managed to perform reasonably well in individual sub-tasks, especially gender bias and communal bias tasks, it is substantially more difficult to do a 3-class classification of aggression level and even more difficult to build a system that correctly classifies everything right. It is only in slightly over 1/3 of the instances that most of the systems predicted the correct class across the board, despite the fact that there was a significant overlap across the three sub-tasks.

pdf bib abs
Developing Universal Dependencies Treebanks for Magahi and Braj
Mohit Raj | Shyam Ratan | Deepak Alok | Ritesh Kumar | Atul Kr. Ojha
Proceedings of the First Workshop on Parsing and its Applications for Indian Languages

In this paper, we discuss the development of treebanks for two low-resourced Indian languages - Magahi and Braj - based on the Universal Dependencies framework. The Magahi treebank contains 945 sentences and Braj treebank around 500 sentences marked with their lemmas, part-of-speech, morphological features and universal dependencies. This paper gives a description of the different dependency relationship found in the two languages and give some statistics of the two treebanks. The dataset will be made publicly available on Universal Dependency (UD) repository in the next (v2.10) release.

This year’s iteration of the SIGMORPHON Shared Task on morphological reinflection focuses on typological diversity and cross-lingual variation of morphosyntactic features. In terms of the task, we enrich UniMorph with new data for 32 languages from 13 language families, with most of them being under-resourced: Kunwinjku, Classical Syriac, Arabic (Modern Standard, Egyptian, Gulf), Hebrew, Amharic, Aymara, Magahi, Braj, Kurdish (Central, Northern, Southern), Polish, Karelian, Livvi, Ludic, Veps, Võro, Evenki, Xibe, Tuvan, Sakha, Turkish, Indonesian, Kodi, Seneca, Asháninka, Yanesha, Chukchi, Itelmen, Eibela. We evaluate six systems on the new data and conduct an extensive error analysis of the systems’ predictions. Transformer-based models generally demonstrate superior performance on the majority of languages, achieving >90% accuracy on 65% of them. The languages on which systems yielded low accuracy are mainly under-resourced, with a limited amount of data. Most errors made by the systems are due to allomorphy, honorificity, and form variation. In addition, we observe that systems especially struggle to inflect multiword lemmas. The systems also produce misspelled forms or end up in repetitive loops (e.g., RNN-based models). Finally, we report a large drop in systems’ performance on previously unseen lemmas.