Aman Sinha

2025

pdf bib abs
Your Model is Overconfident, and Other Lies We Tell Ourselves
Timothee Mickus | Aman Sinha | Raúl Vázquez
Proceedings of the 63rd Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers)

The difficulty intrinsic to a given example, rooted in its inherent ambiguity, is a key yet often overlooked factor in evaluating neural NLP models. We investigate the interplay and divergence among various metrics for assessing intrinsic difficulty, including annotator dissensus, training dynamics, and model confidence. Through a comprehensive analysis using 29 models on three datasets, we reveal that while correlations exist among these metrics, their relationships are neither linear nor monotonic. By disentangling these dimensions of uncertainty, we aim to refine our understanding of data complexity and its implications for evaluating and improving NLP models.

pdf bib abs
Fossils at SemEval-2025 Task 9: Tasting Loss Functions for Food Hazard Detection in Text Reports
Aman Sinha | Federica Gamba
Proceedings of the 19th International Workshop on Semantic Evaluation (SemEval-2025)

Food hazard detection is an emerging field where NLP solutions are being explored. Despite the recent accessibility of powerful language models, one of the key challenges that still persists is the high class imbalance within datasets, often referred to in the literature as the {textit{long tail problem}}.In this work, we present a study exploring different loss functions borrowed from the field of visual recognition, to tackle long-tailed class imbalance for food hazard detection in text reports. Our submission to SemEval-2025 Task 9 on the Food Hazard Detection Challenge shows how re-weighting mechanism in loss functions prove beneficial in class imbalance scenarios. In particular, we empirically show that class-balanced and focal loss functions outperform all other loss strategies for Subtask 1 and 2 respectively.

We present the Mu-SHROOM shared task which is focused on detecting hallucinations and other overgeneration mistakes in the output of instruction-tuned large language models (LLMs).Mu-SHROOM addresses general-purpose LLMs in 14 languages, and frames the hallucination detection problem as a span-labeling task. We received 2,618 submissions from 43 participating teams employing diverse methodologies. The very high number of submissions highlights the interest of the community in hallucination detection. We present the results of the participating systems and provide an empirical analysis in order to better understand the factors that can lead to strong performance in this task. We also underscore current challenges, notably the varying degree of hallucinations across languages and the high annotator disagreement when labeling hallucination spans.

2024

pdf bib abs
Domain-specific or Uncertainty-aware models: Does it really make a difference for biomedical text classification?
Aman Sinha | Timothee Mickus | Marianne Clausel | Mathieu Constant | Xavier Coubez
Proceedings of the 23rd Workshop on Biomedical Natural Language Processing

The success of pretrained language models (PLMs) across a spate of use-cases has led to significant investment from the NLP community towards building domain-specific foundational models. On the other hand, in mission critical settings such as biomedical applications, other aspects also factor in—chief of which is a model’s ability to produce reasonable estimates of its own uncertainty. In the present study, we discuss these two desiderata through the lens of how they shape the entropy of a model’s output probability distribution. We find that domain specificity and uncertainty awareness can often be successfully combined, but the exact task at hand weighs in much more strongly.

pdf bib abs
Retrieve, Generate, Evaluate: A Case Study for Medical Paraphrases Generation with Small Language Models
Ioana Buhnila | Aman Sinha | Mathieu Constant
Proceedings of the 1st Workshop on Towards Knowledgeable Language Models (KnowLLM 2024)

Recent surge in the accessibility of large language models (LLMs) to the general population can lead to untrackable use of such models for medical-related recommendations. Language generation via LLMs models has two key problems: firstly, they are prone to hallucination and therefore, for any medical purpose they require scientific and factual grounding; secondly, LLMs pose tremendous challenge to computational resources due to their gigantic model size. In this work, we introduce pRAGe, a Pipeline for Retrieval Augmented Generation and Evaluation of medical paraphrases generation using Small Language Models (SLM). We study the effectiveness of SLMs and the impact of external knowledge base for medical paraphrase generation in French.

2023

pdf bib abs
What shall we read : the article or the citations? - A case study on scientific language understanding
Aman Sinha | Sam Bigeard | Marianne Clausel | Mathieu Constant
Actes de CORIA-TALN 2023. Actes de l'atelier "Analyse et Recherche de Textes Scientifiques" (ARTS)@TALN 2023

The number of scientific articles is increasing tremendously across all domains to such an extent that it has become hard for researchers to remain up-to-date. Evidently, scientific language understanding systems and Information Extraction (IE) systems, with the advancement of Natural Language Processing (NLP) techniques, are benefiting the needs of users. Although the majority of the practices for building such systems are data-driven, advocating the idea of “The more, the better”. In this work, we revisit the paradigm - questioning what type of data : text (title, abstract) or citations, can have more impact on the performance of scientific language understanding systems.

2022

pdf bib abs
IAI @ SocialDisNER : Catch me if you can! Capturing complex disease mentions in tweets
Aman Sinha | Cristina Garcia Holgado | Marianne Clausel | Matthieu Constant
Proceedings of the Seventh Workshop on Social Media Mining for Health Applications, Workshop & Shared Task

Biomedical NER is an active research area today. Despite the availability of state-of-the-art models for standard NER tasks, their performance degrades on biomedical data due to OOV entities and the challenges encountered in specialized domains. We use Flair-NER framework to investigate the effectiveness of various contextual and static embeddings for NER on Spanish tweets, in particular, to capture complex disease mentions.

pdf bib abs
Word Sense Disambiguation of French Lexicographical Examples Using Lexical Networks
Aman Sinha | Sandrine Ollinger | Mathieu Constant
Proceedings of TextGraphs-16: Graph-based Methods for Natural Language Processing

This paper focuses on the task of word sense disambiguation (WSD) on lexicographic examples relying on the French Lexical Network (fr-LN). For this purpose, we exploit the lexical and relational properties of the network, that we integrated in a feedforward neural WSD model on top of pretrained French BERT embeddings. We provide a comparative study with various models and further show the impact of our approach regarding polysemic units.

2020

pdf bib abs
C-Net: Contextual Network for Sarcasm Detection
Amit Kumar Jena | Aman Sinha | Rohit Agarwal
Proceedings of the Second Workshop on Figurative Language Processing

Automatic Sarcasm Detection in conversations is a difficult and tricky task. Classifying an utterance as sarcastic or not in isolation can be futile since most of the time the sarcastic nature of a sentence heavily relies on its context. This paper presents our proposed model, C-Net, which takes contextual information of a sentence in a sequential manner to classify it as sarcastic or non-sarcastic. Our model showcases competitive performance in the Sarcasm Detection shared task organised on CodaLab and achieved 75.0% F1-score on the Twitter dataset and 66.3% F1-score on Reddit dataset.

pdf bib abs
DSC IIT-ISM at SemEval-2020 Task 6: Boosting BERT with Dependencies for Definition Extraction
Aadarsh Singh | Priyanshu Kumar | Aman Sinha
Proceedings of the Fourteenth Workshop on Semantic Evaluation

We explore the performance of Bidirectional Encoder Representations from Transformers (BERT) at definition extraction. We further propose a joint model of BERT and Text Level Graph Convolutional Network so as to incorporate dependencies into the model. Our proposed model produces better results than BERT and achieves comparable results to BERT with fine tuned language model in DeftEval (Task 6 of SemEval 2020), a shared task of classifying whether a sentence contains a definition or not (Subtask 1).

pdf bib abs
DSC IIT-ISM at SemEval-2020 Task 8: Bi-Fusion Techniques for Deep Meme Emotion Analysis
Pradyumna Gupta | Himanshu Gupta | Aman Sinha
Proceedings of the Fourteenth Workshop on Semantic Evaluation

Memes have become an ubiquitous social media entity and the processing and analysis of such multimodal data is currently an active area of research. This paper presents our work on the Memotion Analysis shared task of SemEval 2020, which involves the sentiment and humor analysis of memes. We propose a system which uses different bimodal fusion techniques to leverage the inter-modal dependency for sentiment and humor classification tasks. Out of all our experiments, the best system improved the baseline with macro F1 scores of 0.357 on Sentiment Classification (Task A), 0.510 on Humor Classification (Task B) and 0.312 on Scales of Semantic Classes (Task C).

pdf bib abs
DSC-IIT ISM at WNUT-2020 Task 2: Detection of COVID-19 informative tweets using RoBERTa
Sirigireddy Dhana Laxmi | Rohit Agarwal | Aman Sinha
Proceedings of the Sixth Workshop on Noisy User-generated Text (W-NUT 2020)

Social media such as Twitter is a hotspot of user-generated information. In this ongoing Covid-19 pandemic, there has been an abundance of data on social media which can be classified as informative and uninformative content. In this paper, we present our work to detect informative Covid-19 English tweets using RoBERTa model as a part of the W-NUT workshop 2020. We show the efficacy of our model on a public dataset with an F1-score of 0.89 on the validation dataset and 0.87 on the leaderboard.

Venues