2024
pdf
abs
LLM Self-Correction with DeCRIM: Decompose, Critique, and Refine for Enhanced Following of Instructions with Multiple Constraints
Thomas Palmeira Ferraz
|
Kartik Mehta
|
Yu-Hsiang Lin
|
Haw-Shiuan Chang
|
Shereen Oraby
|
Sijia Liu
|
Vivek Subramanian
|
Tagyoung Chung
|
Mohit Bansal
|
Nanyun Peng
Findings of the Association for Computational Linguistics: EMNLP 2024
Instruction following is a key capability for LLMs. However, recent studies have shown that LLMs often struggle with instructions containing multiple constraints (e.g. a request to create a social media post “in a funny tone” with “no hashtag”). Despite this, most evaluations focus solely on synthetic data. To address this, we introduce RealInstruct, the first benchmark designed to evaluate LLMs’ ability to follow real-world multi-constrained instructions by leveraging queries real users asked AI assistants. We also investigate model-based evaluation as a cost-effective alternative to human annotation for this task. Our findings reveal that even the proprietary GPT-4 model fails to meet at least one constraint on over 21% of instructions, highlighting the limitations of state-of-the-art models. To address the performance gap between open-source and proprietary models, we propose the Decompose, Critique and Refine (DeCRIM) self-correction pipeline, which enhances LLMs’ ability to follow constraints. DeCRIM works by decomposing the original instruction into a list of constraints and using a Critic model to decide when and where the LLM’s response needs refinement. Our results show that DeCRIM improves Mistral’s performance by 7.3% on RealInstruct and 8.0% on IFEval even with weak feedback. Moreover, we demonstrate that with strong feedback, open-source LLMs with DeCRIM can outperform GPT-4 on both benchmarks.
pdf
abs
Redefining Proactivity for Information Seeking Dialogue
Jing Yang Lee
|
Seokhwan Kim
|
Kartik Mehta
|
Jiun-Yu Kao
|
Yu-Hsiang Lin
|
Arpit Gupta
Proceedings of the Second Workshop on Social Influence in Conversations (SICon 2024)
Humans pay careful attention to the interlocutor’s internal state in dialogues. For example, in recommendation dialogues, we make recommendations while estimating the seeker’s internal state, such as his/her level of knowledge and interest. Since there are no existing annotated resources for the analysis and experiment, we constructed RecomMind, a movie recommendation dialogue dataset with annotations of the seeker’s internal state at the entity level. Each entity has a first-person label annotated by the seeker and a second-person label annotated by the recommender. Our analysis based on RecomMind reveals that the success of recommendations is enhanced when recommenders mention entities that seekers do not know but are interested in. We also propose a response generation framework that explicitly considers the seeker’s internal state, utilizing the chain-of-thought prompting. The human evaluation results show that our proposed method outperforms the baseline method in both consistency and the success of recommendations.
2022
pdf
abs
NER-MQMRC: Formulating Named Entity Recognition as Multi Question Machine Reading Comprehension
Anubhav Shrimal
|
Avi Jain
|
Kartik Mehta
|
Promod Yenigalla
Proceedings of the 2022 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies: Industry Track
NER has been traditionally formulated as a sequence labeling task. However, there has been recent trend in posing NER as a machine reading comprehension task (Wang et al., 2020; Mengge et al., 2020), where entity name (or other information) is considered as a question, text as the context and entity value in text as answer snippet. These works consider MRC based on a single question (entity) at a time. We propose posing NER as a multi-question MRC task, where multiple questions (one question per entity) are considered at the same time for a single text. We propose a novel BERT-based multi-question MRC (NER-MQMRC) architecture for this formulation. NER-MQMRC architecture considers all entities as input to BERT for learning token embeddings with self-attention and leverages BERT-based entity representation for further improving these token embeddings for NER task. Evaluation on three NER datasets show that our proposed architecture leads to average 2.5 times faster training and 2.3 times faster inference as compared to NER-SQMRC framework based models by considering all entities together in a single pass. Further, we show that our model performance does not degrade compared to single-question based MRC (NER-SQMRC) (Devlin et al., 2019) leading to F1 gain of +0.41%, +0.32% and +0.27% for AE-Pub, Ecommerce5PT and Twitter datasets respectively. We propose this architecture primarily to solve large scale e-commerce attribute (or entity) extraction from unstructured text of a magnitude of 50k+ attributes to be extracted on a scalable production environment with high performance and optimised training and inference runtimes.
2021
pdf
abs
Scalable Approach for Normalizing E-commerce Text Attributes (SANTA)
Ravi Shankar Mishra
|
Kartik Mehta
|
Nikhil Rasiwasia
Proceedings of the 4th Workshop on e-Commerce and NLP
In this paper, we present SANTA, a scalable framework to automatically normalize E-commerce attribute values (e.g. “Win 10 Pro”) to a fixed set of pre-defined canonical values (e.g. “Windows 10”). Earlier works on attribute normalization focused on fuzzy string matching (also referred as syntactic matching in this paper). In this work, we first perform an extensive study of nine syntactic matching algorithms and establish that ‘cosine’ similarity leads to best results, showing 2.7% improvement over commonly used Jaccard index. Next, we show that string similarity alone is not sufficient for attribute normalization as many surface forms require going beyond syntactic matching (e.g. “720p” and “HD” are synonyms). While semantic techniques like unsupervised embeddings (e.g. word2vec/fastText) have shown good results in word similarity tasks, we observed that they perform poorly to distinguish between close canonical forms, as these close forms often occur in similar contexts. We propose to learn token embeddings using a twin network with triplet loss. We propose an embedding learning task leveraging raw attribute values and product titles to learn these embeddings in a self-supervised fashion. We show that providing supervision using our proposed task improves over both syntactic and unsupervised embeddings based techniques for attribute normalization. Experiments on a real-world dataset of 50 attributes show that the embeddings trained using our proposed approach obtain 2.3% improvement over best string similarity and 19.3% improvement over best unsupervised embeddings.
pdf
abs
LATEX-Numeric: Language Agnostic Text Attribute Extraction for Numeric Attributes
Kartik Mehta
|
Ioana Oprea
|
Nikhil Rasiwasia
Proceedings of the 2021 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies: Industry Papers
In this paper, we present LATEX-Numeric - a high-precision fully-automated scalable framework for extracting E-commerce numeric attributes from unstructured product text like product description. Most of the past work on attribute extraction is not scalable as they rely on manually curated training data, either with or without use of active learning. We rely on distant supervision for training data generation, removing dependency on manual labels. One issue with distant supervision is that it leads to incomplete training annotation due to missing attribute values while matching. We propose a multi-task learning architecture to deal with missing labels in the training data, leading to F1 improvement of 9.2% for numeric attributes over state-of-the-art single-task architecture. While multi-task architecture benefits both numeric and non-numeric attributes, we present automated techniques to further improve the numeric attributes extraction models. Numeric attributes require a list of units (or aliases) for better matching with distant supervision. We propose an automated algorithm for alias creation using unstructured text and attribute values, leading to a 20.2% F1 improvement. Extensive experiments on real world datasets for 20 numeric attributes across 5 product categories and 3 English marketplaces show that LATEX-numeric achieves a high F1-score, without any manual intervention, making it suitable for practical applications. Finally we show that the improvements are language-agnostic and LATEX-Numeric achieves 13.9% F1 improvement for 3 non-English languages.
2019
pdf
abs
Improving Answer Selection and Answer Triggering using Hard Negatives
Sawan Kumar
|
Shweta Garg
|
Kartik Mehta
|
Nikhil Rasiwasia
Proceedings of the 2019 Conference on Empirical Methods in Natural Language Processing and the 9th International Joint Conference on Natural Language Processing (EMNLP-IJCNLP)
In this paper, we establish the effectiveness of using hard negatives, coupled with a siamese network and a suitable loss function, for the tasks of answer selection and answer triggering. We show that the choice of sampling strategy is key for achieving improved performance on these tasks. Evaluating on recent answer selection datasets - InsuranceQA, SelQA, and an internal QA dataset, we show that using hard negatives with relatively simple model architectures (bag of words and LSTM-CNN) drives significant performance gains. On InsuranceQA, this strategy alone improves over previously reported results by a minimum of 1.6 points in P@1. Using hard negatives with a Transformer encoder provides a further improvement of 2.3 points. Further, we propose to use quadruplet loss for answer triggering, with the aim of producing globally meaningful similarity scores. We show that quadruplet loss function coupled with the selection of hard negatives enables bag-of-words models to improve F1 score by 2.3 points over previous baselines, on SelQA answer triggering dataset. Our results provide key insights into answer selection and answer triggering tasks.