This is an internal, incomplete preview of a proposed change to the ACL Anthology.
For efficiency reasons, we don't generate MODS or Endnote formats, and the preview may be incomplete in other ways, or contain mistakes.
Do not treat this content as an official publication.
KhyatiMahajan
Fixing paper assignments
Please select all papers that belong to the same person.
Indicate below which author they should be assigned to.
Given the inherent subjectivity of similarity in text, fully unsupervised text clustering is unlikely to produce groupings that work across a variety of use cases. Traditional techniques to guide clustering rely on costly, time-consuming human feedback and/or pre-existing labels. Leveraging recent advancements in LLMs and decoder-only embedding models, we present techniques to effectively control text embeddings with minimal human input: prefix instructions and LLM preprocessing. We evaluate clustering performance for datasets with multiple independent ground-truth labels, or perspectives, and find that these techniques can be used to improve clustering for one perspective or use case, at the cost of a tradeoff in performance for another use case.
Collecting instruction fine-tuning (IFT) data is a resource and time intensive task especially in multilingual setting where finding proficient native speakers is challenging. Moreover, traditional data collection is prone to privacy risks, toxicity and lacks scalability. While, fully synthetic datasets are a promising alternative, research on their use in multilingual domain is limited as existing approaches still rely on machine translation to improve multilingual performance. To bridge this gap we introduce M2Lingual, the first fully synthetic, multi-turn multilingual dataset having 175K conversations across 70 languages with a balanced mix of high, low and mid-resourced languages. M2Lingual is constructed using a cost-efficient and scalable method that uses our novel two-step Evol prompt taxonomy to transform a small set of human written instructions to complex and challenging conversations. Results across three model families, six baseline datasets and evaluation spanning 31 languages demonstrates the effectiveness of M2Lingual over other datasets.
Multilingual LLMs have achieved remarkable benchmark performance, but we find they continue to underperform on non-Latin script languages across contemporary LLM families. This discrepancy arises from the fact that LLMs are pretrained with orthographic scripts, which are dominated by Latin characters that obscure their shared phonology with non-Latin scripts. We propose leveraging phonemic transcriptions as complementary signals to induce script-invariant representations. Our study demonstrates that integrating phonemic signals improves performance across both non-Latin and Latin languages, with a particularly significant impact on closing the performance gap between the two. Through detailed experiments, we show that phonemic and orthographic scripts retrieve distinct examples for in-context learning (ICL). This motivates our proposed Mixed-ICL retrieval strategy, where further aggregation leads to our significant performance improvements for both Latin script languages (up to 12.6%) and non-Latin script languages (up to 15.1%) compared to randomized ICL retrieval.
Modeling interlocutor information is essential towards modeling multi-party conversations to account for the presence of multiple participants. We investigate the role of including the persona attributes of both the speaker and addressee relevant to each utterance, collected via 3 distinct mock social media experiments. The participants were recruited via MTurk, and were unaware of the persona attributes of the other users they interacted with on the platform. Our main contributions include 1) a multi-party conversation dataset with rich associated metadata (including persona), and 2) a persona-aware heterogeneous graph transformer response generation model. We find that PersonaHeterMPC provides a good baseline towards persona-aware generation for multi-party conversation modeling, generating responses which are relevant and consistent with the interlocutor personas relevant to the conversation.
Socratic questioning is a teaching strategy where the student is guided towards solving a problem on their own, instead of being given the solution directly. In this paper, we introduce a dataset of Socratic conversations where an instructor helps a novice programmer fix buggy solutions to simple computational problems. The dataset is then used for benchmarking the Socratic debugging abilities of GPT-based language models. While GPT-4 is observed to perform much better than GPT-3.5, its precision, and recall still fall short of human expert abilities, motivating further work in this area.
We present our work on augmenting dialog act recognition capabilities utilizing synthetically generated data. Our work is motivated by the limitations of current dialog act datasets, and the need to adapt for new domains as well as ambiguity in utterances written by humans. We list our observations and findings towards how synthetically generated data can contribute meaningfully towards more robust dialogue act recognition models extending to new domains. Our major finding shows that synthetic data, which is linguistically varied, can be very useful towards this goal and increase the performance from (0.39, 0.16) to (0.85, 0.88) for AFFIRM and NEGATE dialog acts respectively.
The events that took place at the Unite the Right rally held in Charlottesville, Virginia on August 11-12, 2017 caused intense reaction on social media from users across the political spectrum. We present a novel application of psycholinguistics - specifically, construal level theory - to analyze the language on social media around this event of social import through topic models. We find that including psycholinguistic measures of concreteness as covariates in topic models can lead to informed analysis of the language surrounding an event of political import.
We introduce GEM, a living benchmark for natural language Generation (NLG), its Evaluation, and Metrics. Measuring progress in NLG relies on a constantly evolving ecosystem of automated metrics, datasets, and human evaluation standards. Due to this moving target, new models often still evaluate on divergent anglo-centric corpora with well-established, but flawed, metrics. This disconnect makes it challenging to identify the limitations of current models and opportunities for progress. Addressing this limitation, GEM provides an environment in which models can easily be applied to a wide set of tasks and in which evaluation strategies can be tested. Regular updates to the benchmark will help NLG research become more multilingual and evolve the challenge alongside models. This paper serves as the description of the data for the 2021 shared task at the associated GEM Workshop.
In this paper, we describe our approach towards utilizing pre-trained models for the task of hope speech detection. We participated in Task 2: Hope Speech Detection for Equality, Diversity and Inclusion at LT-EDI-2021 @ EACL2021. The goal of this task is to predict the presence of hope speech, along with the presence of samples that do not belong to the same language in the dataset. We describe our approach to fine-tuning RoBERTa for Hope Speech detection in English and our approach to fine-tuning XLM-RoBERTa for Hope Speech detection in Tamil and Malayalam, two low resource Indic languages. We demonstrate the performance of our approach on classifying text into hope-speech, non-hope and not-language. Our approach ranked 1st in English (F1 = 0.93), 1st in Tamil (F1 = 0.61) and 3rd in Malayalam (F1 = 0.83).
We present a comprehensive survey of available corpora for multi-party dialogue. We survey over 300 publications related to multi-party dialogue and catalogue all available corpora in a novel taxonomy. We analyze methods of data collection for multi-party dialogue corpora and identify several lacunae in existing data collection approaches used to collect such dialogue. We present this survey, the first survey to focus exclusively on multi-party dialogue corpora, to motivate research in this area. Through our discussion of existing data collection methods, we identify desiderata and guiding principles for multi-party data collection to contribute further towards advancing this area of dialogue research.
We highlight the contribution of emotional and moral language towards information contagion online. We find that retweet count on Twitter is significantly predicted by the use of negative emotions with negative moral language. We find that a tweet is less likely to be retweeted (hence less engagement and less potential for contagion) when it has emotional language expressed as anger along with a specific type of moral language, known as authority-vice. Conversely, when sadness is expressed with authority-vice, the tweet is more likely to be retweeted. Our findings indicate how emotional and moral language can interact in predicting information contagion.
We study emoji usage patterns across two social media platforms, one of them considered a fringe community called Gab, and the other Twitter. We find that Gab tends to comparatively use more emotionally charged emoji, but also seems more apathetic towards the violence during the event, while Twitter takes a more empathetic approach to the event.