Yassine Benajiba

2023

NLP models often degrade in performance when real world data distributions differ markedly from training data. However, existing dataset drift metrics in NLP have generally not considered specific dimensions of linguistic drift that affect model performance, and they have not been validated in their ability to predict model performance at the individual example level, where such metrics are often used in practice. In this paper, we propose three dimensions of linguistic dataset drift: vocabulary, structural, and semantic drift. These dimensions correspond to content word frequency divergences, syntactic divergences, and meaning changes not captured by word frequencies (e.g. lexical semantic change). We propose interpretable metrics for all three drift dimensions, and we modify past performance prediction methods to predict model performance at both the example and dataset level for English sentiment classification and natural language inference. We find that our drift metrics are more effective than previous metrics at predicting out-of-domain model accuracies (mean 16.8% root mean square error decrease), particularly when compared to popular fine-tuned embedding distances (mean 47.7% error decrease). Fine-tuned embedding distances are much more effective at ranking individual examples by expected performance, but decomposing into vocabulary, structural, and semantic drift produces the best example rankings of all considered model-agnostic drift metrics (mean 6.7% ROC AUC increase).

Opinion summarization provides an important solution for summarizing opinions expressed among a large number of reviews. However, generating aspect-specific and general summaries is challenging due to the lack of annotated data. In this work, we propose two simple yet effective unsupervised approaches to generate both aspect-specific and general opinion summaries by training on synthetic datasets constructed with aspect-related review contents. Our first approach, Seed Words Based Leave-One-Out (SW-LOO), identifies aspect-related portions of reviews simply by exact-matching aspect seed words and outperforms existing methods by 3.4 ROUGE-L points on Space and 0.5 ROUGE-1 point on Oposum+ for aspect-specific opinion summarization. Our second approach, Natural Language Inference Based Leave-One-Out (NLI-LOO) identifies aspect-related sentences utilizing an NLI model in a more general setting without using seed words and outperforms existing approaches by 1.2 ROUGE-L points on Space for aspect-specific opinion summarization and remains competitive on other metrics.

Sequence-to-sequence state-of-the-art systems for dialogue state tracking (DST) use the full dialogue history as input, represent the current state as a list with all the slots, and generate the entire state from scratch at each dialogue turn. This approach is inefficient, especially when the number of slots is large and the conversation is long. We propose Diable, a new task formalisation that simplifies the design and implementation of efficient DST systems and allows one to easily plug and play large language models. We represent the dialogue state as a table and formalise DST as a table manipulation task. At each turn, the system updates the previous state by generating table operations based on the dialogue context. Extensive experimentation on the MultiWoz datasets demonstrates that Diable (i) outperforms strong efficient DST baselines, (ii) is 2.4x more time efficient than current state-of-the-art methods while retaining competitive Joint Goal Accuracy, and (iii) is robust to noisy data annotations due to the table operations approach.

Training a Named Entity Recognition (NER) model often involves fixing a taxonomy of entity types. However, requirements evolve and we might need the NER model to recognize additional entity types. A simple approach is to re-annotate entire dataset with both existing and additional entity types and then train the model on the re-annotated dataset. However, this is an extremely laborious task. To remedy this, we propose a novel approach called Partial Label Model (PLM) that uses only partially annotated datasets. We experiment with 6 diverse datasets and show that PLM consistently performs better than most other approaches (0.5 - 2.5 F1), including in novel settings for taxonomy expansion not considered in prior work. The gap between PLM and all other approaches is especially large in settings where there is limited data available for the additional entity types (as much as 11 F1), thus suggesting a more cost effective approaches to taxonomy expansion.

pdf abs
Dynamic Benchmarking of Masked Language Models on Temporal Concept Drift with Multiple Views
Katerina Margatina | Shuai Wang | Yogarshi Vyas | Neha Anna John | Yassine Benajiba | Miguel Ballesteros
Proceedings of the 17th Conference of the European Chapter of the Association for Computational Linguistics

Temporal concept drift refers to the problem of data changing over time. In the field of NLP, that would entail that language (e.g. new expressions, meaning shifts) and factual knowledge (e.g. new concepts, updated facts) evolve over time. Focusing on the latter, we benchmark 11 pretrained masked language models (MLMs) on a series of tests designed to evaluate the effect of temporal concept drift, as it is crucial that widely used language models remain up-to-date with the ever-evolving factual updates of the real world. Specifically, we provide a holistic framework that (1) dynamically creates temporal test sets of any time granularity (e.g. month, quarter, year) of factual data from Wikidata, (2) constructs fine-grained splits of tests (e.g. updated, new, unchanged facts) to ensure comprehensive analysis, and (3) evaluates MLMs in three distinct ways (single-token probing, multi-token generation, MLM scoring). In contrast to prior work, our framework aims to unveil how robust an MLM is over time and thus to provide a signal in case it has become outdated, by leveraging multiple views of evaluation.

Aspect-based Sentiment Analysis (ABSA) is a fine-grained sentiment analysis task which involves four elements from user-generated texts:aspect term, aspect category, opinion term, and sentiment polarity. Most computational approaches focus on some of the ABSA sub-taskssuch as tuple (aspect term, sentiment polarity) or triplet (aspect term, opinion term, sentiment polarity) extraction using either pipeline or joint modeling approaches. Recently, generative approaches have been proposed to extract all four elements as (one or more) quadrupletsfrom text as a single task. In this work, we take a step further and propose a unified framework for solving ABSA, and the associated sub-tasksto improve the performance in few-shot scenarios. To this end, we fine-tune a T5 model with instructional prompts in a multi-task learning fashion covering all the sub-tasks, as well as the entire quadruple prediction task. In experiments with multiple benchmark datasets, we show that the proposed multi-task prompting approach brings performance boost (by absolute 8.29 F1) in the few-shot learning setting.

With increasing demand for and adoption of virtual assistants, recent work has investigated ways to accelerate bot schema design through the automatic induction of intents or the induction of slots and dialogue states. However, a lack of dedicated benchmarks and standardized evaluation has made progress difficult to track and comparisons between systems difficult to make. This challenge track, held as part of the Eleventh Dialog Systems Technology Challenge, introduces a benchmark that aims to evaluate methods for the automatic induction of customer intents in a realistic setting of customer service interactions between human agents and customers. We propose two subtasks for progressively tackling the automatic induction of intents and corresponding evaluation methodologies. We then present three datasets suitable for evaluating the tasks and propose simple baselines. Finally, we summarize the submissions and results of the challenge track, for which we received submissions from 34 teams.

2021

pdf abs
ODIST: Open World Classification via Distributionally Shifted Instances
Lei Shu | Yassine Benajiba | Saab Mansour | Yi Zhang
Findings of the Association for Computational Linguistics: EMNLP 2021

In this work, we address the open-world classification problem with a method called ODIST, open world classification via distributionally shifted instances. This novel and straightforward method can create out-of-domain instances from the in-domain training instances with the help of a pre-trained generative language model. Experimental results show that ODIST performs better than state-of-the-art decision boundary finding method.

2020

pdf abs
Aspect On: an Interactive Solution for Post-Editing the Aspect Extraction based on Online Learning
Mara Chinea-Rios | Marc Franco-Salvador | Yassine Benajiba
Proceedings of the Twelfth Language Resources and Evaluation Conference

The task of aspect extraction is an important component of aspect-based sentiment analysis. However, it usually requires an expensive human post-processing to ensure quality. In this work we introduce Aspect On, an interactive solution based on online learning that allows users to post-edit the aspect extraction with little effort. The Aspect On interface shows the aspects extracted by a neural model and, given a dataset, annotates its words with the corresponding aspects. Thanks to the online learning, Aspect On updates the model automatically and continuously improves the quality of the aspects displayed to the user. Experimental results show that Aspect On dramatically reduces the number of user clicks and effort required to post-edit the aspects extracted by the model.

2019

pdf abs
SymantoResearch at SemEval-2019 Task 3: Combined Neural Models for Emotion Classification in Human-Chatbot Conversations
Angelo Basile | Marc Franco-Salvador | Neha Pawar | Sanja Štajner | Mara Chinea Rios | Yassine Benajiba
Proceedings of the 13th International Workshop on Semantic Evaluation

In this paper, we present our participation to the EmoContext shared task on detecting emotions in English textual conversations between a human and a chatbot. We propose four neural systems and combine them to further improve the results. We show that our neural ensemble systems can successfully distinguish three emotions (SAD, HAPPY, and ANGRY) and separate them from the rest (OTHERS) in a highly-imbalanced scenario. Our best system achieved a 0.77 F1-score and was ranked fourth out of 165 submissions.

2017

pdf abs
MainiwayAI at IJCNLP-2017 Task 2: Ensembles of Deep Architectures for Valence-Arousal Prediction
Yassine Benajiba | Jin Sun | Yong Zhang | Zhiliang Weng | Or Biran
Proceedings of the IJCNLP 2017, Shared Tasks

This paper introduces Mainiway AI Labs submitted system for the IJCNLP 2017 shared task on Dimensional Sentiment Analysis of Chinese Phrases (DSAP), and related experiments. Our approach consists of deep neural networks with various architectures, and our best system is a voted ensemble of networks. We achieve a Mean Absolute Error of 0.64 in valence prediction and 0.68 in arousal prediction on the test set, both placing us as the 5th ranked team in the competition.

pdf abs
The Sentimental Value of Chinese Sub-Character Components
Yassine Benajiba | Or Biran | Zhiliang Weng | Yong Zhang | Jin Sun
Proceedings of the 9th SIGHAN Workshop on Chinese Language Processing

Sub-character components of Chinese characters carry important semantic information, and recent studies have shown that utilizing this information can improve performance on core semantic tasks. In this paper, we hypothesize that in addition to semantic information, sub-character components may also carry emotional information, and that utilizing it should improve performance on sentiment analysis tasks. We conduct a series of experiments on four Chinese sentiment data sets and show that we can significantly improve the performance in various tasks over that of a character-level embeddings baseline. We then focus on qualitatively assessing multiple examples and trying to explain how the sub-character components affect the results in each case.

2012

pdf
Grading the Quality of Medical Evidence
Binod Gyawali | Thamar Solorio | Yassine Benajiba
BioNLP: Proceedings of the 2012 Workshop on Biomedical Natural Language Processing

2010

pdf
Arabic Named Entity Recognition: Using Features Extracted from Noisy Data
Yassine Benajiba | Imed Zitouni | Mona Diab | Paolo Rosso
Proceedings of the ACL 2010 Conference Short Papers

pdf
Enhancing Mention Detection Using Projection via Aligned Corpora
Yassine Benajiba | Imed Zitouni
Proceedings of the 2010 Conference on Empirical Methods in Natural Language Processing

pdf
Arabic Mention Detection: Toward Better Unit of Analysis
Yassine Benajiba | Imed Zitouni
Human Language Technologies: The 2010 Annual Conference of the North American Chapter of the Association for Computational Linguistics

pdf abs
Arabic Word Segmentation for Better Unit of Analysis
Yassine Benajiba | Imed Zitouni
Proceedings of the Seventh International Conference on Language Resources and Evaluation (LREC'10)

The Arabic language has a very rich morphology where a word is composed of zero or more prefixes, a stem and zero or more suffixes. This makes Arabic data sparse compared to other languages, such as English, and consequently word segmentation becomes very important for many Natural Language Processing tasks that deal with the Arabic language. We present in this paper two segmentation schemes that are morphological segmentation and Arabic TreeBank segmentation and we show their impact on an important natural language processing task that is mention detection. Experiments on Arabic TreeBank corpus show 98.1% accuracy on morphological segmentation and 99.4% on morphological segmentation. We also discuss the importance of segmenting the text; experiments show up to 6F points improvement of the mention detection system performance when morphological segmentation is used instead of not segmenting the text. Obtained results also show up to 3F points improvement is achieved when the appropriate segmentation style is used.