Proceedings of the 3rd Workshop on Figurative Language Processing (FLP)

Debanjan Ghosh, Beata Beigman Klebanov, Smaranda Muresan, Anna Feldman, Soujanya Poria, Tuhin Chakrabarty (Editors)

Anthology ID:
Abu Dhabi, United Arab Emirates (Hybrid)
Association for Computational Linguistics
Bib Export formats:

pdf bib
Proceedings of the 3rd Workshop on Figurative Language Processing (FLP)
Debanjan Ghosh | Beata Beigman Klebanov | Smaranda Muresan | Anna Feldman | Soujanya Poria | Tuhin Chakrabarty

pdf bib
TEDB System Description to a Shared Task on Euphemism Detection 2022
Peratham Wiriyathammabhum

In this report, we describe our Transformers for euphemism detection baseline (TEDB) submissions to a shared task on euphemism detection 2022. We cast the task of predicting euphemism as text classification. We considered Transformer-based models which are the current state-of-the-art methods for text classification. We explored different training schemes, pretrained models, and model architectures. Our best result of 0.816 F1-score (0.818 precision and 0.814 recall) consists of a euphemism-detection-finetuned TweetEval/TimeLMs-pretrained RoBERTa model as a feature extractor frontend with a KimCNN classifier backend trained end-to-end using a cosine annealing scheduler. We observed pretrained models on sentiment analysis and offensiveness detection to correlate with more F1-score while pretraining on other tasks, such as sarcasm detection, produces less F1-scores. Also, putting more word vector channels does not improve the performance in our experiments.

pdf bib
A Prompt Based Approach for Euphemism Detection
Abulimiti Maimaitituoheti | Yang Yong | Fan Xiaochao

Euphemism is an indirect way to express sensitive topics. People can comfortably communicate with each other about sensitive topics or taboos by using euphemisms. The Euphemism Detection Shared Task in the Third Workshop on Figurative Language Processing co-located with EMNLP 2022 provided a euphemism detection dataset that was divided into the train set and the test set. We made euphemism detection experiments by prompt tuning pre-trained language models on the dataset. We used RoBERTa as the pre-trained language model and created suitable templates and verbalizers for the euphemism detection task. Our approach achieved the third-best score in the euphemism detection shared task. This paper describes our model participating in the task.

Transfer Learning Parallel Metaphor using Bilingual Embeddings
Maria Berger

Automated metaphor detection in languages other than English is highly restricted as training corpora are comparably rare. One way to overcome this problem is transfer learning. This paper gives an overview on transfer learning techniques applied to NLP. We first introduce types of transfer learning, then we present work focusing on: i) transfer learning with cross-lingual embeddings; ii) transfer learning in machine translation; and iii) transfer learning using pre-trained transformer models. The paper is complemented by first experiments that make use of bilingual embeddings generated from different sources of parallel data: We i) present the preparation of a parallel Gold corpus; ii) examine the embeddings spaces to search for metaphoric words cross-lingually; iii) run first experiments in transfer learning German metaphor from English labeled data only. Results show that finding data sources for bilingual embeddings training and the vocabulary covered by these embeddings is critical for learning metaphor cross-lingually.

Ring That Bell: A Corpus and Method for Multimodal Metaphor Detection in Videos
Khalid Alnajjar | Mika Hämäläinen | Shuo Zhang

We present the first openly available multimodal metaphor annotated corpus. The corpus consists of videos including audio and subtitles that have been annotated by experts. Furthermore, we present a method for detecting metaphors in the new dataset based on the textual content of the videos. The method achieves a high F1-score (62%) for metaphorical labels. We also experiment with other modalities and multimodal methods; however, these methods did not out-perform the text-based model. In our error analysis, we do identify that there are cases where video could help in disambiguating metaphors, however, the visual cues are too subtle for our model to capture. The data is available on Zenodo.

Picard understanding Darmok: A Dataset and Model for Metaphor-Rich Translation in a Constructed Language
Peter A. Jansen | Jordan Boyd-Graber

Tamarian, a fictional language introduced in the Star Trek episode Darmok, communicates meaning through utterances of metaphorical references, such as “Darmok and Jalad at Tanagra” instead of “We should work together.” This work assembles a Tamarian-English dictionary of utterances from the original episode and several follow-on novels, and uses this to construct a parallel corpus of 456 English-Tamarian utterances. A machine translation system based on a large language model (T5) is trained using this parallel corpus, and is shown to produce an accuracy of 76% when translating from English to Tamarian on known utterances.

The Secret of Metaphor on Expressing Stronger Emotion
Yucheng Li | Frank Guerin | Chenghua Lin

Metaphors are proven to have stronger emotional impact than literal expressions. Although this conclusion is shown to be promising in benefiting various NLP applications, the reasons behind this phenomenon are not well studied. This paper conducts the first study in exploring how metaphors convey stronger emotion than their literal counterparts. We find that metaphors are generally more specific than literal expressions. The more specific property of metaphor can be one of the reasons for metaphors’ superiority in emotion expression. When we compare metaphors with literal expressions with the same specificity level, the gap of emotion expressing ability between both reduces significantly. In addition, we observe specificity is crucial in literal language as well, as literal language can express stronger emotion by making it more specific.

Drum Up SUPPORT: Systematic Analysis of Image-Schematic Conceptual Metaphors
Lennart Wachowiak | Dagmar Gromann | Chao Xu

Conceptual metaphors represent a cognitive mechanism to transfer knowledge structures from one onto another domain. Image-schematic conceptual metaphors (ISCMs) specialize on transferring sensorimotor experiences to abstract domains. Natural language is believed to provide evidence of such metaphors. However, approaches to verify this hypothesis largely rely on top-down methods, gathering examples by way of introspection, or on manual corpus analyses. In order to contribute towards a method that is systematic and can be replicated, we propose to bring together existing processing steps in a pipeline to detect ISCMs, exemplified for the image schema SUPPORT in the COVID-19 domain. This pipeline consist of neural metaphor detection, dependency parsing to uncover construction patterns, clustering, and BERT-based frame annotation of dependent constructions to analyse ISCMs.

Effective Cross-Task Transfer Learning for Explainable Natural Language Inference with T5
Irina Bigoulaeva | Rachneet Singh Sachdeva | Harish Tayyar Madabushi | Aline Villavicencio | Iryna Gurevych

We compare sequential fine-tuning with a model for multi-task learning in the context where we are interested in boosting performance on two of the tasks, one of which depends on the other. We test these models on the FigLang2022 shared task which requires participants to predict language inference labels on figurative language along with corresponding textual explanations of the inference predictions. Our results show that while sequential multi-task learning can be tuned to be good at the first of two target tasks, it performs less well on the second and additionally struggles with overfitting. Our findings show that simple sequential fine-tuning of text-to-text models is an extraordinarily powerful method of achieving cross-task knowledge transfer while simultaneously predicting multiple interdependent targets. So much so, that our best model achieved the (tied) highest score on the task.

Detecting Euphemisms with Literal Descriptions and Visual Imagery
Ilker Kesen | Aykut Erdem | Erkut Erdem | Iacer Calixto

This paper describes our two-stage system for the Euphemism Detection shared task hosted by the 3rd Workshop on Figurative Language Processing in conjunction with EMNLP 2022. Euphemisms tone down expressions about sensitive or unpleasant issues like addiction and death. The ambiguous nature of euphemistic words or expressions makes it challenging to detect their actual meaning within a context. In the first stage, we seek to mitigate this ambiguity by incorporating literal descriptions into input text prompts to our baseline model. It turns out that this kind of direct supervision yields remarkable performance improvement. In the second stage, we integrate visual supervision into our system using visual imageries, two sets of images generated by a text-to-image model by taking terms and descriptions as input. Our experiments demonstrate that visual supervision also gives a statistically significant performance boost. Our system achieved the second place with an F1 score of 87.2%, only about 0.9% worse than the best submission.

Distribution-Based Measures of Surprise for Creative Language: Experiments with Humor and Metaphor
Razvan C. Bunescu | Oseremen O. Uduehi

Novelty or surprise is a fundamental attribute of creative output. As such, we postulate that a writer’s creative use of language leads to word choices and, more importantly, corresponding semantic structures that are unexpected for the reader. In this paper we investigate measures of surprise that rely solely on word distributions computed by language models and show empirically that creative language such as humor and metaphor is strongly correlated with surprise. Surprisingly at first, information content is observed to be at least as good a predictor of creative language as any of the surprise measures investigated. However, the best prediction performance is obtained when information and surprise measures are combined, showing that surprise measures capture an aspect of creative language that goes beyond information content.

Euphemism Detection by Transformers and Relational Graph Attention Network
Yuting Wang | Yiyi Liu | Ruqing Zhang | Yixing Fan | Jiafeng Guo

Euphemism is a type of figurative language broadly adopted in social media and daily conversations. People use euphemism for politeness or to conceal what they are discussing. Euphemism detection is a challenging task because of its obscure and figurative nature. Even humans may not agree on if a word expresses euphemism. In this paper, we propose to employ bidirectional encoder representations transformers (BERT), and relational graph attention network in order to model the semantic and syntactic relations between the target words and the input sentence. The best performing method of ours reaches a Macro-F1 score of 84.0 on the euphemism detection dataset of the third workshop on figurative language processing shared task 2022.

Just-DREAM-about-it: Figurative Language Understanding with DREAM-FLUTE
Yuling Gu | Yao Fu | Valentina Pyatkin | Ian Magnusson | Bhavana Dalvi Mishra | Peter Clark

Figurative language (e.g., “he flew like the wind”) is challenging to understand, as it is hard to tell what implicit information is being conveyed from the surface form alone. We hypothesize that to perform this task well, the reader needs to mentally elaborate the scene being described to identify a sensible meaning of the language. We present DREAM-FLUTE, a figurative language understanding system that does this, first forming a “mental model” of situations described in a premise and hypothesis before making an entailment/contradiction decision and generating an explanation. DREAM-FLUTE uses an existing scene elaboration model, DREAM, for constructing its “mental model.” In the FigLang2022 Shared Task evaluation, DREAM-FLUTE achieved (joint) first place (Acc@60=63.3%), and can perform even better with ensemble techniques, demonstrating the effectiveness of this approach. More generally, this work suggests that adding a reflective component to pretrained language models can improve their performance beyond standard fine-tuning (3.3% improvement in Acc@60).

Bayes at FigLang 2022 Euphemism Detection shared task: Cost-Sensitive Bayesian Fine-tuning and Venn-Abers Predictors for Robust Training under Class Skewed Distributions
Paul Trust | Kadusabe Provia | Kizito Omala

Transformers have achieved a state of the art performance across most natural language processing tasks. However the performance of these models degrade when being trained on skewed class distributions (class imbalance) because training tends to be biased towards head classes with most of the data points . Classical methods that have been proposed to handle this problem (re-sampling and re-weighting) often suffer from unstable performance, poor applicability and poor calibration. In this paper, we propose to use Bayesian methods and Venn-Abers predictors for well calibrated and robust training against class imbalance. Our proposed approach improves f1-score of the baseline RoBERTa (A Robustly Optimized Bidirectional Embedding from Transformers Pretraining Approach) model by about 6 points (79.0% against 72.6%) when training with class imbalanced data.

Food for Thought: How can we exploit contextual embeddings in the translation of idiomatic expressions?
Lukas Santing | Ryan Jean-Luc Sijstermans | Giacomo Anerdi | Pedro Jeuris | Marijn ten Thij | Riza Batista-Navarro

Idiomatic expressions (or idioms) are phrases where the meaning of the phrase cannot be determined from the meaning of the individual words in the expression. Translating idioms between languages is therefore a challenging task. Transformer models based on contextual embeddings have advanced the state-of-the-art across many domains in the field of natural language processing. While research using transformers has advanced both idiom detection as well as idiom disambiguation, idiom translation has not seen a similar advancement. In this work, we investigate two approaches to fine-tuning a pretrained Text-to-Text Transfer Transformer (T5) model to perform idiom translation from English to German. The first approach directly translates English idiom-containing sentences to German, while the second is underpinned by idiom paraphrasing, firstly paraphrasing English idiomatic expressions to their simplified English versions before translating them to German. Results of our evaluation show that each of the approaches is able to generate adequate translations.

EUREKA: EUphemism Recognition Enhanced through Knn-based methods and Augmentation
Sedrick Scott Keh | Rohit Bharadwaj | Emmy Liu | Simone Tedeschi | Varun Gangal | Roberto Navigli

We introduce EUREKA, an ensemble-based approach for performing automatic euphemism detection. We (1) identify and correct potentially mislabelled rows in the dataset, (2) curate an expanded corpus called EuphAug, (3) leverage model representations of Potentially Euphemistic Terms (PETs), and (4) explore using representations of semantically close sentences to aid in classification. Using our augmented dataset and kNN-based methods, EUREKA was able to achieve state-of-the-art results on the public leaderboard of the Euphemism Detection Shared Task, ranking first with a macro F1 score of 0.881.

An insulin pump? Identifying figurative links in the construction of the drug lexicon
Antonio Reyes | Rafael Saldivar

One of the remarkable characteristics of the drug lexicon is its elusive nature. In order to communicate information related to drugs or drug trafficking, the community uses several terms that are mostly unknown to regular people, or even to the authorities. For instance, the terms jolly green, joystick, or jive are used to refer to marijuana. The selection of such terms is not necessarily a random or senseless process, but a communicative strategy in which figurative language plays a relevant role. In this study, we describe an ongoing research to identify drug-related terms by applying machine learning techniques. To this end, a data set regarding drug trafficking in Spanish was built. This data set was used to train a word embedding model to identify terms used by the community to creatively refer to drugs and related matters. The initial findings show an interesting repository of terms created to consciously veil drug-related contents by using figurative language devices, such as metaphor or metonymy. These findings can provide preliminary evidence to be applied by law agencies in order to address actions against crime, drug transactions on the internet, illicit activities, or human trafficking.

Can Yes-No Question-Answering Models be Useful for Few-Shot Metaphor Detection?
Lena Dankin | Kfir Bar | Nachum Dershowitz

Metaphor detection has been a challenging task in the NLP domain both before and after the emergence of transformer-based language models. The difficulty lies in subtle semantic nuances that are required to detect metaphor and in the scarcity of labeled data. We explore few-shot setups for metaphor detection, and also introduce new question answering data that can enhance classifiers that are trained on a small amount of data. We formulate the classification task as a question-answering one, and train a question-answering model. We perform extensive experiments for few shot on several architectures and report the results of several strong baselines. Thus, the answer to the question posed in the title is a definite “Yes!”

An Exploration of Linguistically-Driven and Transfer Learning Methods for Euphemism Detection
Devika Tiwari | Natalie Parde

Euphemisms are often used to drive rhetoric, but their automated recognition and interpretation are under-explored. We investigate four methods for detecting euphemisms in sentences containing potentially euphemistic terms. The first three linguistically-motivated methods rest on an understanding of (1) euphemism’s role to attenuate the harsh connotations of a taboo topic and (2) euphemism’s metaphorical underpinnings. In contrast, the fourth method follows recent innovations in other tasks and employs transfer learning from a general-domain pre-trained language model. While the latter method ultimately (and perhaps surprisingly) performed best (F1 = 0.74), we comprehensively evaluate all four methods to derive additional useful insights from the negative results.

Back to the Roots: Predicting the Source Domain of Metaphors using Contrastive Learning
Meghdut Sengupta | Milad Alshomary | Henning Wachsmuth

Metaphors frame a given target domain using concepts from another, usually more concrete, source domain. Previous research in NLP has focused on the identification of metaphors and the interpretation of their meaning. In contrast, this paper studies to what extent the source domain can be predicted computationally from a metaphorical text. Given a dataset with metaphorical texts from a finite set of source domains, we propose a contrastive learning approach that ranks source domains by their likelihood of being referred to in a metaphorical text. In experiments, it achieves reasonable performance even for rare source domains, clearly outperforming a classification baseline.

SBU Figures It Out: Models Explain Figurative Language
Yash Kumar Lal | Mohaddeseh Bastan

Figurative language is ubiquitous in human communication. However, current NLP models are unable to demonstrate a significant understanding of instances of this phenomena. The EMNLP 2022 shared task on figurative language understanding posed the problem of predicting and explaining the relation between a premise and a hypothesis containing an instance of the use of figurative language. We experiment with different variations of using T5-large for this task and build a model that significantly outperforms the task baseline. Treating it as a new task for T5 and simply finetuning on the data achieves the best score on the defined evaluation. Furthermore, we find that hypothesis-only models are able to achieve most of the performance.

NLP@UIT at FigLang-EMNLP 2022: A Divide-and-Conquer System For Shared Task On Understanding Figurative Language
Khoa Thi-Kim Phan | Duc-Vu Nguyen | Ngan Luu-Thuy Nguyen

This paper describes our submissions to the EMNLP 2022 shared task on Understanding Figurative Language as part of the Figurative Language Workshop (FigLang 2022). Our systems based on pre-trained language model T5 are divide-and-conquer models which can address both two requirements of the task: 1) classification, and 2) generation. In this paper, we introduce different approaches in which each approach we employ a processing strategy on input model. We also emphasize the influence of the types of figurative language on our systems.

Adversarial Perturbations Augmented Language Models for Euphemism Identification
Guneet Kohli | Prabsimran Kaur | Jatin Bedi

Euphemisms are mild words or expressions used instead of harsh or direct words while talking to someone to avoid discussing something unpleasant, embarrassing, or offensive. However, they are often ambiguous, thus making it a challenging task. The Third Workshop on Figurative Language Processing, colocated with EMNLP 2022 organized a shared task on Euphemism Detection to better understand euphemisms. We have used the adversarial augmentation technique to construct new data. This augmented data was then trained using two language models: BERT and longformer. To further enhance the overall performance, various combinations of the results obtained using longformer and BERT were passed through a voting ensembler. We achieved an F1 score of 71.5 using the combination of two adversarial longformers, two adversarial BERT, and one non-adversarial BERT.

FigurativeQA: A Test Benchmark for Figurativeness Comprehension for Question Answering
Geetanjali Rakshit | Jeffrey Flanigan

Figurative language is widespread in human language (Lakoff and Johnson, 2008) posing potential challenges in NLP applications. In this paper, we investigate the effect of figurative language on the task of question answering (QA). We construct FigQA, a test set of 400 yes-no questions with figurative and non-figurative contexts, extracted from product reviews and restaurant reviews. We demonstrate that a state-of-the-art RoBERTa QA model has considerably lower performance in question answering when the contexts are figurative rather than literal, indicating a gap in current models. We propose a general method for improving the performance of QA models by converting the figurative contexts into non-figurative by prompting GPT-3, and demonstrate its effectiveness. Our results indicate a need for building QA models infused with figurative language understanding capabilities.

Exploring Euphemism Detection in Few-Shot and Zero-Shot Settings
Sedrick Scott Keh

This work builds upon the Euphemism Detection Shared Task proposed in the EMNLP 2022 FigLang Workshop, and extends it to few-shot and zero-shot settings. We demonstrate a few-shot and zero-shot formulation using the dataset from the shared task, and we conduct experiments in these settings using RoBERTa and GPT-3. Our results show that language models are able to classify euphemistic terms relatively well even on new terms unseen during training, indicating that it is able to capture higher-level concepts related to euphemisms.

On the Cusp of Comprehensibility: Can Language Models Distinguish Between Metaphors and Nonsense?
Bernadeta Griciūtė | Marc Tanti | Lucia Donatelli

Utterly creative texts can sometimes be difficult to understand, balancing on the edge of comprehensibility. However, good language skills and common sense allow advanced language users both to interpret creative texts and to reject some linguistic input as nonsense. The goal of this paper is to evaluate whether the current language models are also able to make the distinction between a creative language use and nonsense. To test this, we have computed mean rank and pseudo-log-likelihood score (PLL) of metaphorical and nonsensical sentences, and fine-tuned several pretrained models (BERT, RoBERTa) for binary classification between the two categories. There was a significant difference in the mean ranks and PPL scores of the categories, and the classifier reached around 85.5% accuracy. The results raise further questions on what could have let to such satisfactory performance.

A Report on the FigLang 2022 Shared Task on Understanding Figurative Language
Arkadiy Saakyan | Tuhin Chakrabarty | Debanjan Ghosh | Smaranda Muresan

We present the results of the Shared Task on Understanding Figurative Language that we conducted as a part of the 3rd Workshop on Figurative Language Processing (FigLang 2022) at EMNLP 2022. The shared task is based on the FLUTE dataset (Chakrabarty et al., 2022), which consists of NLI pairs containing figurative language along with free text explanations for each NLI instance. The task challenged participants to build models that are able to not only predict the right label for a figurative NLI instance, but also generate a convincing free-text explanation. The participants were able to significantly improve upon provided baselines in both automatic and human evaluation settings. We further summarize the submitted systems and discuss the evaluation results.

A Report on the Euphemisms Detection Shared Task
Patrick Lee | Anna Feldman | Jing Peng

This paper presents The Shared Task on Euphemism Detection for the Third Workshop on Figurative Language Processing (FigLang 2022) held in conjunction with EMNLP 2022. Participants were invited to investigate the euphemism detection task: given input text, identify whether it contains a euphemism. The input data is a corpus of sentences containing potentially euphemistic terms (PETs) collected from the GloWbE corpus, and are human-annotated as containing either a euphemistic or literal usage of a PET. In this paper, we present the results and analyze the common themes, methods and findings of the participating teams.