Proceedings of the 29th International Conference on Computational Linguistics

Nicoletta Calzolari, Chu-Ren Huang, Hansaem Kim, James Pustejovsky, Leo Wanner, Key-Sun Choi, Pum-Mo Ryu, Hsin-Hsi Chen, Lucia Donatelli, Heng Ji, Sadao Kurohashi, Patrizia Paggio, Nianwen Xue, Seokhwan Kim, Younggyun Hahm, Zhong He, Tony Kyungil Lee, Enrico Santus, Francis Bond, Seung-Hoon Na (Editors)

Anthology ID:
Gyeongju, Republic of Korea
International Committee on Computational Linguistics
Bib Export formats:

pdf bib
Proceedings of the 29th International Conference on Computational Linguistics
Nicoletta Calzolari | Chu-Ren Huang | Hansaem Kim | James Pustejovsky | Leo Wanner | Key-Sun Choi | Pum-Mo Ryu | Hsin-Hsi Chen | Lucia Donatelli | Heng Ji | Sadao Kurohashi | Patrizia Paggio | Nianwen Xue | Seokhwan Kim | Younggyun Hahm | Zhong He | Tony Kyungil Lee | Enrico Santus | Francis Bond | Seung-Hoon Na

pdf bib
Do Language Models Make Human-like Predictions about the Coreferents of Italian Anaphoric Zero Pronouns?
James A. Michaelov | Benjamin K. Bergen

Some languages allow arguments to be omitted in certain contexts. Yet human language comprehenders reliably infer the intended referents of these zero pronouns, in part because they construct expectations about which referents are more likely. We ask whether Neural Language Models also extract the same expectations. We test whether 12 contemporary language models display expectations that reflect human behavior when exposed to sentences with zero pronouns from five behavioral experiments conducted in Italian by Carminati (2005). We find that three models - XGLM 2.9B, 4.5B, and 7.5B - capture the human behavior from all the experiments, with others successfully modeling some of the results. This result suggests that human expectations about coreference can be derived from exposure to language, and also indicates features of language models that allow them to better reflect human behavior.

pdf bib
Language Acquisition through Intention Reading and Pattern Finding
Jens Nevens | Jonas Doumen | Paul Van Eecke | Katrien Beuls

One of AI’s grand challenges consists in the development of autonomous agents with communication systems offering the robustness, flexibility and adaptivity found in human languages. While the processes through which children acquire language are by now relatively well understood, a faithful computational operationalisation of the underlying mechanisms is still lacking. Two main cognitive processes are involved in child language acquisition. First, children need to reconstruct the intended meaning of observed utterances, a process called intention reading. Then, they can gradually abstract away from concrete utterances in a process called pattern finding and acquire productive schemata that generalise over form and meaning. In this paper, we introduce a mechanistic model of the intention reading process and its integration with pattern finding capacities. Concretely, we present an agent-based simulation in which an agent learns a grammar that enables them to ask and answer questions about a scene. This involves the reconstruction of queries that correspond to observed questions based on the answer and scene alone, and the generalization of linguistic schemata based on these reconstructed question-query pairs. The result is a productive grammar which can be used to map between natural language questions and queries without ever having observed the queries.

Stability of Syntactic Dialect Classification over Space and Time
Jonathan Dunn | Sidney Wong

This paper analyses the degree to which dialect classifiers based on syntactic representations remain stable over space and time. While previous work has shown that the combination of grammar induction and geospatial text classification produces robust dialect models, we do not know what influence both changing grammars and changing populations have on dialect models. This paper constructs a test set for 12 dialects of English that spans three years at monthly intervals with a fixed spatial distribution across 1,120 cities. Syntactic representations are formulated within the usage-based Construction Grammar paradigm (CxG). The decay rate of classification performance for each dialect over time allows us to identify regions undergoing syntactic change. And the distribution of classification accuracy within dialect regions allows us to identify the degree to which the grammar of a dialect is internally heterogeneous. The main contribution of this paper is to show that a rigorous evaluation of dialect classification models can be used to find both variation over space and change over time.

Subject Verb Agreement Error Patterns in Meaningless Sentences: Humans vs. BERT
Karim Lasri | Olga Seminck | Alessandro Lenci | Thierry Poibeau

Both humans and neural language models are able to perform subject verb number agreement (SVA). In principle, semantics shouldn’t interfere with this task, which only requires syntactic knowledge. In this work we test whether meaning interferes with this type of agreement in English in syntactic structures of various complexities. To do so, we generate both semantically well-formed and nonsensical items. We compare the performance of BERT-base to that of humans, obtained with a psycholinguistic online crowdsourcing experiment. We find that BERT and humans are both sensitive to our semantic manipulation: They fail more often when presented with nonsensical items, especially when their syntactic structure features an attractor (a noun phrase between the subject and the verb that has not the same number as the subject). We also find that the effect of meaningfulness on SVA errors is stronger for BERT than for humans, showing higher lexical sensitivity of the former on this task.

Measuring Morphological Fusion Using Partial Information Decomposition
Michaela Socolof | Jacob Louis Hoover | Richard Futrell | Alessandro Sordoni | Timothy J. O’Donnell

Morphological systems across languages vary when it comes to the relation between form and meaning. In some languages, a single meaning feature corresponds to a single morpheme, whereas in other languages, multiple meaning features are bundled together into one morpheme. The two types of languages have been called agglutinative and fusional, respectively, but this distinction does not capture the graded nature of the phenomenon. We provide a mathematically precise way of characterizing morphological systems using partial information decomposition, a framework for decomposing mutual information into three components: unique, redundant, and synergistic information. We show that highly fusional languages are characterized by high levels of synergy.

Smells like Teen Spirit: An Exploration of Sensorial Style in Literary Genres
Osama Khalid | Padmini Srinivasan

It is well recognized that sensory perceptions and language have interconnections through numerous studies in psychology, neuroscience, and sensorial linguistics. Set in this rich context we ask whether the use of sensorial language in writings is part of linguistic style? This question is important from the view of stylometrics research where a rich set of language features have been explored, but with insufficient attention given to features related to sensorial language. Taking this as the goal we explore several angles about sensorial language and style in collections of lyrics, novels, and poetry. We find, for example, that individual use of sensorial language is not a random phenomenon; choice is likely involved. Also, sensorial style is generally stable over time - the shifts are extremely small. Moreover, style can be extracted from just a few hundred sentences that have sensorial terms. We also identify representative and distinctive features within each genre. For example, we observe that 4 of the top 6 representative features in novels collection involved individuals using olfactory language where we expected them to use non-olfactory language.

Metaphorical Polysemy Detection: Conventional Metaphor Meets Word Sense Disambiguation
Rowan Hall Maudslay | Simone Teufel

Linguists distinguish between novel and conventional metaphor, a distinction which the metaphor detection task in NLP does not take into account. Instead, metaphoricity is formulated as a property of a token in a sentence, regardless of metaphor type. In this paper, we investigate the limitations of treating conventional metaphors in this way, and advocate for an alternative which we name ‘metaphorical polysemy detection’ (MPD). In MPD, only conventional metaphoricity is treated, and it is formulated as a property of word senses in a lexicon. We develop the first MPD model, which learns to identify conventional metaphors in the English WordNet. To train it, we present a novel training procedure that combines metaphor detection with ‘word sense disambiguation’ (WSD). For evaluation, we manually annotate metaphor in two subsets of WordNet. Our model significantly outperforms a strong baseline based on a state-of-the-art metaphor detection model, attaining an ROC-AUC score of .78 (compared to .65) on one of the sets. Additionally, when paired with a WSD model, our approach outperforms a state-of-the-art metaphor detection model at identifying conventional metaphors in text (.659 F1 compared to .626).

Machine Reading, Fast and Slow: When Do Models “Understand” Language?
Sagnik Ray Choudhury | Anna Rogers | Isabelle Augenstein

Two of the most fundamental issues in Natural Language Understanding (NLU) at present are: (a) how it can established whether deep learning-based models score highly on NLU benchmarks for the ”right” reasons; and (b) what those reasons would even be. We investigate the behavior of reading comprehension models with respect to two linguistic ”skills”: coreference resolution and comparison. We propose a definition for the reasoning steps expected from a system that would be ”reading slowly”, and compare that with the behavior of five models of the BERT family of various sizes, observed through saliency scores and counterfactual explanations. We find that for comparison (but not coreference) the systems based on larger encoders are more likely to rely on the ”right” information, but even they struggle with generalization, suggesting that they still learn specific lexical patterns rather than the general principles of comparison.

Hierarchical Attention Network for Explainable Depression Detection on Twitter Aided by Metaphor Concept Mappings
Sooji Han | Rui Mao | Erik Cambria

Automatic depression detection on Twitter can help individuals privately and conveniently understand their mental health status in the early stages before seeing mental health professionals. Most existing black-box-like deep learning methods for depression detection largely focused on improving classification performance. However, explaining model decisions is imperative in health research because decision-making can often be high-stakes and life-and-death. Reliable automatic diagnosis of mental health problems including depression should be supported by credible explanations justifying models’ predictions. In this work, we propose a novel explainable model for depression detection on Twitter. It comprises a novel encoder combining hierarchical attention mechanisms and feed-forward neural networks. To support psycholinguistic studies, our model leverages metaphorical concept mappings as input. Thus, it not only detects depressed individuals, but also identifies features of such users’ tweets and associated metaphor concept mappings.

Multi-view and Cross-view Brain Decoding
Subba Reddy Oota | Jashn Arora | Manish Gupta | Raju S. Bapi

Can we build multi-view decoders that can decode concepts from brain recordings corresponding to any view (picture, sentence, word cloud) of stimuli? Can we build a system that can use brain recordings to automatically describe what a subject is watching using keywords or sentences? How about a system that can automatically extract important keywords from sentences that a subject is reading? Previous brain decoding efforts have focused only on single view analysis and hence cannot help us build such systems. As a first step toward building such systems, inspired by Natural Language Processing literature on multi-lingual and cross-lingual modeling, we propose two novel brain decoding setups: (1) multi-view decoding (MVD) and (2) cross-view decoding (CVD). In MVD, the goal is to build an MV decoder that can take brain recordings for any view as input and predict the concept. In CVD, the goal is to train a model which takes brain recordings for one view as input and decodes a semantic vector representation of another view. Specifically, we study practically useful CVD tasks like image captioning, image tagging, keyword extraction, and sentence formation. Our extensive experiments lead to MVD models with ~0.68 average pairwise accuracy across view pairs, and also CVD models with ~0.8 average pairwise accuracy across tasks. Analysis of the contribution of different brain networks reveals exciting cognitive insights: (1) Models trained on picture or sentence view of stimuli are better MV decoders than a model trained on word cloud view. (2) Our extensive analysis across 9 broad regions, 11 language sub-regions and 16 visual sub-regions of the brain help us localize, for the first time, the parts of the brain involved in cross-view tasks like image captioning, image tagging, sentence formation and keyword extraction. We make the code publicly available.

Visio-Linguistic Brain Encoding
Subba Reddy Oota | Jashn Arora | Vijay Rowtula | Manish Gupta | Raju S. Bapi

Brain encoding aims at reconstructing fMRI brain activity given a stimulus. There exists a plethora of neural encoding models which study brain encoding for single mode stimuli: visual (pretrained CNNs) or text (pretrained language models). Few recent papers have also obtained separate visual and text representation models and performed late-fusion using simple heuristics. However, previous work has failed to explore the co-attentive multi-modal modeling for visual and text reasoning. In this paper, we systematically explore the efficacy of image and multi-modal Transformers for brain encoding. Extensive experiments on two popular datasets, BOLD5000 and Pereira, provide the following insights. (1) We find that VisualBERT, a multi-modal Transformer, significantly outperforms previously proposed single-mode CNNs, image Transformers as well as other previously proposed multi-modal models, thereby establishing new state-of-the-art. (2) The regions such as LPTG, LMTG, LIFG, and STS which have dual functionalities for language and vision, have higher correlation with multi-modal models which reinforces the fact that these models are good at mimicing the human brain behavior. (3) The supremacy of visio-linguistic models raises the question of whether the responses elicited in the visual regions are affected implicitly by linguistic processing even when passively viewing images. Future fMRI tasks can verify this computational insight in an appropriate experimental setting. We make our code publicly available.

Gestures Are Used Rationally: Information Theoretic Evidence from Neural Sequential Models
Yang Xu | Yang Cheng | Riya Bhatia

Verbal communication is companied by rich non-verbal signals. The usage of gestures, poses, and facial expressions facilitates the information transmission in verbal channel. However, few computational studies have explored the non-verbal channels with finer theoretical lens. We extract gesture representations from monologue video data and train neural sequential models, in order to study the degree to which non-verbal signals can effectively transmit information. We focus on examining whether the gestures demonstrate the similar pattern of entropy rate constancy (ERC) found in words, as predicted by Information Theory. Positive results are shown to support the assumption, which leads to the conclusion that speakers indeed use simple gestures to convey information that enhances verbal communication, and the production of non-verbal information is rationally organized.

Revisiting Statistical Laws of Semantic Shift in Romance Cognates
Yoshifumi Kawasaki | Maëlys Salingre | Marzena Karpinska | Hiroya Takamura | Ryo Nagata

This article revisits statistical relationships across Romance cognates between lexical semantic shift and six intra-linguistic variables, such as frequency and polysemy. Cognates are words that are derived from a common etymon, in this case, a Latin ancestor. Despite their shared etymology, some cognate pairs have experienced semantic shift. The degree of semantic shift is quantified using cosine distance between the cognates’ corresponding word embeddings. In the previous literature, frequency and polysemy have been reported to be correlated with semantic shift; however, the understanding of their effects needs revision because of various methodological defects. In the present study, we perform regression analysis under improved experimental conditions, and demonstrate a genuine negative effect of frequency and positive effect of polysemy on semantic shift. Furthermore, we reveal that morphologically complex etyma are more resistant to semantic shift and that the cognates that have been in use over a longer timespan are prone to greater shift in meaning. These findings add to our understanding of the historical process of semantic change.

Character Jacobian: Modeling Chinese Character Meanings with Deep Learning Model
Yu-Hsiang Tseng | Shu-Kai Hsieh

Compounding, a prevalent word-formation process, presents an interesting challenge for computational models. Indeed, the relations between compounds and their constituents are often complicated. It is particularly so in Chinese morphology, where each character is almost simultaneously bound and free when treated as a morpheme. To model such word-formation process, we propose the Notch (NOnlinear Transformation of CHaracter embeddings) model and the character Jacobians. The Notch model first learns the non-linear relations between the constituents and words, and the character Jacobians further describes the character’s role in each word. In a series of experiments, we show that the Notch model predicts the embeddings of the real words from their constituents but helps account for the behavioral data of the pseudowords. Moreover, we also demonstrated that character Jacobians reflect the characters’ meanings. Taken together, the Notch model and character Jacobians may provide a new perspective on studying the word-formation process and morphology with modern deep learning.

COMMA: Modeling Relationship among Motivations, Emotions and Actions in Language-based Human Activities
Yuqiang Xie | Yue Hu | Wei Peng | Guanqun Bi | Luxi Xing

Motivations, emotions, and actions are inter-related essential factors in human activities. While motivations and emotions have long been considered at the core of exploring how people take actions in human activities, there has been relatively little research supporting analyzing the relationship between human mental states and actions. We present the first study that investigates the viability of modeling motivations, emotions, and actions in language-based human activities, named COMMA (Cognitive Framework of Human Activities). Guided by COMMA, we define three natural language processing tasks (emotion understanding, motivation understanding and conditioned action generation), and build a challenging dataset Hail through automatically extracting samples from Story Commonsense. Experimental results on NLP applications prove the effectiveness of modeling the relationship. Furthermore, our models inspired by COMMA can better reveal the essential relationship among motivations, emotions and actions than existing methods.

Exploring Semantic Spaces for Detecting Clustering and Switching in Verbal Fluency
Özge Alacam | Simeon Schüz | Martin Wegrzyn | Johanna Kißler | Sina Zarrieß

In this work, we explore the fitness of various word/concept representations in analyzing an experimental verbal fluency dataset providing human responses to 10 different category enumeration tasks. Based on human annotations of so-called clusters and switches between sub-categories in the verbal fluency sequences, we analyze whether lexical semantic knowledge represented in word embedding spaces (GloVe, fastText, ConceptNet, BERT) is suitable for detecting these conceptual clusters and switches within and across different categories. Our results indicate that ConceptNet embeddings, a distributional semantics method enriched with taxonomical relations, outperforms other semantic representations by a large margin. Moreover, category-specific analysis suggests that individual thresholds per category are more suited for the analysis of clustering and switching in particular embedding sub-space instead of a one-fits-all cross-category solution. The results point to interesting directions for future work on probing word embedding models on the verbal fluency task.

Neuro-Symbolic Visual Dialog
Adnen Abdessaied | Mihai Bâce | Andreas Bulling

We propose Neuro-Symbolic Visual Dialog (NSVD) —the first method to combine deep learning and symbolic program execution for multi-round visually-grounded reasoning. NSVD significantly outperforms existing purely-connectionist methods on two key challenges inherent to visual dialog: long-distance co-reference resolution as well as vanishing question-answering performance. We demonstrate the latter by proposing a more realistic and stricter evaluation scheme in which we use predicted answers for the full dialog history when calculating accuracy. We describe two variants of our model and show that using this new scheme, our best model achieves an accuracy of 99.72% on CLEVR-Dialog—a relative improvement of more than 10% over the state of the art—while only requiring a fraction of training data. Moreover, we demonstrate that our neuro-symbolic models have a higher mean first failure round, are more robust against incomplete dialog histories, and generalise better not only to dialogs that are up to three times longer than those seen during training but also to unseen question types and scenes.

LINGUIST: Language Model Instruction Tuning to Generate Annotated Utterances for Intent Classification and Slot Tagging
Andy Rosenbaum | Saleh Soltan | Wael Hamza | Yannick Versley | Markus Boese

We present LINGUIST, a method for generating annotated data for Intent Classification and Slot Tagging (IC+ST), via fine-tuning AlexaTM 5B, a 5-billion-parameter multilingual sequence-to-sequence (seq2seq) model, on a flexible instruction prompt. In a 10-shot novel intent setting for the SNIPS dataset, LINGUIST surpasses state-of-the-art approaches (Back-Translation and Example Extrapolation) by a wide margin, showing absolute improvement for the target intents of +1.9 points on IC Recall and +2.5 points on ST F1 Score. In the zero-shot cross-lingual setting of the mATIS++ dataset, LINGUIST out-performs a strong baseline of Machine Translation with Slot Alignment by +4.14 points absolute on ST F1 Score across 6 languages, while matching performance on IC. Finally, we verify our results on an internal large-scale multilingual dataset for conversational agent IC+ST and show significant improvements over a baseline which uses Back-Translation, Paraphrasing and Slot Catalog Resampling. To our knowledge, we are the first to demonstrate instruction fine-tuning of a large-scale seq2seq model to control the outputs of multilingual intent- and slot-labeled data generation.

Adaptive Natural Language Generation for Task-oriented Dialogue via Reinforcement Learning
Atsumoto Ohashi | Ryuichiro Higashinaka

When a natural language generation (NLG) component is implemented in a real-world task-oriented dialogue system, it is necessary to generate not only natural utterances as learned on training data but also utterances adapted to the dialogue environment (e.g., noise from environmental sounds) and the user (e.g., users with low levels of understanding ability). Inspired by recent advances in reinforcement learning (RL) for language generation tasks, we propose ANTOR, a method for Adaptive Natural language generation for Task-Oriented dialogue via Reinforcement learning. In ANTOR, a natural language understanding (NLU) module, which corresponds to the user’s understanding of system utterances, is incorporated into the objective function of RL. If the NLG’s intentions are correctly conveyed to the NLU, which understands a system’s utterances, the NLG is given a positive reward. We conducted experiments on the MultiWOZ dataset, and we confirmed that ANTOR could generate adaptive utterances against speech recognition errors and the different vocabulary levels of users.

TAKE: Topic-shift Aware Knowledge sElection for Dialogue Generation
Chenxu Yang | Zheng Lin | Jiangnan Li | Fandong Meng | Weiping Wang | Lanrui Wang | Jie Zhou

Knowledge-grounded dialogue generation consists of two subtasks: knowledge selection and response generation. The knowledge selector generally constructs a query based on the dialogue context and selects the most appropriate knowledge to help response generation. Recent work finds that realizing who (the user or the agent) holds the initiative and utilizing the role-initiative information to instruct the query construction can help select knowledge. It depends on whether the knowledge connection between two adjacent rounds is smooth to assign the role. However, whereby the user takes the initiative only when there is a strong semantic transition between two rounds, probably leading to initiative misjudgment. Therefore, it is necessary to seek a more sensitive reason beyond the initiative role for knowledge selection. To address the above problem, we propose a Topic-shift Aware Knowledge sElector(TAKE). Specifically, we first annotate the topic shift and topic inheritance labels in multi-round dialogues with distant supervision. Then, we alleviate the noise problem in pseudo labels through curriculum learning and knowledge distillation. Extensive experiments on WoW show that TAKE performs better than strong baselines.

Dynamic Dialogue Policy for Continual Reinforcement Learning
Christian Geishauser | Carel van Niekerk | Hsien-chin Lin | Nurul Lubis | Michael Heck | Shutong Feng | Milica Gašić

Continual learning is one of the key components of human learning and a necessary requirement of artificial intelligence. As dialogue can potentially span infinitely many topics and tasks, a task-oriented dialogue system must have the capability to continually learn, dynamically adapting to new challenges while preserving the knowledge it already acquired. Despite the importance, continual reinforcement learning of the dialogue policy has remained largely unaddressed. The lack of a framework with training protocols, baseline models and suitable metrics, has so far hindered research in this direction. In this work we fill precisely this gap, enabling research in dialogue policy optimisation to go from static to dynamic learning. We provide a continual learning algorithm, baseline architectures and metrics for assessing continual learning models. Moreover, we propose the dynamic dialogue policy transformer (DDPT), a novel dynamic architecture that can integrate new knowledge seamlessly, is capable of handling large state spaces and obtains significant zero-shot performance when being exposed to unseen domains, without any growth in network parameter size. We validate the strengths of DDPT in simulation with two user simulators as well as with humans.

GRAVL-BERT: Graphical Visual-Linguistic Representations for Multimodal Coreference Resolution
Danfeng Guo | Arpit Gupta | Sanchit Agarwal | Jiun-Yu Kao | Shuyang Gao | Arijit Biswas | Chien-Wei Lin | Tagyoung Chung | Mohit Bansal

Learning from multimodal data has become a popular research topic in recent years. Multimodal coreference resolution (MCR) is an important task in this area. MCR involves resolving the references across different modalities, e.g., text and images, which is a crucial capability for building next-generation conversational agents. MCR is challenging as it requires encoding information from different modalities and modeling associations between them. Although significant progress has been made for visual-linguistic tasks such as visual grounding, most of the current works involve single turn utterances and focus on simple coreference resolutions. In this work, we propose an MCR model that resolves coreferences made in multi-turn dialogues with scene images. We present GRAVL-BERT, a unified MCR framework which combines visual relationships between objects, background scenes, dialogue, and metadata by integrating Graph Neural Networks with VL-BERT. We present results on the SIMMC 2.0 multimodal conversational dataset, achieving the rank-1 on the DSTC-10 SIMMC 2.0 MCR challenge with F1 score 0.783. Our code is available at

Learning to Improve Persona Consistency in Multi-party Dialogue Generation via Text Knowledge Enhancement
Dongshi Ju | Shi Feng | Pengcheng Lv | Daling Wang | Yifei Zhang

In an open-domain dialogue system, the consistent persona is a key factor to generate real and coherent dialogues. Existing methods suffer from the incomprehensive persona tags that have unique and obscure meanings to describe human’s personality. Besides, the addressee information, which is closely related to express personality in multi-party dialogues, has been neglected. In this paper, we construct a multi-party personalized dialogue dataset and propose a graph convolution network model (PersonaTKG) with addressee selecting mechanism that integrates personas, dialogue utterances, and external text knowledge in a unified graph. Extensive experiments have shown that PersonaTKG outperforms the baselines by large margins and effectively improves persona consistency in the generated responses.

Improving Top-K Decoding for Non-Autoregressive Semantic Parsing via Intent Conditioning
Geunseob Oh | Rahul Goel | Chris Hidey | Shachi Paul | Aditya Gupta | Pararth Shah | Rushin Shah

Semantic parsing (SP) is a core component of modern virtual assistants like Google Assistant and Amazon Alexa. While sequence-to-sequence based auto-regressive (AR) approaches are common for conversational SP, recent studies employ non-autoregressive (NAR) decoders and reduce inference latency while maintaining competitive parsing quality. However, a major drawback of NAR decoders is the difficulty of generating top-k (i.e., k-best) outputs with approaches such as beam search. To address this challenge, we propose a novel NAR semantic parser that introduces intent conditioning on the decoder. Inspired by the traditional intent and slot tagging parsers, we decouple the top-level intent prediction from the rest of a parse. As the top-level intent largely governs the syntax and semantics of a parse, the intent conditioning allows the model to better control beam search and improves the quality and diversity of top-k outputs. We introduce a hybrid teacher-forcing approach to avoid training and inference mismatch. We evaluate the proposed NAR on conversational SP datasets, TOP & TOPv2. Like the existing NAR models, we maintain the O(1) decoding time complexity while generating more diverse outputs and improving top-3 exact match (EM) by 2.4 points. In comparison with AR models, our model speeds up beam search inference by 6.7 times on CPU with competitive top-k EM.

Autoregressive Entity Generation for End-to-End Task-Oriented Dialog
Guanhuan Huang | Xiaojun Quan | Qifan Wang

Task-oriented dialog (TOD) systems are often required to interact with an external knowledge base (KB) to retrieve necessary entity (e.g., restaurants) information to support their response generation. Most current end-to-end TOD systems either retrieve the KB information explicitly or embed it into model parameters for implicit access. While the first approach demands scanning the KB at each turn of response generation, which is inefficient when the KB scales up, the second approach shows higher flexibility and efficiency. In either approach, the response shall contain attributes of the same entity, however the systems may generate a response with conflicting entities. To address this, we propose to generate the entity autoregressively before leveraging it to guide the response generation in an end-to-end system. To ensure entity consistency, we impose a trie constraint on the decoding of an entity. We also introduce a logit concatenation strategy to facilitate gradient backpropagation for end-to-end training. Experiments on MultiWOZ 2.1 single and CAMREST show that our system can generate more high-quality and entity-consistent responses in an end-to-end manner.

Continual Few-shot Intent Detection
Guodun Li | Yuchen Zhai | Qianglong Chen | Xing Gao | Ji Zhang | Yin Zhang

Intent detection is at the core of task-oriented dialogue systems. Existing intent detection systems are typically trained with a large amount of data over a predefined set of intent classes. However, newly emerged intents in multiple domains are commonplace in the real world. And it is time-consuming and impractical for dialogue systems to re-collect enough annotated data and re-train the model. These limitations call for an intent detection system that could continually recognize new intents with very few labeled examples. In this work, we study the Continual Few-shot Intent Detection (CFID) problem and construct a benchmark consisting of nine tasks with multiple domains and imbalanced classes. To address the key challenges of (a) catastrophic forgetting during continuous learning and (b) negative knowledge transfer across tasks, we propose the Prefix-guided Lightweight Encoder (PLE) with three auxiliary strategies, namely Pseudo Samples Replay (PSR), Teacher Knowledge Transfer (TKT) and Dynamic Weighting Replay (DWR). Extensive experiments demonstrate the effectiveness and efficiency of our method in preventing catastrophic forgetting and encouraging positive knowledge transfer across tasks.

“Mama Always Had a Way of Explaining Things So I Could Understand”: A Dialogue Corpus for Learning to Construct Explanations
Henning Wachsmuth | Milad Alshomary

As AI is more and more pervasive in everyday life, humans have an increasing demand to understand its behavior and decisions. Most research on explainable AI builds on the premise that there is one ideal explanation to be found. In fact, however, everyday explanations are co-constructed in a dialogue between the person explaining (the explainer) and the specific person being explained to (the explainee). In this paper, we introduce a first corpus of dialogical explanations to enable NLP research on how humans explain as well as on how AI can learn to imitate this process. The corpus consists of 65 transcribed English dialogues from the Wired video series 5 Levels, explaining 13 topics to five explainees of different proficiency. All 1550 dialogue turns have been manually labeled by five independent professionals for the topic discussed as well as for the dialogue act and the explanation move performed. We analyze linguistic patterns of explainers and explainees, and we explore differences across proficiency levels. BERT-based baseline results indicate that sequence information helps predicting topics, acts, and moves effectively.

Schema Encoding for Transferable Dialogue State Tracking
Hyunmin Jeon | Gary Geunbae Lee

Dialogue state tracking (DST) is an essential sub-task for task-oriented dialogue systems. Recent work has focused on deep neural models for DST. However, the neural models require a large dataset for training. Furthermore, applying them to another domain needs a new dataset because the neural models are generally trained to imitate the given dataset. In this paper, we propose Schema Encoding for Transferable Dialogue State Tracking (SET-DST), which is a neural DST method for effective transfer to new domains. Transferable DST could assist developments of dialogue systems even with few dataset on target domains. We use a schema encoder not just to imitate the dataset but to comprehend the schema of the dataset. We aim to transfer the model to new domains by encoding new schemas and using them for DST on multi-domain settings. As a result, SET-DST improved the joint accuracy by 1.46 points on MultiWOZ 2.1.

A Personalized Dialogue Generator with Implicit User Persona Detection
Itsugun Cho | Dongyang Wang | Ryota Takahashi | Hiroaki Saito

Current works in the generation of personalized dialogue primarily contribute to the agent presenting a consistent personality and driving a more informative response. However, we found that the generated responses from most previous models tend to be self-centered, with little care for the user in the dialogue. Moreover, we consider that human-like conversation is essentially built based on inferring information about the persona of the other party. Motivated by this, we propose a novel personalized dialogue generator by detecting an implicit user persona. Because it is hard to collect a large number of detailed personas for each user, we attempted to model the user’s potential persona and its representation from dialogue history, with no external knowledge. The perception and fader variables were conceived using conditional variational inference. The two latent variables simulate the process of people being aware of each other’s persona and producing a corresponding expression in conversation. Finally, posterior-discriminated regularization was presented to enhance the training procedure. Empirical studies demonstrate that, compared to state-of-the-art methods, our approach is more concerned with the user’s persona and achieves a considerable boost across both automatic metrics and human evaluations.

Incorporating Causal Analysis into Diversified and Logical Response Generation
Jiayi Liu | Wei Wei | Zhixuan Chu | Xing Gao | Ji Zhang | Tan Yan | Yulin Kang

Although the Conditional Variational Auto-Encoder (CVAE) model can generate more diversified responses than the traditional Seq2Seq model, the responses often have low relevance with the input words or are illogical with the question. A causal analysis is carried out to study the reasons behind, and a methodology of searching for the mediators and mitigating the confounding bias in dialogues is provided. Specifically, we propose to predict the mediators to preserve relevant information and auto-regressively incorporate the mediators into generating process. Besides, a dynamic topic graph guided conditional variational auto-encoder (TGG-CVAE) model is utilized to complement the semantic space and reduce the confounding bias in responses. Extensive experiments demonstrate that the proposed model is able to generate both relevant and informative responses, and outperforms the state-of-the-art in terms of automatic metrics and human evaluations.

Reciprocal Learning of Knowledge Retriever and Response Ranker for Knowledge-Grounded Conversations
Jiazhan Feng | Chongyang Tao | Zhen Li | Chang Liu | Tao Shen | Dongyan Zhao

Grounding dialogue agents with knowledge documents has sparked increased attention in both academia and industry. Recently, a growing body of work is trying to build retrieval-based knowledge-grounded dialogue systems. While promising, these approaches require collecting pairs of dialogue context and the corresponding ground-truth knowledge sentences that contain the information regarding the dialogue context. Unfortunately, hand-labeling data to that end is time-consuming, and many datasets and applications lack such knowledge annotations. In this paper, we propose a reciprocal learning approach to jointly optimize a knowledge retriever and a response ranker for knowledge-grounded response retrieval without ground-truth knowledge labels. Specifically, the knowledge retriever uses the feedback from the response ranker as pseudo supervised signals of knowledge retrieval for updating its parameters, while the response ranker also receives the top-ranked knowledge sentences from knowledge retriever for optimization. Evaluation results on two public benchmarks show that our model can significantly outperform previous state-of-the-art methods.

CR-GIS: Improving Conversational Recommendation via Goal-aware Interest Sequence Modeling
Jinfeng Zhou | Bo Wang | Zhitong Yang | Dongming Zhao | Kun Huang | Ruifang He | Yuexian Hou

Conversational recommendation systems (CRS) aim to determine a goal item by sequentially tracking users’ interests through multi-turn conversation. In CRS, implicit patterns of user interest sequence guide the smooth transition of dialog utterances to the goal item. However, with the convenient explicit knowledge of knowledge graph (KG), existing KG-based CRS methods over-rely on the explicit separate KG links to model the user interests but ignore the rich goal-aware implicit interest sequence patterns in a dialog. In addition, interest sequence is also not fully used to generate smooth transited utterances. We propose CR-GIS with a parallel star framework. First, an interest-level star graph is designed to model the goal-aware implicit user interest sequence. Second, a hierarchical Star Transformer is designed to guide the multi-turn utterances generation with the interest-level star graph. Extensive experiments verify the effectiveness of CR-GIS in achieving more accurate recommended items with more fluent and coherent dialog utterances.

GRASP: Guiding Model with RelAtional Semantics Using Prompt for Dialogue Relation Extraction
Junyoung Son | Jinsung Kim | Jungwoo Lim | Heuiseok Lim

The dialogue-based relation extraction (DialogRE) task aims to predict the relations between argument pairs that appear in dialogue. Most previous studies utilize fine-tuning pre-trained language models (PLMs) only with extensive features to supplement the low information density of the dialogue by multiple speakers. To effectively exploit inherent knowledge of PLMs without extra layers and consider scattered semantic cues on the relation between the arguments, we propose a Guiding model with RelAtional Semantics using Prompt (GRASP). We adopt a prompt-based fine-tuning approach and capture relational semantic clues of a given dialogue with 1) an argument-aware prompt marker strategy and 2) the relational clue detection task. In the experiments, GRASP achieves state-of-the-art performance in terms of both F1 and F1c scores on a DialogRE dataset even though our method only leverages PLMs without adding any extra layers.

PEPDS: A Polite and Empathetic Persuasive Dialogue System for Charity Donation
Kshitij Mishra | Azlaan Mustafa Samad | Palak Totala | Asif Ekbal

Persuasive conversations for a social cause often require influencing other person’s attitude or intention that may fail even with compelling arguments. The use of emotions and different types of polite tones as needed with facts may enhance the persuasiveness of a message. To incorporate these two aspects, we propose a polite, empathetic persuasive dialogue system (PEPDS). First, in a Reinforcement Learning setting, a Maximum Likelihood Estimation loss based model is fine-tuned by designing an efficient reward function consisting of five different sub rewards viz. Persuasion, Emotion, Politeness-Strategy Consistency, Dialogue-Coherence and Non-repetitiveness. Then, to generate empathetic utterances for non-empathetic ones, an Empathetic transfer model is built upon the RL fine-tuned model. Due to the unavailability of an appropriate dataset, by utilizing the PERSUASIONFORGOOD dataset, we create two datasets, viz. EPP4G and ETP4G. EPP4G is used to train three transformer-based classification models as per persuasiveness, emotion and politeness strategy to achieve respective reward feedbacks. The ETP4G dataset is used to train an empathetic transfer model. Our experimental results demonstrate that PEPDS increases the rate of persuasive responses with emotion and politeness acknowledgement compared to the current state-of-the-art dialogue models, while also enhancing the dialogue’s engagement and maintaining the linguistic quality.

DialAug: Mixing up Dialogue Contexts in Contrastive Learning for Robust Conversational Modeling
Lahari Poddar | Peiyao Wang | Julia Reinspach

Retrieval-based conversational systems learn to rank response candidates for a given dialogue context by computing the similarity between their vector representations. However, training on a single textual form of the multi-turn context limits the ability of a model to learn representations that generalize to natural perturbations seen during inference. In this paper we propose a framework that incorporates augmented versions of a dialogue context into the learning objective. We utilize contrastive learning as an auxiliary objective to learn robust dialogue context representations that are invariant to perturbations injected through the augmentation method. We experiment with four benchmark dialogue datasets and demonstrate that our framework combines well with existing augmentation methods and can significantly improve over baseline BERT-based ranking architectures. Furthermore, we propose a novel data augmentation method, ConMix, that adds token level perturbations through stochastic mixing of tokens from other contexts in the batch. We show that our proposed augmentation method outperforms previous data augmentation approaches, and provides dialogue representations that are more robust to common perturbations seen during inference.

A Closer Look at Few-Shot Out-of-Distribution Intent Detection
Li-Ming Zhan | Haowen Liang | Lu Fan | Xiao-Ming Wu | Albert Y.S. Lam

We consider few-shot out-of-distribution (OOD) intent detection, a practical and important problem for the development of task-oriented dialogue systems. Despite its importance, this problem is seldom studied in the literature, let alone examined in a systematic way. In this work, we take a closer look at this problem and identify key issues for research. In our pilot study, we reveal the reason why existing OOD intent detection methods are not adequate in dealing with this problem. Based on the observation, we propose a promising approach to tackle this problem based on latent representation generation and self-supervision. Comprehensive experiments on three real-world intent detection benchmark datasets demonstrate the high effectiveness of our proposed approach and its great potential in improving state-of-the-art methods for few-shot OOD intent detection.

CGIM: A Cycle Guided Interactive Learning Model for Consistency Identification in Task-oriented Dialogue
Libo Qin | Qiguang Chen | Tianbao Xie | Qian Liu | Shijue Huang | Wanxiang Che | Zhou Yu

Consistency identification in task-oriented dialog (CI-ToD) usually consists of three subtasks, aiming to identify inconsistency between current system response and current user response, dialog history and the corresponding knowledge base. This work aims to solve CI-ToD task by introducing an explicit interaction paradigm, Cycle Guided Interactive learning Model (CGIM), which achieves to make information exchange explicitly from all the three tasks. Specifically, CGIM relies on two core insights, referred to as guided multi-head attention module and cycle interactive mechanism, that collaborate from each other. On the one hand, each two tasks are linked with the guided multi-head attention module, aiming to explicitly model the interaction across two related tasks. On the other hand, we further introduce cycle interactive mechanism that focuses on facilitating model to exchange information among the three correlated sub-tasks via a cycle interaction manner. Experimental results on CI-ToD benchmark show that our model achieves the state-of-the-art performance, pushing the overall score to 56.3% (5.0% point absolute improvement). In addition, we find that CGIM is robust to the initial task flow order.

CorefDiffs: Co-referential and Differential Knowledge Flow in Document Grounded Conversations
Lin Xu | Qixian Zhou | Jinlan Fu | Min-Yen Kan | See-Kiong Ng

Knowledge-grounded dialog systems need to incorporate smooth transitions among knowledge selected for generating responses, to ensure that dialog flows naturally. For document-grounded dialog systems, the inter- and intra-document knowledge relations can be used to model such conversational flows. We develop a novel Multi-Document Co-Referential Graph (Coref-MDG) to effectively capture the inter-document relationships based on commonsense and similarity and the intra-document co-referential structures of knowledge segments within the grounding documents. We propose CorefDiffs, a Co-referential and Differential flow management method, to linearize the static Coref-MDG into conversational sequence logic. CorefDiffs performs knowledge selection by accounting for contextual graph structures and the knowledge difference sequences. CorefDiffs significantly outperforms the state-of-the-art by 9.5%, 7.4% and 8.2% on three public benchmarks. This demonstrates that the effective modeling of co-reference and knowledge difference for dialog flows are critical for transitions in document-grounded conversation.

SelF-Eval: Self-supervised Fine-grained Dialogue Evaluation
Longxuan Ma | Ziyu Zhuang | Weinan Zhang | Mingda Li | Ting Liu

This paper introduces a novel Self-supervised Fine-grained Dialogue Evaluation framework (SelF-Eval). The core idea is to model the correlation between turn quality and the entire dialogue quality. We first propose a novel automatic data construction method that can automatically assign fine-grained scores for arbitrarily dialogue data. Then we train SelF-Eval with a multi-level contrastive learning schema which helps to distinguish different score levels. Experimental results on multiple benchmarks show that SelF-Eval is highly consistent with human evaluations and better than the state-of-the-art models. We give a detailed analysis of the experiments in this paper. Our code is available on GitHub.

Open-Domain Dialog Evaluation Using Follow-Ups Likelihood
Maxime De Bruyn | Ehsan Lotfi | Jeska Buhmann | Walter Daelemans

Automatic evaluation of open-domain dialogs remains an unsolved problem. Existing methods do not correlate strongly with human annotations. In this paper, we present a new automated evaluation method based on the use of follow-ups. We measure the probability that a language model will continue the conversation with a fixed set of follow-ups (e.g. not really relevant here, what are you trying to say?). When compared against twelve existing methods, our new evaluation achieves the highest correlation with human evaluations.

Joint Goal Segmentation and Goal Success Prediction on Multi-Domain Conversations
Meiguo Wang | Benjamin Yao | Bin Guo | Xiaohu Liu | Yu Zhang | Tuan-Hung Pham | Chenlei Guo

To evaluate the performance of a multi-domain goal-oriented Dialogue System (DS), it is important to understand what the users’ goals are for the conversations and whether those goals are successfully achieved. The success rate of goals directly correlates with user satisfaction and perceived usefulness of the DS. In this paper, we propose a novel automatic dialogue evaluation framework that jointly performs two tasks: goal segmentation and goal success prediction. We extend the RoBERTa-IQ model (Gupta et al., 2021) by adding multi-task learning heads for goal segmentation and success prediction. Using an annotated dataset from a commercial DS, we demonstrate that our proposed model reaches an accuracy that is on-par with single-pass human annotation comparing to a three-pass gold annotation benchmark.

Slot Dependency Modeling for Zero-Shot Cross-Domain Dialogue State Tracking
Qingyue Wang | Yanan Cao | Piji Li | Yanhe Fu | Zheng Lin | Li Guo

Section-Aware Commonsense Knowledge-Grounded Dialogue Generation with Pre-trained Language Model
Sixing Wu | Ying Li | Ping Xue | Dawei Zhang | Zhonghai Wu

In knowledge-grounded dialogue generation, pre-trained language models (PLMs) can be expected to deepen the fusing of dialogue context and knowledge because of their superior ability of semantic understanding. Unlike adopting the plain text knowledge, it is thorny to leverage the structural commonsense knowledge when using PLMs because most PLMs can only operate plain texts. Thus, linearizing commonsense knowledge facts into plan text is a compulsory trick. However, a dialogue is always aligned to a lot of retrieved fact candidates; as a result, the linearized text is always lengthy and then significantly increases the burden of using PLMs. To address this issue, we propose a novel two-stage framework SAKDP. In the first pre-screening stage, we use a ranking network PriorRanking to estimate the relevance of a retrieved knowledge fact. Thus, facts can be clustered into three sections of different priorities. As priority decreases, the relevance decreases, and the number of included facts increases. In the next dialogue generation stage, we use section-aware strategies to encode the linearized knowledge. The powerful but expensive PLM is only used for a few facts in the higher priority sections, reaching the performance-efficiency balance. Both the automatic and human evaluation demonstrate the superior performance of this work.

Using Multi-Encoder Fusion Strategies to Improve Personalized Response Selection
Souvik Das | Sougata Saha | Rohini K. Srihari

Personalized response selection systems are generally grounded on persona. However, a correlation exists between persona and empathy, which these systems do not explore well. Also, when a contradictory or off-topic response is selected, faithfulness to the conversation context plunges. This paper attempts to address these issues by proposing a suite of fusion strategies that capture the interaction between persona, emotion, and entailment information of the utterances. Ablation studies on the Persona-Chat dataset show that incorporating emotion and entailment improves the accuracy of response selection. We combine our fusion strategies and concept-flow encoding to train a BERT-based model which outperforms the previous methods by margins larger than 2.3% on original personas and 1.9% on revised personas in terms of hits@1 (top-1 accuracy), achieving a new state-of-the-art performance on the Persona-Chat dataset

A Multi-Dimensional, Cross-Domain and Hierarchy-Aware Neural Architecture for ISO-Standard Dialogue Act Tagging
Stefano Mezza | Wayne Wobcke | Alan Blair

Dialogue Act tagging with the ISO 24617-2 standard is a difficult task that involves multi-label text classification across a diverse set of labels covering semantic, syntactic and pragmatic aspects of dialogue. The lack of an adequately sized training set annotated with this standard is a major problem when using the standard in practice. In this work we propose a neural architecture to increase classification accuracy, especially on low-frequency fine-grained tags. Our model takes advantage of the hierarchical structure of the ISO taxonomy and utilises syntactic information in the form of Part-Of-Speech and dependency tags, in addition to contextual information from previous turns. We train our architecture on an aggregated corpus of conversations from different domains, which provides a variety of dialogue interactions and linguistic registers. Our approach achieves state-of-the-art tagging results on the DialogBank benchmark data set, providing empirical evidence that this architecture can successfully generalise to different domains.

SPACE-2: Tree-Structured Semi-Supervised Contrastive Pre-training for Task-Oriented Dialog Understanding
Wanwei He | Yinpei Dai | Binyuan Hui | Min Yang | Zheng Cao | Jianbo Dong | Fei Huang | Luo Si | Yongbin Li

Pre-training methods with contrastive learning objectives have shown remarkable success in dialog understanding tasks. However, current contrastive learning solely considers the self-augmented dialog samples as positive samples and treats all other dialog samples as negative ones, which enforces dissimilar representations even for dialogs that are semantically related. In this paper, we propose SPACE-2, a tree-structured pre-trained conversation model, which learns dialog representations from limited labeled dialogs and large-scale unlabeled dialog corpora via semi-supervised contrastive pre-training. Concretely, we first define a general semantic tree structure (STS) to unify the inconsistent annotation schema across different dialog datasets, so that the rich structural information stored in all labeled data can be exploited. Then we propose a novel multi-view score function to increase the relevance of all possible dialogs that share similar STSs and only push away other completely different dialogs during supervised contrastive pre-training. To fully exploit unlabeled dialogs, a basic self-supervised contrastive loss is also added to refine the learned representations. Experiments show that our method can achieve new state-of-the-art results on the DialoGLUE benchmark consisting of seven datasets and four popular dialog understanding tasks.

ET5: A Novel End-to-end Framework for Conversational Machine Reading Comprehension
Xiao Zhang | Heyan Huang | Zewen Chi | Xian-Ling Mao

Conversational machine reading comprehension (CMRC) aims to assist computers to understand an natural language text and thereafter engage in a multi-turn conversation to answer questions related to the text. Existing methods typically require three steps: (1) decision making based on entailment reasoning; (2) span extraction if required by the above decision; (3) question rephrasing based on the extracted span. However, for nearly all these methods, the span extraction and question rephrasing steps cannot fully exploit the fine-grained entailment reasoning information in decision making step because of their relative independence, which will further enlarge the information gap between decision making and question phrasing. Thus, to tackle this problem, we propose a novel end-to-end framework for conversational machine reading comprehension based on shared parameter mechanism, called entailment reasoning T5 (ET5). Despite the lightweight of our proposed framework, experimental results show that the proposed ET5 achieves new state-of-the-art results on the ShARC leaderboard with the BLEU-4 score of 55.2. Our model and code are publicly available.

CoHS-CQG: Context and History Selection for Conversational Question Generation
Xuan Long Do | Bowei Zou | Liangming Pan | Nancy F. Chen | Shafiq Joty | Ai Ti Aw

Conversational question generation (CQG) serves as a vital task for machines to assist humans, such as interactive reading comprehension, through conversations. Compared to traditional single-turn question generation (SQG), CQG is more challenging in the sense that the generated question is required not only to be meaningful, but also to align with the provided conversation. Previous studies mainly focus on how to model the flow and alignment of the conversation, but do not thoroughly study which parts of the context and history are necessary for the model. We believe that shortening the context and history is crucial as it can help the model to optimise more on the conversational alignment property. To this end, we propose CoHS-CQG, a two-stage CQG framework, which adopts a novel CoHS module to shorten the context and history of the input. In particular, it selects the top-p sentences and history turns by calculating the relevance scores of them. Our model achieves state-of-the-art performances on CoQA in both the answer-aware and answer-unaware settings.

Semantic-based Pre-training for Dialogue Understanding
Xuefeng Bai | Linfeng Song | Yue Zhang

Pre-trained language models have made great progress on dialogue tasks. However, these models are typically trained on surface dialogue text, thus are proven to be weak in understanding the main semantic meaning of a dialogue context. We investigate Abstract Meaning Representation (AMR) as explicit semantic knowledge for pre-training models to capture the core semantic information in dialogues during pre-training. In particular, we propose a semantic-based pre-training framework that extends the standard pre-training framework (Devlin et al.,2019) by three tasks for learning 1) core semantic units, 2) semantic relations and 3) the overall semantic representation according to AMR graphs. Experiments on the understanding of both chit-chats and task-oriented dialogues show the superiority of our model. To our knowledge, we are the first to leverage a deep semantic representation for dialogue pre-training.

Distribution Calibration for Out-of-Domain Detection with Bayesian Approximation
Yanan Wu | Zhiyuan Zeng | Keqing He | Yutao Mou | Pei Wang | Weiran Xu

Out-of-Domain (OOD) detection is a key component in a task-oriented dialog system, which aims to identify whether a query falls outside the predefined supported intent set. Previous softmax-based detection algorithms are proved to be overconfident for OOD samples. In this paper, we analyze overconfident OOD comes from distribution uncertainty due to the mismatch between the training and test distributions, which makes the model can’t confidently make predictions thus probably causes abnormal softmax scores. We propose a Bayesian OOD detection framework to calibrate distribution uncertainty using Monte-Carlo Dropout. Our method is flexible and easily pluggable to existing softmax-based baselines and gains 33.33% OOD F1 improvements with increasing only 0.41% inference time compared to MSP. Further analyses show the effectiveness of Bayesian learning for OOD detection.

Tracking Satisfaction States for Customer Satisfaction Prediction in E-commerce Service Chatbots
Yang Sun | Liangqing Wu | Shuangyong Song | Xiaoguang Yu | Xiaodong He | Guohong Fu

Due to the increasing use of service chatbots in E-commerce platforms in recent years, customer satisfaction prediction (CSP) is gaining more and more attention. CSP is dedicated to evaluating subjective customer satisfaction in conversational service and thus helps improve customer service experience. However, previous methods focus on modeling customer-chatbot interaction across different turns, which are hard to represent the important dynamic satisfaction states throughout the customer journey. In this work, we investigate the problem of satisfaction states tracking and its effects on CSP in E-commerce service chatbots. To this end, we propose a dialogue-level classification model named DialogueCSP to track satisfaction states for CSP. In particular, we explore a novel two-step interaction module to represent the dynamic satisfaction states at each turn. In order to capture dialogue-level satisfaction states for CSP, we further introduce dialogue-aware attentions to integrate historical informative cues into the interaction module. To evaluate the proposed approach, we also build a Chinese E-commerce dataset for CSP. Experiment results demonstrate that our model significantly outperforms multiple baselines, illustrating the benefits of satisfaction states tracking on CSP.

Towards Multi-label Unknown Intent Detection
Yawen Ouyang | Zhen Wu | Xinyu Dai | Shujian Huang | Jiajun Chen

Multi-class unknown intent detection has made remarkable progress recently. However, it has a strong assumption that each utterance has only one intent, which does not conform to reality because utterances often have multiple intents. In this paper, we propose a more desirable task, multi-label unknown intent detection, to detect whether the utterance contains the unknown intent, in which each utterance may contain multiple intents. In this task, the unique utterances simultaneously containing known and unknown intents make existing multi-class methods easy to fail. To address this issue, we propose an intuitive and effective method to recognize whether All Intents contained in the utterance are Known (AIK). Our high-level idea is to predict the utterance’s intent number, then check whether the utterance contains the same number of known intents. If the number of known intents is less than the number of intents, it implies that the utterance also contains unknown intents. We benchmark AIK over existing methods, and empirical results suggest that our method obtains state-of-the-art performances. For example, on the MultiWOZ 2.3 dataset, AIK significantly reduces the FPR95 by 12.25% compared to the best baseline.

Pan More Gold from the Sand: Refining Open-domain Dialogue Training with Noisy Self-Retrieval Generation
Yihe Wang | Yitong Li | Yasheng Wang | Fei Mi | Pingyi Zhou | Xin Wang | Jin Liu | Xin Jiang | Qun Liu

Real human conversation data are complicated, heterogeneous, and noisy, from which building open-domain dialogue systems remains a challenging task. In fact, such dialogue data still contains a wealth of information and knowledge, however, they are not fully explored. In this paper, we show existing open-domain dialogue generation methods that memorize context-response paired data with autoregressive or encode-decode language models underutilize the training data. Different from current approaches, using external knowledge, we explore a retrieval-generation training framework that can take advantage of the heterogeneous and noisy training data by considering them as “evidence”. In particular, we use BERTScore for retrieval, which gives better qualities of the evidence and generation. Experiments over publicly available datasets demonstrate that our method can help models generate better responses, even such training data are usually impressed as low-quality data. Such performance gain is comparable with those improved by enlarging the training set, even better. We also found that the model performance has a positive correlation with the relevance of the retrieved evidence. Moreover, our method performed well on zero-shot experiments, which indicates that our method can be more robust to real-world data.

MulZDG: Multilingual Code-Switching Framework for Zero-shot Dialogue Generation
Yongkang Liu | Shi Feng | Daling Wang | Yifei Zhang

Building dialogue generation systems in a zero-shot scenario remains a huge challenge, since the typical zero-shot approaches in dialogue generation rely heavily on large-scale pre-trained language generation models such as GPT-3 and T5. The research on zero-shot dialogue generation without cumbersome language models is limited due to lacking corresponding parallel dialogue corpora. In this paper, we propose a simple but effective Multilingual learning framework for Zero-shot Dialogue Generation (dubbed as MulZDG) that can effectively transfer knowledge from an English corpus with large-scale training samples to a non-English corpus with zero samples. Besides, MulZDG can be viewed as a multilingual data augmentation method to improve the performance of the resource-rich language. First, we construct multilingual code-switching dialogue datasets via translation utterances randomly selected from monolingual English datasets. Then we employ MulZDG to train a unified multilingual dialogue model based on the code-switching datasets. The MulZDG can conduct implicit semantic alignment between different languages. Experiments on DailyDialog and DSTC7 datasets demonstrate that MulZDG not only achieve competitive performance under zero-shot case compared to training with sufficient examples but also greatly improve the performance of the source language.

Target-Guided Open-Domain Conversation Planning
Yosuke Kishinami | Reina Akama | Shiki Sato | Ryoko Tokuhisa | Jun Suzuki | Kentaro Inui

Prior studies addressing target-oriented conversational tasks lack a crucial notion that has been intensively studied in the context of goal-oriented artificial intelligence agents, namely, planning. In this study, we propose the task of Target-Guided Open-Domain Conversation Planning (TGCP) task to evaluate whether neural conversational agents have goal-oriented conversation planning abilities. Using the TGCP task, we investigate the conversation planning abilities of existing retrieval models and recent strong generative models. The experimental results reveal the challenges facing current technology.

Does GPT-3 Generate Empathetic Dialogues? A Novel In-Context Example Selection Method and Automatic Evaluation Metric for Empathetic Dialogue Generation
Young-Jun Lee | Chae-Gyun Lim | Ho-Jin Choi

Since empathy plays a crucial role in increasing social bonding between people, many studies have designed their own dialogue agents to be empathetic using the well-established method of fine-tuning. However, they do not use prompt-based in-context learning, which has shown powerful performance in various natural language processing (NLP) tasks, for empathetic dialogue generation. Although several studies have investigated few-shot in-context learning for empathetic dialogue generation, an in-depth analysis of the generation of empathetic dialogue with in-context learning remains unclear, especially in GPT-3 (Brown et al., 2020). In this study, we explore whether GPT-3 can generate empathetic dialogues through prompt-based in-context learning in both zero-shot and few-shot settings. To enhance performance, we propose two new in-context example selection methods, called SITSM and EMOSITSM, that utilize emotion and situational information. We also introduce a new automatic evaluation method, DIFF-EPITOME, which reflects the human tendency to express empathy. From the analysis, we reveal that our DIFF-EPITOME is effective in measuring the degree of human empathy. We show that GPT-3 achieves competitive performance with Blender 90M, a state-of-the-art dialogue generative model, on both automatic and human evaluation. Our code is available at

DialogueEIN: Emotion Interaction Network for Dialogue Affective Analysis
Yuchen Liu | Jinming Zhao | Jingwen Hu | Ruichen Li | Qin Jin

Emotion Recognition in Conversation (ERC) has attracted increasing attention in the affective computing research field. Previous works have mainly focused on modeling the semantic interactions in the dialogue and implicitly inferring the evolution of the speakers’ emotional states. Few works have considered the emotional interactions, which directly reflect the emotional evolution of speakers in the dialogue. According to psychological and behavioral studies, the emotional inertia and emotional stimulus are important factors that affect the speaker’s emotional state in conversations. In this work, we propose a novel Dialogue Emotion Interaction Network, DialogueEIN, to explicitly model the intra-speaker, inter-speaker, global and local emotional interactions to respectively simulate the emotional inertia, emotional stimulus, global and local emotional evolution in dialogues. Extensive experiments on four ERC benchmark datasets, IEMOCAP, MELD, EmoryNLP and DailyDialog, show that our proposed DialogueEIN considering emotional interaction factors can achieve superior or competitive performance compared to state-of-the-art methods. Our codes and models are released.

Towards Enhancing Health Coaching Dialogue in Low-Resource Settings
Yue Zhou | Barbara Di Eugenio | Brian Ziebart | Lisa Sharp | Bing Liu | Ben Gerber | Nikolaos Agadakos | Shweta Yadav

Health coaching helps patients identify and accomplish lifestyle-related goals, effectively improving the control of chronic diseases and mitigating mental health conditions. However, health coaching is cost-prohibitive due to its highly personalized and labor-intensive nature. In this paper, we propose to build a dialogue system that converses with the patients, helps them create and accomplish specific goals, and can address their emotions with empathy. However, building such a system is challenging since real-world health coaching datasets are limited and empathy is subtle. Thus, we propose a modularized health coaching dialogue with simplified NLU and NLG frameworks combined with mechanism-conditioned empathetic response generation. Through automatic and human evaluation, we show that our system generates more empathetic, fluent, and coherent responses and outperforms the state-of-the-art in NLU tasks while requiring less annotation. We view our approach as a key step towards building automated and more accessible health coaching systems.

Generalized Intent Discovery: Learning from Open World Dialogue System
Yutao Mou | Keqing He | Yanan Wu | Pei Wang | Jingang Wang | Wei Wu | Yi Huang | Junlan Feng | Weiran Xu

Traditional intent classification models are based on a pre-defined intent set and only recognize limited in-domain (IND) intent classes. But users may input out-of-domain (OOD) queries in a practical dialogue system. Such OOD queries can provide directions for future improvement. In this paper, we define a new task, Generalized Intent Discovery (GID), which aims to extend an IND intent classifier to an open-world intent set including IND and OOD intents. We hope to simultaneously classify a set of labeled IND intent classes while discovering and recognizing new unlabeled OOD types incrementally. We construct three public datasets for different application scenarios and propose two kinds of frameworks, pipeline-based and end-to-end for future work. Further, we conduct exhaustive experiments and qualitative analysis to comprehend key challenges and provide new guidance for future GID research.

DialMed: A Dataset for Dialogue-based Medication Recommendation
Zhenfeng He | Yuqiang Han | Zhenqiu Ouyang | Wei Gao | Hongxu Chen | Guandong Xu | Jian Wu

Medication recommendation is a crucial task for intelligent healthcare systems. Previous studies mainly recommend medications with electronic health records (EHRs). However, some details of interactions between doctors and patients may be ignored or omitted in EHRs, which are essential for automatic medication recommendation. Therefore, we make the first attempt to recommend medications with the conversations between doctors and patients. In this work, we construct DIALMED, the first high-quality dataset for medical dialogue-based medication recommendation task. It contains 11, 996 medical dialogues related to 16 common diseases from 3 departments and 70 corresponding common medications. Furthermore, we propose a Dialogue structure and Disease knowledge aware Network (DDN), where a QA Dialogue Graph mechanism is designed to model the dialogue structure and the knowledge graph is used to introduce external disease knowledge. The extensive experimental results demonstrate that the proposed method is a promising solution to recommend medications with medical dialogues. The dataset and code are available at

Speaker Clustering in Textual Dialogue with Pairwise Utterance Relation and Cross-corpus Dialogue Act Supervision
Zhihua Su | Qiang Zhou

We propose a speaker clustering model for textual dialogues, which groups the utterances of a multi-party dialogue without speaker annotations, so that the actual speakers are identical inside each cluster. We find that, without knowing the speakers, the interactions between utterances are still implied in the text, which suggest the relations between speakers. In this work, we model the semantic content of utterance with a pre-trained language model, and the relations between speakers with an utterance-level pairwise matrix. The semantic content representation can be further instructed by cross-corpus dialogue act modeling. The speaker labels are finally generated by spectral clustering. Experiments show that our model outperforms the sequence classification baseline, and benefits from the auxiliary dialogue act classification task. We also discuss the detail of determining the number of speakers (clusters), eliminating the interference caused by semantic similarity, and the impact of utterance distance.

TopKG: Target-oriented Dialog via Global Planning on Knowledge Graph
Zhitong Yang | Bo Wang | Jinfeng Zhou | Yue Tan | Dongming Zhao | Kun Huang | Ruifang He | Yuexian Hou

Target-oriented dialog aims to reach a global target through multi-turn conversation. The key to the task is the global planning towards the target, which flexibly guides the dialog concerning the context. However, existing target-oriented dialog works take a local and greedy strategy for response generation, where global planning is absent. In this work, we propose global planning for target-oriented dialog on a commonsense knowledge graph (KG). We design a global reinforcement learning with the planned paths to flexibly adjust the local response generation model towards the global target. We also propose a KG-based method to collect target-oriented samples automatically from the chit-chat corpus for model training. Experiments show that our method can reach the target with a higher success rate, fewer turns, and more coherent responses.

Extractive Summarisation for German-language Data: A Text-level Approach with Discourse Features
Freya Hewett | Manfred Stede

We examine the link between facets of Rhetorical Structure Theory (RST) and the selection of content for extractive summarisation, for German-language texts. For this purpose, we produce a set of extractive summaries for a dataset of German-language newspaper commentaries, a corpus which already has several layers of annotation. We provide an in-depth analysis of the connection between summary sentences and several RST-based features and transfer these insights to various automated summarisation models. Our results show that RST features are informative for the task of extractive summarisation, particularly nuclearity and relations at sentence-level.

End-to-End Neural Bridging Resolution
Hideo Kobayashi | Yufang Hou | Vincent Ng

The state of bridging resolution research is rather unsatisfactory: not only are state-of-the-art resolvers evaluated in unrealistic settings, but the neural models underlying these resolvers are weaker than those used for entity coreference resolution. In light of these problems, we evaluate bridging resolvers in an end-to-end setting, strengthen them with better encoders, and attempt to gain a better understanding of them via perturbation experiments and a manual analysis of their outputs.

Investigating the Performance of Transformer-Based NLI Models on Presuppositional Inferences
Jad Kabbara | Jackie Chi Kit Cheung

Presuppositions are assumptions that are taken for granted by an utterance, and identifying them is key to a pragmatic interpretation of language. In this paper, we investigate the capabilities of transformer models to perform NLI on cases involving presupposition. First, we present simple heuristics to create alternative “contrastive” test cases based on the ImpPres dataset and investigate the model performance on those test cases. Second, to better understand how the model is making its predictions, we analyze samples from sub-datasets of ImpPres and examine model performance on them. Overall, our findings suggest that NLI-trained transformer models seem to be exploiting specific structural and lexical cues as opposed to performing some kind of pragmatic reasoning.

Re-Examining FactBank: Predicting the Author’s Presentation of Factuality
John Murzaku | Peter Zeng | Magdalena Markowska | Owen Rambow

We present a corrected version of a subset of the FactBank data set. Previously published results on FactBank are no longer valid. We perform experiments on FactBank using multiple training paradigms, data smoothing techniques, and polarity classifiers. We argue that f-measure is an important alternative evaluation metric for factuality. We provide new state-of-the-art results for four corpora including FactBank. We perform an error analysis on Factbank combined with two similar corpora.

The Role of Context and Uncertainty in Shallow Discourse Parsing
Katherine Atwell | Remi Choi | Junyi Jessy Li | Malihe Alikhani

Discourse parsing has proven to be useful for a number of NLP tasks that require complex reasoning. However, over a decade since the advent of the Penn Discourse Treebank, predicting implicit discourse relations in text remains challenging. There are several possible reasons for this, and we hypothesize that models should be exposed to more context as it plays an important role in accurate human annotation; meanwhile adding uncertainty measures can improve model accuracy and calibration. To thoroughly investigate this phenomenon, we perform a series of experiments to determine 1) the effects of context on human judgments, and 2) the effect of quantifying uncertainty with annotator confidence ratings on model accuracy and calibration (which we measure using the Brier score (Brier et al, 1950)). We find that including annotator accuracy and confidence improves model accuracy, and incorporating confidence in the model’s temperature function can lead to models with significantly better-calibrated confidence measures. We also find some insightful qualitative results regarding human and model behavior on these datasets.

Improving Commonsense Contingent Reasoning by Pseudo-data and Its Application to the Related Tasks
Kazumasa Omura | Sadao Kurohashi

Contingent reasoning is one of the essential abilities in natural language understanding, and many language resources annotated with contingent relations have been constructed. However, despite the recent advances in deep learning, the task of contingent reasoning is still difficult for computers. In this study, we focus on the reasoning of contingent relation between basic events. Based on the existing data construction method, we automatically generate large-scale pseudo-problems and incorporate the generated data into training. We also investigate the generality of contingent knowledge through quantitative evaluation by performing transfer learning on the related tasks: discourse relation analysis, the Japanese Winograd Schema Challenge, and the JCommonsenseQA. The experimental results show the effectiveness of utilizing pseudo-problems for both the commonsense contingent reasoning task and the related tasks, which suggests the importance of contingent reasoning.

A Survey in Automatic Irony Processing: Linguistic, Cognitive, and Multi-X Perspectives
Qingcheng Zeng | An-Ran Li

Irony is a ubiquitous figurative language in daily communication. Previously, many researchers have approached irony from linguistic, cognitive science, and computational aspects. Recently, some progress have been witnessed in automatic irony processing due to the rapid development in deep neural models in natural language processing (NLP). In this paper, we will provide a comprehensive overview of computational irony, insights from linguisic theory and cognitive science, as well as its interactions with downstream NLP tasks and newly proposed multi-X irony processing perspectives.

Towards Identifying Alternative-Lexicalization Signals of Discourse Relations
René Knaebel | Manfred Stede

The task of shallow discourse parsing in the Penn Discourse Treebank (PDTB) framework has traditionally been restricted to identifying those relations that are signaled by a discourse connective (“explicit”) and those that have no signal at all (“implicit”). The third type, the more flexible group of “AltLex” realizations has been neglected because of its small amount of occurrences in the PDTB2 corpus. Their number has grown significantly in the recent PDTB3, and in this paper, we present the first approaches for recognizing these “alternative lexicalizations”. We compare the performance of a pattern-based approach and a sequence labeling model, add an experiment on the pre-classification of candidate sentences, and provide an initial qualitative analysis of the error cases made by both models.

Topicalization in Language Models: A Case Study on Japanese
Riki Fujihara | Tatsuki Kuribayashi | Kaori Abe | Ryoko Tokuhisa | Kentaro Inui

Humans use different wordings depending on the context to facilitate efficient communication. For example, instead of completely new information, information related to the preceding context is typically placed at the sentence-initial position. In this study, we analyze whether neural language models (LMs) can capture such discourse-level preferences in text generation. Specifically, we focus on a particular aspect of discourse, namely the topic-comment structure. To analyze the linguistic knowledge of LMs separately, we chose the Japanese language, a topic-prominent language, for designing probing tasks, and we created human topicalization judgment data by crowdsourcing. Our experimental results suggest that LMs have different generalizations from humans; LMs exhibited less context-dependent behaviors toward topicalization judgment. These results highlight the need for the additional inductive biases to guide LMs to achieve successful discourse-level generalization.

“No, They Did Not”: Dialogue Response Dynamics in Pre-trained Language Models
Sanghee J. Kim | Lang Yu | Allyson Ettinger

A critical component of competence in language is being able to identify relevant components of an utterance and reply appropriately. In this paper we examine the extent of such dialogue response sensitivity in pre-trained language models, conducting a series of experiments with a particular focus on sensitivity to dynamics involving phenomena of at-issueness and ellipsis. We find that models show clear sensitivity to a distinctive role of embedded clauses, and a general preference for responses that target main clause content of prior utterances. However, the results indicate mixed and generally weak trends with respect to capturing the full range of dynamics involved in targeting at-issue versus not-at-issue content. Additionally, models show fundamental limitations in grasp of the dynamics governing ellipsis, and response selections show clear interference from superficial factors that outweigh the influence of principled discourse constraints.

New or Old? Exploring How Pre-Trained Language Models Represent Discourse Entities
Sharid Loáiciga | Anne Beyer | David Schlangen

Recent research shows that pre-trained language models, built to generate text conditioned on some context, learn to encode syntactic knowledge to a certain degree. This has motivated researchers to move beyond the sentence-level and look into their ability to encode less studied discourse-level phenomena. In this paper, we add to the body of probing research by investigating discourse entity representations in large pre-trained language models in English. Motivated by early theories of discourse and key pieces of previous work, we focus on the information-status of entities as discourse-new or discourse-old. We present two probing models, one based on binary classification and another one on sequence labeling. The results of our experiments show that pre-trained language models do encode information on whether an entity has been introduced before or not in the discourse. However, this information alone is not sufficient to find the entities in a discourse, opening up interesting questions about the definition of entities for future work.

Dialo-AP: A Dependency Parsing Based Argument Parser for Dialogues
Sougata Saha | Souvik Das | Rohini K. Srihari

While neural approaches to argument mining (AM) have advanced considerably, most of the recent work has been limited to parsing monologues. With an urgent interest in the use of conversational agents for broader societal applications, there is a need to advance the state-of-the-art in argument parsers for dialogues. This enables progress towards more purposeful conversations involving persuasion, debate and deliberation. This paper discusses Dialo-AP, an end-to-end argument parser that constructs argument graphs from dialogues. We formulate AM as dependency parsing of elementary and argumentative discourse units; the system is trained using extensive pre-training and curriculum learning comprising nine diverse corpora. Dialo-AP is capable of generating argument graphs from dialogues by performing all sub-tasks of AM. Compared to existing state-of-the-art baselines, Dialo-AP achieves significant improvements across all tasks, which is further validated through rigorous human evaluation.

ConnPrompt: Connective-cloze Prompt Learning for Implicit Discourse Relation Recognition
Wei Xiang | Zhenglin Wang | Lu Dai | Bang Wang

Implicit Discourse Relation Recognition (IDRR) is to detect and classify relation sense between two text segments without an explicit connective. Vanilla pre-train and fine-tuning paradigm builds upon a Pre-trained Language Model (PLM) with a task-specific neural network. However, the task objective functions are often not in accordance with that of the PLM. Furthermore, this paradigm cannot well exploit some linguistic evidence embedded in the pre-training process. The recent pre-train, prompt, and predict paradigm selects appropriate prompts to reformulate downstream tasks, so as to utilizing the PLM itself for prediction. However, for its success applications, prompts, verbalizer as well as model training should still be carefully designed for different tasks. As the first trial of using this new paradigm for IDRR, this paper develops a Connective-cloze Prompt (ConnPrompt) to transform the relation prediction task as a connective-cloze task. Specifically, we design two styles of ConnPrompt template: Insert-cloze Prompt (ICP) and Prefix-cloze Prompt (PCP) and construct an answer space mapping to the relation senses based on the hierarchy sense tags and implicit connectives. Furthermore, we use a multi-prompt ensemble to fuse predictions from different prompting results. Experiments on the PDTB corpus show that our method significantly outperforms the state-of-the-art algorithms, even with fewer training data.

A Distance-Aware Multi-Task Framework for Conversational Discourse Parsing
Yaxin Fan | Peifeng Li | Fang Kong | Qiaoming Zhu

Conversational discourse parsing aims to construct an implicit utterance dependency tree to reflect the turn-taking in a multi-party conversation. Existing works are generally divided into two lines: graph-based and transition-based paradigms, which perform well for short-distance and long-distance dependency links, respectively. However, there is no study to consider the advantages of both paradigms to facilitate conversational discourse parsing. As a result, we propose a distance-aware multi-task framework DAMT that incorporates the strengths of transition-based paradigm to facilitate the graph-based paradigm from the encoding and decoding process. To promote multi-task learning on two paradigms, we first introduce an Encoding Interactive Module (EIM) to enhance the flow of semantic information between both two paradigms during the encoding step. And then we apply a Distance-Aware Graph Convolutional Network (DAGCN) in the decoding process, which can incorporate the different-distance dependency links predicted by the transition-based paradigm to facilitate the decoding of the graph-based paradigm. The experimental results on the datasets STAC and Molweni show that our method can significantly improve the performance of the SOTA graph-based paradigm on long-distance dependency links.

Linguistically Motivated Features for Classifying Shorter Text into Fiction and Non-Fiction Genre
Arman Kazmi | Sidharth Ranjan | Arpit Sharma | Rajakrishnan Rajkumar

This work deploys linguistically motivated features to classify paragraph-level text into fiction and non-fiction genre using a logistic regression model and infers lexical and syntactic properties that distinguish the two genres. Previous works have focused on classifying document-level text into fiction and non-fiction genres, while in this work, we deal with shorter texts which are closer to real-world applications like sentiment analysis of tweets. Going beyond simple POS tag ratios proposed in Qureshi et al.(2019) for document-level classification, we extracted multiple linguistically motivated features belonging to four categories: Lexical features, POS ratio features, Syntactic features and Raw features. For the task of short-text classification, a model containing 28 best-features (selected via Recursive feature elimination with cross-validation; RFECV) confers an accuracy jump of 15.56 % over a baseline model consisting of 2 POS-ratio features found effective in previous work (cited above). The efficacy of the above model containing a linguistically motivated feature set also transfers over to another dataset viz, Baby BNC corpus. We also compared the classification accuracy of the logistic regression model with two deep-learning models. A 1D CNN model gives an increase of 2% accuracy over the logistic Regression classifier on both corpora. And the BERT-base-uncased model gives the best classification accuracy of 97% on Brown corpus and 98% on Baby BNC corpus. Although both the deep learning models give better results in terms of classification accuracy, the problem of interpreting these models remains unsolved. In contrast, regression model coefficients revealed that fiction texts tend to have more character-level diversity and have lower lexical density (quantified using content-function word ratios) compared to non-fiction texts. Moreover, subtle differences in word order exist between the two genres, i.e., in fiction texts Verbs precede Adverbs (inter-alia).

Semantic Sentence Matching via Interacting Syntax Graphs
Chen Xu | Jun Xu | Zhenhua Dong | Ji-Rong Wen

Studies have shown that the sentence’s syntactic structures are important for semantic sentence matching. A typical approach is encoding each sentence’s syntactic structure into an embedding vector, which can be combined with other features to predict the final matching scores. Though successes have been observed, embedding the whole syntactic structures as one vector inevitably overlooks the fine-grained syntax matching patterns, e.g. the alignment of specific term dependencies relations in the two inputted sentences. In this paper, we formalize the task of semantic sentence matching as a problem of graph matching in which each sentence is represented as a directed graph according to its syntactic structures. The syntax matching patterns (i.e. similar syntactic structures) between two sentences, therefore, can be extracted as the sub-graph structure alignments. The proposed method, referred to as Interacted Syntax Graphs (ISG), represents two sentences’ syntactic alignments as well as their semantic matching signals into one association graph. After that, the neural quadratic assignment programming (QAP) is adapted to extract syntactic matching patterns from the association graph. In this way, the syntactic structures fully interact in a fine granularity during the matching process. Experimental results on three public datasets demonstrated that ISG can outperform the state-of-the-art baselines effectively and efficiently. The empirical analysis also showed that ISG can match sentences in an interpretable way.

Hierarchical Information Matters: Text Classification via Tree Based Graph Neural Network
Chong Zhang | He Zhu | Xingyu Peng | Junran Wu | Ke Xu

Text classification is a primary task in natural language processing (NLP). Recently, graph neural networks (GNNs) have developed rapidly and been applied to text classification tasks. As a special kind of graph data, the tree has a simpler data structure and can provide rich hierarchical information for text classification. Inspired by the structural entropy, we construct the coding tree of the graph by minimizing the structural entropy and propose HINT, which aims to make full use of the hierarchical information contained in the text for the task of text classification. Specifically, we first establish a dependency parsing graph for each text. Then we designed a structural entropy minimization algorithm to decode the key information in the graph and convert each graph to its corresponding coding tree. Based on the hierarchical structure of the coding tree, the representation of the entire graph is obtained by updating the representation of non-leaf nodes in the coding tree layer by layer. Finally, we present the effectiveness of hierarchical information in text classification. Experimental results show that HINT outperforms the state-of-the-art methods on popular benchmarks while having a simple structure and few parameters.

SelfMix: Robust Learning against Textual Label Noise with Self-Mixup Training
Dan Qiao | Chenchen Dai | Yuyang Ding | Juntao Li | Qiang Chen | Wenliang Chen | Min Zhang

The conventional success of textual classification relies on annotated data, and the new paradigm of pre-trained language models (PLMs) still requires a few labeled data for downstream tasks. However, in real-world applications, label noise inevitably exists in training data, damaging the effectiveness, robustness, and generalization of the models constructed on such data. Recently, remarkable achievements have been made to mitigate this dilemma in visual data, while only a few explore textual data. To fill this gap, we present SelfMix, a simple yet effective method, to handle label noise in text classification tasks. SelfMix uses the Gaussian Mixture Model to separate samples and leverages semi-supervised learning. Unlike previous works requiring multiple models, our method utilizes the dropout mechanism on a single model to reduce the confirmation bias in self-training and introduces a textual level mixup training strategy. Experimental results on three text classification benchmarks with different types of text show that the performance of our proposed method outperforms these strong baselines designed for both textual and visual data under different noise ratios and noise types. Our anonymous code is available at

Community Topic: Topic Model Inference by Consecutive Word Community Discovery
Eric Austin | Osmar R. Zaïane | Christine Largeron

We present our novel, hyperparameter-free topic modelling algorithm, Community Topic. Our algorithm is based on mining communities from term co-occurrence networks. We empirically evaluate and compare Community Topic with Latent Dirichlet Allocation and the recently developed top2vec algorithm. We find that Community Topic runs faster than the competitors and produces topics that achieve higher coherence scores. Community Topic can discover coherent topics at various scales. The network representation used by Community Topic results in a natural relationship between topics and a topic hierarchy. This allows sub- and super-topics to be found on demand. These features make Community Topic the ideal tool for downstream applications such as applied research and conversational agents.

Where to Attack: A Dynamic Locator Model for Backdoor Attack in Text Classifications
Heng-yang Lu | Chenyou Fan | Jun Yang | Cong Hu | Wei Fang | Xiao-jun Wu

Nowadays, deep-learning based NLP models are usually trained with large-scale third-party data which can be easily injected with malicious backdoors. Thus, BackDoor Attack (BDA) study has become a trending research to help promote the robustness of an NLP system. Text-based BDA aims to train a poisoned model with both clean and poisoned texts to perform normally on clean inputs while being misled to predict those trigger-embedded texts as target labels set by attackers. Previous works usually choose fixed Positions-to-Poison (P2P) first, then add triggers upon those positions such as letter insertion or deletion. However, considering the positions of words with important semantics may vary in different contexts, fixed P2P models are severely limited in flexibility and performance. We study the text-based BDA from the perspective of automatically and dynamically selecting P2P from contexts. We design a novel Locator model which can predict P2P dynamically without human intervention. Based on the predicted P2P, four effective strategies are introduced to show the BDA performance. Experiments on two public datasets show both tinier test accuracy gap on clean data and higher attack success rate on poisoned ones. Human evaluation with volunteers also shows the P2P predicted by our model are important for classification. Source code is available at

Locally Distributed Activation Vectors for Guided Feature Attribution
Housam K. B. Bashier | Mi-Young Kim | Randy Goebel

Explaining the predictions of a deep neural network (DNN) is a challenging problem. Many attempts at interpreting those predictions have focused on attribution-based methods, which assess the contributions of individual features to each model prediction. However, attribution-based explanations do not always provide faithful explanations to the target model, e.g., noisy gradients can result in unfaithful feature attribution for back-propagation methods. We present a method to learn explanations-specific representations while constructing deep network models for text classification. These representations can be used to faithfully interpret black-box predictions, i.e., highlighting the most important input features and their role in any particular prediction. We show that learning specific representations improves model interpretability across various tasks, for both qualitative and quantitative evaluations, while preserving predictive performance.

Addressing Leakage in Self-Supervised Contextualized Code Retrieval
Johannes Villmow | Viola Campos | Adrian Ulges | Ulrich Schwanecke

We address contextualized code retrieval, the search for code snippets helpful to fill gaps in a partial input program. Our approach facilitates a large-scale self-supervised contrastive training by splitting source code randomly into contexts and targets. To combat leakage between the two, we suggest a novel approach based on mutual identifier masking, dedentation, and the selection of syntax-aligned targets. Our second contribution is a new dataset for direct evaluation of contextualized code retrieval, based on a dataset of manually aligned subpassages of code clones. Our experiments demonstrate that the proposed approach improves retrieval substantially, and yields new state-of-the-art results for code clone and defect detection.

A Domain Knowledge Enhanced Pre-Trained Language Model for Vertical Search: Case Study on Medicinal Products
Kesong Liu | Jianhui Jiang | Feifei Lyu

We present a biomedical knowledge enhanced pre-trained language model for medicinal product vertical search. Following ELECTRA’s replaced token detection (RTD) pre-training, we leverage biomedical entity masking (EM) strategy to learn better contextual word representations. Furthermore, we propose a novel pre-training task, product attribute prediction (PAP), to inject product knowledge into the pre-trained language model efficiently by leveraging medicinal product databases directly. By sharing the parameters of PAP’s transformer encoder with that of RTD’s main transformer, these two pre-training tasks are jointly learned. Experiments demonstrate the effectiveness of PAP task for pre-trained language model on medicinal product vertical search scenario, which includes query-title relevance, query intent classification, and named entity recognition in query.

CONCRETE: Improving Cross-lingual Fact-checking with Cross-lingual Retrieval
Kung-Hsiang Huang | ChengXiang Zhai | Heng Ji

Fact-checking has gained increasing attention due to the widespread of falsified information. Most fact-checking approaches focus on claims made in English only due to the data scarcity issue in other languages. The lack of fact-checking datasets in low-resource languages calls for an effective cross-lingual transfer technique for fact-checking. Additionally, trustworthy information in different languages can be complementary and helpful in verifying facts. To this end, we present the first fact-checking framework augmented with cross-lingual retrieval that aggregates evidence retrieved from multiple languages through a cross-lingual retriever. Given the absence of cross-lingual information retrieval datasets with claim-like queries, we train the retriever with our proposed Cross-lingual Inverse Cloze Task (X-ICT), a self-supervised algorithm that creates training instances by translating the title of a passage. The goal for X-ICT is to learn cross-lingual retrieval in which the model learns to identify the passage corresponding to a given translated title. On the X-Fact dataset, our approach achieves 2.23% absolute F1 improvement in the zero-shot cross-lingual setup over prior systems. The source code and data are publicly available at

E-VarM: Enhanced Variational Word Masks to Improve the Interpretability of Text Classification Models
Ling Ge | ChunMing Hu | Guanghui Ma | Junshuang Wu | Junfan Chen | JiHong Liu | Hong Zhang | Wenyi Qin | Richong Zhang

Enhancing the interpretability of text classification models can help increase the reliability of these models in real-world applications. Currently, most researchers focus on extracting task-specific words from inputs to improve the interpretability of the model. The competitive approaches exploit the Variational Information Bottleneck (VIB) to improve the performance of word masking at the word embedding layer to obtain task-specific words. However, these approaches ignore the multi-level semantics of the text, which can impair the interpretability of the model, and do not consider the risk of representation overlap caused by the VIB, which can impair the classification performance. In this paper, we propose an enhanced variational word masks approach, named E-VarM, to solve these two issues effectively. The E-VarM combines multi-level semantics from all hidden layers of the model to mask out task-irrelevant words and uses contrastive learning to readjust the distances between representations. Empirical studies on ten benchmark text classification datasets demonstrate that our approach outperforms the SOTA methods in simultaneously improving the interpretability and accuracy of the model.

Attribute Injection for Pretrained Language Models: A New Benchmark and an Efficient Method
Reinald Kim Amplayo | Kang Min Yoo | Sang-Woo Lee

Metadata attributes (e.g., user and product IDs from reviews) can be incorporated as additional inputs to neural-based NLP models, by expanding the architecture of the models to improve performance. However, recent models rely on pretrained language models (PLMs), in which previously used techniques for attribute injection are either nontrivial or cost-ineffective. In this paper, we introduce a benchmark for evaluating attribute injection models, which comprises eight datasets across a diverse range of tasks and domains and six synthetically sparsified ones. We also propose a lightweight and memory-efficient method to inject attributes into PLMs. We extend adapters, i.e. tiny plug-in feed-forward modules, to include attributes both independently of or jointly with the text. We use approximation techniques to parameterize the model efficiently for domains with large attribute vocabularies, and training mechanisms to handle multi-labeled and sparse attributes. Extensive experiments and analyses show that our method outperforms previous attribute injection methods and achieves state-of-the-art performance on all datasets.

Towards Robust Neural Retrieval with Source Domain Synthetic Pre-Finetuning
Revanth Gangi Reddy | Vikas Yadav | Md Arafat Sultan | Martin Franz | Vittorio Castelli | Heng Ji | Avirup Sil

Research on neural IR has so far been focused primarily on standard supervised learning settings, where it outperforms traditional term matching baselines. Many practical use cases of such models, however, may involve previously unseen target domains. In this paper, we propose to improve the out-of-domain generalization of Dense Passage Retrieval (DPR) - a popular choice for neural IR - through synthetic data augmentation only in the source domain. We empirically show that pre-finetuning DPR with additional synthetic data in its source domain (Wikipedia), which we generate using a fine-tuned sequence-to-sequence generator, can be a low-cost yet effective first step towards its generalization. Across five different test sets, our augmented model shows more robust performance than DPR in both in-domain and zero-shot out-of-domain evaluation.

Parameter-Efficient Neural Reranking for Cross-Lingual and Multilingual Retrieval
Robert Litschko | Ivan Vulić | Goran Glavaš

State-of-the-art neural (re)rankers are notoriously data-hungry which – given the lack of large-scale training data in languages other than English – makes them rarely used in multilingual and cross-lingual retrieval settings. Current approaches therefore commonly transfer rankers trained on English data to other languages and cross-lingual setups by means of multilingual encoders: they fine-tune all parameters of pretrained massively multilingual Transformers (MMTs, e.g., multilingual BERT) on English relevance judgments, and then deploy them in the target language(s). In this work, we show that two parameter-efficient approaches to cross-lingual transfer, namely Sparse Fine-Tuning Masks (SFTMs) and Adapters, allow for a more lightweight and more effective zero-shot transfer to multilingual and cross-lingual retrieval tasks. We first train language adapters (or SFTMs) via Masked Language Modelling and then train retrieval (i.e., reranking) adapters (SFTMs) on top, while keeping all other parameters fixed. At inference, this modular design allows us to compose the ranker by applying the (re)ranking adapter (or SFTM) trained with source language data together with the language adapter (or SFTM) of a target language. We carry out a large scale evaluation on the CLEF-2003 and HC4 benchmarks and additionally, as another contribution, extend the former with queries in three new languages: Kyrgyz, Uyghur and Turkish. The proposed parameter-efficient methods outperform standard zero-shot transfer with full MMT fine-tuning, while being more modular and reducing training times. The gains are particularly pronounced for low-resource languages, where our approaches also substantially outperform the competitive machine translation-based rankers.

LIME: Weakly-Supervised Text Classification without Seeds
Seongmin Park | Jihwa Lee

In weakly-supervised text classification, only label names act as sources of supervision. Predominant approaches to weakly-supervised text classification utilize a two-phase framework, where test samples are first assigned pseudo-labels and are then used to train a neural text classifier. In most previous work, the pseudo-labeling step is dependent on obtaining seed words that best capture the relevance of each class label. We present LIME, a framework for weakly-supervised text classification that entirely replaces the brittle seed-word generation process with entailment-based pseudo-classification. We find that combining weakly-supervised classification and textual entailment mitigates shortcomings of both, resulting in a more streamlined and effective classification pipeline. With just an off-the-shelf textual entailment model, LIME outperforms recent baselines in weakly-supervised text classification and achieves state-of-the-art in 4 benchmarks.

Multi-Stage Framework with Refinement Based Point Set Registration for Unsupervised Bi-Lingual Word Alignment
Silviu Vlad Oprea | Sourav Dutta | Haytham Assem

Cross-lingual alignment of word embeddings are important in knowledge transfer across languages, for improving machine translation and other multi-lingual applications. Current unsupervised approaches relying on learning structure-preserving transformations, using adversarial networks and refinement strategies, suffer from instability and convergence issues. This paper proposes BioSpere, a novel multi-stage framework for unsupervised mapping of bi-lingual word embeddings onto a shared vector space, by combining adversarial initialization, refinement procedure and point set registration. Experiments for parallel dictionary induction and word similarity demonstrate state-of-the-art unsupervised results for BioSpere on diverse languages – showcasing robustness against variable adversarial performance.

EM-PERSONA: EMotion-assisted Deep Neural Framework for PERSONAlity Subtyping from Suicide Notes
Soumitra Ghosh | Dhirendra Kumar Maurya | Asif Ekbal | Pushpak Bhattacharyya

The World Health Organization has emphasised the need of stepping up suicide prevention efforts to meet the United Nation’s Sustainable Development Goal target of 2030 (Goal 3: Good health and well-being). We address the challenging task of personality subtyping from suicide notes. Most research on personality subtyping has relied on statistical analysis and feature engineering. Moreover, state-of-the-art transformer models in the automated personality subtyping problem have received relatively less attention. We develop a novel EMotion-assisted PERSONAlity Detection Framework (EM-PERSONA). We annotate the benchmark CEASE-v2.0 suicide notes dataset with personality traits across four dichotomies: Introversion (I)-Extraversion (E), Intuition (N)-Sensing (S), Thinking (T)-Feeling (F), Judging (J)–Perceiving (P). Our proposed method outperforms all baselines on comprehensive evaluation using multiple state-of-the-art systems. Across the four dichotomies, EM-PERSONA improved accuracy by 2.04%, 3.69%, 4.52%, and 3.42%, respectively, over the highest-performing single-task systems.

Dense Template Retrieval for Customer Support
Tiago Mesquita | Bruno Martins | Mariana Almeida

Templated answers are used extensively in customer support scenarios, providing an efficient way to cover a plethora of topics, with an easily maintainable collection of templates. However, the number of templates is often too high for an agent to manually search. Automatically suggesting the correct template for a given question can thus improve the service efficiency, reducing costs and leading to a better customer satisfaction. In this work, we propose a dense retrieval framework for the customer support scenario, adapting a standard in-batch negatives technique to support unpaired sampling of queries and templates. We also propose a novel loss that extends the typical query-centric similarity, exploiting other similarity relations in the training data. Experiments show that our approach achieves considerable improvements, in terms of performance and training speed, over more standard dense retrieval methods. This includes methods such as DPR, and also ablated versions of the proposed approach.

Exploring Label Hierarchy in a Generative Way for Hierarchical Text Classification
Wei Huang | Chen Liu | Bo Xiao | Yihua Zhao | Zhaoming Pan | Zhimin Zhang | Xinyun Yang | Guiquan Liu

Hierarchical Text Classification (HTC), which aims to predict text labels organized in hierarchical space, is a significant task lacking in investigation in natural language processing. Existing methods usually encode the entire hierarchical structure and fail to construct a robust label-dependent model, making it hard to make accurate predictions on sparse lower-level labels and achieving low Macro-F1. In this paper, we explore the level dependency and path dependency of the label hierarchy in a generative way for building the knowledge of upper-level labels of current path into lower-level ones, and thus propose a novel PAAM-HiA-T5 model for HTC: a hierarchy-aware T5 model with path-adaptive attention mechanism. Specifically, we generate a multi-level sequential label structure to exploit hierarchical dependency across different levels with Breadth-First Search (BFS) and T5 model. To further improve label dependency prediction within each path, we then propose an original path-adaptive attention mechanism (PAAM) to lead the model to adaptively focus on the path where the currently generated label is located, shielding the noise from other paths. Comprehensive experiments on three benchmark datasets show that PAAM-HiA-T5 greatly outperforms all state-of-the-art HTC approaches especially in Macro-F1.

MuSeCLIR: A Multiple Senses and Cross-lingual Information Retrieval Dataset
Wing Yan Li | Julie Weeds | David Weir

This paper addresses a deficiency in existing cross-lingual information retrieval (CLIR) datasets and provides a robust evaluation of CLIR systems’ disambiguation ability. CLIR is commonly tackled by combining translation and traditional IR. Due to translation ambiguity, the problem of ambiguity is worse in CLIR than in monolingual IR. But existing auto-generated CLIR datasets are dominated by searches for named entity mentions, which does not provide a good measure for disambiguation performance, as named entity mentions can often be transliterated across languages and tend not to have multiple translations. Therefore, we introduce a new evaluation dataset (MuSeCLIR) to address this inadequacy. The dataset focusses on polysemous common nouns with multiple possible translations. MuSeCLIR is constructed from multilingual Wikipedia and supports searches on documents written in European (French, German, Italian) and Asian (Chinese, Japanese) languages. We provide baseline statistical and neural model results on MuSeCLIR which show that MuSeCLIR has a higher requirement on the ability of systems to disambiguate query terms.

Complicate Then Simplify: A Novel Way to Explore Pre-trained Models for Text Classification
Xu Zhang | Zejie Liu | Yanzheng Xiang | Deyu Zhou

With the development of pre-trained models (PTMs), the performance of text classification has been continuously improved by directly employing the features generated by PTMs. However such way might not fully explore the knowledge in PTMs as it is constrained by the difficulty of the task. Compared to difficult task, the learning algorithms tend to saturate early on the simple task. Moreover, the native sentence representations derived from BERT are prone to be collapsed and directly employing such representation for text classification might fail to fully capture discriminative features. In order to address these issues, in this paper we propose a novel framework for text classification which implements a two-stage training strategy. In the pre-training stage, auxiliary labels are introduced to increase the task difficulties and to fully exploit the knowledge in the pre-trained model. In the fine-tuning stage, the textual representation learned in the pre-training stage is employed and the classifier is fine-tuned to obtain better classification performance. Experiments were conducted on six text classification corpora and the results showed that the proposed framework outperformed several state-of-the-art baselines.

Adaptive Feature Discrimination and Denoising for Asymmetric Text Matching
Yan Li | Chenliang Li | Junjun Guo

Asymmetric text matching has becoming increasingly indispensable for many downstream tasks (e.g., IR and NLP). Here, asymmetry means that the documents involved for matching hold different amounts of information, e.g., a short query against a relatively longer document. The existing solutions mainly focus on modeling the feature interactions between asymmetric texts, but rarely go one step further to recognize discriminative features and perform feature denoising to enhance relevance learning. In this paper, we propose a novel adaptive feature discrimination and denoising model for asymmetric text matching, called ADDAX. For each asymmetric text pair, ADDAX is devised to explicitly distinguish discriminative features and filter out irrelevant features in a context-aware fashion. Concretely, a matching-adapted gating siamese cell (MAGS) is firstly devised to identify discriminative features and produce the corresponding hybrid representations for a text pair. Afterwards, we introduce a locality-constrained hashing denoiser to perform feature-level denoising by learning a discriminative low-dimensional binary codes for redundantly longer text. Extensive experiments on four real-world datasets from different downstream tasks demostrate that the proposed ADDAX obtains substantial performance gain over 36 up-to-date state-of-the-art alternatives.

Rethinking Data Augmentation in Text-to-text Paradigm
Yanan Chen | Yang Liu

As manually labelling data can be costly, some recent studies tend to augment the training data for improving the generalization power of machine learning models, known as data augmentation (DA). With the arise of pre-trained language models (PLMs), some recent works on DA try to synthesize new samples benefiting from the knowledge learned from PLM’s pre-training. Along the same direction, we in this paper propose to integrate text-to-text language models and construct a new two-phase framework for augmentation: 1) a fine-tuning phase where PLMs are well adapted to downstream classification with the help of two novel schemes, and 2) a generation phase where the fine-tuned models are leveraged to create new samples for performance lifting. This paradigm opens up a new way of designing fine-tuning scheme to better serve DA in an easy-to-implement manner, and can be easily extended to other desired tasks. We evaluate our proposal on two public classification datasets and demonstrate its effectiveness with remarkable gains.

ConTextING: Granting Document-Wise Contextual Embeddings to Graph Neural Networks for Inductive Text Classification
Yen-Hao Huang | Yi-Hsin Chen | Yi-Shin Chen

Graph neural networks (GNNs) have been recently applied in natural language processing. Various GNN research studies are proposed to learn node interactions within the local graph of each document that contains words, sentences, or topics for inductive text classification. However, most inductive GNNs that are built on a word graph generally take global word embeddings as node features, without referring to document-wise contextual information. Consequently, we find that BERT models can perform better than inductive GNNs. An intuitive follow-up approach is used to enrich GNNs with contextual embeddings from BERT, yet there is a lack of related research. In this work, we propose a simple yet effective unified model, coined ConTextING, with a joint training mechanism to learn from both document embeddings and contextual word interactions simultaneously. Our experiments show that ConTextING outperforms pure inductive GNNs and BERT-style models. The analyses also highlight the benefits of the sub-word graph and joint training with separated classifiers.

Virtual Knowledge Graph Construction for Zero-Shot Domain-Specific Document Retrieval
Yeon Seonwoo | Seunghyun Yoon | Franck Dernoncourt | Trung Bui | Alice Oh

Domain-specific documents cover terminologies and specialized knowledge. This has been the main challenge of domain-specific document retrieval systems. Previous approaches propose domain-adaptation and transfer learning methods to alleviate this problem. However, these approaches still follow the same document representation method in previous approaches; a document is embedded into a single vector. In this study, we propose VKGDR. VKGDR represents a given corpus into a graph of entities and their relations (known as a virtual knowledge graph) and computes the relevance between queries and documents based on the graph representation. We conduct three experiments 1) domain-specific document retrieval, 2) comparison of our virtual knowledge graph construction method with previous approaches, and 3) ablation study on each component of our virtual knowledge graph. From the results, we see that unsupervised VKGDR outperforms baselines in a zero-shot setting and even outperforms fully-supervised bi-encoder. We also verify that our virtual knowledge graph construction method results in better retrieval performance than previous approaches.

MICO: Selective Search with Mutual Information Co-training
Zhanyu Wang | Xiao Zhang | Hyokun Yun | Choon Hui Teo | Trishul Chilimbi

In contrast to traditional exhaustive search, selective search first clusters documents into several groups before all the documents are searched exhaustively by a query, to limit the search executed within one group or only a few groups. Selective search is designed to reduce the latency and computation in modern large-scale search systems. In this study, we propose MICO, a Mutual Information CO-training framework for selective search with minimal supervision using the search logs. After training, MICO does not only cluster the documents, but also routes unseen queries to the relevant clusters for efficient retrieval. In our empirical experiments, MICO significantly improves the performance on multiple metrics of selective search and outperforms a number of existing competitive baselines.

DPTDR: Deep Prompt Tuning for Dense Passage Retrieval
Zhengyang Tang | Benyou Wang | Ting Yao

Deep prompt tuning (DPT) has gained great success in most natural language processing (NLP) tasks. However, it is not well-investigated in dense retrieval where fine-tuning (FT) still dominates. When deploying multiple retrieval tasks using the same backbone model (e.g., RoBERTa), FT-based methods are unfriendly in terms of deployment cost: each new retrieval model needs to repeatedly deploy the backbone model without reuse. To reduce the deployment cost in such a scenario, this work investigates applying DPT in dense retrieval. The challenge is that directly applying DPT in dense retrieval largely underperforms FT methods. To compensate for the performance drop, we propose two model-agnostic and task-agnostic strategies for DPT-based retrievers, namely retrieval-oriented intermediate pretraining and unified negative mining, as a general approach that could be compatible with any pre-trained language model and retrieval task. The experimental results show that the proposed method (called DPTDR) outperforms previous state-of-the-art models on both MS-MARCO and Natural Questions. We also conduct ablation studies to examine the effectiveness of each strategy in DPTDR. We believe this work facilitates the industry, as it saves enormous efforts and costs of deployment and increases the utility of computing resources. Our code is available at

BERT-Flow-VAE: A Weakly-supervised Model for Multi-Label Text Classification
Ziwen Liu | Josep Grau-Bove | Scott Allan Orr

Multi-label Text Classification (MLTC) is the task of categorizing documents into one or more topics. Considering the large volumes of data and varying domains of such tasks, fully supervised learning requires manually fully annotated datasets which is costly and time-consuming. In this paper, we propose BERT-Flow-VAE (BFV), a Weakly-Supervised Multi-Label Text Classification (WSMLTC) model that reduces the need for full supervision. This new model (1) produces BERT sentence embeddings and calibrates them using a flow model, (2) generates an initial topic-document matrix by averaging results of a seeded sparse topic model and a textual entailment model which only require surface name of topics and 4-6 seed words per topic, and (3) adopts a VAE framework to reconstruct the embeddings under the guidance of the topic-document matrix. Finally, (4) it uses the means produced by the encoder model in the VAE architecture as predictions for MLTC. Experimental results on 6 multi-label datasets show that BFV can substantially outperform other baseline WSMLTC models in key metrics and achieve approximately 84% performance of a fully-supervised model.

Welcome to the Modern World of Pronouns: Identity-Inclusive Natural Language Processing beyond Gender
Anne Lauscher | Archie Crowley | Dirk Hovy

The world of pronouns is changing – from a closed word class with few members to an open set of terms to reflect identities. However, Natural Language Processing (NLP) barely reflects this linguistic shift, resulting in the possible exclusion of non-binary users, even though recent work outlined the harms of gender-exclusive language technology. The current modeling of 3rd person pronouns is particularly problematic. It largely ignores various phenomena like neopronouns, i.e., novel pronoun sets that are not (yet) widely established. This omission contributes to the discrimination of marginalized and underrepresented groups, e.g., non-binary individuals. It thus prevents gender equality, one of the UN’s sustainable development goals (goal 5). Further, other identity-expressions beyond gender are ignored by current NLP technology. This paper provides an overview of 3rd person pronoun issues for NLP. Based on our observations and ethical considerations, we define a series of five desiderata for modeling pronouns in language technology, which we validate through a survey. We evaluate existing and novel modeling approaches w.r.t. these desiderata qualitatively and quantify the impact of a more discrimination-free approach on an established benchmark dataset.

Threat Scenarios and Best Practices to Detect Neural Fake News
Artidoro Pagnoni | Martin Graciarena | Yulia Tsvetkov

In this work, we discuss different threat scenarios from neural fake news generated by state-of-the-art language models. Through our experiments, we assess the performance of generated text detection systems under these threat scenarios. For each scenario, we also identify the minimax strategy for the detector that minimizes its worst-case performance. This constitutes a set of best practices that practitioners can rely on. In our analysis, we find that detectors are prone to shortcut learning (lack of out-of-distribution generalization) and discuss approaches to mitigate this problem and improve detectors more broadly. Finally, we argue that strong detectors should be released along with new generators.

From Polarity to Intensity: Mining Morality from Semantic Space
Chunxu Zhao | Pengyuan Liu | Dong Yu

Most works on computational morality focus on moral polarity recognition, i.e., distinguishing right from wrong. However, a discrete polarity label is not informative enough to reflect morality as it does not contain any degree or intensity information. Existing approaches to compute moral intensity are limited to word-level measurement and heavily rely on human labelling. In this paper, we propose MoralScore, a weakly-supervised framework that can automatically measure moral intensity from text. It only needs moral polarity labels, which are more robust and easier to acquire. Besides, the framework can capture latent moral information not only from words but also from sentence-level semantics which can provide a more comprehensive measurement. To evaluate the performance of our method, we introduce a set of evaluation metrics and conduct extensive experiments. Results show that our method achieves good performance on both automatic and human evaluations.

SOS: Systematic Offensive Stereotyping Bias in Word Embeddings
Fatma Elsafoury | Steve R. Wilson | Stamos Katsigiannis | Naeem Ramzan

Systematic Offensive stereotyping (SOS) in word embeddings could lead to associating marginalised groups with hate speech and profanity, which might lead to blocking and silencing those groups, especially on social media platforms. In this [id=stk]work, we introduce a quantitative measure of the SOS bias, [id=stk]validate it in the most commonly used word embeddings, and investigate if it explains the performance of different word embeddings on the task of hate speech detection. Results show that SOS bias exists in almost all examined word embeddings and that [id=stk]the proposed SOS bias metric correlates positively with the statistics of published surveys on online extremism. We also show that the [id=stk]proposed metric reveals distinct information [id=stk]compared to established social bias metrics. However, we do not find evidence that SOS bias explains the performance of hate speech detection models based on the different word embeddings.

Bigger Data or Fairer Data? Augmenting BERT via Active Sampling for Educational Text Classification
Lele Sha | Yuheng Li | Dragan Gasevic | Guanliang Chen

Pretrained Language Models (PLMs), though popular, have been diagnosed to encode bias against protected groups in the representations they learn, which may harm the prediction fairness of downstream models. Given that such bias is believed to be related to the amount of demographic information carried in the learned representations, this study aimed to quantify the awareness that a PLM (i.e., BERT) has regarding people’s protected attributes and augment BERT to improve prediction fairness of downstream models by inhibiting this awareness. Specifically, we developed a method to dynamically sample data to continue the pretraining of BERT and enable it to generate representations carrying minimal demographic information, which can be directly used as input to downstream models for fairer predictions. By experimenting on the task of classifying educational forum posts and measuring fairness between students of different gender or first-language backgrounds, we showed that, compared to a baseline without any additional pretraining, our method improved not only fairness (with a maximum improvement of 52.33%) but also accuracy (with a maximum improvement of 2.53%). Our method can be generalized to any PLM and demographic attributes. All the codes used in this study can be accessed via

Debiasing Word Embeddings with Nonlinear Geometry
Lu Cheng | Nayoung Kim | Huan Liu

Debiasing word embeddings has been largely limited to individual and independent social categories. However, real-world corpora typically present multiple social categories that possibly correlate or intersect with each other. For instance, “hair weaves” is stereotypically associated with African American females, but neither African American nor females alone. Therefore, this work studies biases associated with multiple social categories: joint biases induced by the union of different categories and intersectional biases that do not overlap with the biases of the constituent categories. We first empirically observe that individual biases intersect non-trivially (i.e., over a one-dimensional subspace). Drawing from the intersectional theory in social science and the linguistic theory, we then construct an intersectional subspace to debias for multiple social categories using the nonlinear geometry of individual biases. Empirical evaluations corroborate the efficacy of our approach.

Debiasing Isn’t Enough! – on the Effectiveness of Debiasing MLMs and Their Social Biases in Downstream Tasks
Masahiro Kaneko | Danushka Bollegala | Naoaki Okazaki

We study the relationship between task-agnostic intrinsic and task-specific extrinsic social bias evaluation measures for MLMs, and find that there exists only a weak correlation between these two types of evaluation measures. Moreover, we find that MLMs debiased using different methods still re-learn social biases during fine-tuning on downstream tasks. We identify the social biases in both training instances as well as their assigned labels as reasons for the discrepancy between intrinsic and extrinsic bias evaluation measurements. Overall, our findings highlight the limitations of existing MLM bias evaluation measures and raise concerns on the deployment of MLMs in downstream applications using those measures.

Quantifying Bias from Decoding Techniques in Natural Language Generation
Mayukh Das | Wolf Tilo Balke

Natural language generation (NLG) models can propagate social bias towards particular demography. Though several studies investigated bias from data and model, NLG task distinctively uses stochastic decoder that can positively or negatively impact the bias-sensitive tokens initially predicted by the model. To address this gap in research, we present an extensive analysis of bias from decoding techniques for open-domain language generation considering the entire decoding space. We analyze to what extent bias metrics like toxicity and sentiment are impacted by the individual components of decoder algorithms. To this extent, we also analyze the trade-off between bias scores and human-annotated generation quality throughout the decoder space. Together, these methods reveal the imperative of testing inference time bias and provide evidence on the usefulness of inspecting the entire decoding spectrum.

A Study of Implicit Bias in Pretrained Language Models against People with Disabilities
Pranav Narayanan Venkit | Mukund Srinath | Shomir Wilson

Pretrained language models (PLMs) have been shown to exhibit sociodemographic biases, such as against gender and race, raising concerns of downstream biases in language technologies. However, PLMs’ biases against people with disabilities (PWDs) have received little attention, in spite of their potential to cause similar harms. Using perturbation sensitivity analysis, we test an assortment of popular word embedding-based and transformer-based PLMs and show significant biases against PWDs in all of them. The results demonstrate how models trained on large corpora widely favor ableist language.

Social Norms-Grounded Machine Ethics in Complex Narrative Situation
Tao Shen | Xiubo Geng | Daxin Jiang

Ethical judgment aims to determine if a person in a narrative situation acts under people’s social norms under a culture, so it is crucial to understand actions in narratives and achieve machine ethics. Recent works depend on data-driven methods to directly judge the ethics of complex real-world narratives but face two major challenges. First, they cannot well handle dilemma situations due to a lack of basic knowledge about social norms. Second, they focus merely on sparse situation-level judgment regardless of the social norms involved during the judgment, leading to a black box. In this work, inspired by previous knowledge-grounded and -augmented paradigms, we propose to complement a complex situation with grounded social norms. Besides a norm-grounding knowledge model, we present a novel norm-supported ethical judgment model in line with neural module networks to alleviate dilemma situations and improve norm-level explainability. Empirically, our model improves state-of-the-art performance on two narrative judgment benchmarks.

Bias at a Second Glance: A Deep Dive into Bias for German Educational Peer-Review Data Modeling
Thiemo Wambsganss | Vinitra Swamy | Roman Rietsche | Tanja Käser

Natural Language Processing (NLP) has become increasingly utilized to provide adaptivity in educational applications. However, recent research has highlighted a variety of biases in pre-trained language models. While existing studies investigate bias in different domains, they are limited in addressing fine-grained analysis on educational corpora and text that is not English. In this work, we analyze bias across text and through multiple architectures on a corpus of 9,165 German peer-reviews collected from university students over five years. Notably, our corpus includes labels such as helpfulness, quality, and critical aspect ratings from the peer-review recipient as well as demographic attributes. We conduct a Word Embedding Association Test (WEAT) analysis on (1) our collected corpus in connection with the clustered labels, (2) the most common pre-trained German language models (T5, BERT, and GPT-2) and GloVe embeddings, and (3) the language models after fine-tuning on our collected data-set. In contrast to our initial expectations, we found that our collected corpus does not reveal many biases in the co-occurrence analysis or in the GloVe embeddings. However, the pre-trained German language models find substantial conceptual, racial, and gender bias and have significant changes in bias across conceptual and racial axes during fine-tuning on the peer-review data. With our research, we aim to contribute to the fourth UN sustainability goal (quality education) with a novel dataset, an understanding of biases in natural language education data, and the potential harms of not counteracting biases in language models for educational tasks.

Dynamic Relevance Graph Network for Knowledge-Aware Question Answering
Chen Zheng | Parisa Kordjamshidi

This work investigates the challenge of learning and reasoning for Commonsense Question Answering given an external source of knowledge in the form of a knowledge graph (KG). We propose a novel graph neural network architecture, called Dynamic Relevance Graph Network (DRGN). DRGN operates on a given KG subgraph based on the question and answers entities and uses the relevance scores between the nodes to establish new edges dynamically for learning node representations in the graph network. This explicit usage of relevance as graph edges has the following advantages, a) the model can exploit the existing relationships, re-scale the node weights, and influence the way the neighborhood nodes’ representations are aggregated in the KG subgraph, b) It potentially recovers the missing edges in KG that are needed for reasoning. Moreover, as a byproduct, our model improves handling the negative questions due to considering the relevance between the question node and the graph entities. Our proposed approach shows competitive performance on two QA benchmarks, CommonsenseQA and OpenbookQA, compared to the state-of-the-art published results.

SISER: Semantic-Infused Selective Graph Reasoning for Fact Verification
Eunhwan Park | Jong-Hyeon Lee | DongHyeon Jeon | Seonhoon Kim | Inho Kang | Seung-Hoon Na

This study proposes Semantic-Infused SElective Graph Reasoning (SISER) for fact verification, which newly presents semantic-level graph reasoning and injects its reasoning-enhanced representation into other types of graph-based and sequence-based reasoning methods. SISER combines three reasoning types: 1) semantic-level graph reasoning, which uses a semantic graph from evidence sentences, whose nodes are elements of a triple – <Subject, Verb, Object>, 2) “semantic-infused” sentence-level “selective” graph reasoning, which combine semantic-level and sentence-level representations and perform graph reasoning in a selective manner using the node selection mechanism, and 3) sequence reasoning, which concatenates all evidence sentences and performs attention-based reasoning. Experiment results on a large-scale dataset for Fact Extraction and VERification (FEVER) show that SISER outperforms the previous graph-based approaches and achieves state-of-the-art performance.

Answering Numerical Reasoning Questions in Table-Text Hybrid Contents with Graph-based Encoder and Tree-based Decoder
Fangyu Lei | Shizhu He | Xiang Li | Jun Zhao | Kang Liu

Perform like an Engine: A Closed-Loop Neural-Symbolic Learning Framework for Knowledge Graph Inference
Guanglin Niu | Bo Li | Yongfei Zhang | Shiliang Pu

Knowledge graph (KG) inference aims to address the natural incompleteness of KGs, including rule learning-based and KG embedding (KGE) models. However, the rule learning-based models suffer from low efficiency and generalization while KGE models lack interpretability. To address these challenges, we propose a novel and effective closed-loop neural-symbolic learning framework EngineKG via incorporating our developed KGE and rule learning modules. KGE module exploits symbolic rules and paths to enhance the semantic association between entities and relations for improving KG embeddings and interpretability. A novel rule pruning mechanism is proposed in the rule learning module by leveraging paths as initial candidate rules and employing KG embeddings together with concepts for extracting more high-quality rules. Experimental results on four real-world datasets show that our model outperforms the relevant baselines on link prediction tasks, demonstrating the superiority of our KG inference model in a neural-symbolic learning fashion. The source code and datasets of this paper are available at

Table-based Fact Verification with Self-labeled Keypoint Alignment
Guangzhen Zhao | Peng Yang

Table-based fact verification aims to verify whether a statement sentence is trusted or fake. Most existing methods rely on graph feature or data augmentation but fail to investigate evidence correlation between the statement and table effectively. In this paper, we propose a self-Labeled Keypoint Alignment model, named LKA, to explore the correlation between the two. Specifically, a dual-view alignment module based on the statement and table views is designed to discriminate the salient words through multiple interactions, where one regular and one adversarial alignment network cooperatively character the alignment discrepancy. Considering the interaction characteristic inherent in the alignment module, we introduce a novel mixture-of experts block to elaborately integrate the interacted information for supporting the alignment and final classification. Furthermore, a contrastive learning loss is utilized to learn the precise representation of the structure-involved words, encouraging the words closer to words with the same table attribute and farther from the words with the unrelated attribute. Experimental results on three widely-studied datasets show that our model can outperform the state-of-the-art baselines and capture interpretable evidence words.

IMCI: Integrate Multi-view Contextual Information for Fact Extraction and Verification
Hao Wang | Yangguang Li | Zhen Huang | Yong Dou

With the rapid development of automatic fake news detection technology, fact extraction and verification (FEVER) has been attracting more attention. The task aims to extract the most related fact evidences from millions of open-domain Wikipedia documents and then verify the credibility of corresponding claims. Although several strong models have been proposed for the task and they have made great process, we argue that they fail to utilize multi-view contextual information and thus cannot obtain better performance. In this paper, we propose to integrate multi-view contextual information (IMCI) for fact extraction and verification. For each evidence sentence, we define two kinds of context, i.e. intra-document context and inter-document context. Intra-document context consists of the document title and all the other sentences from the same document. Inter-document context consists of all other evidences which may come from different documents. Then we integrate the multi-view contextual information to encode the evidence sentences to handle the task. Our experimental results on FEVER 1.0 shared task show that our IMCI framework makes great progress on both fact extraction and verification, and achieves state-of-the-art performance with a winning FEVER score of 73.96% and label accuracy of 77.25% on the online blind test set. We also conduct ablation study to detect the impact of multi-view contextual information.

Prompt Combines Paraphrase: Teaching Pre-trained Models to Understand Rare Biomedical Words
Haochun Wang | Chi Liu | Nuwa Xi | Sendong Zhao | Meizhi Ju | Shiwei Zhang | Ziheng Zhang | Yefeng Zheng | Bing Qin | Ting Liu

Prompt-based fine-tuning for pre-trained models has proven effective for many natural language processing tasks under few-shot settings in general domain. However, tuning with prompt in biomedical domain has not been investigated thoroughly. Biomedical words are often rare in general domain, but quite ubiquitous in biomedical contexts, which dramatically deteriorates the performance of pre-trained models on downstream biomedical applications even after fine-tuning, especially in low-resource scenarios. We propose a simple yet effective approach to helping models learn rare biomedical words during tuning with prompt. Experimental results show that our method can achieve up to 6% improvement in biomedical natural language inference task without any extra parameters or training steps using few-shot vanilla prompt settings.

Self-Supervised Intermediate Fine-Tuning of Biomedical Language Models for Interpreting Patient Case Descriptions
Israa Alghanmi | Luis Espinosa-Anke | Steven Schockaert

Interpreting patient case descriptions has emerged as a challenging problem for biomedical NLP, where the aim is typically to predict diagnoses, to recommended treatments, or to answer questions about cases more generally. Previous work has found that biomedical language models often lack the knowledge that is needed for such tasks. In this paper, we aim to improve their performance through a self-supervised intermediate fine-tuning strategy based on PubMed abstracts. Our solution builds on the observation that many of these abstracts are case reports, and thus essentially patient case descriptions. As a general strategy, we propose to fine-tune biomedical language models on the task of predicting masked medical concepts from such abstracts. We find that the success of this strategy crucially depends on the selection of the medical concepts to be masked. By ensuring that these concepts are sufficiently salient, we can substantially boost the performance of biomedical language models, achieving state-of-the-art results on two benchmarks.

Evaluating and Mitigating Inherent Linguistic Bias of African American English through Inference
Jamell Dacon | Haochen Liu | Jiliang Tang

Recent studies show that NLP models trained on standard English texts tend to produce biased outcomes against underrepresented English varieties. In this work, we conduct a pioneering study of the English variety use of African American English (AAE) in NLI task. First, we propose CodeSwitch, a greedy unidirectional morphosyntactically-informed rule-based translation method for data augmentation. Next, we use CodeSwitch to present a preliminary study to determine if demographic language features do in fact influence models to produce false predictions. Then, we conduct experiments on two popular datasets and propose two simple, yet effective and generalizable debiasing methods. Our findings show that NLI models (e.g. BERT) trained under our proposed frameworks outperform traditional large language models while maintaining or even improving the prediction performance. In addition, we intend to release CodeSwitch, in hopes of promoting dialectal language diversity in training data to both reduce the discriminatory societal impacts and improve model robustness of downstream NLP tasks.

Can We Guide a Multi-Hop Reasoning Language Model to Incrementally Learn at Each Single-Hop?
Jesus Lovon-Melgarejo | Jose G. Moreno | Romaric Besançon | Olivier Ferret | Lynda Tamine

Despite the success of state-of-the-art pre-trained language models (PLMs) on a series of multi-hop reasoning tasks, they still suffer from their limited abilities to transfer learning from simple to complex tasks and vice-versa. We argue that one step forward to overcome this limitation is to better understand the behavioral trend of PLMs at each hop over the inference chain. Our critical underlying idea is to mimic human-style reasoning: we envision the multi-hop reasoning process as a sequence of explicit single-hop reasoning steps. To endow PLMs with incremental reasoning skills, we propose a set of inference strategies on relevant facts and distractors allowing us to build automatically generated training datasets. Using the SHINRA and ConceptNet resources jointly, we empirically show the effectiveness of our proposal on multiple-choice question answering and reading comprehension, with a relative improvement in terms of accuracy of 68.4% and 16.0% w.r.t. classic PLMs, respectively.

Modeling Hierarchical Reasoning Chains by Linking Discourse Units and Key Phrases for Reading Comprehension
Jialin Chen | Zhuosheng Zhang | Hai Zhao

Machine reading comprehension (MRC) poses new challenges to logical reasoning, which aims to understand the implicit logical relations entailed in the given contexts and perform inference over them. Due to the complexity of logic, logical connections exist at different granularity levels. However, most existing methods of logical reasoning individually focus on either entity-aware or discourse-based information but ignore the hierarchical relations that may even have mutual effects. This paper proposes a holistic graph network (HGN) that deals with context at both discourse-level and word-level as the basis for logical reasoning to provide a more fine-grained relation extraction. Specifically, node-level and type-level relations, which can be interpreted as bridges in the reasoning process, are modeled by a hierarchical interaction mechanism to improve the interpretation of MRC systems. Experimental results on logical reasoning QA datasets (ReClor and LogiQA) and natural language inference datasets (SNLI and ANLI) show the effectiveness and generalization of our method, and in-depth analysis verifies its capability to understand complex logical relations.

Hierarchical Representation-based Dynamic Reasoning Network for Biomedical Question Answering
Jianguo Mao | Jiyuan Zhang | Zengfeng Zeng | Weihua Peng | Wenbin Jiang | Xiangdong Wang | Hong Liu | Yajuan Lyu

Recently, Biomedical Question Answering (BQA) has attracted growing attention due to its application value and technical challenges. Most existing works treat it as a semantic matching task that predicts answers by computing confidence among questions, options and evidence sentences, which is insufficient for scenarios that require complex reasoning based on a deep understanding of biomedical evidences. We propose a novel model termed Hierarchical Representation-based Dynamic Reasoning Network (HDRN) to tackle this problem. It first constructs the hierarchical representations for biomedical evidences to learn semantics within and among evidences. It then performs dynamic reasoning based on the hierarchical representations of evidences to solve complex biomedical problems. Against the existing state-of-the-art model, the proposed model significantly improves more than 4.5%, 3% and 1.3% on three mainstream BQA datasets, PubMedQA, MedQA-USMLE and NLPEC. The ablation study demonstrates the superiority of each improvement of our model. The code will be released after the paper is published.

ArT: All-round Thinker for Unsupervised Commonsense Question Answering
Jiawei Wang | Hai Zhao

Without labeled question-answer pairs for necessary training, unsupervised commonsense question-answering (QA) appears to be extremely challenging due to its indispensable unique prerequisite on commonsense source like knowledge bases (KBs), which are usually highly resource consuming in construction. Recently pre-trained language models (PLMs) show effectiveness as an alternative for commonsense clues when they play a role of knowledge generator. However, existing work either relies on large-scale in-domain or out-of-domain labeled data, or fails to generate knowledge of high quality in a general way. Motivated by human thinking experience, we propose an approach of All-round Thinker (ArT) by fully taking association during knowledge generating. In detail, our model first focuses on key parts in the given context, and then generates highly related knowledge on such a basis in an association way like human thinking. Besides, for casual reasoning, a reverse thinking mechanism is especially added to further enhance bidirectional inferring between cause and effect. ArT is totally unsupervised and KBs-free. We evaluate it on three commonsense QA benchmarks: COPA, SocialIQA and SCT. On all scales of PLM backbones, ArT shows its brilliant performance and outperforms previous advanced unsupervised models.

Teaching Neural Module Networks to Do Arithmetic
Jiayi Chen | Xiao-Yu Guo | Yuan-Fang Li | Gholamreza Haffari

Answering complex questions that require multi-step multi-type reasoning over raw text is challenging, especially when conducting numerical reasoning. Neural Module Networks (NMNs), follow the programmer-interpreter framework and design trainable modules to learn different reasoning skills. However, NMNs only have limited reasoning abilities, and lack numerical reasoning capability. We upgrade NMNs by: (a) bridging the gap between its interpreter and the complex questions; (b) introducing addition and subtraction modules that perform numerical reasoning over numbers. On a subset of DROP, experimental results show that our proposed methods enhance NMNs’ numerical reasoning skills by 17.7% improvement of F1 score and significantly outperform previous state-of-the-art models.

An Augmented Benchmark Dataset for Geometric Question Answering through Dual Parallel Text Encoding
Jie Cao | Jing Xiao

Automatic math problem solving has attracted much attention of NLP researchers recently. However, most of the works focus on the solving of Math Word Problems (MWPs). In this paper, we study on the Geometric Problem Solving based on neural networks. Solving geometric problems requires the integration of text and diagram information as well as the knowledge of the relevant theorems. The lack of high-quality datasets and efficient neural geometric solvers impedes the development of automatic geometric problems solving. Based on GeoQA, we newly annotate 2,518 geometric problems with richer types and greater difficulty to form an augmented benchmark dataset GeoQA+, containing 6,027 problems in training set and 7,528 totally. We further perform data augmentation method to expand the training set to 12,054. Besides, we design a Dual Parallel text Encoder DPE to efficiently encode long and medium-length problem text. The experimental results validate the effectiveness of GeoQA+ and DPE module, and the accuracy of automatic geometric problem solving is improved to 66.09%.

Competence-based Question Generation
Jingxuan Tu | Kyeongmin Rim | James Pustejovsky

Models of natural language understanding often rely on question answering and logical inference benchmark challenges to evaluate the performance of a system. While informative, such task-oriented evaluations do not assess the broader semantic abilities that humans have as part of their linguistic competence when speaking and interpreting language. We define competence-based (CB) question generation, and focus on queries over lexical semantic knowledge involving implicit argument and subevent structure of verbs. We present a method to generate such questions and a dataset of English cooking recipes we use for implementing the generation method. Our primary experiment shows that even large pretrained language models perform poorly on CB questions until they are provided with additional contextualized semantic information. The data and the source code is available at: https: //

Coalescing Global and Local Information for Procedural Text Understanding
Kaixin Ma | Filip Ilievski | Jonathan Francis | Eric Nyberg | Alessandro Oltramari

Procedural text understanding is a challenging language reasoning task that requires models to track entity states across the development of a narrative. We identify three core aspects required for modeling this task, namely the local and global view of the inputs, as well as the global view of outputs. Prior methods have considered a subset of these aspects, which leads to either low precision or low recall. In this paper, we propose a new model Coalescing Global and Local Information (CGLI), which builds entity- and timestep-aware input representations (local input) considering the whole context (global input), and we jointly model the entity states with a structured prediction objective (global output). Thus, CGLI simultaneously optimizes for both precision and recall. Moreover, we extend CGLI with additional output layers and integrate it into a story reasoning framework. Extensive experiments on a popular procedural text understanding dataset show that our model achieves state-of-the-art results, while experiments on a story reasoning benchmark show the positive impact of our model on downstream reasoning.

Original Content Is All You Need! an Empirical Study on Leveraging Answer Summary for WikiHowQA Answer Selection Task
Liang Wen | Juan Li | Houfeng Wang | Yingwei Luo | Xiaolin Wang | Xiaodong Zhang | Zhicong Cheng | Dawei Yin

Answer selection task requires finding appropriate answers to questions from informative but crowdsourced candidates. A key factor impeding its solution by current answer selection approaches is the redundancy and lengthiness issues of crowdsourced answers. Recently, Deng et al. (2020) constructed a new dataset, WikiHowQA, which contains a corresponding reference summary for each original lengthy answer. And their experiments show that leveraging the answer summaries helps to attend the essential information in original lengthy answers and improve the answer selection performance under certain circumstances. However, when given a question and a set of long candidate answers, human beings could effortlessly identify the correct answer without the aid of additional answer summaries since the original answers contain all the information volume that answer summaries contain. In addition, pretrained language models have been shown superior or comparable to human beings on many natural language processing tasks. Motivated by those, we design a series of neural models, either pretraining-based or non-pretraining-based, to check wether the additional answer summaries are helpful for ranking the relevancy degrees of question-answer pairs on WikiHowQA dataset. Extensive automated experiments and hand analysis show that the additional answer summaries are not useful for achieving the best performance.

Case-Based Abductive Natural Language Inference
Marco Valentino | Mokanarangan Thayaparan | André Freitas

Most of the contemporary approaches for multi-hop Natural Language Inference (NLI) construct explanations considering each test case in isolation. However, this paradigm is known to suffer from semantic drift, a phenomenon that causes the construction of spurious explanations leading to wrong conclusions. In contrast, this paper proposes an abductive framework for multi-hop NLI exploring the retrieve-reuse-refine paradigm in Case-Based Reasoning (CBR). Specifically, we present Case-Based Abductive Natural Language Inference (CB-ANLI), a model that addresses unseen inference problems by analogical transfer of prior explanations from similar examples. We empirically evaluate the abductive framework on commonsense and scientific question answering tasks, demonstrating that CB-ANLI can be effectively integrated with sparse and dense pre-trained encoders to improve multi-hop inference, or adopted as an evidence retriever for Transformers. Moreover, an empirical analysis of semantic drift reveals that the CBR paradigm boosts the quality of the most challenging explanations, a feature that has a direct impact on robustness and accuracy in downstream inference tasks.

Semantic Structure Based Query Graph Prediction for Question Answering over Knowledge Graph
Mingchen Li | Shihao Ji

Building query graphs from natural language questions is an important step in complex question answering over knowledge graph (Complex KGQA). In general, a question can be correctly answered if its query graph is built correctly and the right answer is then retrieved by issuing the query graph against the KG. Therefore, this paper focuses on query graph generation from natural language questions. Existing approaches for query graph generation ignore the semantic structure of a question, resulting in a large number of noisy query graph candidates that undermine prediction accuracies. In this paper, we define six semantic structures from common questions in KGQA and develop a novel Structure-BERT to predict the semantic structure of a question. By doing so, we can first filter out noisy candidate query graphs by the predicted semantic structures, and then rank the remaining candidates with a BERT-based ranking model. Extensive experiments on two popular benchmarks MetaQA and WebQuestionsSP (WSP) demonstrate the effectiveness of our method as compared to state-of-the-arts.

Repo4QA: Answering Coding Questions via Dense Retrieval on GitHub Repositories
Minyu Chen | Guoqiang Li | Chen Ma | Jingyang Li | Hongfei Fu

Open-source platforms such as GitHub and Stack Overflow both play significant roles in current software ecosystems. It is crucial but time-consuming for developers to raise programming questions in coding forums such as Stack Overflow and be navigated to actual solutions on GitHub repositories. In this paper, we dedicate to accelerating this activity. We find that traditional information retrieval-based methods fail to handle the long and complex questions in coding forums, and thus cannot find suitable coding repositories. To effectively and efficiently bridge the semantic gap between repositories and real-world coding questions, we introduce a specialized dataset named Repo4QA, which includes over 12,000 question-repository pairs constructed from Stack Overflow and GitHub. Furthermore, we propose QuRep, a CodeBERT-based model that jointly learns the representation of both questions and repositories. Experimental results demonstrate that our model simultaneously captures the semantic features in both questions and repositories through supervised contrastive loss and hard negative sampling. We report that our approach outperforms existing state-of-art methods by 3%-8% on MRR and 5%-8% on P@1.

Addressing Limitations of Encoder-Decoder Based Approach to Text-to-SQL
Octavian Popescu | Irene Manotas | Ngoc Phuoc An Vo | Hangu Yeo | Elahe Khorashani | Vadim Sheinin

Most attempts on Text-to-SQL task using encoder-decoder approach show a big problem of dramatic decline in performance for new databases. For the popular Spider dataset, despite models achieving 70% accuracy on its development or test sets, the same models show a huge decline below 20% accuracy for unseen databases. The root causes for this problem are complex and they cannot be easily fixed by adding more manually created training. In this paper we address the problem and propose a solution that is a hybrid system using automated training-data augmentation technique. Our system consists of a rule-based and a deep learning components that interact to understand crucial information in a given query and produce correct SQL as a result. It achieves double-digit percentage improvement for databases that are not part of the Spider corpus.

Mintaka: A Complex, Natural, and Multilingual Dataset for End-to-End Question Answering
Priyanka Sen | Alham Fikri Aji | Amir Saffari

We introduce Mintaka, a complex, natural, and multilingual dataset designed for experimenting with end-to-end question-answering models. Mintaka is composed of 20,000 question-answer pairs collected in English, annotated with Wikidata entities, and translated into Arabic, French, German, Hindi, Italian, Japanese, Portuguese, and Spanish for a total of 180,000 samples. Mintaka includes 8 types of complex questions, including superlative, intersection, and multi-hop questions, which were naturally elicited from crowd workers. We run baselines over Mintaka, the best of which achieves 38% hits@1 in English and 31% hits@1 multilingually, showing that existing models have room for improvement. We release Mintaka at

Can Edge Probing Tests Reveal Linguistic Knowledge in QA Models?
Sagnik Ray Choudhury | Nikita Bhutani | Isabelle Augenstein

There have been many efforts to try to understand what grammatical knowledge (e.g., ability to understand the part of speech of a token) is encoded in large pre-trained language models (LM). This is done through ‘Edge Probing’ (EP) tests: supervised classification tasks to predict the grammatical properties of a span (whether it has a particular part of speech) using only the token representations coming from the LM encoder. However, most NLP applications fine-tune these LM encoders for specific tasks. Here, we ask: if an LM is fine-tuned, does the encoding of linguistic information in it change, as measured by EP tests? Specifically, we focus on the task of Question Answering (QA) and conduct experiments on multiple datasets. We find that EP test results do not change significantly when the fine-tuned model performs well or in adversarial situations where the model is forced to learn wrong correlations. From a similar finding, some recent papers conclude that fine-tuning does not change linguistic knowledge in encoders but they do not provide an explanation. We find that EP models are susceptible to exploiting spurious correlations in the EP datasets. When this dataset bias is corrected, we do see an improvement in the EP test results as expected.

Conversational QA Dataset Generation with Answer Revision
Seonjeong Hwang | Gary Geunbae Lee

Conversational question-answer generation is a task that automatically generates a large-scale conversational question answering dataset based on input passages. In this paper, we introduce a novel framework that extracts question-worthy phrases from a passage and then generates corresponding questions considering previous conversations. In particular, our framework revises the extracted answers after generating questions so that answers exactly match paired questions. Experimental results show that our simple answer revision approach leads to significant improvement in the quality of synthetic data. Moreover, we prove that our framework can be effectively utilized for domain adaptation of conversational question answering.

DABERT: Dual Attention Enhanced BERT for Semantic Matching
Sirui Wang | Di Liang | Jian Song | Yuntao Li | Wei Wu

Transformer-based pre-trained language models such as BERT have achieved remarkable results in Semantic Sentence Matching. However, existing models still suffer from insufficient ability to capture subtle differences. Minor noise like word addition, deletion, and modification of sentences may cause flipped predictions. To alleviate this problem, we propose a novel Dual Attention Enhanced BERT (DABERT) to enhance the ability of BERT to capture fine-grained differences in sentence pairs. DABERT comprises (1) Dual Attention module, which measures soft word matches by introducing a new dual channel alignment mechanism to model affinity and difference attention. (2) Adaptive Fusion module, this module uses attention to learn the aggregation of difference and affinity features, and generates a vector describing the matching details of sentence pairs. We conduct extensive experiments on well-studied semantic matching and robustness test datasets, and the experimental results show the effectiveness of our proposed method.

Locate Then Ask: Interpretable Stepwise Reasoning for Multi-hop Question Answering
Siyuan Wang | Zhongyu Wei | Zhihao Fan | Qi Zhang | Xuanjing Huang

Multi-hop reasoning requires aggregating multiple documents to answer a complex question. Existing methods usually decompose the multi-hop question into simpler single-hop questions to solve the problem for illustrating the explainable reasoning process. However, they ignore grounding on the supporting facts of each reasoning step, which tends to generate inaccurate decompositions. In this paper, we propose an interpretable stepwise reasoning framework to incorporate both single-hop supporting sentence identification and single-hop question generation at each intermediate step, and utilize the inference of the current hop for the next until reasoning out the final result. We employ a unified reader model for both intermediate hop reasoning and final hop inference and adopt joint optimization for more accurate and robust multi-hop reasoning. We conduct experiments on two benchmark datasets HotpotQA and 2WikiMultiHopQA. The results show that our method can effectively boost performance and also yields a better interpretable reasoning process without decomposition supervision.

Less Is Better: Recovering Intended-Feature Subspace to Robustify NLU Models
Ting Wu | Tao Gui

Datasets with significant proportions of bias present threats for training a trustworthy model on NLU tasks. Despite yielding great progress, current debiasing methods impose excessive reliance on the knowledge of bias attributes. Definition of the attributes, however, is elusive and varies across different datasets. In addition, leveraging these attributes at input level to bias mitigation may leave a gap between intrinsic properties and the underlying decision rule. To narrow down this gap and liberate the supervision on bias, we suggest extending bias mitigation into feature space. Therefore, a novel model, Recovering Intended-Feature Subspace with Knowledge-Free (RISK) is developed. Assuming that shortcut features caused by various biases are unintended for prediction, RISK views them as redundant features. When delving into a lower manifold to remove redundancies, RISK reveals that an extremely low-dimensional subspace with intended features can robustly represent the highly biased dataset. Empirical results demonstrate our model can consistently improve model generalization to out-of-distribution set, and achieves a new state-of-the-art performance.

CORN: Co-Reasoning Network for Commonsense Question Answering
Xin Guan | Biwei Cao | Qingqing Gao | Zheng Yin | Bo Liu | Jiuxin Cao

Commonsense question answering (QA) requires machines to utilize the QA content and external commonsense knowledge graph (KG) for reasoning when answering questions. Existing work uses two independent modules to model the QA contextual text representation and relationships between QA entities in KG, which prevents information sharing between modules for co-reasoning. In this paper, we propose a novel model, Co-Reasoning Network (CORN), which adopts a bidirectional multi-level connection structure based on Co-Attention Transformer. The structure builds bridges to connect each layer of the text encoder and graph encoder, which can introduce the QA entity relationship from KG to the text encoder and bring contextual text information to the graph encoder, so that these features can be deeply interactively fused to form comprehensive text and graph node representations. Meanwhile, we propose a QA-aware node based KG subgraph construction method. The QA-aware nodes aggregate the question entity nodes and the answer entity nodes, and further guide the expansion and construction process of the subgraph to enhance the connectivity and reduce the introduction of noise. We evaluate our model on QA benchmarks in the CommonsenseQA and OpenBookQA datasets, and CORN achieves state-of-the-art performance.

Logical Form Generation via Multi-task Learning for Complex Question Answering over Knowledge Bases
Xixin Hu | Xuan Wu | Yiheng Shu | Yuzhong Qu

Question answering over knowledge bases (KBQA) for complex questions is a challenging task in natural language processing. Recently, generation-based methods that translate natural language questions to executable logical forms have achieved promising performance. These methods use auxiliary information to augment the logical form generation of questions with unseen KB items or novel combinations, but the noise introduced can also leads to more incorrect results. In this work, we propose GMT-KBQA, a Generation-based KBQA method via Multi-Task learning, to better retrieve and utilize auxiliary information. GMT-KBQA first obtains candidate entities and relations through dense retrieval, and then introduces a multi-task model which jointly learns entity disambiguation, relation classification, and logical form generation. Experimental results show that GMT-KBQA achieves state-of-the-art results on both ComplexWebQuestions and WebQuestionsSP datasets. Furthermore, the detailed evaluation demonstrates that GMT-KBQA benefits from the auxiliary tasks and has a strong generalization capability.

CMQA: A Dataset of Conditional Question Answering with Multiple-Span Answers
Yiming Ju | Weikang Wang | Yuanzhe Zhang | Suncong Zheng | Kang Liu | Jun Zhao

Forcing the answer of the Question Answering (QA) task to be a single text span might be restrictive since the answer can be multiple spans in the context. Moreover, we found that multi-span answers often appear with two characteristics when building the QA system for a real-world application. First, multi-span answers might be caused by users lacking domain knowledge and asking ambiguous questions, which makes the question need to be answered with conditions. Second, there might be hierarchical relations among multiple answer spans. Some recent span-extraction QA datasets include multi-span samples, but they only contain unconditional and parallel answers, which cannot be used to tackle this problem. To bridge the gap, we propose a new task: conditional question answering with hierarchical multi-span answers, where both the hierarchical relations and the conditions need to be extracted. Correspondingly, we introduce CMQA, a Conditional Multiple-span Chinese Question Answering dataset to study the new proposed task. The final release of CMQA consists of 7,861 QA pairs and 113,089 labels, where all samples contain multi-span answers, 50.4% of samples are conditional, and 56.6% of samples are hierarchical. CMQA can serve as a benchmark to study the new proposed task and help study building QA systems for real-world applications. The low performance of models drawn from related literature shows that the new proposed task is challenging for the community to solve.

To What Extent Do Natural Language Understanding Datasets Correlate to Logical Reasoning? A Method for Diagnosing Logical Reasoning.
Yitian Li | Jidong Tian | Wenqing Chen | Caoyun Fan | Hao He | Yaohui Jin

Reasoning and knowledge-related skills are considered as two fundamental skills for natural language understanding (NLU) tasks such as machine reading comprehension (MRC) and natural language inference (NLI). However, it is not clear to what extent an NLU task defined on a dataset correlates to a specific NLU skill. On the one hand, evaluating the correlation requires an understanding of the significance of the NLU skill in a dataset. Significance judges whether a dataset includes sufficient material to help the model master this skill. On the other hand, it is also necessary to evaluate the dependence of the task on the NLU skill. Dependence is a measure of how much the task defined on a dataset depends on the skill. In this paper, we propose a systematic method to diagnose the correlations between an NLU dataset and a specific skill, and then take a fundamental reasoning skill, logical reasoning, as an example for analysis. The method adopts a qualitative indicator to indicate the significance while adopting a quantitative indicator to measure the dependence. We perform diagnosis on 8 MRC datasets (including two types) and 3 NLI datasets and acquire intuitively reasonable results. We then perform the analysis to further understand the results and the proposed indicators. Based on the analysis, although the diagnostic method has some limitations, it is still an effective method to perform a basic diagnosis of the correlation between the dataset and logical reasoning skill, which also can be generalized to other NLU skills.

ArcaneQA: Dynamic Program Induction and Contextualized Encoding for Knowledge Base Question Answering
Yu Gu | Yu Su

Question answering on knowledge bases (KBQA) poses a unique challenge for semantic parsing research due to two intertwined challenges: large search space and ambiguities in schema linking. Conventional ranking-based KBQA models, which rely on a candidate enumeration step to reduce the search space, struggle with flexibility in predicting complicated queries and have impractical running time. In this paper, we present ArcaneQA, a novel generation-based model that addresses both the large search space and the schema linking challenges in a unified framework with two mutually boosting ingredients: dynamic program induction for tackling the large search space and dynamic contextualized encoding for schema linking. Experimental results on multiple popular KBQA datasets demonstrate the highly competitive performance of ArcaneQA in both effectiveness and efficiency.

Unsupervised Question Answering via Answer Diversifying
Yuxiang Nie | Heyan Huang | Zewen Chi | Xian-Ling Mao

Unsupervised question answering is an attractive task due to its independence on labeled data. Previous works usually make use of heuristic rules as well as pre-trained models to construct data and train QA models. However, most of these works regard named entity (NE) as the only answer type, which ignores the high diversity of answers in the real world. To tackle this problem, we propose a novel unsupervised method by diversifying answers, named DiverseQA. Specifically, the proposed method is composed of three modules: data construction, data augmentation and denoising filter. Firstly, the data construction module extends the extracted named entity into a longer sentence constituent as the new answer span to construct a QA dataset with diverse answers. Secondly, the data augmentation module adopts an answer-type dependent data augmentation process via adversarial training in the embedding level. Thirdly, the denoising filter module is designed to alleviate the noise in the constructed data. Extensive experiments show that the proposed method outperforms previous unsupervised models on five benchmark datasets, including SQuADv1.1, NewsQA, TriviaQA, BioASQ, and DuoRC. Besides, the proposed method shows strong performance in the few-shot learning setting.

Weakly Supervised Formula Learner for Solving Mathematical Problems
Yuxuan Wu | Hideki Nakayama

Mathematical reasoning task is a subset of the natural language question answering task. Existing work suggested solving this task with a two-phase approach, where the model first predicts formulas from questions and then calculates answers from such formulas. This approach achieved desirable performance in existing work. However, its reliance on annotated formulas as intermediate labels throughout its training limited its application. In this work, we put forward the idea to enable models to learn optimal formulas autonomously. We proposed Weakly Supervised Formula Learner, a learning framework that drives the formula exploration with weak supervision from the final answers to mathematical problems. Our experiments are conducted on two representative mathematical reasoning datasets MathQA and Math23K. On MathQA, our method outperformed baselines trained on complete yet imperfect formula annotations. On Math23K, our method outperformed other weakly supervised learning methods.

Reducing Spurious Correlations for Answer Selection by Feature Decorrelation and Language Debiasing
Zeyi Zhong | Min Yang | Ruifeng Xu

Deep neural models have become the mainstream in answer selection, yielding state-of-the-art performance. However, these models tend to rely on spurious correlations between prediction labels and input features, which in general suffer from robustness and generalization. In this paper, we propose a novel Spurious Correlation reduction method to improve the robustness of the neural ANswer selection models (SCAN) from the sample and feature perspectives by removing the feature dependencies and language biases in answer selection. First, from the sample perspective, we propose a feature decorrelation module by learning a weight for each instance at the training phase to remove the feature dependencies and reduce the spurious correlations without prior knowledge of such correlations. Second, from the feature perspective, we propose a feature debiasing module with contrastive learning to alleviate the negative language biases (spurious correlations) and further improve the robustness of the AS models. Experimental results on three benchmark datasets show that SCAN achieves substantial improvements over strong baselines. For reproducibility, we will release our code and data upon the publication of this paper.

Understanding and Improving Zero-shot Multi-hop Reasoning in Generative Question Answering
Zhengbao Jiang | Jun Araki | Haibo Ding | Graham Neubig

Generative question answering (QA) models generate answers to questions either solely based on the parameters of the model (the closed-book setting) or additionally retrieving relevant evidence (the open-book setting). Generative QA models can answer some relatively complex questions, but the mechanism through which they do so is still poorly understood. We perform several studies aimed at better understanding the multi-hop reasoning capabilities of generative QA models. First, we decompose multi-hop questions into multiple corresponding single-hop questions, and find marked inconsistency in QA models’ answers on these pairs of ostensibly identical question chains. Second, we find that models lack zero-shot multi-hop reasoning ability: when trained only on single-hop questions, models generalize poorly to multi-hop questions. Finally, we demonstrate that it is possible to improve models’ zero-shot multi-hop reasoning capacity through two methods that approximate real multi-hop natural language (NL) questions by training on either concatenation of single-hop questions or logical forms (SPARQL). In sum, these results demonstrate that multi-hop reasoning does not emerge naturally in generative QA models, but can be encouraged by advances in training or modeling techniques. Code is available at

Domain Adaptation for Question Answering via Question Classification
Zhenrui Yue | Huimin Zeng | Ziyi Kou | Lanyu Shang | Dong Wang

Question answering (QA) has demonstrated impressive progress in answering questions from customized domains. Nevertheless, domain adaptation remains one of the most elusive challenges for QA systems, especially when QA systems are trained in a source domain but deployed in a different target domain. In this work, we investigate the potential benefits of question classification for QA domain adaptation. We propose a novel framework: Question Classification for Question Answering (QC4QA). Specifically, a question classifier is adopted to assign question classes to both the source and target data. Then, we perform joint training in a self-supervised fashion via pseudo-labeling. For optimization, inter-domain discrepancy between the source and target domain is reduced via maximum mean discrepancy (MMD) distance. We additionally minimize intra-class discrepancy among QA samples of the same question class for fine-grained adaptation performance. To the best of our knowledge, this is the first work in QA domain adaptation to leverage question classification with self-supervised adaptation. We demonstrate the effectiveness of the proposed QC4QA with consistent improvements against the state-of-the-art baselines on multiple datasets.

Prompt-based Conservation Learning for Multi-hop Question Answering
Zhenyun Deng | Yonghua Zhu | Yang Chen | Qianqian Qi | Michael Witbrock | Patricia Riddle

Multi-hop question answering (QA) requires reasoning over multiple documents to answer a complex question and provide interpretable supporting evidence. However, providing supporting evidence is not enough to demonstrate that a model has performed the desired reasoning to reach the correct answer. Most existing multi-hop QA methods fail to answer a large fraction of sub-questions, even if their parent questions are answered correctly. In this paper, we propose the Prompt-based Conservation Learning (PCL) framework for multi-hop QA, which acquires new knowledge from multi-hop QA tasks while conserving old knowledge learned on single-hop QA tasks, mitigating forgetting. Specifically, we first train a model on existing single-hop QA tasks, and then freeze this model and expand it by allocating additional sub-networks for the multi-hop QA task. Moreover, to condition pre-trained language models to stimulate the kind of reasoning required for specific multi-hop questions, we learn soft prompts for the novel sub-networks to perform type-specific reasoning. Experimental results on the HotpotQA benchmark show that PCL is competitive for multi-hop QA and retains good performance on the corresponding single-hop sub-questions, demonstrating the efficacy of PCL in mitigating knowledge loss by forgetting.

GLAF: Global-to-Local Aggregation and Fission Network for Semantic Level Fact Verification
Zhiyuan Ma | Jianjun Li | Guohui Li | Yongjing Cheng

Accurate fact verification depends on performing fine-grained reasoning over crucial entities by capturing their latent logical relations hidden in multiple evidence clues, which is generally lacking in existing fact verification models. In this work, we propose a novel Global-to-Local Aggregation and Fission network (GLAF) to fill this gap. Instead of treating entire sentences or all semantic elements within them as nodes to construct a coarse-grained or unstructured evidence graph as in previous methods, GLAF constructs a fine-grained and structured evidence graph by parsing the rambling sentences into structural triple-level reasoning clues and regarding them as graph nodes to achieve fine-grained and interpretable evidence graph reasoning. Specifically, to capture latent logical relations between the clues, GLAF first employs a local fission reasoning layer to conduct fine-grained multi-hop reasoning, and then uses a global evidence aggregation layer to achieve information sharing and the interchange of evidence clues for final claim label prediction. Experimental results on the FEVER dataset demonstrate the effectiveness of GLAF, showing that it achieves the state-of-the-art performance by obtaining a 77.62% FEVER score.

Exploiting Hybrid Semantics of Relation Paths for Multi-hop Question Answering over Knowledge Graphs
Zile Qiao | Wei Ye | Tong Zhang | Tong Mo | Weiping Li | Shikun Zhang

Answering natural language questions on knowledge graphs (KGQA) remains a great challenge in terms of understanding complex questions via multi-hop reasoning. Previous efforts usually exploit large-scale entity-related text corpus or knowledge graph (KG) embeddings as auxiliary information to facilitate answer selection. However, the rich semantics implied in off-the-shelf relation paths between entities is far from well explored. This paper proposes improving multi-hop KGQA by exploiting relation paths’ hybrid semantics. Specifically, we integrate explicit textual information and implicit KG structural features of relation paths based on a novel rotate-and-scale entity link prediction framework. Extensive experiments on three existing KGQA datasets demonstrate the superiority of our method, especially in multi-hop scenarios. Further investigation confirms our method’s systematical coordination between questions and relation paths to identify answer entities.

Adaptive Threshold Selective Self-Attention for Chinese NER
Biao Hu | Zhen Huang | Minghao Hu | Ziwen Zhang | Yong Dou

Recently, Transformer has achieved great success in Chinese named entity recognition (NER) owing to its good parallelism and ability to model long-range dependencies, which utilizes self-attention to encode context. However, the fully connected way of self-attention may scatter the attention distribution and allow some irrelevant character information to be integrated, leading to entity boundaries being misidentified. In this paper, we propose a data-driven Adaptive Threshold Selective Self-Attention (ATSSA) mechanism that aims to dynamically select the most relevant characters to enhance the Transformer architecture for Chinese NER. In ATSSA, the attention score threshold of each query is automatically generated, and characters with attention score higher than the threshold are selected by the query while others are discarded, so as to address irrelevant attention integration. Experiments on four benchmark Chinese NER datasets show that the proposed ATSSA brings 1.68 average F1 score improvements to the baseline model and achieves state-of-the-art performance.

Cluster-aware Pseudo-Labeling for Supervised Open Relation Extraction
Bin Duan | Shusen Wang | Xingxian Liu | Yajing Xu

Supervised open relation extraction aims to discover novel relations by leveraging supervised data of pre-defined relations. However, most existing methods do not achieve effective knowledge transfer from pre-defined relations to novel relations, they have difficulties generating high-quality pseudo-labels for unsupervised data of novel relations and usually suffer from the error propagation issue. In this paper, we propose a Cluster-aware Pseudo-Labeling (CaPL) method to improve the pseudo-labels quality and transfer more knowledge for discovering novel relations. Specifically, the model is firstly pre-trained with the pre-defined relations to learn the relation representations. To improve the pseudo-labels quality, the distances between each instance and all cluster centers are used to generate the cluster-aware soft pseudo-labels for novel relations. To mitigate the catastrophic forgetting issue, we design the consistency regularization loss to make better use of the pseudo-labels and jointly train the model with both unsupervised and supervised data. Experimental results on two public datasets demonstrate that our proposed method achieves new state-of-the-arts performance.

Few-shot Named Entity Recognition with Entity-level Prototypical Network Enhanced by Dispersedly Distributed Prototypes
Bin Ji | Shasha Li | Shaoduo Gan | Jie Yu | Jun Ma | Huijun Liu | Jing Yang

Few-shot named entity recognition (NER) enables us to build a NER system for a new domain using very few labeled examples. However, existing prototypical networks for this task suffer from roughly estimated label dependency and closely distributed prototypes, thus often causing misclassifications. To address the above issues, we propose EP-Net, an Entity-level Prototypical Network enhanced by dispersedly distributed prototypes. EP-Net builds entity-level prototypes and considers text spans to be candidate entities, so it no longer requires the label dependency. In addition, EP-Net trains the prototypes from scratch to distribute them dispersedly and aligns spans to prototypes in the embedding space using a space projection. Experimental results on two evaluation tasks and the Few-NERD settings demonstrate that EP-Net consistently outperforms the previous strong models in terms of overall performance. Extensive analyses further validate the effectiveness of EP-Net.

Different Data, Different Modalities! Reinforced Data Splitting for Effective Multimodal Information Extraction from Social Media Posts
Bo Xu | Shizhou Huang | Ming Du | Hongya Wang | Hui Song | Chaofeng Sha | Yanghua Xiao

Recently, multimodal information extraction from social media posts has gained increasing attention in the natural language processing community. Despite their success, current approaches overestimate the significance of images. In this paper, we argue that different social media posts should consider different modalities for multimodal information extraction. Multimodal models cannot always outperform unimodal models. Some posts are more suitable for the multimodal model, while others are more suitable for the unimodal model. Therefore, we propose a general data splitting strategy to divide the social media posts into two sets so that these two sets can achieve better performance under the information extraction models of the corresponding modalities. Specifically, for an information extraction task, we first propose a data discriminator that divides social media posts into a multimodal and a unimodal set. Then we feed these sets into the corresponding models. Finally, we combine the results of these two models to obtain the final extraction results. Due to the lack of explicit knowledge, we use reinforcement learning to train the data discriminator. Experiments on two different multimodal information extraction tasks demonstrate the effectiveness of our method. The source code of this paper can be found in

Augmentation, Retrieval, Generation: Event Sequence Prediction with a Three-Stage Sequence-to-Sequence Approach
Bo Zhou | Chenhao Wang | Yubo Chen | Kang Liu | Jun Zhao | Jiexin Xu | Xiaojian Jiang | Qiuxia Li

Being able to infer possible events related to a specific target is critical to natural language processing. One challenging task in this line is event sequence prediction, which aims at predicting a sequence of events given a goal. Currently existing approach models this task as a statistical induction problem, to predict a sequence of events by exploring the similarity between the given goal and the known sequences of events. However, this statistical based approach is complex and predicts a limited variety of events. At the same time this approach ignores the rich knowledge of external events that is important for predicting event sequences. In this paper, in order to predict more diverse events, we first reformulate the event sequence prediction problem as a sequence generation problem. Then to leverage external event knowledge, we propose a three-stage model including augmentation, retrieval and generation. Experimental results on the event sequence prediction dataset show that our model outperforms existing methods, demonstrating the effectiveness of the proposed model.

Generating Temporally-ordered Event Sequences via Event Optimal Transport
Bo Zhou | Yubo Chen | Kang Liu | Jun Zhao | Jiexin Xu | Xiaojian Jiang | Qiuxia Li

Generating temporally-ordered event sequences in texts is important to natural language processing. Two emerging tasks in this direction are temporal event ordering (rearranging the set of events to correct order) and event infilling (generating an event at a specified position). To tackle the two related tasks, the existing method adopts a vanilla sequence-to-sequence model via maximum likelihood estimation (MLE). However, applying this approach to these tasks will cause two issues. One issue is that the MLE loss emphasizes strict local alignment and ignores the global semantics of the event. The other issue is that the model adopts a word-level objective to model events in texts, failing to evaluate the predicted results of the model from the perspective of event sequence. To alleviate these issues, we present a novel model to tackle the generation of temporally-ordered event sequences via Event Optimal Transport (EOT). First, we treat the events in the sequence as modeling units and explicitly extract the semantics of the events. Second, to provide event sequence-level evaluation of the predicted results of the model, we directly match events in sequences. Extensive experimental results show that our approach outperforms previous models on all evaluation datasets. In particular, the accuracy is improved by 7.7%, and the Macro F1 is improved by 7.2% on one of the datasets.

Improving Continual Relation Extraction through Prototypical Contrastive Learning
Chengwei Hu | Deqing Yang | Haoliang Jin | Zhen Chen | Yanghua Xiao

Continual relation extraction (CRE) aims to extract relations towards the continuous and iterative arrival of new data, of which the major challenge is the catastrophic forgetting of old tasks. In order to alleviate this critical problem for enhanced CRE performance, we propose a novel Continual Relation Extraction framework with Contrastive Learning, namely CRECL, which is built with a classification network and a prototypical contrastive network to achieve the incremental-class learning of CRE. Specifically, in the contrastive network a given instance is contrasted with the prototype of each candidate relations stored in the memory module. Such contrastive learning scheme ensures the data distributions of all tasks more distinguishable, so as to alleviate the catastrophic forgetting further. Our experiment results not only demonstrate our CRECL’s advantage over the state-of-the-art baselines on two public datasets, but also verify the effectiveness of CRECL’s contrastive learning on improving performance.

Prompt-based Text Entailment for Low-Resource Named Entity Recognition
Dongfang Li | Baotian Hu | Qingcai Chen

Pre-trained Language Models (PLMs) have been applied in NLP tasks and achieve promising results. Nevertheless, the fine-tuning procedure needs labeled data of the target domain, making it difficult to learn in low-resource and non-trivial labeled scenarios. To address these challenges, we propose Prompt-based Text Entailment (PTE) for low-resource named entity recognition, which better leverages knowledge in the PLMs. We first reformulate named entity recognition as the text entailment task. The original sentence with entity type-specific prompts is fed into PLMs to get entailment scores for each candidate. The entity type with the top score is then selected as final label. Then, we inject tagging labels into prompts and treat words as basic units instead of n-gram spans to reduce time complexity in generating candidates by n-grams enumeration. Experimental results demonstrate that the proposed method PTE achieves competitive performance on the CoNLL03 dataset, and better than fine-tuned counterparts on the MIT Movie and Few-NERD dataset in low-resource settings.

Key Mention Pairs Guided Document-Level Relation Extraction
Feng Jiang | Jianwei Niu | Shasha Mo | Shengda Fan

Document-level Relation Extraction (DocRE) aims at extracting relations between entities in a given document. Since different mention pairs may express different relations or even no relation, it is crucial to identify key mention pairs responsible for the entity-level relation labels. However, most recent studies treat different mentions equally while predicting the relations between entities, leading to sub-optimal performance. To this end, we propose a novel DocRE model called Key Mention pairs Guided Relation Extractor (KMGRE) to directly model mention-level relations, containing two modules: a mention-level relation extractor and a key instance classifier. These two modules could be iteratively optimized with an EM-based algorithm to enhance each other. We also propose a new method to solve the multi-label problem in optimizing the mention-level relation extractor. Experimental results on two public DocRE datasets demonstrate that the proposed model is effective and outperforms previous state-of-the-art models.

A Hybrid Model of Classification and Generation for Spatial Relation Extraction
Feng Wang | Peifeng Li | Qiaoming Zhu

Extracting spatial relations from texts is a fundamental task for natural language understanding and previous studies only regard it as a classification task, ignoring those spatial relations with null roles due to their poor information. To address the above issue, we first view spatial relation extraction as a generation task and propose a novel hybrid model HMCGR for this task. HMCGR contains a generation and a classification model, while the former can generate those null-role relations and the latter can extract those non-null-role relations to complement each other. Moreover, a reflexivity evaluation mechanism is applied to further improve the accuracy based on the reflexivity principle of spatial relation. Experimental results on SpaceEval show that HMCGR outperforms the SOTA baselines significantly.

Mining Health-related Cause-Effect Statements with High Precision at Large Scale
Ferdinand Schlatt | Dieter Bettin | Matthias Hagen | Benno Stein | Martin Potthast

An efficient assessment of the health relatedness of text passages is important to mine the web at scale to conduct health sociological analyses or to develop a health search engine. We propose a new efficient and effective termhood score for predicting the health relatedness of phrases and sentences, which achieves 69% recall at over 90% precision on a web dataset with cause-effect statements. It is more effective than state-of-the-art medical entity linkers and as effective but much faster than BERT-based approaches. Using our method, we compile the Webis Medical CauseNet 2022, a new resource of 7.8 million health-related cause-effect statements such as “Studies show that stress induces insomnia” in which the cause (‘stress’) and effect (‘insomnia’) are labeled.

Find the Funding: Entity Linking with Incomplete Funding Knowledge Bases
Gizem Aydin | Seyed Amin Tabatabaei | George Tsatsaronis | Faegheh Hasibi

Automatic extraction of funding information from academic articles adds significant value to industry and research communities, including tracking research outcomes by funding organizations, profiling researchers and universities based on the received funding, and supporting open access policies. Two major challenges of identifying and linking funding entities are: (i) sparse graph structure of the Knowledge Base (KB), which makes the commonly used graph-based entity linking approaches suboptimal for the funding domain, (ii) missing entities in KB, which (unlike recent zero-shot approaches) requires marking entity mentions without KB entries as NIL. We propose an entity linking model that can perform NIL prediction and overcome data scarcity issues in a time and data-efficient manner. Our model builds on a transformer-based mention detection and a bi-encoder model to perform entity linking. We show that our model outperforms strong existing baselines.

KiPT: Knowledge-injected Prompt Tuning for Event Detection
Haochen Li | Tong Mo | Hongcheng Fan | Jingkun Wang | Jiaxi Wang | Fuhao Zhang | Weiping Li

Event detection aims to detect events from the text by identifying and classifying event triggers (the most representative words). Most of the existing works rely heavily on complex downstream networks and require sufficient training data. Thus, those models may be structurally redundant and perform poorly when data is scarce. Prompt-based models are easy to build and are promising for few-shot tasks. However, current prompt-based methods may suffer from low precision because they have not introduced event-related semantic knowledge (e.g., part of speech, semantic correlation, etc.). To address these problems, this paper proposes a Knowledge-injected Prompt Tuning (KiPT) model. Specifically, the event detection task is formulated into a condition generation task. Then, knowledge-injected prompts are constructed using external knowledge bases, and a prompt tuning strategy is leveraged to optimize the prompts. Extensive experiments indicate that KiPT outperforms strong baselines, especially in few-shot scenarios.

OneEE: A One-Stage Framework for Fast Overlapping and Nested Event Extraction
Hu Cao | Jingye Li | Fangfang Su | Fei Li | Hao Fei | Shengqiong Wu | Bobo Li | Liang Zhao | Donghong Ji

Event extraction (EE) is an essential task of information extraction, which aims to extract structured event information from unstructured text. Most prior work focuses on extracting flat events while neglecting overlapped or nested ones. A few models for overlapped and nested EE includes several successive stages to extract event triggers and arguments,which suffer from error propagation. Therefore, we design a simple yet effective tagging scheme and model to formulate EE as word-word relation recognition, called OneEE. The relations between trigger or argument words are simultaneously recognized in one stage with parallel grid tagging, thus yielding a very fast event extraction speed. The model is equipped with an adaptive event fusion module to generate event-aware representations and a distance-aware predictor to integrate relative distance information for word-word relation recognition, which are empirically demonstrated to be effective mechanisms. Experiments on 3 overlapped and nested EE benchmarks, namely FewFC, Genia11, and Genia13, show that OneEE achieves the state-of-the-art (SOTA) results. Moreover, the inference speed of OneEE is faster than those of baselines in the same condition, and can be further substantially improved since it supports parallel inference.

Joint Language Semantic and Structure Embedding for Knowledge Graph Completion
Jianhao Shen | Chenguang Wang | Linyuan Gong | Dawn Song

The task of completing knowledge triplets has broad downstream applications. Both structural and semantic information plays an important role in knowledge graph completion. Unlike previous approaches that rely on either the structures or semantics of the knowledge graphs, we propose to jointly embed the semantics in the natural language description of the knowledge triplets with their structure information. Our method embeds knowledge graphs for the completion task via fine-tuning pre-trained language models with respect to a probabilistic structured loss, where the forward pass of the language models captures semantics and the loss reconstructs structures. Our extensive experiments on a variety of knowledge graph benchmarks have demonstrated the state-of-the-art performance of our method. We also show that our method can significantly improve the performance in a low-resource regime, thanks to the better use of semantics. The code and datasets are available at

Event Detection with Dual Relational Graph Attention Networks
Jiaxin Mi | Po Hu | Peng Li

Event detection, which aims to identify instances of specific event types from pieces of text, is a fundamental task in information extraction. Most existing approaches leverage syntactic knowledge with a set of syntactic relations to enhance event detection. However, a side effect of these syntactic-based approaches is that they may confuse different syntactic relations and tend to introduce redundant or noisy information, which may lead to performance degradation. To this end, we propose a simple yet effective model named DualGAT (Dual Relational Graph Attention Networks), which exploits the complementary nature of syntactic and semantic relations to alleviate the problem. Specifically, we first construct a dual relational graph that both aggregates syntactic and semantic relations to the key nodes in the graph, so that event-relevant information can be comprehensively captured from multiple perspectives (i.e., syntactic and semantic views). We then adopt augmented relational graph attention networks to encode the graph and optimize its attention weights by introducing contextual information, which further improves the performance of event detection. Extensive experiments conducted on the standard ACE2005 benchmark dataset indicate that our method significantly outperforms the state-of-the-art methods and verifies the superiority of DualGAT over existing syntactic-based methods.

A Multi-Format Transfer Learning Model for Event Argument Extraction via Variational Information Bottleneck
Jie Zhou | Qi Zhang | Qin Chen | Qi Zhang | Liang He | Xuanjing Huang

Event argument extraction (EAE) aims to extract arguments with given roles from texts, which have been widely studied in natural language processing. Most previous works have achieved good performance in specific EAE datasets with dedicated neural architectures. Whereas, these architectures are usually difficult to adapt to new datasets/scenarios with various annotation schemas or formats. Furthermore, they rely on large-scale labeled data for training, which is unavailable due to the high labelling cost in most cases. In this paper, we propose a multi-format transfer learning model with variational information bottleneck, which makes use of the information especially the common knowledge in existing datasets for EAE in new datasets. Specifically, we introduce a shared-specific prompt framework to learn both format-shared and format-specific knowledge from datasets with different formats. In order to further absorb the common knowledge for EAE and eliminate the irrelevant noise, we integrate variational information bottleneck into our architecture to refine the shared representation. We conduct extensive experiments on three benchmark datasets, and obtain new state-of-the-art performance on EAE.

RSGT: Relational Structure Guided Temporal Relation Extraction
Jie Zhou | Shenpo Dong | Hongkui Tu | Xiaodong Wang | Yong Dou

Temporal relation extraction aims to extract temporal relations between event pairs, which is crucial for natural language understanding. Few efforts have been devoted to capturing the global features. In this paper, we propose RSGT: Relational Structure Guided Temporal Relation Extraction to extract the relational structure features that can fit for both inter-sentence and intra-sentence relations. Specifically, we construct a syntactic-and-semantic-based graph to extract relational structures. Then we present a graph neural network based model to learn the representation of this graph. After that, an auxiliary temporal neighbor prediction task is used to fine-tune the encoder to get more comprehensive node representations. Finally, we apply a conflict detection and correction algorithm to adjust the wrongly predicted labels. Experiments on two well-known datasets, MATRES and TB-Dense, demonstrate the superiority of our method (2.3% F1 improvement on MATRES, 3.5% F1 improvement on TB-Dense).

Learning Hierarchy-Aware Quaternion Knowledge Graph Embeddings with Representing Relations as 3D Rotations
Jinfa Yang | Xianghua Ying | Yongjie Shi | Xin Tong | Ruibin Wang | Taiyan Chen | Bowei Xing

Knowledge graph embedding aims to represent entities and relations as low-dimensional vectors, which is an effective way for predicting missing links. It is crucial for knowledge graph embedding models to model and infer various relation patterns, such as symmetry/antisymmetry. However, many existing approaches fail to model semantic hierarchies, which are common in the real world. We propose a new model called HRQE, which represents entities as pure quaternions. The relational embedding consists of two parts: (a) Using unit quaternions to represent the rotation part in 3D space, where the head entities are rotated by the corresponding relations through Hamilton product. (b) Using scale parameters to constrain the modulus of entities to make them have hierarchical distributions. To the best of our knowledge, HRQE is the first model that can encode symmetry/antisymmetry, inversion, composition, multiple relation patterns and learn semantic hierarchies simultaneously. Experimental results demonstrate the effectiveness of HRQE against some of the SOTA methods on four well-established knowledge graph completion benchmarks.

Two Languages Are Better than One: Bilingual Enhancement for Chinese Named Entity Recognition
Jinzhong Ning | Zhihao Yang | Zhizheng Wang | Yuanyuan Sun | Hongfei Lin | Jian Wang

Chinese Named Entity Recognition (NER) has continued to attract research attention. However, most existing studies only explore the internal features of the Chinese language but neglect other lingual modal features. Actually, as another modal knowledge of the Chinese language, English contains rich prompts about entities that can potentially be applied to improve the performance of Chinese NER. Therefore, in this study, we explore the bilingual enhancement for Chinese NER and propose a unified bilingual interaction module called the Adapted Cross-Transformers with Global Sparse Attention (ACT-S) to capture the interaction of bilingual information. We utilize a model built upon several different ACT-Ss to integrate the rich English information into the Chinese representation. Moreover, our model can learn the interaction of information between bilinguals (inter-features) and the dependency information within Chinese (intra-features). Compared with existing Chinese NER methods, our proposed model can better handle entities with complex structures. The English text that enhances the model is automatically generated by machine translation, avoiding high labour costs. Experimental results on four well-known benchmark datasets demonstrate the effectiveness and robustness of our proposed model.

Read Extensively, Focus Smartly: A Cross-document Semantic Enhancement Method for Visual Documents NER
Jun Zhao | Xin Zhao | WenYu Zhan | Tao Gui | Qi Zhang | Liang Qiao | Zhanzhan Cheng | Shiliang Pu

The introduction of multimodal information and pretraining technique significantly improves entity recognition from visually-rich documents. However, most of the existing methods pay unnecessary attention to irrelevant regions of the current document while ignoring the potentially valuable information in related documents. To deal with this problem, this work proposes a cross-document semantic enhancement method, which consists of two modules: 1) To prevent distractions from irrelevant regions in the current document, we design a learnable attention mask mechanism, which is used to adaptively filter redundant information in the current document. 2) To further enrich the entity-related context, we propose a cross-document information awareness technique, which enables the model to collect more evidence across documents to assist in prediction. The experimental results on two documents understanding benchmarks covering eight languages demonstrate that our method outperforms the SOTA methods.

STAD: Self-Training with Ambiguous Data for Low-Resource Relation Extraction
Junjie Yu | Xing Wang | Jiangjiang Zhao | Chunjie Yang | Wenliang Chen

We present a simple yet effective self-training approach, named as STAD, for low-resource relation extraction. The approach first classifies the auto-annotated instances into two groups: confident instances and uncertain instances, according to the probabilities predicted by a teacher model. In contrast to most previous studies, which mainly only use the confident instances for self-training, we make use of the uncertain instances. To this end, we propose a method to identify ambiguous but useful instances from the uncertain instances and then divide the relations into candidate-label set and negative-label set for each ambiguous instance. Next, we propose a set-negative training method on the negative-label sets for the ambiguous instances and a positive training method for the confident instances. Finally, a joint-training method is proposed to build the final relation extraction system on all data. Experimental results on two widely used datasets SemEval2010 Task-8 and Re-TACRED with low-resource settings demonstrate that this new self-training approach indeed achieves significant and consistent improvements when comparing to several competitive self-training systems.

Flat Multi-modal Interaction Transformer for Named Entity Recognition
Junyu Lu | Dixiang Zhang | Jiaxing Zhang | Pingjian Zhang

Multi-modal named entity recognition (MNER) aims at identifying entity spans and recognizing their categories in social media posts with the aid of images. However, in dominant MNER approaches, the interaction of different modalities is usually carried out through the alternation of self-attention and cross-attention or over-reliance on the gating machine, which results in imprecise and biased correspondence between fine-grained semantic units of text and image. To address this issue, we propose a Flat Multi-modal Interaction Transformer (FMIT) for MNER. Specifically, we first utilize noun phrases in sentences and general domain words to obtain visual cues. Then, we transform the fine-grained semantic representation of the vision and text into a unified lattice structure and design a novel relative position encoding to match different modalities in Transformer. Meanwhile, we propose to leverage entity boundary detection as an auxiliary task to alleviate visual bias. Experiments show that our methods achieve the new state-of-the-art performance on two benchmark datasets.

MetaSLRCL: A Self-Adaptive Learning Rate and Curriculum Learning Based Framework for Few-Shot Text Classification
Kailin Zhao | Xiaolong Jin | Saiping Guan | Jiafeng Guo | Xueqi Cheng

Due to the lack of labeled data in many realistic scenarios, a number of few-shot learning methods for text classification have been proposed, among which the meta learning based ones have recently attracted much attention. Such methods usually consist of a learner as the classifier and a meta learner for specializing the learner to different tasks. For the learner, learning rate is crucial to its performance. However, existing methods treat it as a hyper parameter and adjust it manually, which is time-consuming and laborious. Intuitively, for different tasks and neural network layers, the learning rates should be different and self-adaptive. For the meta learner, it requires a good generalization ability so as to quickly adapt to new tasks. Motivated by these issues, we propose a novel meta learning framework, called MetaSLRCL, for few-shot text classification. Specifically, we present a novel meta learning mechanism to obtain different learning rates for different tasks and neural network layers so as to enable the learner to quickly adapt to new training data. Moreover, we propose a task-oriented curriculum learning mechanism to help the meta learner achieve a better generalization ability by learning from different tasks with increasing difficulties. Extensive experiments on three benchmark datasets demonstrate the effectiveness of MetaSLRCL.

A Simple Temporal Information Matching Mechanism for Entity Alignment between Temporal Knowledge Graphs
Li Cai | Xin Mao | Meirong Ma | Hao Yuan | Jianchao Zhu | Man Lan

Entity alignment (EA) aims to find entities in different knowledge graphs (KGs) that refer to the same object in the real world. Recent studies incorporate temporal information to augment the representations of KGs. The existing methods for EA between temporal KGs (TKGs) utilize a time-aware attention mechanisms to incorporate relational and temporal information into entity embeddings. The approaches outperform the previous methods by using temporal information. However, we believe that it is not necessary to learn the embeddings of temporal information in KGs since most TKGs have uniform temporal representations. Therefore, we propose a simple GNN model combined with a temporal information matching mechanism, which achieves better performance with less time and fewer parameters. Furthermore, since alignment seeds are difficult to label in real-world applications, we also propose a method to generate unsupervised alignment seeds via the temporal information of TKG. Extensive experiments on public datasets indicate that our supervised method significantly outperforms the previous methods and the unsupervised one has competitive performance.

DCT-Centered Temporal Relation Extraction
Liang Wang | Peifeng Li | Sheng Xu

Most previous work on temporal relation extraction only focused on extracting the temporal relations among events or suffered from the issue of different expressions of events, timexes and Document Creation Time (DCT). Moreover, DCT can act as a hub to semantically connect the other events and timexes in a document. Unfortunately, previous work cannot benefit from such critical information. To address the above issues, we propose a unified DCT-centered Temporal Relation Extraction model DTRE to identify the relations among events, timexes and DCT. Specifically, sentence-style DCT representation is introduced to address the first issue and unify event expressions, timexes and DCT. Then, a DCT-aware graph is applied to obtain their contextual structural representations. Furthermore, a DCT-anchoring multi-task learning framework is proposed to jointly predict three types of temporal relations in a batch. Finally, we apply a DCT-guided global inference to further enhance the global consistency among different relations. Experimental results on three datasets show that our DTRE outperforms several SOTA baselines on E-E, E-T and E-D significantly.

Document-level Biomedical Relation Extraction Based on Multi-Dimensional Fusion Information and Multi-Granularity Logical Reasoning
Lishuang Li | Ruiyuan Lian | Hongbin Lu | Jingyao Tang

Document-level biomedical relation extraction (Bio-DocuRE) is an important branch of biomedical text mining that aims to automatically extract all relation facts from the biomedical text. Since there are a considerable number of relations in biomedical documents that need to be judged by other existing relations, logical reasoning has become a research hotspot in the past two years. However, current models with reasoning are single-granularity only based on one element information, ignoring the complementary fact of different granularity reasoning information. In addition, obtaining rich document information is a prerequisite for logical reasoning, but most of the previous models cannot sufficiently utilize document information, which limits the reasoning ability of the model. In this paper, we propose a novel Bio-DocuRE model called FILR, based on Multi-Dimensional Fusion Information and Multi-Granularity Logical Reasoning. Specifically, FILR presents a multi-dimensional information fusion module MDIF to extract sufficient global document information. Then FILR proposes a multi-granularity reasoning module MGLR to obtain rich inference information through the reasoning of both entity-pairs and mention-pairs. We evaluate our FILR model on two widely used biomedical corpora CDR and GDA. Experimental results show that FILR achieves state-of-the-art performance.

Simple Yet Powerful: An Overlooked Architecture for Nested Named Entity Recognition
Matias Rojas | Felipe Bravo-Marquez | Jocelyn Dunstan

Named Entity Recognition (NER) is an important task in Natural Language Processing that aims to identify text spans belonging to predefined categories. Traditional NER systems ignore nested entities, which are entities contained in other entity mentions. Although several methods have been proposed to address this case, most of them rely on complex task-specific structures and ignore potentially useful baselines for the task. We argue that this creates an overly optimistic impression of their performance. This paper revisits the Multiple LSTM-CRF (MLC) model, a simple, overlooked, yet powerful approach based on training independent sequence labeling models for each entity type. Extensive experiments with three nested NER corpora show that, regardless of the simplicity of this model, its performance is better or at least as well as more sophisticated methods. Furthermore, we show that the MLC architecture achieves state-of-the-art results in the Chilean Waiting List corpus by including pre-trained language models. In addition, we implemented an open-source library that computes task-specific metrics for nested NER. The results suggest that metrics used in previous work do not measure well the ability of a model to detect nested entities, while our metrics provide new evidence on how existing approaches handle the task.

ERGO: Event Relational Graph Transformer for Document-level Event Causality Identification
Meiqi Chen | Yixin Cao | Kunquan Deng | Mukai Li | Kun Wang | Jing Shao | Yan Zhang

Document-level Event Causality Identification (DECI) aims to identify event-event causal relations in a document. Existing works usually build an event graph for global reasoning across multiple sentences. However, the edges between events have to be carefully designed through heuristic rules or external tools. In this paper, we propose a novel Event Relational Graph TransfOrmer (ERGO) framework for DECI, to ease the graph construction and improve it over the noisy edge issue. Different from conventional event graphs, we define a pair of events as a node and build a complete event relational graph without any prior knowledge or tools. This naturally formulates DECI as a node classification problem, and thus we capture the causation transitivity among event pairs via a graph transformer. Furthermore, we design a criss-cross constraint and an adaptive focal loss for the imbalanced classification, to alleviate the issues of false positives and false negatives. Extensive experiments on two benchmark datasets show that ERGO greatly outperforms previous state-of-the-art (SOTA) methods (12.8% F1 gains on average).

DRK: Discriminative Rule-based Knowledge for Relieving Prediction Confusions in Few-shot Relation Extraction
Mengru Wang | Jianming Zheng | Fei Cai | Taihua Shao | Honghui Chen

Few-shot relation extraction aims to identify the relation type between entities in a given text in the low-resource scenario. Albeit much progress, existing meta-learning methods still fall into prediction confusions owing to the limited inference ability over shallow text features. To relieve these confusions, this paper proposes a discriminative rule-based knowledge (DRK) method. Specifically, DRK adopts a logic-aware inference module to ease the word-overlap confusion, which introduces a logic rule to constrain the inference process, thereby avoiding the adverse effect of shallow text features. Also, DRK employs a discrimination finding module to alleviate the entity-type confusion, which explores distinguishable text features via a hierarchical contrastive learning. We conduct extensive experiments on four types of meta tasks and the results show promising improvements from DRK (6.0% accuracy gains on average). Besides, error analyses reveal the word-overlap and entity-type errors are the main courses of mispredictions in few-shot relation extraction.

DocQueryNet: Value Retrieval with Arbitrary Queries for Form-like Documents
Mingfei Gao | Le Xue | Chetan Ramaiah | Chen Xing | Ran Xu | Caiming Xiong

We propose, DocQueryNet, a value retrieval method with arbitrary queries for form-like documents to reduce human effort of processing forms. Unlike previous methods that only address a fixed set of field items, our method predicts target value for an arbitrary query based on the understanding of the layout and semantics of a form. To further boost model performance, we propose a simple document language modeling (SimpleDLM) strategy to improve document understanding on large-scale model pre-training. Experimental results show that DocQueryNet outperforms previous designs significantly and the SimpleDLM further improves our performance on value retrieval by around 17% F1 score compared with the state-of-the-art pre-training method. Code is available here,

DoSEA: A Domain-specific Entity-aware Framework for Cross-Domain Named Entity Recogition
Minghao Tang | Peng Zhang | Yongquan He | Yongxiu Xu | Chengpeng Chao | Hongbo Xu

Cross-domain named entity recognition aims to improve performance in a target domain with shared knowledge from a well-studied source domain. The previous sequence-labeling based method focuses on promoting model parameter sharing among domains. However, such a paradigm essentially ignores the domain-specific information and suffers from entity type conflicts. To address these issues, we propose a novel machine reading comprehension based framework, named DoSEA, which can identify domain-specific semantic differences and mitigate the subtype conflicts between domains. Concretely, we introduce an entity existence discrimination task and an entity-aware training setting, to recognize inconsistent entity annotations in the source domain and bring additional reference to better share information across domains. Experiments on six datasets prove the effectiveness of our DoSEA. Our source code can be obtained from

Incremental Prompting: Episodic Memory Prompt for Lifelong Event Detection
Minqian Liu | Shiyu Chang | Lifu Huang

Lifelong event detection aims to incrementally update a model with new event types and data while retaining the capability on previously learned old types. One critical challenge is that the model would catastrophically forget old types when continually trained on new data. In this paper, we introduce Episodic Memory Prompts (EMP) to explicitly retain the learned task-specific knowledge. Our method adopts continuous prompt for each task and they are optimized to instruct the model prediction and learn event-specific representation. The EMPs learned in previous tasks are carried along with the model in subsequent tasks, and can serve as a memory module that keeps the old knowledge and transferring to new tasks. Experiment results demonstrate the effectiveness of our method. Furthermore, we also conduct a comprehensive analysis of the new and old event types in lifelong learning.

Recent Advances in Text-to-SQL: A Survey of What We Have and What We Expect
Naihao Deng | Yulong Chen | Yue Zhang

Text-to-SQL has attracted attention from both the natural language processing and database communities because of its ability to convert the semantics in natural language into SQL queries and its practical application in building natural language interfaces to database systems. The major challenges in text-to-SQL lie in encoding the meaning of natural utterances, decoding to SQL queries, and translating the semantics between these two forms. These challenges have been addressed to different extents by the recent advances. However, there is still a lack of comprehensive surveys for this task. To this end, we review recent progress on text-to-SQL for datasets, methods, and evaluation and provide this systematic survey, addressing the aforementioned challenges and discussing potential future directions. We hope this survey can serve as quick access to existing work and motivate future research.

An MRC Framework for Semantic Role Labeling
Nan Wang | Jiwei Li | Yuxian Meng | Xiaofei Sun | Han Qiu | Ziyao Wang | Guoyin Wang | Jun He

Semantic Role Labeling (SRL) aims at recognizing the predicate-argument structure of a sentence and can be decomposed into two subtasks: predicate disambiguation and argument labeling. Prior work deals with these two tasks independently, which ignores the semantic connection between the two tasks. In this paper, we propose to use the machine reading comprehension (MRC) framework to bridge this gap. We formalize predicate disambiguation as multiple-choice machine reading comprehension, where the descriptions of candidate senses of a given predicate are used as options to select the correct sense. The chosen predicate sense is then used to determine the semantic roles for that predicate, and these semantic roles are used to construct the query for another MRC model for argument labeling. In this way, we are able to leverage both the predicate semantics and the semantic role semantics for argument labeling. We also propose to select a subset of all the possible semantic roles for computational efficiency. Experiments show that the proposed framework achieves state-of-the-art or comparable results to previous work.

PCBERT: Parent and Child BERT for Chinese Few-shot NER
Peichao Lai | Feiyang Ye | Lin Zhang | Zhiwei Chen | Yanggeng Fu | Yingjie Wu | Yilei Wang

Achieving good performance on few-shot or zero-shot datasets has been a long-term challenge for NER. The conventional semantic transfer approaches on NER will decrease model performance when the semantic distribution is quite different, especially in Chinese few-shot NER. Recently, prompt-tuning has been thoroughly considered for low-resource tasks. But there is no effective prompt-tuning approach for Chinese few-shot NER. In this work, we propose a prompt-based Parent and Child BERT (PCBERT) for Chinese few-shot NER. To train an annotating model on high-resource datasets and then discover more implicit labels on low-resource datasets. We further design a label extension strategy to achieve label transferring from high-resource datasets. We evaluated our model on Weibo and the other three sampling Chinese NER datasets, and the experimental result demonstrates our approach’s effectiveness in few-shot learning.

Label Smoothing for Text Mining
Peiyang Liu | Xiangyu Xi | Wei Ye | Shikun Zhang

Current text mining models are trained with 0-1 hard label that indicates whether an instance belongs to a class, ignoring rich information of the relevance degree. Soft label, which involved each label of varying degrees than the hard label, is considered more suitable for describing instances. The process of generating soft labels from hard labels is defined as label smoothing (LS). Classical LS methods focus on universal data mining tasks so that they ignore the valuable text features in text mining tasks. This paper presents a novel keyword-based LS method to automatically generate soft labels from hard labels via exploiting the relevance between labels and text instances. Generated soft labels are then incorporated into existing models as auxiliary targets during the training stage, capable of improving models without adding any extra parameters. Results of extensive experiments on text classification and large-scale text retrieval datasets demonstrate that soft labels generated by our method contain rich knowledge of text features, improving the performance of corresponding models under both balanced and unbalanced settings.

Diverse Multi-Answer Retrieval with Determinantal Point Processes
Poojitha Nandigam | Nikhil Rayaprolu | Manish Shrivastava

Often questions provided to open-domain question answering systems are ambiguous. Traditional QA systems that provide a single answer are incapable of answering ambiguous questions since the question may be interpreted in several ways and may have multiple distinct answers. In this paper, we address multi-answer retrieval which entails retrieving passages that can capture majority of the diverse answers to the question. We propose a re-ranking based approach using Determinantal point processes utilizing BERT as kernels. Our method jointly considers query-passage relevance and passage-passage correlation to retrieve passages that are both query-relevant and diverse. Results demonstrate that our re-ranking technique outperforms state-of-the-art method on the AmbigQA dataset.

Improving Deep Embedded Clustering via Learning Cluster-level Representations
Qing Yin | Zhihua Wang | Yunya Song | Yida Xu | Shuai Niu | Liang Bai | Yike Guo | Xian Yang

Driven by recent advances in neural networks, various Deep Embedding Clustering (DEC) based short text clustering models are being developed. In these works, latent representation learning and text clustering are performed simultaneously. Although these methods are becoming increasingly popular, they use pure cluster-oriented objectives, which can produce meaningless representations. To alleviate this problem, several improvements have been developed to introduce additional learning objectives in the clustering process, such as models based on contrastive learning. However, existing efforts rely heavily on learning meaningful representations at the instance level. They have limited focus on learning global representations, which are necessary to capture the overall data structure at the cluster level. In this paper, we propose a novel DEC model, which we named the deep embedded clustering model with cluster-level representation learning (DECCRL) to jointly learn cluster and instance level representations. Here, we extend the embedded topic modelling approach to introduce reconstruction constraints to help learn cluster-level representations. Experimental results on real-world short text datasets demonstrate that our model produces meaningful clusters.

Decoupling Mixture-of-Graphs: Unseen Relational Learning for Knowledge Graph Completion by Fusing Ontology and Textual Experts
Ran Song | Shizhu He | Suncong Zheng | Shengxiang Gao | Kang Liu | Zhengtao Yu | Jun Zhao

Knowledge Graph Embedding (KGE) has been proposed and successfully utilized to knowledge Graph Completion (KGC). But classic KGE paradigm often fail in unseen relation representations. Previous studies mainly utilize the textual descriptions of relations and its neighbor relations to represent unseen relations. In fact, the semantics of a relation can be expressed by three kinds of graphs: factual graph, ontology graph, textual description graph, and they can complement each other. A more common scenario in the real world is that seen and unseen relations appear at the same time. In this setting, the training set (only seen relations) and testing set (both seen and unseen relations) own different distributions. And the train-test inconsistency problem will make KGE methods easiy overfit on seen relations and under-performance on unseen relations. In this paper, we propose decoupling mixture-of-graph experts (DMoG) for unseen relations learning, which could represent the unseen relations in the factual graph by fusing ontology and textual graphs, and decouple fusing space and reasoning space to alleviate overfitting for seen relations. The experiments on two unseen only public datasets and a mixture dataset verify the effectiveness of the proposed method, which improves the state-of-the-art methods by 6.84% in Hits@10 on average.

CETA: A Consensus Enhanced Training Approach for Denoising in Distantly Supervised Relation Extraction
Ruri Liu | Shasha Mo | Jianwei Niu | Shengda Fan

Distantly supervised relation extraction aims to extract relational facts from texts but suffers from noisy instances. Existing methods usually select reliable sentences that rely on potential noisy labels, resulting in wrongly selecting many noisy training instances or underutilizing a large amount of valuable training data. This paper proposes a sentence-level DSRE method beyond typical instance selection approaches by preventing samples from falling into the wrong classification space on the feature space. Specifically, a theorem for denoising and the corresponding implementation, named Consensus Enhanced Training Approach (CETA), are proposed in this paper. By training the model with CETA, samples of different classes are separated, and samples of the same class are closely clustered in the feature space. Thus the model can easily establish the robust classification boundary to prevent noisy labels from biasing wrongly labeled samples into the wrong classification space. This process is achieved by enhancing the classification consensus between two discrepant classifiers and does not depend on any potential noisy labels, thus avoiding the above two limitations. Extensive experiments on widely-used benchmarks have demonstrated that CETA significantly outperforms the previous methods and achieves new state-of-the-art results.

MedDistant19: Towards an Accurate Benchmark for Broad-Coverage Biomedical Relation Extraction
Saadullah Amin | Pasquale Minervini | David Chang | Pontus Stenetorp | Guenter Neumann

Relation extraction in the biomedical domain is challenging due to the lack of labeled data and high annotation costs, needing domain experts. Distant supervision is commonly used to tackle the scarcity of annotated data by automatically pairing knowledge graph relationships with raw texts. Such a pipeline is prone to noise and has added challenges to scale for covering a large number of biomedical concepts. We investigated existing broad-coverage distantly supervised biomedical relation extraction benchmarks and found a significant overlap between training and test relationships ranging from 26% to 86%. Furthermore, we noticed several inconsistencies in the data construction process of these benchmarks, and where there is no train-test leakage, the focus is on interactions between narrower entity types. This work presents a more accurate benchmark MedDistant19 for broad-coverage distantly supervised biomedical relation extraction that addresses these shortcomings and is obtained by aligning the MEDLINE abstracts with the widely used SNOMED Clinical Terms knowledge base. Lacking thorough evaluation with domain-specific language models, we also conduct experiments validating general domain relation extraction findings to biomedical relation extraction.

Decorrelate Irrelevant, Purify Relevant: Overcome Textual Spurious Correlations from a Feature Perspective
Shihan Dou | Rui Zheng | Ting Wu | SongYang Gao | Junjie Shan | Qi Zhang | Yueming Wu | Xuanjing Huang

Natural language understanding (NLU) models tend to rely on spurious correlations (i.e., dataset bias) to achieve high performance on in-distribution datasets but poor performance on out-of-distribution ones. Most of the existing debiasing methods often identify and weaken these samples with biased features (i.e., superficial surface features that cause such spurious correlations). However, down-weighting these samples obstructs the model in learning from the non-biased parts of these samples. To tackle this challenge, in this paper, we propose to eliminate spurious correlations in a fine-grained manner from a feature space perspective. Specifically, we introduce Random Fourier Features and weighted re-sampling to decorrelate the dependencies between features to mitigate spurious correlations. After obtaining decorrelated features, we further design a mutual-information-based method to purify them, which forces the model to learn features that are more relevant to tasks. Extensive experiments on two well-studied NLU tasks demonstrate that our method is superior to other comparative approaches.

Event Causality Identification via Derivative Prompt Joint Learning
Shirong Shen | Heng Zhou | Tongtong Wu | Guilin Qi

This paper studies event causality identification, which aims at predicting the causality relation for a pair of events in a sentence. Regarding event causality identification as a supervised classification task, most existing methods suffer from the problem of insufficient annotated data. In this paper, we propose a new derivative prompt joint learning model for event causality identification, which leverages potential causal knowledge in the pre-trained language model to tackle the data scarcity problem. Specifically, rather than external data or knowledge augmentation, we derive two relevant prompt tasks from event causality identification to enhance the model’s ability to identify explicit and implicit causality. We evaluate our model on two benchmark datasets and the results show that our model has great advantages over previous methods.

Event Causality Extraction with Event Argument Correlations
Shiyao Cui | Jiawei Sheng | Xin Cong | Quangang Li | Tingwen Liu | Jinqiao Shi

Event Causality Identification (ECI), which aims to detect whether a causality relation exists between two given textual events, is an important task for event causality understanding. However, the ECI task ignores crucial event structure and cause-effect causality component information, making it struggle for downstream applications. In this paper, we introduce a novel task, namely Event Causality Extraction (ECE), aiming to extract the cause-effect event causality pairs with their structured event information from plain texts. The ECE task is more challenging since each event can contain multiple event arguments, posing fine-grained correlations between events to decide the cause-effect event pair. Hence, we propose a method with a dual grid tagging scheme to capture the intra- and inter-event argument correlations for ECE. Further, we devise a event type-enhanced model architecture to realize the dual grid tagging scheme. Experiments demonstrate the effectiveness of our method, and extensive analyses point out several future directions for ECE.

SCL-RAI: Span-based Contrastive Learning with Retrieval Augmented Inference for Unlabeled Entity Problem in NER
Shuzheng Si | Shuang Zeng | Jiaxing Lin | Baobao Chang

Unlabeled Entity Problem (UEP) in Named Entity Recognition (NER) datasets seriously hinders the improvement of NER performance. This paper proposes SCL-RAI to cope with this problem. Firstly, we decrease the distance of span representations with the same label while increasing it for different ones via span-based contrastive learning, which relieves the ambiguity among entities and improves the robustness of the model over unlabeled entities. Then we propose retrieval augmented inference to mitigate the decision boundary shifting problem. Our method significantly outperforms the previous SOTA method by 4.21% and 8.64% F1-score on two real-world datasets.

A Relation Extraction Dataset for Knowledge Extraction from Web Tables
Siffi Singh | Alham Fikri Aji | Gaurav Singh | Christos Christodoulopoulos

Relational web-tables are significant sources of structural information that are widely used for relation extraction and population of facts into knowledge graphs. To transform the web-table data into knowledge, we need to identify the relations that exist between column pairs. Currently, there are only a handful of publicly available datasets with relations annotated against natural web-tables. Most datasets are constructed using synthetic tables that lack valuable metadata information, or are limited in size to be considered as a challenging evaluation set. In this paper, we present REDTab, the largest natural-table relation extraction dataset. We have annotated ~9K tables and ~22K column pairs using crowd sourced annotators from MTurk, which has 50x larger number of column pairs than the existing human-annotated benchmark. Our test set is specially designed to be challenging as observed in our experiment results using TaBERT. We publicly release REDTab as a benchmark for the evaluation process in relation extraction.

Automatic Keyphrase Generation by Incorporating Dual Copy Mechanisms in Sequence-to-Sequence Learning
Siyu Wang | Jianhui Jiang | Yao Huang | Yin Wang

The keyphrase generation task is a challenging work that aims to generate a set of keyphrases for a piece of text. Many previous studies based on the sequence-to-sequence model were used to generate keyphrases, and they introduce a copy mechanism to achieve good results. However, we observed that most of the keyphrases are composed of some important words (seed words) in the source text, and if these words can be identified accurately and copied to create more keyphrases, the performance of the model might be improved. To address this challenge, we propose a DualCopyNet model, which introduces an additional sequence labeling layer for identifying seed words, and further copies the words for generating new keyphrases by dual copy mechanisms. Experimental results demonstrate that our model outperforms the baseline models and achieves an obvious performance improvement.

Dependency-aware Prototype Learning for Few-shot Relation Classification
Tianshu Yu | Min Yang | Xiaoyan Zhao

Few-shot relation classification aims to classify the relation type between two given entities in a sentence by training with a few labeled instances for each relation. However, most of existing models fail to distinguish multiple relations that co-exist in one sentence. This paper presents a novel dependency-aware prototype learning (DAPL) method for few-shot relation classification. Concretely, we utilize dependency trees and shortest dependency paths (SDP) as structural information to complement the contextualized representations of input sentences by using the dependency-aware embedding as attention inputs to learn attentive sentence representations. In addition, we introduce a gate controlled update mechanism to update the dependency-aware representations according to the output of each network layer. Extensive experiments on the FewRel dataset show that DAPL achieves substantially better performance than strong baselines. For reproducibility, we will release our code and data upon the publication of this paper at

MECI: A Multilingual Dataset for Event Causality Identification
Viet Dac Lai | Amir Pouran Ben Veyseh | Minh Van Nguyen | Franck Dernoncourt | Thien Huu Nguyen

Event Causality Identification (ECI) is the task of detecting causal relations between events mentioned in the text. Although this task has been extensively studied for English materials, it is under-explored for many other languages. A major reason for this issue is the lack of multilingual datasets that provide consistent annotations for event causality relations in multiple non-English languages. To address this issue, we introduce a new multilingual dataset for ECI, called MECI. The dataset employs consistent annotation guidelines for five typologically different languages, i.e., English, Danish, Spanish, Turkish, and Urdu. Our dataset thus enable a new research direction on cross-lingual transfer learning for ECI. Our extensive experiments demonstrate high quality for MECI that can provide ample research challenges and directions for future research. We will publicly release MECI to promote research on multilingual ECI.

Method Entity Extraction from Biomedical Texts
Waqar Bin Kalim | Robert E. Mercer

In the field of Natural Language Processing (NLP), extracting method entities from biomedical text has been a challenging task. Scientific research papers commonly consist of complex keywords and domain-specific terminologies, and new terminologies are continuously appearing. In this research, we find method terminologies in biomedical text using both rule-based and machine learning techniques. We first use linguistic features to extract method sentence candidates from a large corpus of biomedical text. Then, we construct a silver standard biomedical corpus composed of these sentences. With a rule-based method that makes use of the Stanza dependency parsing module, we label the method entities in these sentences. Using this silver standard corpus we train two machine learning algorithms to automatically extract method entities from biomedical text. Our results show that it is possible to develop machine learning models that can automatically extract method entities to a reasonable accuracy without the need for a gold standard dataset.

Optimal Partial Transport Based Sentence Selection for Long-form Document Matching
Weijie Yu | Liang Pang | Jun Xu | Bing Su | Zhenhua Dong | Ji-Rong Wen

One typical approach to long-form document matching is first conducting alignment between cross-document sentence pairs, and then aggregating all of the sentence-level matching signals. However, this approach could be problematic because the alignment between documents is partial — despite two documents as a whole are well-matched, most of the sentences could still be dissimilar. Those dissimilar sentences lead to spurious sentence-level matching signals which may overwhelm the real ones, increasing the difficulties of learning the matching function. Therefore, accurately selecting the key sentences for document matching is becoming a challenging issue. To address the issue, we propose a novel matching approach that equips existing document matching models with an Optimal Partial Transport (OPT) based component, namely OPT-Match, which selects the sentences that play a major role in matching. Enjoying the partial transport properties of OPT, the selected key sentences can not only effectively enhance the matching accuracy, but also be explained as the rationales for the matching results. Extensive experiments on four publicly available datasets demonstrated that existing methods equipped with OPT-Match consistently outperformed the corresponding underlying methods. Evaluations also showed that the key sentences selected by OPT-Match were consistent with human-provided rationales.

LightNER: A Lightweight Tuning Paradigm for Low-resource NER via Pluggable Prompting
Xiang Chen | Lei Li | Shumin Deng | Chuanqi Tan | Changliang Xu | Fei Huang | Luo Si | Huajun Chen | Ningyu Zhang

Most NER methods rely on extensive labeled data for model training, which struggles in the low-resource scenarios with limited training data. Existing dominant approaches usually suffer from the challenge that the target domain has different label sets compared with a resource-rich source domain, which can be concluded as class transfer and domain transfer. In this paper, we propose a lightweight tuning paradigm for low-resource NER via pluggable prompting (LightNER). Specifically, we construct the unified learnable verbalizer of entity categories to generate the entity span sequence and entity categories without any label-specific classifiers, thus addressing the class transfer issue. We further propose a pluggable guidance module by incorporating learnable parameters into the self-attention layer as guidance, which can re-modulate the attention and adapt pre-trained weights. Note that we only tune those inserted module with the whole parameter of the pre-trained language model fixed, thus, making our approach lightweight and flexible for low-resource scenarios and can better transfer knowledge across domains. Experimental results show that LightNER can obtain comparable performance in the standard supervised setting and outperform strong baselines in low-resource settings.

Cross-modal Contrastive Attention Model for Medical Report Generation
Xiao Song | Xiaodan Zhang | Junzhong Ji | Ying Liu | Pengxu Wei

Medical report automatic generation has gained increasing interest recently as a way to help radiologists write reports more efficiently. However, this image-to-text task is rather challenging due to the typical data biases: 1) Normal physiological structures dominate the images, with only tiny abnormalities; 2) Normal descriptions accordingly dominate the reports. Existing methods have attempted to solve these problems, but they neglect to exploit useful information from similar historical cases. In this paper, we propose a novel Cross-modal Contrastive Attention (CMCA) model to capture both visual and semantic information from similar cases, with mainly two modules: a Visual Contrastive Attention Module for refining the unique abnormal regions compared to the retrieved case images; a Cross-modal Attention Module for matching the positive semantic information from the case reports. Extensive experiments on two widely-used benchmarks, IU X-Ray and MIMIC-CXR, demonstrate that the proposed model outperforms the state-of-the-art methods on almost all metrics. Further analyses also validate that our proposed model is able to improve the reports with more accurate abnormal findings and richer descriptions.

Domain-Specific NER via Retrieving Correlated Samples
Xin Zhang | Yong Jiang | Xiaobin Wang | Xuming Hu | Yueheng Sun | Pengjun Xie | Meishan Zhang

Successful Machine Learning based Named Entity Recognition models could fail on texts from some special domains, for instance, Chinese addresses and e-commerce titles, where requires adequate background knowledge. Such texts are also difficult for human annotators. In fact, we can obtain some potentially helpful information from correlated texts, which have some common entities, to help the text understanding. Then, one can easily reason out the correct answer by referencing correlated samples. In this paper, we suggest enhancing NER models with correlated samples. We draw correlated samples by the sparse BM25 retriever from large-scale in-domain unlabeled data. To explicitly simulate the human reasoning process, we perform a training-free entity type calibrating by majority voting. To capture correlation features in the training stage, we suggest to model correlated samples by the transformer-based multi-instance cross-encoder. Empirical results on datasets of the above two domains show the efficacy of our methods.

Type-enriched Hierarchical Contrastive Strategy for Fine-Grained Entity Typing
Xinyu Zuo | Haijin Liang | Ning Jing | Shuang Zeng | Zhou Fang | Yu Luo

Fine-grained entity typing (FET) aims to deduce specific semantic types of the entity mentions in the text. Modern methods for FET mainly focus on learning what a certain type looks like. And few works directly model the type differences, that is, let models know the extent that which one type is different from others. To alleviate this problem, we propose a type-enriched hierarchical contrastive strategy for FET. Our method can directly model the differences between hierarchical types and improve the ability to distinguish multi-grained similar types. On the one hand, we embed type into entity contexts to make type information directly perceptible. On the other hand, we design a constrained contrastive strategy on the hierarchical structure to directly model the type differences, which can simultaneously perceive the distinguishability between types at different granularity. Experimental results on three benchmarks, BBN, OntoNotes, and FIGER show that our method achieves significant performance on FET by effectively modeling type differences.

Document-Level Relation Extraction via Pair-Aware and Entity-Enhanced Representation Learning
Xiusheng Huang | Hang Yang | Yubo Chen | Jun Zhao | Kang Liu | Weijian Sun | Zuyu Zhao

Document-level relation extraction aims to recognize relations among multiple entity pairs from a whole piece of article. Recent methods achieve considerable performance but still suffer from two challenges: a) the relational entity pairs are sparse, b) the representation of entity pairs is insufficient. In this paper, we propose Pair-Aware and Entity-Enhanced(PAEE) model to solve the aforementioned two challenges. For the first challenge, we design a Pair-Aware Representation module to predict potential relational entity pairs, which constrains the relation extraction to the predicted entity pairs subset rather than all pairs; For the second, we introduce a Entity-Enhanced Representation module to assemble directional entity pairs and obtain a holistic understanding of the entire document. Experimental results show that our approach can obtain state-of-the-art performance on four benchmark datasets DocRED, DWIE, CDR and GDA.

Improving Zero-Shot Entity Linking Candidate Generation with Ultra-Fine Entity Type Information
Xuhui Sui | Ying Zhang | Kehui Song | Baohang Zhou | Guoqing Zhao | Xin Wei | Xiaojie Yuan

Entity linking, which aims at aligning ambiguous entity mentions to their referent entities in a knowledge base, plays a key role in multiple natural language processing tasks. Recently, zero-shot entity linking task has become a research hotspot, which links mentions to unseen entities to challenge the generalization ability. For this task, the training set and test set are from different domains, and thus entity linking models tend to be overfitting due to the tendency of memorizing the properties of entities that appear frequently in the training set. We argue that general ultra-fine-grained type information can help the linking models to learn contextual commonality and improve their generalization ability to tackle the overfitting problem. However, in the zero-shot entity linking setting, any type information is not available and entities are only identified by textual descriptions. Thus, we first extract the ultra-fine entity type information from the entity textual descriptions. Then, we propose a hierarchical multi-task model to improve the high-level zero-shot entity linking candidate generation task by utilizing the entity typing task as an auxiliary low-level task, which introduces extracted ultra-fine type information into the candidate generation task. Experimental results demonstrate the effectiveness of utilizing the ultra-fine entity type information and our proposed method achieves state-of-the-art performance.

CofeNet: Context and Former-Label Enhanced Net for Complicated Quotation Extraction
Yequan Wang | Xiang Li | Aixin Sun | Xuying Meng | Huaming Liao | Jiafeng Guo

Quotation extraction aims to extract quotations from written text. There are three components in a quotation: source refers to the holder of the quotation, cue is the trigger word(s), and content is the main body. Existing solutions for quotation extraction mainly utilize rule-based approaches and sequence labeling models. While rule-based approaches often lead to low recalls, sequence labeling models cannot well handle quotations with complicated structures. In this paper, we propose the Context and Former-Label Enhanced Net () for quotation extraction. is able to extract complicated quotations with components of variable lengths and complicated structures. On two public datasets (and ) and one proprietary dataset (), we show that our achieves state-of-the-art performance on complicated quotation extraction.

Supporting Medical Relation Extraction via Causality-Pruned Semantic Dependency Forest
Yifan Jin | Jiangmeng Li | Zheng Lian | Chengbo Jiao | Xiaohui Hu

Medical Relation Extraction (MRE) task aims to extract relations between entities in medical texts. Traditional relation extraction methods achieve impressive success by exploring the syntactic information, e.g., dependency tree. However, the quality of the 1-best dependency tree for medical texts produced by an out-of-domain parser is relatively limited so that the performance of medical relation extraction method may degenerate. To this end, we propose a method to jointly model semantic and syntactic information from medical texts based on causal explanation theory. We generate dependency forests consisting of the semantic-embedded 1-best dependency tree. Then, a task-specific causal explainer is adopted to prune the dependency forests, which are further fed into a designed graph convolutional network to learn the corresponding representation for downstream task. Empirically, the various comparisons on benchmark medical datasets demonstrate the effectiveness of our model.

Aspect-based Sentiment Analysis as Machine Reading Comprehension
Yifei Yang | Hai Zhao

Existing studies typically handle aspect-based sentiment analysis by stacking multiple neural modules, which inevitably result in severe error propagation. Instead, we propose a novel end-to-end framework, MRCOOL: MRC-PrOmpt mOdeL framework, where numerous sentiment aspects are elicited by a machine reading comprehension (MRC) model and their corresponding sentiment polarities are classified in a prompt learning way. Experiments show that our end-to-end framework consistently yields promising results on widely-used benchmark datasets which significantly outperform existing state-of-the-art models or achieve comparable performance.

Nested Named Entity Recognition as Corpus Aware Holistic Structure Parsing
Yifei Yang | Zuchao Li | Hai Zhao

As a fundamental natural language processing task and one of core knowledge extraction techniques, named entity recognition (NER) is widely used to extract information from texts for downstream tasks. Nested NER is a branch of NER in which the named entities (NEs) are nested with each other. However, most of the previous studies on nested NER usually apply linear structure to model the nested NEs which are actually accommodated in a hierarchical structure. Thus in order to address this mismatch, this work models the full nested NEs in a sentence as a holistic structure, then we propose a holistic structure parsing algorithm to disclose the entire NEs once for all. Besides, there is no research on applying corpus-level information to NER currently. To make up for the loss of this information, we introduce Point-wise Mutual Information (PMI) and other frequency features from corpus-aware statistics for even better performance by holistic modeling from sentence-level to corpus-level. Experiments show that our model yields promising results on widely-used benchmarks which approach or even achieve state-of-the-art. Further empirical studies show that our proposed corpus-aware features can substantially improve NER domain adaptation, which demonstrates the surprising advantage of our proposed corpus-level holistic structure modeling.

DESED: Dialogue-based Explanation for Sentence-level Event Detection
Yinyi Wei | Shuaipeng Liu | Jianwei Lv | Xiangyu Xi | Hailei Yan | Wei Ye | Tong Mo | Fan Yang | Guanglu Wan

Many recent sentence-level event detection efforts focus on enriching sentence semantics, e.g., via multi-task or prompt-based learning. Despite the promising performance, these methods commonly depend on label-extensive manual annotations or require domain expertise to design sophisticated templates and rules. This paper proposes a new paradigm, named dialogue-based explanation, to enhance sentence semantics for event detection. By saying dialogue-based explanation of an event, we mean explaining it through a consistent information-intensive dialogue, with the original event description as the start utterance. We propose three simple dialogue generation methods, whose outputs are then fed into a hybrid attention mechanism to characterize the complementary event semantics. Extensive experimental results on two event detection datasets verify the effectiveness of our method and suggest promising research opportunities in the dialogue-based explanation paradigm.

Data Augmentation for Few-Shot Knowledge Graph Completion from Hierarchical Perspective
Yuanzhou Yao | Zhao Zhang | Yongjun Xu | Chao Li

Few-shot knowledge graph completion (FKGC) has become a new research focus in the field of knowledge graphs in recent years, which aims to predict the missing links for relations that only have a few associative triples. Existing models attempt to solve the problem via learning entity and relation representations. However, the limited training data severely hinders the performance of existing models. To this end, we propose to solve the FKGC problem with the data augmentation technique. Specifically, we perform data augmentation from two perspectives, i.e., inter-task view and intra-task view. The former generates new tasks for FKGC, while the latter enriches the support or query set for an individual task. It is worth noting that the proposed framework can be applied to a number of existing FKGC models. Experimental evaluation on two public datasets indicates our model is capable of achieving substantial improvements over baselines.

CLIO: Role-interactive Multi-event Head Attention Network for Document-level Event Extraction
Yubing Ren | Yanan Cao | Fang Fang | Ping Guo | Zheng Lin | Wei Ma | Yi Liu

Transforming the large amounts of unstructured text on the Internet into structured event knowledge is a critical, yet unsolved goal of NLP, especially when addressing document-level text. Existing methods struggle in Document-level Event Extraction (DEE) due to its two intrinsic challenges: (a) Nested arguments, which means one argument is the sub-string of another one. (b) Multiple events, which indicates we should identify multiple events and assemble the arguments for them. In this paper, we propose a role-interactive multi-event head attention network (CLIO) to solve these two challenges jointly. The key idea is to map different events to multiple subspaces (i.e. multi-event head). In each event subspace, we draw the semantic representation of each role closer to its corresponding arguments, then we determine whether the current event exists. To further optimize event representation, we propose an event representation enhancing strategy to regularize pre-trained embedding space to be more isotropic. Our experiments on two widely used DEE datasets show that CLIO achieves consistent improvements over previous methods.

COPNER: Contrastive Learning with Prompt Guiding for Few-shot Named Entity Recognition
Yucheng Huang | Kai He | Yige Wang | Xianli Zhang | Tieliang Gong | Rui Mao | Chen Li

Distance metric learning has become a popular solution for few-shot Named Entity Recognition (NER). The typical setup aims to learn a similarity metric for measuring the semantic similarity between test samples and referents, where each referent represents an entity class. The effect of this setup may, however, be compromised for two reasons. First, there is typically a limited optimization exerted on the representations of entity tokens after initing by pre-trained language models. Second, the referents may be far from representing corresponding entity classes due to the label scarcity in the few-shot setting. To address these challenges, we propose a novel approach named COntrastive learning with Prompt guiding for few-shot NER (COPNER). We introduce a novel prompt composed of class-specific words to COPNER to serve as 1) supervision signals for conducting contrastive learning to optimize token representations; 2) metric referents for distance-metric inference on test samples. Experimental results demonstrate that COPNER outperforms state-of-the-art models with a significant margin in most cases. Moreover, COPNER shows great potential in the zero-shot setting.

Few Clean Instances Help Denoising Distant Supervision
Yufang Liu | Ziyin Huang | Yijun Wang | Changzhi Sun | Man Lan | Yuanbin Wu | Xiaofeng Mou | Ding Wang

Existing distantly supervised relation extractors usually rely on noisy data for both model training and evaluation, which may lead to garbage-in-garbage-out systems. To alleviate the problem, we study whether a small clean dataset could help improve the quality of distantly supervised models. We show that besides getting a more convincing evaluation of models, a small clean dataset also helps us to build more robust denoising models. Specifically, we propose a new criterion for clean instance selection based on influence functions. It collects sample-level evidence for recognizing good instances (which is more informative than loss-level evidence). We also propose a teacher-student mechanism for controlling purity of intermediate results when bootstrapping the clean set. The whole approach is model-agnostic and demonstrates strong performances on both denoising real (NYT) and synthetic noisy datasets.

SEE-Few: Seed, Expand and Entail for Few-shot Named Entity Recognition
Zeng Yang | Linhai Zhang | Deyu Zhou

Few-shot named entity recognition (NER) aims at identifying named entities based on only few labeled instances. Current few-shot NER methods focus on leveraging existing datasets in the rich-resource domains which might fail in a training-from-scratch setting where no source-domain data is used. To tackle training-from-scratch setting, it is crucial to make full use of the annotation information (the boundaries and entity types). Therefore, in this paper, we propose a novel multi-task (Seed, Expand and Entail) learning framework, SEE-Few, for Few-shot NER without using source domain data. The seeding and expanding modules are responsible for providing as accurate candidate spans as possible for the entailing module. The entailing module reformulates span classification as a textual entailment task, leveraging both the contextual clues and entity type information. All the three modules share the same text encoder and are jointly learned. Experimental results on several benchmark datasets under the training-from-scratch setting show that the proposed method outperformed several state-of-the-art few-shot NER methods with a large margin. Our code is available at

Ruleformer: Context-aware Rule Mining over Knowledge Graph
Zezhong Xu | Peng Ye | Hui Chen | Meng Zhao | Huajun Chen | Wen Zhang

Rule mining is an effective approach for reasoning over knowledge graph (KG). Existing works mainly concentrate on mining rules. However, there might be several rules that could be applied for reasoning for one relation, and how to select appropriate rules for completion of different triples has not been discussed. In this paper, we propose to take the context information into consideration, which helps select suitable rules for the inference tasks. Based on this idea, we propose a transformer-based rule mining approach, Ruleformer. It consists of two blocks: 1) an encoder extracting the context information from subgraph of head entities with modified attention mechanism, and 2) a decoder which aggregates the subgraph information from the encoder output and generates the probability of relations for each step of reasoning. The basic idea behind Ruleformer is regarding rule mining process as a sequence to sequence task. To make the subgraph a sequence input to the encoder and retain the graph structure, we devise a relational attention mechanism in Transformer. The experiment results show the necessity of considering these information in rule mining task and the effectiveness of our model.

Are People Located in the Places They Mention in Their Tweets? A Multimodal Approach
Zhaomin Xiao | Eduardo Blanco

This paper introduces the problem of determining whether people are located in the places they mention in their tweets. In particular, we investigate the role of text and images to solve this challenging problem. We present a new corpus of tweets that contain both text and images. Our analyses show that this problem is multimodal at its core: human judgments depend on whether annotators have access to the text, the image, or both. Experimental results show that a neural architecture that combines both modalities yields better results. We also conduct an error analysis to provide insights into why and when each modality is beneficial.

Multi-modal Contrastive Representation Learning for Entity Alignment
Zhenxi Lin | Ziheng Zhang | Meng Wang | Yinghui Shi | Xian Wu | Yefeng Zheng

Multi-modal entity alignment aims to identify equivalent entities between two different multi-modal knowledge graphs, which consist of structural triples and images associated with entities. Most previous works focus on how to utilize and encode information from different modalities, while it is not trivial to leverage multi-modal knowledge in entity alignment because of the modality heterogeneity. In this paper, we propose MCLEA, a Multi-modal Contrastive Learning based Entity Alignment model, to obtain effective joint representations for multi-modal entity alignment. Different from previous works, MCLEA considers task-oriented modality and models the inter-modal relationships for each entity representation. In particular, MCLEA firstly learns multiple individual representations from multiple modalities, and then performs contrastive learning to jointly model intra-modal and inter-modal interactions. Extensive experimental results show that MCLEA outperforms state-of-the-art baselines on public datasets under both supervised and unsupervised settings.

Nonparametric Forest-Structured Neural Topic Modeling
Zhihong Zhang | Xuewen Zhang | Yanghui Rao

Neural topic models have been widely used in discovering the latent semantics from a corpus. Recently, there are several researches on hierarchical neural topic models since the relationships among topics are valuable for data analysis and exploration. However, the existing hierarchical neural topic models are limited to generate a single topic tree. In this study, we present a nonparametric forest-structured neural topic model by firstly applying the self-attention mechanism to capture parent-child topic relationships, and then build a sparse directed acyclic graph to form a topic forest. Experiments indicate that our model can automatically learn a forest-structured topic hierarchy with indefinite numbers of trees and leaves, and significantly outperforms the baseline models on topic hierarchical rationality and affinity.

KGE-CL: Contrastive Learning of Tensor Decomposition Based Knowledge Graph Embeddings
Zhiping Luo | Wentao Xu | Weiqing Liu | Jiang Bian | Jian Yin | Tie-Yan Liu

Learning the embeddings of knowledge graphs (KG) is vital in artificial intelligence, and can benefit various downstream applications, such as recommendation and question answering. In recent years, many research efforts have been proposed for knowledge graph embedding (KGE). However, most previous KGE methods ignore the semantic similarity between the related entities and entity-relation couples in different triples since they separately optimize each triple with the scoring function. To address this problem, we propose a simple yet efficient contrastive learning framework for tensor decomposition based (TDB) KGE, which can shorten the semantic distance of the related entities and entity-relation couples in different triples and thus improve the performance of KGE. We evaluate our proposed method on three standard KGE datasets: WN18RR, FB15k-237 and YAGO3-10. Our method can yield some new state-of-the-art results, achieving 51.2% MRR, 46.8% Hits@1 on the WN18RR dataset, 37.8% MRR, 28.6% Hits@1 on FB15k-237 dataset, and 59.1% MRR, 51.8% Hits@1 on the YAGO3-10 dataset.

A Coarse-to-fine Cascaded Evidence-Distillation Neural Network for Explainable Fake News Detection
Zhiwei Yang | Jing Ma | Hechang Chen | Hongzhan Lin | Ziyang Luo | Yi Chang

Existing fake news detection methods aim to classify a piece of news as true or false and provide veracity explanations, achieving remarkable performances. However, they often tailor automated solutions on manual fact-checked reports, suffering from limited news coverage and debunking delays. When a piece of news has not yet been fact-checked or debunked, certain amounts of relevant raw reports are usually disseminated on various media outlets, containing the wisdom of crowds to verify the news claim and explain its verdict. In this paper, we propose a novel Coarse-to-fine Cascaded Evidence-Distillation (CofCED) neural network for explainable fake news detection based on such raw reports, alleviating the dependency on fact-checked ones. Specifically, we first utilize a hierarchical encoder for web text representation, and then develop two cascaded selectors to select the most explainable sentences for verdicts on top of the selected top-K reports in a coarse-to-fine manner. Besides, we construct two explainable fake news datasets, which is publicly available. Experimental results demonstrate that our model significantly outperforms state-of-the-art detection baselines and generates high-quality explanations from diverse evaluation perspectives.

Document-level Event Factuality Identification via Machine Reading Comprehension Frameworks with Transfer Learning
Zhong Qian | Heng Zhang | Peifeng Li | Qiaoming Zhu | Guodong Zhou

Document-level Event Factuality Identification (DEFI) predicts the factuality of a specific event based on a document from which the event can be derived, which is a fundamental and crucial task in Natural Language Processing (NLP). However, most previous studies only considered sentence-level task and did not adopt document-level knowledge. Moreover, they modelled DEFI as a typical text classification task depending on annotated information heavily, and limited to the task-specific corpus only, which resulted in data scarcity. To tackle these issues, we propose a new framework formulating DEFI as Machine Reading Comprehension (MRC) tasks considering both Span-Extraction (Ext) and Multiple-Choice (Mch). Our model does not employ any other explicit annotated information, and utilizes Transfer Learning (TL) to extract knowledge from universal large-scale MRC corpora for cross-domain data augmentation. The empirical results on DLEFM corpus demonstrate that the proposed model outperforms several state-of-the-arts.

Unregulated Chinese-to-English Data Expansion Does NOT Work for Neural Event Detection
Zhongqiu Li | Yu Hong | Jie Wang | Shiming He | Jianmin Yao | Guodong Zhou

We leverage cross-language data expansion and retraining to enhance neural Event Detection (abbr., ED) on English ACE corpus. Machine translation is utilized for expanding English training set of ED from that of Chinese. However, experimental results illustrate that such strategy actually results in performance degradation. The survey of translations suggests that the mistakenly-aligned triggers in the expanded data negatively influences the retraining process. We refer this phenomenon to “trigger falsification”. To overcome the issue, we apply heuristic rules for regulating the expanded data, fixing the distracting samples that contain the falsified triggers. The supplementary experiments show that the rule-based regulation is beneficial, yielding the improvement of about 1.6% F1-score for ED. We additionally prove that, instead of transfer learning from the translated ED data, the straight data combination by random pouring surprisingly performs better.

Finding Influential Instances for Distantly Supervised Relation Extraction
Zifeng Wang | Rui Wen | Xi Chen | Shao-Lun Huang | Ningyu Zhang | Yefeng Zheng

Distant supervision (DS) is a strong way to expand the datasets for enhancing relation extraction (RE) models but often suffers from high label noise. Current works based on attention, reinforcement learning, or GAN are black-box models so they neither provide meaningful interpretation of sample selection in DS nor stability on different domains. On the contrary, this work proposes a novel model-agnostic instance sampling method for DS by influence function (IF), namely REIF. Our method identifies favorable/unfavorable instances in the bag based on IF, then does dynamic instance sampling. We design a fast influence sampling algorithm that reduces the computational complexity from 𝒪(mn) to 𝒪(1), with analyzing its robustness on the selected sampling function. Experiments show that by simply sampling the favorable instances during training, REIF is able to win over a series of baselines which have complicated architectures. We also demonstrate that REIF can support interpretable instance selection.

A Simple Model for Distantly Supervised Relation Extraction
Ziqin Rao | Fangxiang Feng | Ruifan Li | Xiaojie Wang

Distantly supervised relation extraction is challenging due to the noise within data. Recent methods focus on exploiting bag representations based on deep neural networks with complex de-noising scheme to achieve remarkable performance. In this paper, we propose a simple but effective BERT-based Graph convolutional network Model (i.e., BGM). Our BGM comprises of an instance embedding module and a bag representation module. The instance embedding module uses a BERT-based pretrained language model to extract key information from each instance. The bag representaion module constructs the corresponding bag graph then apply a convolutional operation to obtain the bag representation. Our BGM model achieves a considerable improvement on two benchmark datasets, i.e., NYT10 and GDS.

Augmenting Legal Judgment Prediction with Contrastive Case Relations
Dugang Liu | Weihao Du | Lei Li | Weike Pan | Zhong Ming

Existing legal judgment prediction methods usually only consider one single case fact description as input, which may not fully utilize the information in the data such as case relations and frequency. In this paper, we propose a new perspective that introduces some contrastive case relations to construct case triples as input, and a corresponding judgment prediction framework with case triples modeling (CTM). Our CTM can more effectively utilize beneficial information to refine the encoding and decoding processes through three customized modules, including the case triple module, the relational attention module, and the category decoder module. Finally, we conduct extensive experiments on two public datasets to verify the effectiveness of our CTM, including overall evaluation, compatibility analysis, ablation studies, analysis of gain source and visualization of case representations.

Constrained Regeneration for Cross-Lingual Query-Focused Extractive Summarization
Elsbeth Turcan | David Wan | Faisal Ladhak | Petra Galuscakova | Sukanta Sen | Svetlana Tchistiakova | Weijia Xu | Marine Carpuat | Kenneth Heafield | Douglas Oard | Kathleen McKeown

Query-focused summaries of foreign-language, retrieved documents can help a user understand whether a document is actually relevant to the query term. A standard approach to this problem is to first translate the source documents and then perform extractive summarization to find relevant snippets. However, in a cross-lingual setting, the query term does not necessarily appear in the translations of relevant documents. In this work, we show that constrained machine translation and constrained post-editing can improve human relevance judgments by including a query term in a summary when its translation appears in the source document. We also present several strategies for selecting only certain documents for regeneration which yield further improvements

Programmable Annotation with Diversed Heuristics and Data Denoising
Ernie Chang | Alex Marin | Vera Demberg

Neural natural language generation (NLG) and understanding (NLU) models are costly and require massive amounts of annotated data to be competitive. Recent data programming frameworks address this bottleneck by allowing human supervision to be provided as a set of labeling functions to construct generative models that synthesize weak labels at scale. However, these labeling functions are difficult to build from scratch for NLG/NLU models, as they often require complex rule sets to be specified. To this end, we propose a novel data programming framework that can jointly construct labeled data for language generation and understanding tasks – by allowing the annotators to modify an automatically-inferred alignment rule set between sequence labels and text, instead of writing rules from scratch. Further, to mitigate the effect of poor quality labels, we propose a dually-regularized denoising mechanism for optimizing the NLU and NLG models. On two benchmarks we show that the framework can generate high-quality data that comes within a 1.48 BLEU and 6.42 slot F1 of the 100% human-labeled data (42k instances) with just 100 labeled data samples – outperforming benchmark annotation frameworks and other semi-supervised approaches.

Text-to-Text Extraction and Verbalization of Biomedical Event Graphs
Giacomo Frisoni | Gianluca Moro | Lorenzo Balzani

Biomedical events represent complex, graphical, and semantically rich interactions expressed in the scientific literature. Almost all contributions in the event realm orbit around semantic parsing, usually employing discriminative architectures and cumbersome multi-step pipelines limited to a small number of target interaction types. We present the first lightweight framework to solve both event extraction and event verbalization with a unified text-to-text approach, allowing us to fuse all the resources so far designed for different tasks. To this end, we present a new event graph linearization technique and release highly comprehensive event-text paired datasets, covering more than 150 event types from multiple biology subareas (English language). By streamlining parsing and generation to translations, we propose baseline transformer model results according to multiple biomedical text mining benchmarks and NLG metrics. Our extractive models achieve greater state-of-the-art performance than single-task competitors and show promising capabilities for the controlled generation of coherent natural language utterances from structured data.

Multimodal Semi-supervised Learning for Disaster Tweet Classification
Iustin Sirbu | Tiberiu Sosea | Cornelia Caragea | Doina Caragea | Traian Rebedea

During natural disasters, people often use social media platforms, such as Twitter, to post information about casualties and damage produced by disasters. This information can help relief authorities gain situational awareness in nearly real time, and enable them to quickly distribute resources where most needed. However, annotating data for this purpose can be burdensome, subjective and expensive. In this paper, we investigate how to leverage the copious amounts of unlabeled data generated on social media by disaster eyewitnesses and affected individuals during disaster events. To this end, we propose a semi-supervised learning approach to improve the performance of neural models on several multimodal disaster tweet classification tasks. Our approach shows significant improvements, obtaining up to 7.7% improvements in F-1 in low-data regimes and 1.9% when using the entire training data. We make our code and data publicly available at

Automated Essay Scoring via Pairwise Contrastive Regression
Jiayi Xie | Kaiwei Cai | Li Kong | Junsheng Zhou | Weiguang Qu

Automated essay scoring (AES) involves the prediction of a score relating to the writing quality of an essay. Most existing works in AES utilize regression objectives or ranking objectives respectively. However, the two types of methods are highly complementary. To this end, in this paper we take inspiration from contrastive learning and propose a novel unified Neural Pairwise Contrastive Regression (NPCR) model in which both objectives are optimized simultaneously as a single loss. Specifically, we first design a neural pairwise ranking model to guarantee the global ranking order in a large list of essays, and then we further extend this pairwise ranking model to predict the relative scores between an input essay and several reference essays. Additionally, a multi-sample voting strategy is employed for inference. We use Quadratic Weighted Kappa to evaluate our model on the public Automated Student Assessment Prize (ASAP) dataset, and the experimental results demonstrate that NPCR outperforms previous methods by a large margin, achieving the state-of-the-art average performance for the AES task.

Medical Question Understanding and Answering with Knowledge Grounding and Semantic Self-Supervision
Khalil Mrini | Harpreet Singh | Franck Dernoncourt | Seunghyun Yoon | Trung Bui | Walter W. Chang | Emilia Farcas | Ndapa Nakashole

Current medical question answering systems have difficulty processing long, detailed and informally worded questions submitted by patients, called Consumer Health Questions (CHQs). To address this issue, we introduce a medical question understanding and answering system with knowledge grounding and semantic self-supervision. Our system is a pipeline that first summarizes a long, medical, user-written question, using a supervised summarization loss. Then, our system performs a two-step retrieval to return answers. The system first matches the summarized user question with an FAQ from a trusted medical knowledge base, and then retrieves a fixed number of relevant sentences from the corresponding answer document. In the absence of labels for question matching or answer relevance, we design 3 novel, self-supervised and semantically-guided losses. We evaluate our model against two strong retrieval-based question answering baselines. Evaluators ask their own questions and rate the answers retrieved by our baselines and own system according to their relevance. They find that our system retrieves more relevant answers, while achieving speeds 20 times faster. Our self-supervised losses also help the summarizer achieve higher scores in ROUGE, as well as in human evaluation metrics.

A Progressive Framework for Role-Aware Rumor Resolution
Lei Chen | Guanying Li | Zhongyu Wei | Yang Yang | Baohua Zhou | Qi Zhang | Xuanjing Huang

Existing works on rumor resolution have shown great potential in recognizing word appearance and user participation. However, they ignore the intrinsic propagation mechanisms of rumors and present poor adaptive ability when unprecedented news emerges. To exploit the fine-grained rumor diffusion patterns and generalize rumor resolution methods, we formulate a predecessor task to identify triggering posts, and then exploit their characteristics to facilitate rumor verification. We design a tree-structured annotation interface and extend PHEME dataset with labels on the message level. Data analysis shows that triggers play a critical role in verifying rumors and present similar lingual patterns across irrelevant events. We propose a graph-based model considering the direction and interaction of information flow to implement role-aware rumor resolution. Experimental results demonstrate the effectiveness of our proposed model and progressive scheme.

Uncertainty-aware Propagation Structure Reconstruction for Fake News Detection
Lingwei Wei | Dou Hu | Wei Zhou | Songlin Hu

The widespread of fake news has detrimental societal effects. Recent works model information propagation as graph structure and aggregate structural features from user interactions for fake news detection. However, they usually neglect a broader propagation uncertainty issue, caused by some missing and unreliable interactions during actual spreading, and suffer from learning accurate and diverse structural properties. In this paper, we propose a novel dual graph-based model, Uncertainty-aware Propagation Structure Reconstruction (UPSR) for improving fake news detection. Specifically, after the original propagation modeling, we introduce propagation structure reconstruction to fully explore latent interactions in the actual propagation. We design a novel Gaussian Propagation Estimation to refine the original deterministic node representation by multiple Gaussian distributions and arise latent interactions with KL divergence between distributions in a multi-facet manner. Extensive experiments on two real-world datasets demonstrate the effectiveness and superiority of our model.

A Unified Propagation Forest-based Framework for Fake News Detection
Lingwei Wei | Dou Hu | Yantong Lai | Wei Zhou | Songlin Hu

Fake news’s quick propagation on social media brings severe social ramifications and economic damage. Previous fake news detection usually learn semantic and structural patterns within a single target propagation tree. However, they are usually limited in narrow signals since they do not consider latent information cross other propagation trees. Motivated by a common phenomenon that most fake news is published around a specific hot event/topic, this paper develops a new concept of propagation forest to naturally combine propagation trees in a semantic-aware clustering. We propose a novel Unified Propagation Forest-based framework (UniPF) to fully explore latent correlations between propagation trees to improve fake news detection. Besides, we design a root-induced training strategy, which encourages representations of propagation trees to be closer to their prototypical root nodes. Extensive experiments on four benchmarks consistently suggest the effectiveness and scalability of UniPF.

CLoSE: Contrastive Learning of Subframe Embeddings for Political Bias Classification of News Media
Michelle YoungJin Kim | Kristen Marie Johnson

Framing is a political strategy in which journalists and politicians emphasize certain aspects of a societal issue in order to influence and sway public opinion. Frameworks for detecting framing in news articles or social media posts are critical in understanding the spread of biased information in our society. In this paper, we propose CLoSE, a multi-task BERT-based model which uses contrastive learning to embed indicators of frames from news articles in order to predict political bias. We evaluate the performance of our proposed model on subframes and political bias classification tasks. We also demonstrate the model’s classification accuracy on zero-shot and few-shot learning tasks, providing a promising avenue for framing detection in unlabeled data.

Grammatical Error Correction: Are We There Yet?
Muhammad Reza Qorib | Hwee Tou Ng

There has been much recent progress in natural language processing, and grammatical error correction (GEC) is no exception. We found that state-of-the-art GEC systems (T5 and GECToR) outperform humans by a wide margin on the CoNLL-2014 test set, a benchmark GEC test corpus, as measured by the standard F0.5 evaluation metric. However, a careful examination of their outputs reveals that there are still classes of errors that they fail to correct. This suggests that creating new test data that more accurately measure the true performance of GEC systems constitutes important future work.

CXR Data Annotation and Classification with Pre-trained Language Models
Nina Zhou | Ai Ti Aw | Zhuo Han Liu | Cher heng Tan | Yonghan Ting | Wen Xiang Chen | Jordan sim zheng Ting

Clinical data annotation has been one of the major obstacles for applying machine learning approaches in clinical NLP. Open-source tools such as NegBio and CheXpert are usually designed on data from specific institutions, which limit their applications to other institutions due to the differences in writing style, structure, language use as well as label definition. In this paper, we propose a new weak supervision annotation framework with two improvements compared to existing annotation frameworks: 1) we propose to select representative samples for efficient manual annotation; 2) we propose to auto-annotate the remaining samples, both leveraging on a self-trained sentence encoder. This framework also provides a function for identifying inconsistent annotation errors. The utility of our proposed weak supervision annotation framework is applicable to any given data annotation task, and it provides an efficient form of sample selection and data auto-annotation with better classification results for real applications.

uChecker: Masked Pretrained Language Models as Unsupervised Chinese Spelling Checkers
Piji Li

The task of Chinese Spelling Check (CSC) is aiming to detect and correct spelling errors that can be found in the text. While manually annotating a high-quality dataset is expensive and time-consuming, thus the scale of the training dataset is usually very small (e.g., SIGHAN15 only contains 2339 samples for training), therefore supervised-learning based models usually suffer the data sparsity limitation and over-fitting issue, especially in the era of big language models. In this paper, we are dedicated to investigating the unsupervised paradigm to address the CSC problem and we propose a framework named uChecker to conduct unsupervised spelling error detection and correction. Masked pretrained language models such as BERT are introduced as the backbone model considering their powerful language diagnosis capability. Benefiting from the various and flexible MASKing operations, we propose a Confusionset-guided masking strategy to fine-train the masked language model to further improve the performance of unsupervised detection and correction. Experimental results on standard datasets demonstrate the effectiveness of our proposed model uChecker in terms of character-level and sentence-level Accuracy, Precision, Recall, and F1-Measure on tasks of spelling error detection and correction respectively.

Boosting Deep CTR Prediction with a Plug-and-Play Pre-trainer for News Recommendation
Qijiong Liu | Jieming Zhu | Quanyu Dai | Xiao-Ming Wu

Understanding news content is critical to improving the quality of news recommendation. To achieve this goal, recent studies have attempted to apply pre-trained language models (PLMs) such as BERT for semantic-enhanced news recommendation. Despite their great success in offline evaluation, it is still a challenge to apply such large PLMs in real-time ranking model due to the stringent requirement in inference and updating time. To bridge this gap, we propose a plug-and-play pre-trainer, namely PREC, to learn both user and news encoders through multi-task pre-training. Instead of directly leveraging sophisticated PLMs for end-to-end inference, we focus on how to use the derived user and item representations to boost the performance of conventional lightweight models for click-through-rate prediction. This enables efficient online inference as well as compatibility to conventional models, which would significantly ease the practical deployment. We validate the effectiveness of PREC through both offline evaluation on public datasets and online A/B testing in an industrial application.

Improving Fake News Detection of Influential Domain via Domain- and Instance-Level Transfer
Qiong Nan | Danding Wang | Yongchun Zhu | Qiang Sheng | Yuhui Shi | Juan Cao | Jintao Li

Social media spreads both real news and fake news in various domains including politics, health, entertainment, etc. It is crucial to automatically detect fake news, especially for news of influential domains like politics and health because they may lead to serious social impact, e.g., panic in the COVID-19 pandemic. Some studies indicate the correlation between domains and perform multi-domain fake news detection. However, these multi-domain methods suffer from a seesaw problem that the performance of some domains is often improved by hurting the performance of other domains, which could lead to an unsatisfying performance in the specific target domains. To address this issue, we propose a Domain- and Instance-level Transfer Framework for Fake News Detection (DITFEND), which could improve the performance of specific target domains. To transfer coarse-grained domain-level knowledge, we train a general model with data of all domains from the meta-learning perspective. To transfer fine-grained instance-level knowledge and adapt the general model to a target domain, a language model is trained on the target domain to evaluate the transferability of each data instance in source domains and re-weight the instance’s contribution. Experiments on two real-world datasets demonstrate the effectiveness of DITFEND. According to both offline and online experiments, the DITFEND shows superior effectiveness for fake news detection.

Student Surpasses Teacher: Imitation Attack for Black-Box NLP APIs
Qiongkai Xu | Xuanli He | Lingjuan Lyu | Lizhen Qu | Gholamreza Haffari

Machine-learning-as-a-service (MLaaS) has attracted millions of users to their splendid large-scale models. Although published as black-box APIs, the valuable models behind these services are still vulnerable to imitation attacks. Recently, a series of works have demonstrated that attackers manage to steal or extract the victim models. Nonetheless, none of the previous stolen models can outperform the original black-box APIs. In this work, we conduct unsupervised domain adaptation and multi-victim ensemble to showing that attackers could potentially surpass victims, which is beyond previous understanding of model extraction. Extensive experiments on both benchmark datasets and real-world APIs validate that the imitators can succeed in outperforming the original black-box models on transferred domains. We consider our work as a milestone in the research of imitation attack, especially on NLP APIs, as the superior performance could influence the defense or even publishing strategy of API providers.

Combining Compressions for Multiplicative Size Scaling on Natural Language Tasks
Rajiv Movva | Jinhao Lei | Shayne Longpre | Ajay Gupta | Chris DuBois

Quantization, knowledge distillation, and magnitude pruning are among the most popular methods for neural network compression in NLP. Independently, these methods reduce model size and can accelerate inference, but their relative benefit and combinatorial interactions have not been rigorously studied. For each of the eight possible subsets of these techniques, we compare accuracy vs. model size tradeoffs across six BERT architecture sizes and eight GLUE tasks. We find that quantization and distillation consistently provide greater benefit than pruning. Surprisingly, except for the pair of pruning and quantization, using multiple methods together rarely yields diminishing returns. Instead, we observe complementary and super-multiplicative reductions to model size. Our work quantitatively demonstrates that combining compression methods can synergistically reduce model size, and that practitioners should prioritize (1) quantization, (2) knowledge distillation, and (3) pruning to maximize accuracy vs. model size tradeoffs.

PlugAT: A Plug and Play Module to Defend against Textual Adversarial Attack
Rui Zheng | Rong Bao | Qin Liu | Tao Gui | Qi Zhang | Xuanjing Huang | Rui Xie | Wei Wu

Adversarial training, which minimizes the loss of adversarially perturbed examples, has received considerable attention. However, these methods require modifying all model parameters and optimizing the model from scratch, which is parameter inefficient and unfriendly to the already deployed models. As an alternative, we propose a pluggable defense module PlugAT, to provide robust predictions by adding a few trainable parameters to the model inputs while keeping the original model frozen. To reduce the potential side effects of using defense modules, we further propose a novel forgetting restricted adversarial training, which filters out bad adversarial examples that impair the performance of original ones. The PlugAT-equipped BERT model substantially improves robustness over several strong baselines on various text classification tasks, whilst training only 9.1% parameters. We observe that defense modules trained under the same model architecture have domain adaptation ability between similar text classification datasets.

Automatic ICD Coding Exploiting Discourse Structure and Reconciled Code Embeddings
Shurui Zhang | Bozheng Zhang | Fuxin Zhang | Bo Sang | Wanchun Yang

The International Classification of Diseases (ICD) is the foundation of global health statistics and epidemiology. The ICD is designed to translate health conditions into alphanumeric codes. A number of approaches have been proposed for automatic ICD coding, since manual coding is labor-intensive and there is a global shortage of healthcare workers. However, existing studies did not exploit the discourse structure of clinical notes, which provides rich contextual information for code assignment. In this paper, we exploit the discourse structure by leveraging section type classification and section type embeddings. We also focus on the class-imbalanced problem and the heterogeneous writing style between clinical notes and ICD code definitions. The proposed reconciled embedding approach is able to tackle them simultaneously. Experimental results on the MIMIC dataset show that our model outperforms all previous state-of-the-art models by a large margin. The source code is available at

Towards Summarizing Healthcare Questions in Low-Resource Setting
Shweta Yadav | Cornelia Caragea

The current advancement in abstractive document summarization depends to a large extent on a considerable amount of human-annotated datasets. However, the creation of large-scale datasets is often not feasible in closed domains, such as medical and healthcare domains, where human annotation requires domain expertise. This paper presents a novel data selection strategy to generate diverse and semantic questions in a low-resource setting with the aim to summarize healthcare questions. Our method exploits the concept of guided semantic-overlap and diversity-based objective functions to optimally select the informative and diverse set of synthetic samples for data augmentation. Our extensive experiments on benchmark healthcare question summarization datasets demonstrate the effectiveness of our proposed data selection strategy by achieving new state-of-the-art results. Our human evaluation shows that our method generates diverse, fluent, and informative summarized questions.

Doc-GCN: Heterogeneous Graph Convolutional Networks for Document Layout Analysis
Siwen Luo | Yihao Ding | Siqu Long | Josiah Poon | Soyeon Caren Han

Recognizing the layout of unstructured digital documents is crucial when parsing the documents into the structured, machine-readable format for downstream applications. Recent studies in Document Layout Analysis usually rely on visual cues to understand documents while ignoring other information, such as contextual information or the relationships between document layout components, which are vital to boost better layout analysis performance. Our Doc-GCN presents an effective way to harmonize and integrate heterogeneous aspects for Document Layout Analysis. We construct different graphs to capture the four main features aspects of document layout components, including syntactic, semantic, density, and appearance features. Then, we apply graph convolutional networks to enhance each aspect of features and apply the node-level pooling for integration. Finally, we concatenate features of all aspects and feed them into the 2-layer MLPs for document layout component classification. Our Doc-GCN achieves state-of-the-art results on three widely used DLA datasets: PubLayNet, FUNSD, and DocBank. The code will be released at

Analytic Automated Essay Scoring Based on Deep Neural Networks Integrating Multidimensional Item Response Theory
Takumi Shibata | Masaki Uto

Essay exams have been attracting attention as a way of measuring the higher-order abilities of examinees, but they have two major drawbacks in that grading them is expensive and raises questions about fairness. As an approach to overcome these problems, automated essay scoring (AES) is in increasing need. Many AES models based on deep neural networks have been proposed in recent years and have achieved high accuracy, but most of these models are designed to predict only a single overall score. However, to provide detailed feedback in practical situations, we often require not only the overall score but also analytic scores corresponding to various aspects of the essay.Several neural AES models that can predict both the analytic scores and the overall score have also been proposed for this very purpose. However, conventional models are designed to have complex neural architectures for each analytic score, which makes interpreting the score prediction difficult. To improve the interpretability of the prediction while maintaining scoring accuracy, we propose a new neural model for automated analytic scoring that integrates a multidimensional item response theory model, which is a popular psychometric model.

DP-Rewrite: Towards Reproducibility and Transparency in Differentially Private Text Rewriting
Timour Igamberdiev | Thomas Arnold | Ivan Habernal

Text rewriting with differential privacy (DP) provides concrete theoretical guarantees for protecting the privacy of individuals in textual documents. In practice, existing systems may lack the means to validate their privacy-preserving claims, leading to problems of transparency and reproducibility. We introduce DP-Rewrite, an open-source framework for differentially private text rewriting which aims to solve these problems by being modular, extensible, and highly customizable. Our system incorporates a variety of downstream datasets, models, pre-training procedures, and evaluation metrics to provide a flexible way to lead and validate private text rewriting research. To demonstrate our software in practice, we provide a set of experiments as a case study on the ADePT DP text rewriting system, detecting a privacy leak in its pre-training approach. Our system is publicly available, and we hope that it will help the community to make DP text rewriting research more accessible and transparent.

Harnessing Abstractive Summarization for Fact-Checked Claim Detection
Varad Bhatnagar | Diptesh Kanojia | Kameswari Chebrolu

Social media platforms have become new battlegrounds for anti-social elements, with misinformation being the weapon of choice. Fact-checking organizations try to debunk as many claims as possible while staying true to their journalistic processes but cannot cope with its rapid dissemination. We believe that the solution lies in partial automation of the fact-checking life cycle, saving human time for tasks which require high cognition. We propose a new workflow for efficiently detecting previously fact-checked claims that uses abstractive summarization to generate crisp queries. These queries can then be executed on a general-purpose retrieval system associated with a collection of previously fact-checked claims. We curate an abstractive text summarization dataset comprising noisy claims from Twitter and their gold summaries. It is shown that retrieval performance improves 2x by using popular out-of-the-box summarization models and 3x by fine-tuning them on the accompanying dataset compared to verbatim querying. Our approach achieves Recall@5 and MRR of 35% and 0.3, compared to baseline values of 10% and 0.1, respectively. Our dataset, code, and models are available publicly:

Learning to Generate Explanation from e-Hospital Services for Medical Suggestion
Wei-Lin Chen | An-Zi Yen | Hen-Hsen Huang | Hsin-Hsi Chen

Explaining the reasoning of neural models has attracted attention in recent years. Providing highly-accessible and comprehensible explanations in natural language is useful for humans to understand model’s prediction results. In this work, we present a pilot study to investigate explanation generation with a narrative and causal structure for the scenario of health consulting. Our model generates a medical suggestion regarding the patient’s concern and provides an explanation as the outline of the reasoning. To align the generated explanation with the suggestion, we propose a novel discourse-aware mechanism with multi-task learning. Experimental results show that our model achieves promising performances in both quantitative and human evaluation.

DeltaNet: Conditional Medical Report Generation for COVID-19 Diagnosis
Xian Wu | Shuxin Yang | Zhaopeng Qiu | Shen Ge | Yangtian Yan | Xingwang Wu | Yefeng Zheng | S. Kevin Zhou | Li Xiao

Fast screening and diagnosis are critical in COVID-19 patient treatment. In addition to the gold standard RT-PCR, radiological imaging like X-ray and CT also works as an important means in patient screening and follow-up. However, due to the excessive number of patients, writing reports becomes a heavy burden for radiologists. To reduce the workload of radiologists, we propose DeltaNet to generate medical reports automatically. Different from typical image captioning approaches that generate reports with an encoder and a decoder, DeltaNet applies a conditional generation process. In particular, given a medical image, DeltaNet employs three steps to generate a report: 1) first retrieving related medical reports, i.e., the historical reports from the same or similar patients; 2) then comparing retrieved images and current image to find the differences; 3) finally generating a new report to accommodate identified differences based on the conditional report. We evaluate DeltaNet on a COVID-19 dataset, where DeltaNet outperforms state-of-the-art approaches. Besides COVID-19, the proposed DeltaNet can be applied to other diseases as well. We validate its generalization capabilities on the public IU-Xray and MIMIC-CXR datasets for chest-related diseases.

MCS: An In-battle Commentary System for MOBA Games
Xiaofeng Qi | Chao Li | Zhongping Liang | Jigang Liu | Cheng Zhang | Yuanxin Wei | Lin Yuan | Guang Yang | Lanxiao Huang | Min Li

This paper introduces a generative system for in-battle real-time commentary in mobile MOBA games. Event commentary is important for battles in MOBA games, which is applicable to a wide range of scenarios like live streaming, e-sports commentary and combat information analysis. The system takes real-time match statistics and events as input, and an effective transform method is designed to convert match statistics and utterances into consistent encoding space. This paper presents the general framework and implementation details of the proposed system, and provides experimental results on large-scale real-world match data.

A Two Stage Adaptation Framework for Frame Detection via Prompt Learning
Xinyi Mou | Zhongyu Wei | Changjian Jiang | Jiajie Peng

Framing is a communication strategy to bias discussion by selecting and emphasizing. Frame detection aims to automatically analyze framing strategy. Previous works on frame detection mainly focus on a single scenario or issue, ignoring the special characteristics of frame detection that new events emerge continuously and policy agenda changes dynamically. To better deal with various context and frame typologies across different issues, we propose a two-stage adaptation framework. In the framing domain adaptation from pre-training stage, we design two tasks based on pivots and prompts to learn a transferable encoder, verbalizer, and prompts. In the downstream scenario generalization stage, the transferable components are applied to new issues and label sets. Experiment results demonstrate the effectiveness of our framework in different scenarios. Also, it shows superiority both in full-resource and low-resource conditions.

Summarizing Patients’ Problems from Hospital Progress Notes Using Pre-trained Sequence-to-Sequence Models
Yanjun Gao | Dmitriy Dligach | Timothy Miller | Dongfang Xu | Matthew M. M. Churpek | Majid Afshar

Automatically summarizing patients’ main problems from daily progress notes using natural language processing methods helps to battle against information and cognitive overload in hospital settings and potentially assists providers with computerized diagnostic decision support. Problem list summarization requires a model to understand, abstract, and generate clinical documentation. In this work, we propose a new NLP task that aims to generate a list of problems in a patient’s daily care plan using input from the provider’s progress notes during hospitalization. We investigate the performance of T5 and BART, two state-of-the-art seq2seq transformer architectures, in solving this problem. We provide a corpus built on top of progress notes from publicly available electronic health record progress notes in the Medical Information Mart for Intensive Care (MIMIC)-III. T5 and BART are trained on general domain text, and we experiment with a data augmentation method and a domain adaptation pre-training method to increase exposure to medical vocabulary and knowledge. Evaluation methods include ROUGE, BERTScore, cosine similarity on sentence embedding, and F-score on medical concepts. Results show that T5 with domain adaptive pre-training achieves significant performance gains compared to a rule-based system and general domain pre-trained language models, indicating a promising direction for tackling the problem summarization task.

Human-in-the-loop Robotic Grasping Using BERT Scene Representation
Yaoxian Song | Penglei Sun | Pengfei Fang | Linyi Yang | Yanghua Xiao | Yue Zhang

Current NLP techniques have been greatly applied in different domains. In this paper, we propose a human-in-the-loop framework for robotic grasping in cluttered scenes, investigating a language interface to the grasping process, which allows the user to intervene by natural language commands. This framework is constructed on a state-of-the-art grasping baseline, where we substitute a scene-graph representation with a text representation of the scene using BERT. Experiments on both simulation and physical robot show that the proposed method outperforms conventional object-agnostic and scene-graph based methods in the literature. In addition, we find that with human intervention, performance can be significantly improved. Our dataset and code are available on our project website

Automated Chinese Essay Scoring from Multiple Traits
Yaqiong He | Feng Jiang | Xiaomin Chu | Peifeng Li

Automatic Essay Scoring (AES) is the task of using the computer to evaluate the quality of essays automatically. Current research on AES focuses on scoring the overall quality or single trait of prompt-specific essays. However, the users not only expect to obtain the overall score but also the instant feedback from different traits to help their writing in the real world. Therefore, we first annotate a mutli-trait dataset ACEA including 1220 argumentative essays from four traits, i.e., essay organization, topic, logic, and language. And then we design a hierarchical multi-task trait scorer HMTS to evaluate the quality of writing by modeling these four traits. Moreover, we propose an inter-sequence attention mechanism to enhance information interaction between different tasks and design the trait-specific features for various tasks in AES. The experimental results on ACEA show that our HMTS can effectively score essays from multiple traits, outperforming several strong models.

Semantic-Preserving Adversarial Code Comprehension
Yiyang Li | Hongqiu Wu | Hai Zhao

Based on the tremendous success of pre-trained language models (PrLMs) for source code comprehension tasks, current literature studies either ways to further improve the performance (generalization) of PrLMs, or their robustness against adversarial attacks. However, they have to compromise on the trade-off between the two aspects and none of them consider improving both sides in an effective and practical way. To fill this gap, we propose Semantic-Preserving Adversarial Code Embeddings (SPACE) to find the worst-case semantic-preserving attacks while forcing the model to predict the correct labels under these worst cases. Experiments and analysis demonstrate that SPACE can stay robust against state-of-the-art attacks while boosting the performance of PrLMs for code.

Continually Detection, Rapidly React: Unseen Rumors Detection Based on Continual Prompt-Tuning
Yuhui Zuo | Wei Zhu | Guoyong GUET Cai

Since open social platforms allow for a large and continuous flow of unverified information, rumors can emerge unexpectedly and spread quickly. However, existing rumor detection (RD) models often assume the same training and testing distributions and can not cope with the continuously changing social network environment. This paper proposed a Continual Prompt-Tuning RD (CPT-RD) framework, which avoids catastrophic forgetting (CF) of upstream tasks during sequential task learning and enables bidirectional knowledge transfer between domain tasks. Specifically, we propose the following strategies: (a) Our design explicitly decouples shared and domain-specific knowledge, thus reducing the interference among different domains during optimization; (b) Several technologies aim to transfer knowledge of upstream tasks to deal with emergencies; (c) A task-conditioned prompt-wise hypernetwork (TPHNet) is used to consolidate past domains. In addition, CPT-RD avoids CF without the necessity of a rehearsal buffer. Finally, CPT-RD is evaluated on English and Chinese RD datasets and is effective and efficient compared to prior state-of-the-art methods.

AiM: Taking Answers in Mind to Correct Chinese Cloze Tests in Educational Applications
Yusen Zhang | Zhongli Li | Qingyu Zhou | Ziyi Liu | Chao Li | Mina Ma | Yunbo Cao | Hongzhi Liu

To automatically correct handwritten assignments, the traditional approach is to use an OCR model to recognize characters and compare them to answers. The OCR model easily gets confused on recognizing handwritten Chinese characters, and the textual information of the answers is missing during the model inference. However, teachers always have these answers in mind to review and correct assignments. In this paper, we focus on the Chinese cloze tests correction and propose a multimodal approach(named AiM). The encoded representations of answers interact with the visual information of students’ handwriting. Instead of predicting ‘right’ or ‘wrong’, we perform the sequence labeling on the answer text to infer which answer character differs from the handwritten content in a fine-grained way. We take samples of OCR datasets as the positive samples for this task, and develop a negative sample augmentation method to scale up the training data. Experimental results show that AiM outperforms OCR-based methods by a large margin. Extensive studies demonstrate the effectiveness of our multimodal approach.

TreeMAN: Tree-enhanced Multimodal Attention Network for ICD Coding
Zichen Liu | Xuyuan Liu | Yanlong Wen | Guoqing Zhao | Fen Xia | Xiaojie Yuan

ICD coding is designed to assign the disease codes to electronic health records (EHRs) upon discharge, which is crucial for billing and clinical statistics. In an attempt to improve the effectiveness and efficiency of manual coding, many methods have been proposed to automatically predict ICD codes from clinical notes. However, most previous works ignore the decisive information contained in structured medical data in EHRs, which is hard to be captured from the noisy clinical notes. In this paper, we propose a Tree-enhanced Multimodal Attention Network (TreeMAN) to fuse tabular features and textual features into multimodal representations by enhancing the text representations with tree-based features via the attention mechanism. Tree-based features are constructed according to decision trees learned from structured multimodal medical data, which capture the decisive information about ICD coding. We can apply the same multi-label classifier from previous text models to the multimodal representations to predict ICD codes. Experiments on two MIMIC datasets show that our method outperforms prior state-of-the-art ICD coding approaches. The code is available at

Gated Mechanism Enhanced Multi-Task Learning for Dialog Routing
Ziming Huang | Zhuoxuan Jiang | Ke Wang | Juntao Li | Shanshan Feng | Xian-Ling Mao

Currently, human-bot symbiosis dialog systems, e.g. pre- and after-sales in E-commerce, are ubiquitous, and the dialog routing component is essential to improve the overall efficiency, reduce human resource cost and increase user experience. To satisfy this requirement, existing methods are mostly heuristic and cannot obtain high-quality performance. In this paper, we investigate the important problem by thoroughly mining both the data-to-task and task-to-task knowledge among various kinds of dialog data. To achieve the above target, we propose a comprehensive and general solution with multi-task learning framework, specifically including a novel dialog encoder and two tailored gated mechanism modules. The proposed Gated Mechanism enhanced Multi-task Model (G3M) can play the role of hierarchical information filtering and is non-invasive to the existing dialog systems. Experiments on two datasets collected from the real world demonstrate our method’s effectiveness and the results achieve the state-of-the-art performance by relatively increasing 8.7%/11.8% on RMSE metric and 2.2%/4.4% on F1 metric.

Negation, Coordination, and Quantifiers in Contextualized Language Models
Aikaterini-Lida Kalouli | Rita Sevastjanova | Christin Beck | Maribel Romero

With the success of contextualized language models, much research explores what these models really learn and in which cases they still fail. Most of this work focuses on specific NLP tasks and on the learning outcome. Little research has attempted to decouple the models’ weaknesses from specific tasks and focus on the embeddings per se and their mode of learning. In this paper, we take up this research opportunity: based on theoretical linguistic insights, we explore whether the semantic constraints of function words are learned and how the surrounding context impacts their embeddings. We create suitable datasets, provide new insights into the inner workings of LMs vis-a-vis function words and implement an assisting visual web interface for qualitative analysis.

Tales and Tropes: Gender Roles from Word Embeddings in a Century of Children’s Books
Anjali Adukia | Patricia Chiril | Callista Christ | Anjali Das | Alex Eble | Emileigh Harrison | Hakizumwami Birali Runesha

The manner in which gender is portrayed in materials used to teach children conveys messages about people’s roles in society. In this paper, we measure the gendered depiction of central domains of social life in 100 years of highly influential children’s books. We make two main contributions: (1) we find that the portrayal of gender in these books reproduces traditional gender norms in society, and (2) we publish StoryWords 1.0, the first word embeddings trained on such a large body of children’s literature. We find that, relative to males, females are more likely to be represented in relation to their appearance than in relation to their competence; second, they are more likely to be represented in relation to their role in the family than their role in business. Finally, we find that non-binary or gender-fluid individuals are rarely mentioned. Our analysis advances understanding of the different messages contained in content commonly used to teach children, with immediate applications for practice, policy, and research.

CLOWER: A Pre-trained Language Model with Contrastive Learning over Word and Character Representations
Borun Chen | Hongyin Tang | Jiahao Bu | Kai Zhang | Jingang Wang | Qifan Wang | Hai-Tao Zheng | Wei Wu | Liqian Yu

Pre-trained Language Models (PLMs) have achieved remarkable performance gains across numerous downstream tasks in natural language understanding. Various Chinese PLMs have been successively proposed for learning better Chinese language representation. However, most current models use Chinese characters as inputs and are not able to encode semantic information contained in Chinese words. While recent pre-trained models incorporate both words and characters simultaneously, they usually suffer from deficient semantic interactions and fail to capture the semantic relation between words and characters. To address the above issues, we propose a simple yet effective PLM CLOWER, which adopts the Contrastive Learning Over Word and charactER representations. In particular, CLOWER implicitly encodes the coarse-grained information (i.e., words) into the fine-grained representations (i.e., characters) through contrastive learning on multi-grained information. CLOWER is of great value in realistic scenarios since it can be easily incorporated into any existing fine-grained based PLMs without modifying the production pipelines. Extensive experiments conducted on a range of downstream tasks demonstrate the superior performance of CLOWER over several state-of-the-art baselines.

On the Nature of BERT: Correlating Fine-Tuning and Linguistic Competence
Federica Merendi | Felice Dell’Orletta | Giulia Venturi

Several studies in the literature on the interpretation of Neural Language Models (NLM) focus on the linguistic generalization abilities of pre-trained models. However, little attention is paid to how the linguistic knowledge of the models changes during the fine-tuning steps. In this paper, we contribute to this line of research by showing to what extent a wide range of linguistic phenomena are forgotten across 50 epochs of fine-tuning, and how the preserved linguistic knowledge is correlated with the resolution of the fine-tuning task. To this end, we considered a quite understudied task where linguistic information plays the main role, i.e. the prediction of the evolution of written language competence of native language learners. In addition, we investigate whether it is possible to predict the fine-tuned NLM accuracy across the 50 epochs solely relying on the assessed linguistic competence. Our results are encouraging and show a high relationship between the model’s linguistic competence and its ability to solve a linguistically-based downstream task.

LayerConnect: Hypernetwork-Assisted Inter-Layer Connector to Enhance Parameter Efficiency
Haoxiang Shi | Rongsheng Zhang | Jiaan Wang | Cen Wang | Yinhe Zheng | Tetsuya Sakai

Pre-trained Language Models (PLMs) are the cornerstone of the modern Natural Language Processing (NLP). However, as PLMs become heavier, fine tuning all their parameters loses their efficiency. Existing parameter-efficient methods generally focus on reducing the trainable parameters in PLMs but neglect the inference speed, which limits the ability to deploy PLMs. In this paper, we propose LayerConnect (hypernetwork-assisted inter-layer connectors) to enhance inference efficiency. Specifically, a light-weight connector with a linear structure is inserted between two Transformer layers, and the parameters inside each connector are tuned by a hypernetwork comprising an interpolator and a down-sampler. We perform extensive experiments on the widely used the GLUE benchmark. The experimental results verify the inference efficiency of our model. Compared to Adapter, our model parameters are reduced to approximately 11.75%, while the performance degradation is kept to less than 5% (2.5 points on average).

Effect of Post-processing on Contextualized Word Representations
Hassan Sajjad | Firoj Alam | Fahim Dalvi | Nadir Durrani

Post-processing of static embedding has been shown to improve their performance on both lexical and sequence-level tasks. However, post-processing for contextualized embeddings is an under-studied problem. In this work, we question the usefulness of post-processing for contextualized embeddings obtained from different layers of pre-trained language models. More specifically, we standardize individual neuron activations using z-score, min-max normalization, and by removing top principal components using the all-but-the-top method. Additionally, we apply unit length normalization to word representations. On a diverse set of pre-trained models, we show that post-processing unwraps vital information present in the representations for both lexical tasks (such as word similarity and analogy) and sequence classification tasks. Our findings raise interesting points in relation to the research studies that use contextualized representations, and suggest z-score normalization as an essential step to consider when using them in an application.

Does BERT Rediscover a Classical NLP Pipeline?
Jingcheng Niu | Wenjie Lu | Gerald Penn

Does BERT store surface knowledge in its bottom layers, syntactic knowledge in its middle layers, and semantic knowledge in its upper layers? In re-examining Jawahar et al. (2019) and Tenney et al.’s (2019a) probes into the structure of BERT, we have found that the pipeline-like separation that they asserted lacks conclusive empirical support. BERT’s structure is, however, linguistically founded, although perhaps in a way that is more nuanced than can be explained by layers alone. We introduce a novel probe, called GridLoc, through which we can also take into account token positions, training rounds, and random seeds. Using GridLoc, we are able to detect other, stronger regularities that suggest that pseudo-cognitive appeals to layer depth may not be the preferable mode of explanation for BERT’s inner workings.

HG2Vec: Improved Word Embeddings from Dictionary and Thesaurus Based Heterogeneous Graph
Qitong Wang | Mohammed J Zaki

Learning word embeddings is an essential topic in natural language processing. Most existing works use a vast corpus as a primary source while training, but this requires massive time and space for data pre-processing and model training. We propose a new model, HG2Vec, that learns word embeddings utilizing only dictionaries and thesauri. Our model reaches the state-of-art on multiple word similarity and relatedness benchmarks. We demonstrate that dictionaries and thesauri are effective resources to learn word embeddings. In addition, we exploit a new context-focused loss that models transitive relationships between word pairs and balances the performance between similarity and relatedness benchmarks, yielding superior results.

Transferring Knowledge from Structure-aware Self-attention Language Model to Sequence-to-Sequence Semantic Parsing
Ran Ji | Jianmin Ji

Semantic parsing considers the task of mapping a natural language sentence into a target formal representation, where various sophisticated sequence-to-sequence (seq2seq) models have been applied with promising results. Generally, these target representations follow a syntax formalism that limits permitted forms. However, it is neither easy nor flexible to explicitly integrate this syntax formalism into a neural seq2seq model. In this paper, we present a structure-aware self-attention language model to capture structural information of target representations and propose a knowledge distillation based approach to incorporating the target language model into a seq2seq model, where grammar rules or sketches are not required in the training process. An ablation study shows that the proposed language model can notably improve the performance of the baseline model. The experiments show that our method achieves new state-of-the-art performance among neural approaches on four semantic parsing (ATIS, GEO) and Python code generation (Django, CoNaLa) tasks.

Enhancing Contextual Word Representations Using Embedding of Neighboring Entities in Knowledge Graphs
Ryoko Tokuhisa | Keisuke Kawano | Akihiro Nakamura | Satoshi Koide

Pre-trained language models (PLMs) such as BERT and RoBERTa have dramatically improved the performance of various natural language processing tasks. Although these models are trained on large amounts of raw text, they have no explicit grounding in real-world entities. Knowledge graphs (KGs) are manually annotated with factual knowledge and store the relations between nodes corresponding to entities as labeled edges. This paper proposes a mechanism called KG-attention, which integrates the structure of a KG into recent PLM architectures. Unlike the existing PLM+KG integration methods, KG-attention generalizes the embeddings of neighboring entities using the relation embeddings; accordingly, it can handle relations between unconnected entities in the KG. Experimental results demonstrated that our method achieved significant improvements in a relation classification task, an entity typing task, and several language comprehension tasks.

Generic Overgeneralization in Pre-trained Language Models
Sello Ralethe | Jan Buys

Generic statements such as “ducks lay eggs” make claims about kinds, e.g., ducks as a category. The generic overgeneralization effect refers to the inclination to accept false universal generalizations such as “all ducks lay eggs” or “all lions have manes” as true. In this paper, we investigate the generic overgeneralization effect in pre-trained language models experimentally. We show that pre-trained language models suffer from overgeneralization and tend to treat quantified generic statements such as “all ducks lay eggs” as if they were true generics. Furthermore, we demonstrate how knowledge embedding methods can lessen this effect by injecting factual knowledge about kinds into pre-trained language models. To this end, we source factual knowledge about two types of generics, minority characteristic generics and majority characteristic generics, and inject this knowledge using a knowledge embedding model. Our results show that knowledge injection reduces, but does not eliminate, generic overgeneralization, and that majority characteristic generics of kinds are more susceptible to overgeneralization bias.

How about Time? Probing a Multilingual Language Model for Temporal Relations
Tommaso Caselli | Irene Dini | Felice Dell’Orletta

This paper presents a comprehensive set of probing experiments using a multilingual language model, XLM-R, for temporal relation classification between events in four languages. Results show an advantage of contextualized embeddings over static ones and a detrimen- tal role of sentence level embeddings. While obtaining competitive results against state-of-the-art systems, our probes indicate a lack of suitable encoded information to properly address this task.

CogBERT: Cognition-Guided Pre-trained Language Models
Xiao Ding | Bowen Chen | Li Du | Bing Qin | Ting Liu

We study the problem of integrating cognitive language processing signals (e.g., eye-tracking or EEG data) into pre-trained language models like BERT. Existing methods typically fine-tune pre-trained models on cognitive data, ignoring the semantic gap between the texts and cognitive signals. To fill the gap, we propose CogBERT, a framework that can induce fine-grained cognitive features from cognitive data and incorporate cognitive features into BERT by adaptively adjusting the weight of cognitive features for different NLP tasks. Extensive experiments show that: (1) Cognition-guided pre-trained models can consistently perform better than basic pre-trained models on ten NLP tasks. (2) Different cognitive features contribute differently to different NLP tasks. Based on this observation, we give a fine-grained explanation of why cognitive data is helpful for NLP. (3) Different transformer layers of pre-trained models should encode different cognitive features, with word-level cognitive features at the bottom and semantic-level cognitive features at the top. (4) Attention visualization demonstrates that CogBERT aligns with human gaze patterns and improves its natural language comprehension ability.

Can Transformers Process Recursive Nested Constructions, Like Humans?
Yair Lakretz | Théo Desbordes | Dieuwke Hupkes | Stanislas Dehaene

Recursive processing is considered a hallmark of human linguistic abilities. A recent study evaluated recursive processing in recurrent neural language models (RNN-LMs) and showed that such models perform below chance level on embedded dependencies within nested constructions – a prototypical example of recursion in natural language. Here, we study if state-of-the-art Transformer LMs do any better. We test eight different Transformer LMs on two different types of nested constructions, which differ in whether the embedded (inner) dependency is short or long range. We find that Transformers achieve near-perfect performance on short-range embedded dependencies, significantly better than previous results reported for RNN-LMs and humans. However, on long-range embedded dependencies, Transformers’ performance sharply drops below chance level. Remarkably, the addition of only three words to the embedded dependency caused Transformers to fall from near-perfect to below-chance performance. Taken together, our results reveal how brittle syntactic processing is in Transformers, compared to humans.

NSP-BERT: A Prompt-based Few-Shot Learner through an Original Pre-training Task —— Next Sentence Prediction
Yi Sun | Yu Zheng | Chao Hao | Hangping Qiu

Using prompts to utilize language models to perform various downstream tasks, also known as prompt-based learning or prompt-learning, has lately gained significant success in comparison to the pre-train and fine-tune paradigm. Nonetheless, virtually most prompt-based methods are token-level such as PET based on mask language model (MLM). In this paper, we attempt to accomplish several NLP tasks in the zero-shot and few-shot scenarios using a BERT original pre-training task abandoned by RoBERTa and other models——Next Sentence Prediction (NSP). Unlike token-level techniques, our sentence-level prompt-based method NSP-BERT does not need to fix the length of the prompt or the position to be predicted, allowing it to handle tasks such as entity linking with ease. NSP-BERT can be applied to a variety of tasks based on its properties. We present an NSP-tuning approach with binary cross-entropy loss for single-sentence classification tasks that is competitive compared to PET and EFL. By continuing to train BERT on RoBERTa’s corpus, the model’s performance improved significantly, which indicates that the pre-training corpus is another important determinant of few-shot besides model size and prompt method.

MetaPrompting: Learning to Learn Better Prompts
Yutai Hou | Hongyuan Dong | Xinghao Wang | Bohan Li | Wanxiang Che

Prompting method is regarded as one of the crucial progress for few-shot nature language processing. Recent research on prompting moves from discrete tokens based “hard prompts” to continuous “soft prompts”, which employ learnable vectors as pseudo prompt tokens and achieve better performance. Though showing promising prospects, these soft-prompting methods are observed to rely heavily on good initialization to take effect. Unfortunately, obtaining a perfect initialization for soft prompts requires understanding of inner language models working and elaborate design, which is no easy task and has to restart from scratch for each new task. To remedy this, we propose a generalized soft prompting method called MetaPrompting, which adopts the well-recognized model-agnostic meta-learning algorithm to automatically find better prompt initialization that facilitates fast adaptation to new prompting tasks. Extensive experiments show MetaPrompting tackles soft prompt initialization problem and brings significant improvement on three different datasets (over 7 points improvement in accuracy for 1-shot setting), achieving new state-of-the-art performance.

Parameter-Efficient Mixture-of-Experts Architecture for Pre-trained Language Models
Ze-Feng Gao | Peiyu Liu | Wayne Xin Zhao | Zhong-Yi Lu | Ji-Rong Wen

Recently, Mixture-of-Experts (short as MoE) architecture has achieved remarkable success in increasing the model capacity of large-scale language models. However, MoE requires incorporating significantly more parameters than the base model being extended. In this paper, we propose building a parameter-efficient MoE architecture by sharing information across experts. We adopt matrix product operator (MPO, a tensor decomposition from quantum many-body physics) to reconstruct the parameter matrix in the expert layer and increase model capacity for pre-trained language models by sharing parameters of the central tensor (containing the core information) among different experts while enabling the specificity through the auxiliary tensors (complementing the central tensor) of different experts. To address the unbalanced optimization issue, we further design the gradient mask strategy for the MPO-based MoE architecture. Extensive experiments based on T5 and GPT-2 show improved performance and efficiency of the pre-trained language model (27.2x reduction in total parameters for the superior model performance, compared with the Switch Transformers). Our code is publicly available at

Pre-trained Token-replaced Detection Model as Few-shot Learner
Zicheng Li | Shoushan Li | Guodong Zhou

Pre-trained masked language models have demonstrated remarkable ability as few-shot learners. In this paper, as an alternative, we propose a novel approach to few-shot learning with pre-trained token-replaced detection models like ELECTRA. In this approach, we reformulate a classification or a regression task as a token-replaced detection problem. Specifically, we first define a template and label description words for each task and put them into the input to form a natural language prompt. Then, we employ the pre-trained token-replaced detection model to predict which label description word is the most original (i.e., least replaced) among all label description words in the prompt. A systematic evaluation on 16 datasets demonstrates that our approach outperforms few-shot learners with pre-trained masked language models in both one-sentence and two-sentence learning tasks.

Evaluating Diversity of Multiword Expressions in Annotated Text
Adam Lion-Bouton | Yagmur Ozturk | Agata Savary | Jean-Yves Antoine

Diversity can be decomposed into three distinct concepts, namely: variety, balance and disparity. This paper borrows from the extensive formalization and measures of diversity developed in ecology in order to evaluate the variety and balance of multiword expression annotation produced by automatic annotation systems. The measures of richness, normalized richness, and two variations of Hill’s evenness are considered in this paper. We observe how these measures behave against increasingly smaller samples of gold annotations of multiword expressions and use their comportment to validate or invalidate their pertinence for multiword expressions in annotated texts. We apply the validated measures to annotations in 14 languages produced by systems during the PARSEME shared task on automatic identification of multiword expressions and on the gold versions of the corpora. We also explore the limits of such evaluation by studying the impact of lemmatization errors in the Turkish corpus used in the shared task.

CausalQA: A Benchmark for Causal Question Answering
Alexander Bondarenko | Magdalena Wolska | Stefan Heindorf | Lukas Blübaum | Axel-Cyrille Ngonga Ngomo | Benno Stein | Pavel Braslavski | Matthias Hagen | Martin Potthast

At least 5% of questions submitted to search engines ask about cause-effect relationships in some way. To support the development of tailored approaches that can answer such questions, we construct Webis-CausalQA-22, a benchmark corpus of 1.1 million causal questions with answers. We distinguish different types of causal questions using a novel typology derived from a data-driven, manual analysis of questions from ten large question answering (QA) datasets. Using high-precision lexical rules, we extract causal questions of each type from these datasets to create our corpus. As an initial baseline, the state-of-the-art QA model UnifiedQA achieves a ROUGE-L F1 score of 0.48 on our new benchmark.

MACRONYM: A Large-Scale Dataset for Multilingual and Multi-Domain Acronym Extraction
Amir Pouran Ben Veyseh | Nicole Meister | Seunghyun Yoon | Rajiv Jain | Franck Dernoncourt | Thien Huu Nguyen

Acronym extraction is the task of identifying acronyms and their expanded forms in texts that is necessary for various NLP applications. Despite major progress for this task in recent years, one limitation of existing AE research is that they are limited to the English language and certain domains (i.e., scientific and biomedical). Challenges of AE in other languages and domains are mainly unexplored. As such, lacking annotated datasets in multiple languages and domains has been a major issue to prevent research in this direction. To address this limitation, we propose a new dataset for multilingual and multi-domain AE. Specifically, 27,200 sentences in 6 different languages and 2 new domains, i.e., legal and scientific, are manually annotated for AE. Our experiments on the dataset show that AE in different languages and learning settings has unique challenges, emphasizing the necessity of further research on multilingual and multi-domain AE.

Curating a Large-Scale Motivational Interviewing Dataset Using Peer Support Forums
Anuradha Welivita | Pearl Pu

A significant limitation in developing therapeutic chatbots to support people going through psychological distress is the lack of high-quality, large-scale datasets capturing conversations between clients and trained counselors. As a remedy, researchers have focused their attention on scraping conversational data from peer support platforms such as Reddit. But the extent to which the responses from peers align with responses from trained counselors is understudied. We address this gap by analyzing the differences between responses from counselors and peers by getting trained counselors to annotate ≈17K such responses using Motivational Interviewing Treatment Integrity (MITI) code, a well-established behavioral coding system that differentiates between favorable and unfavorable responses. We developed an annotation pipeline with several stages of quality control. Due to its design, this method was able to achieve 97% of coverage, meaning that out of the 17.3K responses we successfully labeled 16.8K with a moderate agreement. We use this data to conclude the extent to which conversational data from peer support platforms align with real therapeutic conversations and discuss in what ways they can be exploited to train therapeutic chatbots.

CCTC: A Cross-Sentence Chinese Text Correction Dataset for Native Speakers
Baoxin Wang | Xingyi Duan | Dayong Wu | Wanxiang Che | Zhigang Chen | Guoping Hu

The Chinese text correction (CTC) focuses on detecting and correcting Chinese spelling errors and grammatical errors. Most existing datasets of Chinese spelling check (CSC) and Chinese grammatical error correction (GEC) are focused on a single sentence written by Chinese-as-a-second-language (CSL) learners. We find that errors caused by native speakers differ significantly from those produced by non-native speakers. These differences make it inappropriate to use the existing test sets directly to evaluate text correction systems for native speakers. Some errors also require the cross-sentence information to be identified and corrected. In this paper, we propose a cross-sentence Chinese text correction dataset for native speakers. Concretely, we manually annotated 1,500 texts written by native speakers. The dataset consists of 30,811 sentences and more than 1,000,000 Chinese characters. It contains four types of errors: spelling errors, redundant words, missing words, and word ordering errors. We also test some state-of-the-art models on the dataset. The experimental results show that even the model with the best performance is 20 points lower than humans, which indicates that there is still much room for improvement. We hope that the new dataset can fill the gap in cross-sentence text correction for native Chinese speakers.

RealMedDial: A Real Telemedical Dialogue Dataset Collected from Online Chinese Short-Video Clips
Bo Xu | Hongtong Zhang | Jian Wang | Xiaokun Zhang | Dezhi Hao | Linlin Zong | Hongfei Lin | Fenglong Ma

Intelligent medical services have attracted great research interests for providing automated medical consultation. However, the lack of corpora becomes a main obstacle to related research, particularly data from real scenarios. In this paper, we construct RealMedDial, a Chinese medical dialogue dataset based on real medical consultation. RealMedDial contains 2,637 medical dialogues and 24,255 utterances obtained from Chinese short-video clips of real medical consultations. We collected and annotated a wide range of meta-data with respect to medical dialogue including doctor profiles, hospital departments, diseases and symptoms for fine-grained analysis on language usage pattern and clinical diagnosis. We evaluate the performance of medical response generation, department routing and doctor recommendation on RealMedDial. Results show that RealMedDial are applicable to a wide range of NLP tasks with respect to medical dialogue.

TempoWiC: An Evaluation Benchmark for Detecting Meaning Shift in Social Media
Daniel Loureiro | Aminette D’Souza | Areej Nasser Muhajab | Isabella A. White | Gabriel Wong | Luis Espinosa-Anke | Leonardo Neves | Francesco Barbieri | Jose Camacho-Collados

Language evolves over time, and word meaning changes accordingly. This is especially true in social media, since its dynamic nature leads to faster semantic shifts, making it challenging for NLP models to deal with new content and trends. However, the number of datasets and models that specifically address the dynamic nature of these social platforms is scarce. To bridge this gap, we present TempoWiC, a new benchmark especially aimed at accelerating research in social media-based meaning shift. Our results show that TempoWiC is a challenging benchmark, even for recently-released language models specialized in social media.

Automatic Generation of Large-scale Multi-turn Dialogues from Reddit
Daniil Huryn | William M. Hutsell | Jinho D. Choi

This paper presents novel methods to automatically convert posts and their comments from discussion forums such as Reddit into multi-turn dialogues. Our methods are generalizable to any forums; thus, they allow us to generate a massive amount of dialogues for diverse topics that can be used to pretrain language models. Four methods are introduced, Greedy_Baseline, Greedy_Advanced, Beam Search and Threading, which are applied to posts from 10 subreddits and assessed. Each method makes a noticeable improvement over its predecessor such that the best method shows an improvement of 36.3% over the baseline for appropriateness. Our best method is applied to posts from those 10 subreddits for the creation of a corpus comprising 10,098 dialogues (3.3M tokens), 570 of which are compared against dialogues in three other datasets, Blended Skill Talk, Daily Dialogue, and Topical Chat. Our dialogues are found to be more engaging but slightly less natural than the ones in the other datasets, while it costs a fraction of human labor and money to generate our corpus compared to the others. To the best of our knowledge, it is the first work to create a large multi-turn dialogue corpus from Reddit that can advance neural dialogue systems.

ConFiguRe: Exploring Discourse-level Chinese Figures of Speech
Dawei Zhu | Qiusi Zhan | Zhejian Zhou | Yifan Song | Jiebin Zhang | Sujian Li

Figures of speech, such as metaphor and irony, are ubiquitous in literature works and colloquial conversations. This poses great challenge for natural language understanding since figures of speech usually deviate from their ostensible meanings to express deeper semantic implications. Previous research lays emphasis on the literary aspect of figures and seldom provide a comprehensive exploration from a view of computational linguistics. In this paper, we first propose the concept of figurative unit, which is the carrier of a figure. Then we select 12 types of figures commonly used in Chinese, and build a Chinese corpus for Contextualized Figure Recognition (ConFiguRe). Different from previous token-level or sentence-level counterparts, ConFiguRe aims at extracting a figurative unit from discourse-level context, and classifying the figurative unit into the right figure type. On ConFiguRe, three tasks, i.e., figure extraction, figure type classification and figure recognition, are designed and the state-of-the-art techniques are utilized to implement the benchmarks. We conduct thorough experiments and show that all three tasks are challenging for existing models, thus requiring further research. Our dataset and code are publicly available at

Twitter Topic Classification
Dimosthenis Antypas | Asahi Ushio | Jose Camacho-Collados | Vitor Silva | Leonardo Neves | Francesco Barbieri

Social media platforms host discussions about a wide variety of topics that arise everyday. Making sense of all the content and organising it into categories is an arduous task. A common way to deal with this issue is relying on topic modeling, but topics discovered using this technique are difficult to interpret and can differ from corpus to corpus. In this paper, we present a new task based on tweet topic classification and release two associated datasets. Given a wide range of topics covering the most important discussion points in social media, we provide training and testing data from recent time periods that can be used to evaluate tweet classification models. Moreover, we perform a quantitative evaluation and analysis of current general- and domain-specific language models on the task, which provide more insights on the challenges and nature of the task.

Layer or Representation Space: What Makes BERT-based Evaluation Metrics Robust?
Doan Nam Long Vu | Nafise Sadat Moosavi | Steffen Eger

The evaluation of recent embedding-based evaluation metrics for text generation is primarily based on measuring their correlation with human evaluations on standard benchmarks. However, these benchmarks are mostly from similar domains to those used for pretraining word embeddings. This raises concerns about the (lack of) generalization of embedding-based metrics to new and noisy domains that contain a different vocabulary than the pretraining data. In this paper, we examine the robustness of BERTScore, one of the most popular embedding-based metrics for text generation. We show that (a) an embedding-based metric that has the highest correlation with human evaluations on a standard benchmark can have the lowest correlation if the amount of input noise or unknown tokens increases, (b) taking embeddings from the first layer of pretrained models improves the robustness of all metrics, and (c) the highest robustness is achieved when using character-level embeddings, instead of token-based embeddings, from the first layer of the pretrained model.

Evaluating the Performance of Transformer-based Language Models for Neuroatypical Language
Duanchen Liu | Zoey Liu | Qingyun Yang | Yujing Huang | Emily Prud’hommeaux

Difficulties with social aspects of language are among the hallmarks of autism spectrum disorder (ASD). These communication differences are thought to contribute to the challenges that adults with ASD experience when seeking employment, underscoring the need for interventions that focus on improving areas of weakness in pragmatic and social language. In this paper, we describe a transformer-based framework for identifying linguistic features associated with social aspects of communication using a corpus of conversations between adults with and without ASD and neurotypical conversational partners produced while engaging in collaborative tasks. While our framework yields strong accuracy overall, performance is significantly worse for the language of participants with ASD, suggesting that they use a more diverse set of strategies for some social linguistic functions. These results, while showing promise for the development of automated language analysis tools to support targeted language interventions for ASD, also reveal weaknesses in the ability of large contextualized language models to model neuroatypical language.

TERMinator: A System for Scientific Texts Processing
Elena Bruches | Olga Tikhobaeva | Yana Dementyeva | Tatiana Batura

This paper is devoted to the extraction of entities and semantic relations between them from scientific texts, where we consider scientific terms as entities. In this paper, we present a dataset that includes annotations for two tasks and develop a system called TERMinator for the study of the influence of language models on term recognition and comparison of different approaches for relation extraction. Experiments show that language models pre-trained on the target language are not always show the best performance. Also adding some heuristic approaches may improve the overall quality of the particular task. The developed tool and the annotated corpus are publicly available at and may be useful for other researchers.

LipKey: A Large-Scale News Dataset for Absent Keyphrases Generation and Abstractive Summarization
Fajri Koto | Timothy Baldwin | Jey Han Lau

Summaries, keyphrases, and titles are different ways of concisely capturing the content of a document. While most previous work has released the datasets of keyphrases and summarization separately, in this work, we introduce LipKey, the largest news corpus with human-written abstractive summaries, absent keyphrases, and titles. We jointly use the three elements via multi-task training and training as joint structured inputs, in the context of document summarization. We find that including absent keyphrases and titles as additional context to the source document improves transformer-based summarization models.

Understanding Attention for Vision-and-Language Tasks
Feiqi Cao | Soyeon Caren Han | Siqu Long | Changwei Xu | Josiah Poon

Attention mechanism has been used as an important component across Vision-and-Language(VL) tasks in order to bridge the semantic gap between visual and textual features. While attention has been widely used in VL tasks, it has not been examined the capability of different attention alignment calculation in bridging the semantic gap between visual and textual clues. In this research, we conduct a comprehensive analysis on understanding the role of attention alignment by looking into the attention score calculation methods and check how it actually represents the visual region’s and textual token’s significance for the global assessment. We also analyse the conditions which attention score calculation mechanism would be more (or less) interpretable, and which may impact the model performance on three different VL tasks, including visual question answering, text-to-image generation, text-and-image matching (both sentence and image retrieval). Our analysis is the first of its kind and provides useful insights of the importance of each attention alignment score calculation when applied at the training phase of VL tasks, commonly ignored in attention-based cross modal models, and/or pretrained models. Our code is available at:

Effective Data Augmentation for Sentence Classification Using One VAE per Class
Frédéric Piedboeuf | Philippe Langlais

In recent years, data augmentation has become an important field of machine learning. While images can use simple techniques such as cropping or rotating, textual data augmentation needs more complex manipulations to ensure that the generated examples are useful. Variational auto-encoders (VAE) and its conditional variant the Conditional-VAE (CVAE) are often used to generate new textual data, both relying on a good enough training of the generator so that it doesn’t create examples of the wrong class. In this paper, we explore a simpler way to use VAE for data augmentation: the training of one VAE per class. We show on several dataset sizes, as well as on four different binary classification tasks, that it systematically outperforms other generative data augmentation techniques.

NLG-Metricverse: An End-to-End Library for Evaluating Natural Language Generation
Giacomo Frisoni | Antonella Carbonaro | Gianluca Moro | Andrea Zammarchi | Marco Avagnano

Driven by deep learning breakthroughs, natural language generation (NLG) models have been at the center of steady progress in the last few years, with a ubiquitous task influence. However, since our ability to generate human-indistinguishable artificial text lags behind our capacity to assess it, it is paramount to develop and apply even better automatic evaluation metrics. To facilitate researchers to judge the effectiveness of their models broadly, we introduce NLG-Metricverse—an end-to-end open-source library for NLG evaluation based on Python. Our framework provides a living collection of NLG metrics in a unified and easy-to-use environment, supplying tools to efficiently apply, analyze, compare, and visualize them. This includes (i) the extensive support to heterogeneous automatic metrics with n-arity management, (ii) the meta-evaluation upon individual performance, metric-metric and metric-human correlations, (iii) graphical interpretations for helping humans better gain score intuitions, (iv) formal categorization and convenient documentation to accelerate metrics understanding. NLG-Metricverse aims to increase the comparability and replicability of NLG research, hopefully stimulating new contributions in the area.

TestAug: A Framework for Augmenting Capability-based NLP Tests
Guanqun Yang | Mirazul Haque | Qiaochu Song | Wei Yang | Xueqing Liu

The recently proposed capability-based NLP testing allows model developers to test the functional capabilities of NLP models, revealing functional failures for models with good held-out evaluation scores. However, existing work on capability-based testing requires the developer to compose each individual test template from scratch. Such approach thus requires extensive manual efforts and is less scalable. In this paper, we investigate a different approach that requires the developer to only annotate a few test templates, while leveraging the GPT-3 engine to generate the majority of test cases. While our approach saves the manual efforts by design, it guarantees the correctness of the generated suites with a validity checker. Moreover, our experimental results show that the test suites generated by GPT-3 are more diverse than the manually created ones; they can also be used to detect more errors compared to manually created counterparts. Our test suites can be downloaded at

KoCHET: A Korean Cultural Heritage Corpus for Entity-related Tasks
Gyeongmin Kim | Jinsung Kim | Junyoung Son | Heuiseok Lim

As digitized traditional cultural heritage documents have rapidly increased, resulting in an increased need for preservation and management, practical recognition of entities and typification of their classes has become essential. To achieve this, we propose KoCHET - a Korean cultural heritage corpus for the typical entity-related tasks, i.e., named entity recognition (NER), relation extraction (RE), and entity typing (ET). Advised by cultural heritage experts based on the data construction guidelines of government-affiliated organizations, KoCHET consists of respectively 112,362, 38,765, 113,198 examples for NER, RE, and ET tasks, covering all entity types related to Korean cultural heritage. Moreover, unlike the existing public corpora, modified redistribution can be allowed both domestic and foreign researchers. Our experimental results make the practical usability of KoCHET more valuable in terms of cultural heritage. We also provide practical insights of KoCHET in terms of statistical and linguistic analysis. Our corpus is freely available at

MonoByte: A Pool of Monolingual Byte-level Language Models
Hugo Abonizio | Leandro Rodrigues de Souza | Roberto Lotufo | Rodrigo Nogueira

The zero-shot cross-lingual ability of models pretrained on multilingual and even monolingual corpora has spurred many hypotheses to explain this intriguing empirical result. However, due to the costs of pretraining, most research uses public models whose pretraining methodology, such as the choice of tokenization, corpus size, and computational budget, might differ drastically. When researchers pretrain their own models, they often do so under a constrained budget, and the resulting models might underperform significantly compared to SOTA models. These experimental differences led to various inconsistent conclusions about the nature of the cross-lingual ability of these models. To help further research on the topic, we released 10 monolingual byte-level models rigorously pretrained under the same configuration with a large compute budget (equivalent to 420 days on a V100) and corpora that are 4 times larger than the original BERT’s. Because they are tokenizer-free, the problem of unseen token embeddings is eliminated, thus allowing researchers to try a wider range of cross-lingual experiments in languages with different scripts. Additionally, we release two models pretrained on non-natural language texts that can be used in sanity-check experiments. Experiments on QA and NLI tasks show that our monolingual models achieve competitive performance to the multilingual one, and hence can be served to strengthen our understanding of cross-lingual transferability in language models.

Wizard of Tasks: A Novel Conversational Dataset for Solving Real-World Tasks in Conversational Settings
Jason Ingyu Choi | Saar Kuzi | Nikhita Vedula | Jie Zhao | Giuseppe Castellucci | Marcus Collins | Shervin Malmasi | Oleg Rokhlenko | Eugene Agichtein

Conversational Task Assistants (CTAs) are conversational agents whose goal is to help humans perform real-world tasks. CTAs can help in exploring available tasks, answering task-specific questions and guiding users through step-by-step instructions. In this work, we present Wizard of Tasks, the first corpus of such conversations in two domains: Cooking and Home Improvement. We crowd-sourced a total of 549 conversations (18,077 utterances) with an asynchronous Wizard-of-Oz setup, relying on recipes from WholeFoods Market for the cooking domain, and WikiHow articles for the home improvement domain. We present a detailed data analysis and show that the collected data can be a valuable and challenging resource for CTAs in two tasks: Intent Classification (IC) and Abstractive Question Answering (AQA). While on IC we acquired a high performing model (>85% F1), on AQA the performance is far from being satisfactory (~27% BertScore-F1), suggesting that more work is needed to solve the task of low-resource AQA.

K-MHaS: A Multi-label Hate Speech Detection Dataset in Korean Online News Comment
Jean Lee | Taejun Lim | Heejun Lee | Bogeun Jo | Yangsok Kim | Heegeun Yoon | Soyeon Caren Han

Online hate speech detection has become an important issue due to the growth of online content, but resources in languages other than English are extremely limited. We introduce K-MHaS, a new multi-label dataset for hate speech detection that effectively handles Korean language patterns. The dataset consists of 109k utterances from news comments and provides a multi-label classification using 1 to 4 labels, and handles subjectivity and intersectionality. We evaluate strong baselines on K-MHaS. KR-BERT with a sub-character tokenizer outperforms others, recognizing decomposed characters in each hate speech class.

Domain- and Task-Adaptation for VaccinChatNL, a Dutch COVID-19 FAQ Answering Corpus and Classification Model
Jeska Buhmann | Maxime De Bruyn | Ehsan Lotfi | Walter Daelemans

FAQs are important resources to find information. However, especially if a FAQ concerns many question-answer pairs, it can be a difficult and time-consuming job to find the answer you are looking for. A FAQ chatbot can ease this process by automatically retrieving the relevant answer to a user’s question. We present VaccinChatNL, a Dutch FAQ corpus on the topic of COVID-19 vaccination. Starting with 50 question-answer pairs we built VaccinChat, a FAQ chatbot, which we used to gather more user questions that were also annotated with the appropriate or new answer classes. This iterative process of gathering user questions, annotating them, and retraining the model with the increased data set led to a corpus that now contains 12,883 user questions divided over 181 answers. We provide the first publicly available Dutch FAQ answering data set of this size with large groups of semantically equivalent human-paraphrased questions. Furthermore, our study shows that before fine-tuning a classifier, continued pre-training of Dutch language models with task- and/or domain-specific data improves classification results. In addition, we show that large groups of semantically similar questions are important for obtaining well-performing intent classification models.

Benchmarking Automated Clinical Language Simplification: Dataset, Algorithm, and Evaluation
Junyu Luo | Junxian Lin | Chi Lin | Cao Xiao | Xinning Gui | Fenglong Ma

Patients with low health literacy usually have difficulty understanding medical jargon and the complex structure of professional medical language. Although some studies are proposed to automatically translate expert language into layperson-understandable language, only a few of them focus on both accuracy and readability aspects simultaneously in the clinical domain. Thus, simplification of the clinical language is still a challenging task, but unfortunately, it is not yet fully addressed in previous work. To benchmark this task, we construct a new dataset named MedLane to support the development and evaluation of automated clinical language simplification approaches. Besides, we propose a new model called DECLARE that follows the human annotation procedure and achieves state-of-the-art performance compared with eight strong baselines. To fairly evaluate the performance, we also propose three specific evaluation metrics. Experimental results demonstrate the utility of the annotated MedLane dataset and the effectiveness of the proposed model DECLARE.

WikiHan: A New Comparative Dataset for Chinese Languages
Kalvin Chang | Chenxuan Cui | Youngmin Kim | David R. Mortensen

Most comparative datasets of Chinese varieties are not digital; however, Wiktionary includes a wealth of transcriptions of words from these varieties. The usefulness of these data is limited by the fact that they use a wide range of variety-specific romanizations, making data difficult to compare. The current work collects this data into a single constituent (IPA, or International Phonetic Alphabet) and structured form (TSV) for use in comparative linguistics and Chinese NLP. At the time of writing, the dataset contains 67,943 entries across 8 varieties and Middle Chinese. The dataset is validated on a protoform reconstruction task using an encoder-decoder cross-attention architecture (Meloni et al 2021), achieving an accuracy of 54.11%, a PER (phoneme error rate) of 17.69%, and a FER (feature error rate) of 6.60%.

Visual Recipe Flow: A Dataset for Learning Visual State Changes of Objects with Recipe Flows
Keisuke Shirai | Atsushi Hashimoto | Taichi Nishimura | Hirotaka Kameko | Shuhei Kurita | Yoshitaka Ushiku | Shinsuke Mori

We present a new multimodal dataset called Visual Recipe Flow, which enables us to learn a cooking action result for each object in a recipe text. The dataset consists of object state changes and the workflow of the recipe text. The state change is represented as an image pair, while the workflow is represented as a recipe flow graph. We developed a web interface to reduce human annotation costs. The dataset allows us to try various applications, including multimodal information retrieval.

IMPARA: Impact-Based Metric for GEC Using Parallel Data
Koki Maeda | Masahiro Kaneko | Naoaki Okazaki

Automatic evaluation of grammatical error correction (GEC) is essential in developing useful GEC systems. Existing methods for automatic evaluation require multiple reference sentences or manual scores. However, such resources are expensive, thereby hindering automatic evaluation for various domains and correction styles. This paper proposes an Impact-based Metric for GEC using PARAllel data, IMPARA, which utilizes correction impacts computed by parallel data comprising pairs of grammatical/ungrammatical sentences. As parallel data is cheaper than manually assessing evaluation scores, IMPARA can reduce the cost of data creation for automatic evaluation. Correlations between IMPARA and human scores indicate that IMPARA is comparable or better than existing evaluation methods. Furthermore, we find that IMPARA can perform evaluations that fit different domains and correction styles trained on various parallel data.

Evons: A Dataset for Fake and Real News Virality Analysis and Prediction
Kriste Krstovski | Angela Soomin Ryu | Bruce Kogut

We present a novel collection of news articles originating from fake and real news media sources for the analysis and prediction of news virality. Unlike existing fake news datasets which either contain claims, or news article headline and body, in this collection each article is supported with a Facebook engagement count which we consider as an indicator of the article virality. In addition we also provide the article description and thumbnail image with which the article was shared on Facebook. These images were automatically annotated with object tags and color attributes. Using cloud based vision analysis tools, thumbnail images were also analyzed for faces and detected faces were annotated with facial attributes. We empirically investigate the use of this collection on an example task of article virality prediction.

Are Pretrained Multilingual Models Equally Fair across Languages?
Laura Cabello Piqueras | Anders Søgaard

Pretrained multilingual language models can help bridge the digital language divide, enabling high-quality NLP models for lower-resourced languages. Studies of multilingual models have so far focused on performance, consistency, and cross-lingual generalisation. However, with their wide-spread application in the wild and downstream societal impact, it is important to put multilingual models under the same scrutiny as monolingual models. This work investigates the group fairness of multilingual models, asking whether these models are equally fair across languages. To this end, we create a new four-way multilingual dataset of parallel cloze test examples (MozArt), equipped with demographic information (balanced with regard to gender and native tongue) about the test participants. We evaluate three multilingual models on MozArt –mBERT, XLM-R, and mT5– and show that across the four target languages, the three models exhibit different levels of group disparity, e.g., exhibiting near-equal risk for Spanish, but high levels of disparity for German.

Possible Stories: Evaluating Situated Commonsense Reasoning under Multiple Possible Scenarios
Mana Ashida | Saku Sugawara

The possible consequences for the same context may vary depending on the situation we refer to. However, current studies in natural language processing do not focus on situated commonsense reasoning under multiple possible scenarios. This study frames this task by asking multiple questions with the same set of possible endings as candidate answers, given a short story text. Our resulting dataset, Possible Stories, consists of more than 4.5K questions over 1.3K story texts in English. We discover that even current strong pretrained language models struggle to answer the questions consistently, highlighting that the highest accuracy in an unsupervised setting (60.2%) is far behind human accuracy (92.5%). Through a comparison with existing datasets, we observe that the questions in our dataset contain minimal annotation artifacts in the answer options. In addition, our dataset includes examples that require counterfactual reasoning, as well as those requiring readers’ reactions and fictional information, suggesting that our dataset can serve as a challenging testbed for future studies on situated commonsense reasoning.

DiaBiz.Kom - towards a Polish Dialogue Act Corpus Based on ISO 24617-2 Standard
Marcin Oleksy | Jan Wieczorek | Dorota Drużyłowska | Julia Klyus | Aleksandra Domogała | Krzysztof Hwaszcz | Hanna Kędzierska | Daria Mikoś | Anita Wróż

This article presents the specification and evaluation of DiaBiz.Kom – the corpus of dialogue texts in Polish. The corpus contains transcriptions of telephone conversations conducted according to a prepared scenario. The transcripts of conversations have been manually annotated with a layer of information concerning communicative functions. DiaBiz.Kom is the first corpus of this type prepared for the Polish language and will be used to develop a system of dialog analysis and modules for creating advanced chatbots.

Towards Explainable Evaluation of Language Models on the Semantic Similarity of Visual Concepts
Maria Lymperaiou | George Manoliadis | Orfeas Menis Mastromichalakis | Edmund G. Dervakos | Giorgos Stamou

Recent breakthroughs in NLP research, such as the advent of Transformer models have indisputably contributed to major advancements in several tasks. However, few works research robustness and explainability issues of their evaluation strategies. In this work, we examine the behavior of high-performing pre-trained language models, focusing on the task of semantic similarity for visual vocabularies. First, we address the need for explainable evaluation metrics, necessary for understanding the conceptual quality of retrieved instances. Our proposed metrics provide valuable insights in local and global level, showcasing the inabilities of widely used approaches. Secondly, adversarial interventions on salient query semantics expose vulnerabilities of opaque metrics and highlight patterns in learned linguistic representations.

Establishing Annotation Quality in Multi-label Annotations
Marian Marchal | Merel Scholman | Frances Yung | Vera Demberg

In many linguistic fields requiring annotated data, multiple interpretations of a single item are possible. Multi-label annotations more accurately reflect this possibility. However, allowing for multi-label annotations also affects the chance that two coders agree with each other. Calculating inter-coder agreement for multi-label datasets is therefore not trivial. In the current contribution, we evaluate different metrics for calculating agreement on multi-label annotations: agreement on the intersection of annotated labels, an augmented version of Cohen’s Kappa, and precision, recall and F1. We propose a bootstrapping method to obtain chance agreement for each measure, which allows us to obtain an adjusted agreement coefficient that is more interpretable. We demonstrate how various measures affect estimates of agreement on simulated datasets and present a case study of discourse relation annotations. We also show how the proportion of double labels, and the entropy of the label distribution, influences the measures outlined above and how a bootstrapped adjusted agreement can make agreement measures more comparable across datasets in multi-label scenarios.

Biographically Relevant Tweets – a New Dataset, Linguistic Analysis and Classification Experiments
Michael Wiegand | Rebecca Wilm | Katja Markert

We present a new dataset comprising tweets for the novel task of detecting biographically relevant utterances. Biographically relevant utterances are all those utterances that reveal some persistent and non-trivial information about the author of a tweet, e.g. habits, (dis)likes, family status, physical appearance, employment information, health issues etc. Unlike previous research we do not restrict biographical relevance to a small fixed set of pre-defined relations. Next to classification experiments employing state-of-the-art classifiers to establish strong baselines for future work, we carry out a linguistic analysis that compares the predictiveness of various high-level features. We also show that the task is different from established tasks, such as aspectual classification or sentiment analysis.

BECEL: Benchmark for Consistency Evaluation of Language Models
Myeongjun Jang | Deuk Sin Kwon | Thomas Lukasiewicz

Behavioural consistency is a critical condition for a language model (LM) to become trustworthy like humans. Despite its importance, however, there is little consensus on the definition of LM consistency, resulting in different definitions across many studies. In this paper, we first propose the idea of LM consistency based on behavioural consistency and establish a taxonomy that classifies previously studied consistencies into several sub-categories. Next, we create a new benchmark that allows us to evaluate a model on 19 test cases, distinguished by multiple types of consistency and diverse downstream tasks. Through extensive experiments on the new benchmark, we ascertain that none of the modern pre-trained language models (PLMs) performs well in every test case, while exhibiting high inconsistency in many cases. Our experimental results suggest that a unified benchmark that covers broad aspects (i.e., multiple consistency types and tasks) is essential for a more precise evaluation.

KoBEST: Korean Balanced Evaluation of Significant Tasks
Myeongjun Jang | Dohyung Kim | Deuk Sin Kwon | Eric Davis

A well-formulated benchmark plays a critical role in spurring advancements in the natural language processing (NLP) field, as it allows objective and precise evaluation of diverse models. As modern language models (LMs) have become more elaborate and sophisticated, more difficult benchmarks that require linguistic knowledge and reasoning have been proposed. However, most of these benchmarks only support English, and great effort is necessary to construct benchmarks for other low resource languages. To this end, we propose a new benchmark named Korean balanced evaluation of significant tasks (KoBEST), which consists of five Korean-language downstream tasks. Professional Korean linguists designed the tasks that require advanced Korean linguistic knowledge. Moreover, our data is purely annotated by humans and thoroughly reviewed to guarantee high data quality. We also provide baseline models and human performance results. Our dataset is available on the Huggingface.

A New Public Corpus for Clinical Section Identification: MedSecId
Paul Landes | Kunal Patel | Sean S. Huang | Adam Webb | Barbara Di Eugenio | Cornelia Caragea

The process by which sections in a document are demarcated and labeled is known as section identification. Such sections are helpful to the reader when searching for information and contextualizing specific topics. The goal of this work is to segment the sections of clinical medical domain documentation. The primary contribution of this work is MedSecId, a publicly available set of 2,002 fully annotated medical notes from the MIMIC-III. We include several baselines, source code, a pretrained model and analysis of the data showing a relationship between medical concepts across sections using principal component analysis.

A Data-driven Approach to Named Entity Recognition for Early Modern French
Pedro Ortiz Suarez | Simon Gabay

Named entity recognition has become an increasingly useful tool for digital humanities research, specially when it comes to historical texts. However, historical texts pose a wide range of challenges to both named entity recognition and natural language processing in general that are still difficult to address even with modern neural methods. In this article we focus in named entity recognition for historical French, and in particular for Early Modern French (16th-18th c.), i.e. Ancien Régime French. However, instead of developing a specialised architecture to tackle the particularities of this state of language, we opt for a data-driven approach by developing a new corpus with fine-grained entity annotation, covering three centuries of literature corresponding to the early modern period; we try to annotate as much data as possible producing a corpus that is many times bigger than the most popular NER evaluation corpora for both Contemporary English and French. We then fine-tune existing state-of-the-art architectures for Early Modern and Contemporary French, obtaining results that are on par with those of the current state-of-the-art NER systems for Contemporary English. Both the corpus and the fine-tuned models are released.

Reproducibility and Automation of the Appraisal Taxonomy
Pradeesh Parameswaran | Andrew Trotman | Veronica Liesaputra | David Eyers

There is a lack of reproducibility in results from experiments that apply the Appraisal taxonomy. Appraisal is widely used by linguists to study how people judge things or people. Automating Appraisal could be beneficial for use cases such as moderating online comments. Past work in Appraisal annotation has been descriptive in nature and, the lack of publicly available data sets hinders the progress of automation. In this work, we are interested in two things; first, measuring the performance of automated approaches to Appraisal classification in the publicly available Australasian Language Technology Association (ALTA) Shared Task Challenge data set. Second, we are interested in reproducing the annotation of the ALTA data set. Four additional annotators, each with a different linguistics background, were employed to re-annotate the data set. Our results show a poor level of agreement at more detailed Appraisal categories (Fleiss Kappa = 0.059) and a fair level of agreement (Kappa = 0.372) at coarse-level categories. We find similar results when using automated approaches that are available publicly. Our empirical evidence suggests that at present, automating classification is practical only when considering coarse-level categories of the taxonomy.

Few-Shot Table Understanding: A Benchmark Dataset and Pre-Training Baseline
Ruixue Liu | Shaozu Yuan | Aijun Dai | Lei Shen | Tiangang Zhu | Meng Chen | Xiaodong He

Few-shot table understanding is a critical and challenging problem in real-world scenario as annotations over large amount of tables are usually costly. Pre-trained language models (PLMs), which have recently flourished on tabular data, have demonstrated their effectiveness for table understanding tasks. However, few-shot table understanding is rarely explored due to the deficiency of public table pre-training corpus and well-defined downstream benchmark tasks, especially in Chinese. In this paper, we establish a benchmark dataset, FewTUD, which consists of 5 different tasks with human annotations to systematically explore the few-shot table understanding in depth. Since there is no large number of public Chinese tables, we also collect a large-scale, multi-domain tabular corpus to facilitate future Chinese table pre-training, which includes one million tables and related natural language text with auxiliary supervised interaction signals. Finally, we present FewTPT, a novel table PLM with rich interactions over tabular data, and evaluate its performance comprehensively on the benchmark. Our dataset and model will be released to the public soon.

Tafsir Dataset: A Novel Multi-Task Benchmark for Named Entity Recognition and Topic Modeling in Classical Arabic Literature
Sajawel Ahmed | Rob van der Goot | Misbahur Rehman | Carl Kruse | Ömer Özsoy | Alexander Mehler | Gemma Roig

Various historical languages, which used to be lingua franca of science and arts, deserve the attention of current NLP research. In this work, we take the first data-driven steps towards this research line for Classical Arabic (CA) by addressing named entity recognition (NER) and topic modeling (TM) on the example of CA literature. We manually annotate the encyclopedic work of Tafsir Al-Tabari with span-based NEs, sentence-based topics, and span-based subtopics, thus creating the Tafsir Dataset with over 51,000 sentences, the first large-scale multi-task benchmark for CA. Next, we analyze our newly generated dataset, which we make open-source available, with current language models (lightweight BiLSTM, transformer-based MaChAmP) along a novel script compression method, thereby achieving state-of-the-art performance for our target task CA-NER. We also show that CA-TM from the perspective of historical topic models, which are central to Arabic studies, is very challenging. With this interdisciplinary work, we lay the foundations for future research on automatic analysis of CA literature.

Resource of Wikipedias in 31 Languages Categorized into Fine-Grained Named Entities
Satoshi Sekine | Kouta Nakayama | Masako Nomoto | Maya Ando | Asuka Sumida | Koji Matsuda

This paper describes a resource of Wikipedias in 31 languages categorized into Extended Named Entity (ENE), which has 219 fine-grained NE categories. We first categorized 920 K Japanese Wikipedia pages according to the ENE scheme using machine learning, followed by manual validation. We then organized a shared task of Wikipedia categorization into 30 languages. The training data were provided by Japanese categorization and the language links, and the task was to categorize the Wikipedia pages into 30 languages, with no language links from Japanese Wikipedia (20M pages in total). Thirteen groups with 24 systems participated in the 2020 and 2021 tasks, sharing their outputs for resource-building. The Japanese categorization accuracy was 98.5%, and the best performance among the 30 languages ranges from 80 to 93 in F-measure. Using ensemble learning, we created outputs with an average F-measure of 86.8, which is 1.7 better than the best single systems. The total size of the resource is 32.5M pages, including the training data. We call this resource creation scheme “Resource by Collaborative Contribution (RbCC)”. We also constructed structuring tasks (attribute extraction and link prediction) using RbCC under our ongoing project, “SHINRA.”

Accuracy meets Diversity in a News Recommender System
Shaina Raza | Syed Raza Bashir | Usman Naseem

News recommender systems face certain challenges. These challenges arise due to evolving users’ preferences over dynamically created news articles. The diversity is necessary for a news recommender system to expose users to a variety of information. We propose a deep neural network based on a two-tower architecture that learns news representation through a news item tower and users’ representations through a query tower. We customize an augmented vector for each query and news item to introduce information interaction between the two towers. We introduce diversity in the proposed architecture by considering a category loss function that aligns items’ representation of uneven news categories. Experimental results on two news datasets reveal that our proposed architecture is more effective compared to the state-of-the-art methods and achieves a balance between accuracy and diversity.

Dynamic Nonlinear Mixup with Distance-based Sample Selection
Shaokang Zhang | Lei Jiang | Jianlong Tan

Data augmentation with mixup has shown to be effective on the NLP tasks. Although its great success, the mixup still has shortcomings. First, vanilla mixup randomly selects one sample to generate the mixup sample for a given sample. It remains unclear how to best choose the input samples for the mixup. Second, linear interpolation limits the space of synthetic data and its regularization effect. In this paper, we propose the dynamic nonlinear mixup with distance-based sample selection, which not only generates multiple sample pairs based on the distance between each sample but also enlarges the space of synthetic samples. Specifically, we compute the distance between each input data by cosine similarity and select multiple samples for a given sample. Then we use the dynamic nonlinear mixup to fuse sample pairs. It does not use a linear, scalar mixing strategy, but a nonlinear interpolation strategy, where the mixing strategy is adaptively updated for the input and label pairs. Experiments on the multiple public datasets demonstrate that dynamic nonlinear mixup outperforms state-of-the-art methods.

MultiCoNER: A Large-scale Multilingual Dataset for Complex Named Entity Recognition
Shervin Malmasi | Anjie Fang | Besnik Fetahu | Sudipta Kar | Oleg Rokhlenko

We present AnonData, a large multilingual dataset for Named Entity Recognition that covers 3 domains (Wiki sentences, questions, and search queries) across 11 languages, as well as multilingual and code-mixing subsets. This dataset is designed to represent contemporary challenges in NER, including low-context scenarios (short and uncased text), syntactically complex entities like movie titles, and long-tail entity distributions. The 26M token dataset is compiled from public resources using techniques such as heuristic-based sentence sampling, template extraction and slotting, and machine translation. We tested the performance of two NER models on our dataset: a baseline XLM-RoBERTa model, and a state-of-the-art NER GEMNET model that leverages gazetteers. The baseline achieves moderate performance (macro-F1=54%). GEMNET, which uses gazetteers, improvement significantly (average improvement of macro-F1=+30%) and demonstrates the difficulty of our dataset. AnonData poses challenges even for large pre-trained language models, and we believe that it can help further research in building robust NER systems.

Extracting a Knowledge Base of COVID-19 Events from Social Media
Shi Zong | Ashutosh Baheti | Wei Xu | Alan Ritter

We present a manually annotated corpus of 10,000 tweets containing public reports of five COVID-19 events, including positive and negative tests, deaths, denied access to testing, claimed cures and preventions. We designed slot-filling questions for each event type and annotated a total of 28 fine-grained slots, such as the location of events, recent travel, and close contacts. We show that our corpus can support fine-tuning BERT-based classifiers to automatically extract publicly reported events, which can be further collected for building a knowledge base. Our knowledge base is constructed over Twitter data covering two years and currently covers over 4.2M events. It can answer complex queries with high precision, such as “Which organizations have employees that tested positive in Philadelphia?” We believe our proposed methodology could be quickly applied to develop knowledge bases for new domains in response to an emerging crisis, including natural disasters or future disease outbreaks.

Accounting for Language Effect in the Evaluation of Cross-lingual AMR Parsers
Shira Wein | Nathan Schneider

Cross-lingual Abstract Meaning Representation (AMR) parsers are currently evaluated in comparison to gold English AMRs, despite parsing a language other than English, due to the lack of multilingual AMR evaluation metrics. This evaluation practice is problematic because of the established effect of source language on AMR structure. In this work, we present three multilingual adaptations of monolingual AMR evaluation metrics and compare the performance of these metrics to sentence-level human judgments. We then use our most highly correlated metric to evaluate the output of state-of-the-art cross-lingual AMR parsers, finding that Smatch may still be a useful metric in comparison to gold English AMRs, while our multilingual adaptation of S2match (XS2match) is best for comparison with gold in-language AMRs.

QSTS: A Question-Sensitive Text Similarity Measure for Question Generation
Sujatha Das Gollapalli | See-Kiong Ng

While question generation (QG) has received significant focus in conversation modeling and text generation research, the problems of comparing questions and evaluation of QG models have remained inadequately addressed. Indeed, QG models continue to be evaluated using traditional measures such as BLEU, METEOR, and ROUGE scores which were designed for other text generation problems. We propose QSTS, a novel Question-Sensitive Text Similarity measure for comparing two questions by characterizing their target intent based on question class, named-entity, and semantic similarity information from the two questions. We show that QSTS addresses several shortcomings of existing measures that depend on n-gram overlap scores and obtains superior results compared to traditional measures on publicly-available QG datasets. We also collect a novel dataset SimQG, for enabling question similarity research in QG contexts. SimQG contains questions generated by state-of-the-art QG models along with human judgements on their relevance with respect to the passage context they were generated for as well as when compared to the given reference question. Using SimQG, we showcase the key aspect of QSTS that differentiates it from all existing measures. QSTS is not only able to characterize similarity between two questions, but is also able to score questions with respect to passage contexts. Thus QSTS is, to our knowledge, the first metric that enables the measurement of QG performance in a reference-free manner.

Noun-MWP: Math Word Problems Meet Noun Answers
Taehun Cha | Jaeheun Jung | Donghun Lee

We introduce a new type of problems for math word problem (MWP) solvers, named Noun-MWPs, whose answer is a non-numerical string containing a noun from the problem text. We present a novel method to empower existing MWP solvers to handle Noun-MWPs, and apply the method on Expression-Pointer Transformer (EPT). Our model, N-EPT, solves Noun-MWPs significantly better than other models, and at the same time, solves conventional MWPs as well. Solving Noun-MWPs may lead to bridging MWP solvers and traditional question-answering NLP models.

ViNLI: A Vietnamese Corpus for Studies on Open-Domain Natural Language Inference
Tin Van Huynh | Kiet Van Nguyen | Ngan Luu-Thuy Nguyen

Over a decade, the research field of computational linguistics has witnessed the growth of corpora and models for natural language inference (NLI) for rich-resource languages such as English and Chinese. A large-scale and high-quality corpus is necessary for studies on NLI for Vietnamese, which can be considered a low-resource language. In this paper, we introduce ViNLI (Vietnamese Natural Language Inference), an open-domain and high-quality corpus for evaluating Vietnamese NLI models, which is created and evaluated with a strict process of quality control. ViNLI comprises over 30,000 human-annotated premise-hypothesis sentence pairs extracted from more than 800 online news articles on 13 distinct topics. In this paper, we introduce the guidelines for corpus creation which take the specific characteristics of the Vietnamese language in expressing entailment and contradiction into account. To evaluate the challenging level of our corpus, we conduct experiments with state-of-the-art deep neural networks and pre-trained models on our dataset. The best system performance is still far from human performance (a 14.20% gap in accuracy). The ViNLI corpus is a challenging corpus to accelerate progress in Vietnamese computational linguistics. Our corpus is available publicly for research purposes.

InferES : A Natural Language Inference Corpus for Spanish Featuring Negation-Based Contrastive and Adversarial Examples
Venelin Kovatchev | Mariona Taulé

In this paper we present InferES - an original corpus for Natural Language Inference (NLI) in European Spanish. We propose, implement, and analyze a variety of corpus-creating strategies utilizing expert linguists and crowd workers. The objectives behind InferES are to provide high-quality data, and at the same time to facilitate the systematic evaluation of automated systems. Specifically, we focus on measuring and improving the performance of machine learning systems on negation-based adversarial examples and their ability to generalize across out-of-distribution topics. We train two transformer models on InferES (8,055 gold examples) in a variety of scenarios. Our best model obtains 72.8% accuracy, leaving a lot of room for improvement. The “hypothesis-only” baseline performs only 2%-5% higher than majority, indicating much fewer annotation artifacts than prior work. We show that models trained on InferES generalize very well across topics (both in- and out-of-distribution) and perform moderately well on negation-based adversarial examples.

ParaZh-22M: A Large-Scale Chinese Parabank via Machine Translation
Wenjie Hao | Hongfei Xu | Deyi Xiong | Hongying Zan | Lingling Mu

Paraphrasing, i.e., restating the same meaning in different ways, is an important data augmentation approach for natural language processing (NLP). Zhang et al. (2019b) propose to extract sentence-level paraphrases from multiple Chinese translations of the same source texts, and construct the PKU Paraphrase Bank of 0.5M sentence pairs. However, despite being the largest Chinese parabank to date, the size of PKU parabank is limited by the availability of one-to-many sentence translation data, and cannot well support the training of large Chinese paraphrasers. In this paper, we relieve the restriction with one-to-many sentence translation data, and construct ParaZh-22M, a larger Chinese parabank that is composed of 22M sentence pairs, based on one-to-one bilingual sentence translation data and machine translation (MT). In our data augmentation experiments, we show that paraphrasing based on ParaZh-22M can bring about consistent and significant improvements over several strong baselines on a wide range of Chinese NLP tasks, including a number of Chinese natural language understanding benchmarks (CLUE) and low-resource machine translation.

ESimCSE: Enhanced Sample Building Method for Contrastive Learning of Unsupervised Sentence Embedding
Xing Wu | Chaochen Gao | Liangjun Zang | Jizhong Han | Zhongyuan Wang | Songlin Hu

Contrastive learning has been attracting much attention for learning unsupervised sentence embeddings. The current state-of-the-art unsupervised method is the unsupervised SimCSE (unsup-SimCSE). Unsup-SimCSE takes dropout as a minimal data augmentation method, and passes the same input sentence to a pre-trained Transformer encoder (with dropout turned on) twice to obtain the two corresponding embeddings to build a positive pair. As the length information of a sentence will generally be encoded into the sentence embeddings due to the usage of position embedding in Transformer, each positive pair in unsup-SimCSE actually contains the same length information. And thus unsup-SimCSE trained with these positive pairs is probably biased, which would tend to consider that sentences of the same or similar length are more similar in semantics. Through statistical observations, we find that unsup-SimCSE does have such a problem. To alleviate it, we apply a simple repetition operation to modify the input sentence, and then pass the input sentence and its modified counterpart to the pre-trained Transformer encoder, respectively, to get the positive pair. Additionally, we draw inspiration from the community of computer vision and introduce a momentum contrast, enlarging the number of negative pairs without additional calculations. The proposed two modifications are applied on positive and negative pairs separately, and build a new sentence embedding method, termed Enhanced Unsup-SimCSE (ESimCSE). We evaluate the proposed ESimCSE on several benchmark datasets w.r.t the semantic text similarity (STS) task. Experimental results show that ESimCSE outperforms the state-of-the-art unsup-SimCSE by an average Spearman correlation of 2.02% on BERT-base.

Measuring Robustness for NLP
Yu Yu | Abdul Rafae Khan | Jia Xu

The quality of Natural Language Processing (NLP) models is typically measured by the accuracy or error rate of a predefined test set. Because the evaluation and optimization of these measures are narrowed down to a specific domain like news and cannot be generalized to other domains like Twitter, we often observe that a system reported with human parity results generates surprising errors in real-life use scenarios. We address this weakness with a new approach that uses an NLP quality measure based on robustness. Unlike previous work that has defined robustness using Minimax to bound worst cases, we measure robustness based on the consistency of cross-domain accuracy and introduce the coefficient of variation and (epsilon, gamma)-Robustness. Our measures demonstrate higher agreements with human evaluation than accuracy scores like BLEU on ranking Machine Translation (MT) systems. Our experiments of sentiment analysis and MT tasks show that incorporating our robustness measures into learning objectives significantly enhances the final NLP prediction accuracy over various domains, such as biomedical and social media.

CSL: A Large-scale Chinese Scientific Literature Dataset
Yudong Li | Yuqing Zhang | Zhe Zhao | Linlin Shen | Weijie Liu | Weiquan Mao | Hui Zhang

Scientific literature serves as a high-quality corpus, supporting a lot of Natural Language Processing (NLP) research. However, existing datasets are centered around the English language, which restricts the development of Chinese scientific NLP. In this work, we present CSL, a large-scale Chinese Scientific Literature dataset, which contains the titles, abstracts, keywords and academic fields of 396k papers. To our knowledge, CSL is the first scientific document dataset in Chinese. The CSL can serve as a Chinese corpus. Also, this semi-structured data is a natural annotation that can constitute many supervised NLP tasks. Based on CSL, we present a benchmark to evaluate the performance of models across scientific domain tasks, i.e., summarization, keyword generation and text classification. We analyze the behavior of existing text-to-text models on the evaluation tasks and reveal the challenges for Chinese scientific NLP tasks, which provides a valuable reference for future research. Data and code will be publicly available.

Singlish Message Paraphrasing: A Joint Task of Creole Translation and Text Normalization
Zhengyuan Liu | Shikang Ni | Ai Ti Aw | Nancy F. Chen

Within the natural language processing community, English is by far the most resource-rich language. There is emerging interest in conducting translation via computational approaches to conform its dialects or creole languages back to standard English. This computational approach paves the way to leverage generic English language backbones, which are beneficial for various downstream tasks. However, in practical online communication scenarios, the use of language varieties is often accompanied by noisy user-generated content, making this translation task more challenging. In this work, we introduce a joint paraphrasing task of creole translation and text normalization of Singlish messages, which can shed light on how to process other language varieties and dialects. We formulate the task in three different linguistic dimensions: lexical level normalization, syntactic level editing, and semantic level rewriting. We build an annotated dataset of Singlish-to-Standard English messages, and report performance on a perturbation-resilient sequence-to-sequence model. Experimental results show that the model produces reasonable generation results, and can improve the performance of downstream tasks like stance detection.

CINO: A Chinese Minority Pre-trained Language Model
Ziqing Yang | Zihang Xu | Yiming Cui | Baoxin Wang | Min Lin | Dayong Wu | Zhigang Chen

Multilingual pre-trained language models have shown impressive performance on cross-lingual tasks. It greatly facilitates the applications of natural language processing on low-resource languages. However, there are still some languages that the current multilingual models do not perform well on. In this paper, we propose CINO (Chinese Minority Pre-trained Language Model), a multilingual pre-trained language model for Chinese minority languages. It covers Standard Chinese, Yue Chinese, and six other ethnic minority languages. To evaluate the cross-lingual ability of the multilingual model on ethnic minority languages, we collect documents from Wikipedia and news websites, and construct two text classification datasets, WCM (Wiki-Chinese-Minority) and CMNews (Chinese-Minority-News). We show that CINO notably outperforms the baselines on various classification tasks. The CINO model and the datasets are publicly available at

One Word, Two Sides: Traces of Stance in Contextualized Word Representations
Aina Garí Soler | Matthieu Labeau | Chloé Clavel

The way we use words is influenced by our opinion. We investigate whether this is reflected in contextualized word embeddings. For example, is the representation of “animal” different between people who would abolish zoos and those who would not? We explore this question from a Lexical Semantic Change standpoint. Our experiments with BERT embeddings derived from datasets with stance annotations reveal small but significant differences in word representations between opposing stances.

Prepositions Matter in Quantifier Scope Disambiguation
Aleksander Leczkowski | Justyna Grudzińska | Manuel Vargas Guzmán | Aleksander Wawer | Aleksandra Siemieniuk

Although it is widely agreed that world knowledge plays a significant role in quantifier scope disambiguation (QSD), there has been only very limited work on how to integrate this knowledge into a QSD model. This paper contributes to this scarce line of research by incorporating into a machine learning model our knowledge about relations, as conveyed by a manageable closed class of function words: prepositions. For data, we use a scope-disambiguated corpus created by AnderBois, Brasoveanu and Henderson, which is additionally annotated with prepositional senses using Schneider et al’s Semantic Network of Adposition and Case Supersenses (SNACS) scheme. By applying Manshadi and Allen’s method to the corpus, we were able to inspect the information gain provided by prepositions for the QSD task. Statistical analysis of the performance of the classifiers, trained in scenarios with and without preposition information, supports the claim that prepositional senses have a strong positive impact on the learnability of automatic QSD systems.

Modelling Commonsense Properties Using Pre-Trained Bi-Encoders
Amit Gajbhiye | Luis Espinosa-Anke | Steven Schockaert

Grasping the commonsense properties of everyday concepts is an important prerequisite to language understanding. While contextualised language models are reportedly capable of predicting such commonsense properties with human-level accuracy, we argue that such results have been inflated because of the high similarity between training and test concepts. This means that models which capture concept similarity can perform well, even if they do not capture any knowledge of the commonsense properties themselves. In settings where there is no overlap between the properties that are considered during training and testing, we find that the empirical performance of standard language models drops dramatically. To address this, we study the possibility of fine-tuning language models to explicitly model concepts and their properties. In particular, we train separate concept and property encoders on two types of readily available data: extracted hyponym-hypernym pairs and generic sentences. Our experimental results show that the resulting encoders allow us to predict commonsense properties with much higher accuracy than is possible by directly fine-tuning language models. We also present experimental results for the related task of unsupervised hypernym discovery.

COIN – an Inexpensive and Strong Baseline for Predicting Out of Vocabulary Word Embeddings
Andrew Schneider | Lihong He | Zhijia Chen | Arjun Mukherjee | Eduard Dragut

Social media is the ultimate challenge for many natural language processing tools. The constant emergence of linguistic constructs challenge even the most sophisticated NLP tools. Predicting word embeddings for out of vocabulary words is one of those challenges. Word embedding models only include terms that occur a sufficient number of times in their training corpora. Word embedding vector models are unable to directly provide any useful information about a word not in their vocabularies. We propose a fast method for predicting vectors for out of vocabulary terms that makes use of the surrounding terms of the unknown term and the hidden context layer of the word2vec model. We propose this method as a strong baseline in the sense that 1) while it does not surpass all state-of-the-art methods, it surpasses several techniques for vector prediction on benchmark tasks, 2) even when it underperforms, the margin is very small retaining competitive performance in downstream tasks, and 3) it is inexpensive to compute, requiring no additional training stage. We also show that our technique can be incorporated into existing methods to achieve a new state-of-the-art on the word vector prediction problem.

DynGL-SDP: Dynamic Graph Learning for Semantic Dependency Parsing
Bin Li | Miao Gao | Yunlong Fan | Yikemaiti Sataer | Zhiqiang Gao | Yaocheng Gui

A recent success in semantic dependency parsing shows that graph neural networks can make significant accuracy improvements, owing to its powerful ability in learning expressive graph representations. However, this work learns graph representations based on a static graph constructed by an existing parser, suffering from two drawbacks: (1) the static graph might be error-prone (e.g., noisy or incomplete), and (2) graph construction stage and graph representation learning stage are disjoint, the errors introduced in the graph construction stage cannot be corrected and might be accumulated to later stages. To address these two drawbacks, we propose a dynamic graph learning framework and apply it to semantic dependency parsing, for jointly learning graph structure and graph representations. Experimental results show that our parser outperforms the previous parsers on the SemEval-2015 Task 18 dataset in three languages (English, Chinese, and Czech).

Knowledge Is Flat: A Seq2Seq Generative Framework for Various Knowledge Graph Completion
Chen Chen | Yufei Wang | Bing Li | Kwok-Yan Lam

Knowledge Graph Completion (KGC) has been recently extended to multiple knowledge graph (KG) structures, initiating new research directions, e.g. static KGC, temporal KGC and few-shot KGC. Previous works often design KGC models closely coupled with specific graph structures, which inevitably results in two drawbacks: 1) structure-specific KGC models are mutually incompatible; 2) existing KGC methods are not adaptable to emerging KGs. In this paper, we propose KG-S2S, a Seq2Seq generative framework that could tackle different verbalizable graph structures by unifying the representation of KG facts into “flat” text, regardless of their original form. To remedy the KG structure information loss from the “flat” text, we further improve the input representations of entities and relations, and the inference algorithm in KG-S2S. Experiments on five benchmarks show that KG-S2S outperforms many competitive baselines, setting new state-of-the-art performance. Finally, we analyze KG-S2S’s ability on the different relations and the Non-entity Generations.

Modelling Frequency, Attestation, and Corpus-Based Information with OntoLex-FrAC
Christian Chiarcos | Elena-Simona Apostol | Besim Kabashi | Ciprian-Octavian Truică

OntoLex-Lemon has become a de facto standard for lexical resources in the web of data. This paper provides the first overall description of the emerging OntoLex module for Frequency, Attestations, and Corpus-Based Information (OntoLex-FrAC) that is intended to complement OntoLex-Lemon with the necessary vocabulary to represent major types of information found in or automatically derived from corpora, for applications in both language technology and the language sciences.

Contrast Sets for Stativity of English Verbs in Context
Daniel Chen | Alexis Palmer

For the task of classifying verbs in context as dynamic or stative, current models approach human performance, but only for particular data sets. To better understand the performance of such models, and how well they are able to generalize beyond particular test sets, we apply the contrast set (Gardner et al., 2020) methodology to stativity classification. We create nearly 300 contrastive pairs by perturbing test set instances just enough to change their labels from one class to the other, while preserving coherence, meaning, and well-formedness. Contrastive evaluation shows that a model with near-human performance on an in-distribution test set degrades substantially when applied to transformed examples, showing that the stative vs. dynamic classification task is more complex than the model performance might otherwise suggest. Code and data are freely available.

Multilingual and Multimodal Topic Modelling with Pretrained Embeddings
Elaine Zosa | Lidia Pivovarova

This paper presents M3L-Contrast—a novel multimodal multilingual (M3L) neural topic model for comparable data that maps texts from multiple languages and images into a shared topic space. Our model is trained jointly on texts and images and takes advantage of pretrained document and image embeddings to abstract the complexities between different languages and modalities. As a multilingual topic model, it produces aligned language-specific topics and as multimodal model, it infers textual representations of semantic concepts in images. We demonstrate that our model is competitive with a zero-shot topic model in predicting topic distributions for comparable multilingual data and significantly outperforms a zero-shot model in predicting topic distributions for comparable texts and images. We also show that our model performs almost as well on unaligned embeddings as it does on aligned embeddings.

Zero-shot Script Parsing
Fangzhou Zhai | Vera Demberg | Alexander Koller

Script knowledge is useful to a variety of NLP tasks. However, existing resources only cover a small number of activities, limiting its practical usefulness. In this work, we propose a zero-shot learning approach to script parsing, the task of tagging texts with scenario-specific event and participant types, which enables us to acquire script knowledge without domain-specific annotations. We (1) learn representations of potential event and participant mentions by promoting cluster consistency according to the annotated data; (2) perform clustering on the event / participant candidates from unannotated texts that belongs to an unseen scenario. The model achieves 68.1/74.4 average F1 for event / participant parsing, respectively, outperforming a previous CRF model that, in contrast, has access to scenario-specific supervision. We also evaluate the model by testing on a different corpus, where it achieved 55.5/54.0 average F1 for event / participant parsing.

Word Sense Disambiguation with Knowledge-Enhanced and Local Self-Attention-based Extractive Sense Comprehension
Guobiao Zhang | Wenpeng Lu | Xueping Peng | Shoujin Wang | Baoshuo Kan | Rui Yu

Word sense disambiguation (WSD), identifying the most suitable meaning of ambiguous words in the given contexts according to a predefined sense inventory, is one of the most classical and challenging tasks in natural language processing. Benefiting from the powerful ability of deep neural networks, WSD has achieved a great advancement in recent years. Reformulating WSD as a text span extraction task is an effective approach, which accepts a sentence context of an ambiguous word together with all definitions of its candidate senses simultaneously, and requires to extract the text span corresponding with the right sense. However, the approach merely depends on a short definition to learn sense representation, which neglects abundant semantic knowledge from related senses and leads to data-inefficient learning and suboptimal WSD performance. To address the limitations, we propose a novel WSD method with Knowledge-Enhanced and Local Self-Attention-based Extractive Sense Comprehension (KELESC). Specifically, a knowledge-enhanced method is proposed to enrich semantic representation by incorporating additional examples and definitions of the related senses in WordNet. Then, in order to avoid the huge computing complexity induced by the additional information, a local self-attention mechanism is utilized to constrain attention to be local, which allows longer input texts without large-scale computing burdens. Extensive experimental results demonstrate that KELESC achieves better performance than baseline models on public benchmark datasets.

A Novel Multi-Task Learning Approach for Context-Sensitive Compound Type Identification in Sanskrit
Jivnesh Sandhan | Ashish Gupta | Hrishikesh Terdalkar | Tushar Sandhan | Suvendu Samanta | Laxmidhar Behera | Pawan Goyal

The phenomenon of compounding is ubiquitous in Sanskrit. It serves for achieving brevity in expressing thoughts, while simultaneously enriching the lexical and structural formation of the language. In this work, we focus on the Sanskrit Compound Type Identification (SaCTI) task, where we consider the problem of identifying semantic relations between the components of a compound word. Earlier approaches solely rely on the lexical information obtained from the components and ignore the most crucial contextual and syntactic information useful for SaCTI. However, the SaCTI task is challenging primarily due to the implicitly encoded context-sensitive semantic relation between the compound components. Thus, we propose a novel multi-task learning architecture which incorporates the contextual information and enriches the complementary syntactic information using morphological tagging and dependency parsing as two auxiliary tasks. Experiments on the benchmark datasets for SaCTI show 6.1 points (Accuracy) and 7.7 points (F1-score) absolute gain compared to the state-of-the-art system. Further, our multi-lingual experiments demonstrate the efficacy of the proposed architecture in English and Marathi languages.

Testing Large Language Models on Compositionality and Inference with Phrase-Level Adjective-Noun Entailment
Lorenzo Bertolini | Julie Weeds | David Weir

Previous work has demonstrated that pre-trained large language models (LLM) acquire knowledge during pre-training which enables reasoning over relationships between words (e.g, hyponymy) and more complex inferences over larger units of meaning such as sentences. Here, we investigate whether lexical entailment (LE, i.e. hyponymy or the is a relation between words) can be generalised in a compositional manner. Accordingly, we introduce PLANE (Phrase-Level Adjective-Noun Entailment), a new benchmark to test models on fine-grained compositional entailment using adjective-noun phrases. Our experiments show that knowledge extracted via In–Context and transfer learning is not enough to solve PLANE. However, a LLM trained on PLANE can generalise well to out–of–distribution sets, since the required knowledge can be stored in the representations of subwords (SW) tokens.

Does BERT Recognize an Agent? Modeling Dowty’s Proto-Roles with Contextual Embeddings
Mattia Proietti | Gianluca Lebani | Alessandro Lenci

Contextual embeddings build multidimensional representations of word tokens based on their context of occurrence. Such models have been shown to achieve a state-of-the-art performance on a wide variety of tasks. Yet, the community struggles in understanding what kind of semantic knowledge these representations encode. We report a series of experiments aimed at investigating to what extent one of such models, BERT, is able to infer the semantic relations that, according to Dowty’s Proto-Roles theory, a verbal argument receives by virtue of its role in the event described by the verb. This hypothesis were put to test by learning a linear mapping from the BERT’s verb embeddings to an interpretable space of semantic properties built from the linguistic dataset by White et al. (2016). In a first experiment we tested whether the semantic properties inferred from a typed version of the BERT embeddings would be more linguistically plausible than those produced by relying on static embeddings. We then move to evaluate the semantic properties inferred from the contextual embeddings both against those available in the original dataset, as well as by assessing their ability to model the semantic properties possessed by the agent of the verbs participating in the so-called causative alternation.

Towards Structure-aware Paraphrase Identification with Phrase Alignment Using Sentence Encoders
Qiwei Peng | David Weir | Julie Weeds

Previous works have demonstrated the effectiveness of utilising pre-trained sentence encoders based on their sentence representations for meaning comparison tasks. Though such representations are shown to capture hidden syntax structures, the direct similarity comparison between them exhibits weak sensitivity to word order and structural differences in given sentences. A single similarity score further makes the comparison process hard to interpret. Therefore, we here propose to combine sentence encoders with an alignment component by representing each sentence as a list of predicate-argument spans (where their span representations are derived from sentence encoders), and decomposing the sentence-level meaning comparison into the alignment between their spans for paraphrase identification tasks. Empirical results show that the alignment component brings in both improved performance and interpretability for various sentence encoders. After closer investigation, the proposed approach indicates increased sensitivity to structural difference and enhanced ability to distinguish non-paraphrases with high lexical overlap.

CILex: An Investigation of Context Information for Lexical Substitution Methods
Sandaru Seneviratne | Elena Daskalaki | Artem Lenskiy | Hanna Suominen

Lexical substitution, which aims to generate substitutes for a target word given a context, is an important natural language processing task useful in many applications. Due to the paucity of annotated data, existing methods for lexical substitution tend to rely on manually curated lexical resources and contextual word embedding models. Methods based on lexical resources are likely to miss relevant substitutes whereas relying only on contextual word embedding models fails to provide adequate information on the impact of a substitute in the entire context and the overall meaning of the input. We proposed CILex, which uses contextual sentence embeddings along with methods that capture additional context information complimenting contextual word embeddings for lexical substitution. This ensured the semantic consistency of a substitute with the target word while maintaining the overall meaning of the sentence. Our experimental comparisons with previously proposed methods indicated that our solution is now the state-of-the-art on both the widely used LS07 and CoInCo datasets with P@1 scores of 55.96% and 57.25% for lexical substitution. The implementation of the proposed approach is available at under the MIT license.

Emotion Enriched Retrofitted Word Embeddings
Sapan Shah | Sreedhar Reddy | Pushpak Bhattacharyya

Word embeddings learned using the distributional hypothesis (e.g., GloVe, Word2vec) are good at encoding various lexical-semantic relations. However, they do not capture the emotion aspects of words. We present a novel retrofitting method for updating the vectors of emotion bearing words like fun, offence, angry, etc. The retrofitted embeddings achieve better inter-cluster and intra-cluster distance for words having the same emotions, e.g., the joy cluster containing words like fun, happiness, etc., and the anger cluster with words like offence, rage, etc., as evaluated through different cluster quality metrics. For the downstream tasks on sentiment analysis and sarcasm detection, simple classification models, such as SVM and Attention Net, learned using our retrofitted embeddings perform better than their pre-trained counterparts (about 1.5 % improvement in F1-score) as well as other benchmarks. Furthermore, the difference in performance is more pronounced in the limited data setting.

Metaphor Detection via Linguistics Enhanced Siamese Network
Shenglong Zhang | Ying Liu

In this paper we present MisNet, a novel model for word level metaphor detection. MisNet converts two linguistic rules, i.e., Metaphor Identification Procedure (MIP) and Selectional Preference Violation (SPV) into semantic matching tasks. MIP module computes the similarity between the contextual meaning and the basic meaning of a target word. SPV module perceives the incongruity between target words and their contexts. To better represent basic meanings, MisNet utilizes dictionary resources. Empirical results indicate that MisNet achieves competitive performance on several datasets.

Fast and Accurate End-to-End Span-based Semantic Role Labeling as Word-based Graph Parsing
Shilin Zhou | Qingrong Xia | Zhenghua Li | Yu Zhang | Yu Hong | Min Zhang

This paper proposes to cast end-to-end span-based SRL as a word-based graph parsing task. The major challenge is how to represent spans at the word level. Borrowing ideas from research on Chinese word segmentation and named entity recognition, we propose and compare four different schemata of graph representation, i.e., BES, BE, BIES, and BII, among which we find that the BES schema performs the best. We further gain interesting insights through detailed analysis. Moreover, we propose a simple constrained Viterbi procedure to ensure the legality of the output graph according to the constraints of the SRL structure. We conduct experiments on two widely used benchmark datasets, i.e., CoNLL05 and CoNLL12. Results show that our word-based graph parsing approach achieves consistently better performance than previous results, under all settings of end-to-end and predicate-given, without and with pre-trained language models (PLMs). More importantly, our model can parse 669/252 sentences per second, without and with PLMs respectively.

Unsupervised Lexical Substitution with Decontextualised Embeddings
Takashi Wada | Timothy Baldwin | Yuji Matsumoto | Jey Han Lau

We propose a new unsupervised method for lexical substitution using pre-trained language models. Compared to previous approaches that use the generative capability of language models to predict substitutes, our method retrieves substitutes based on the similarity of contextualised and decontextualised word embeddings, i.e. the average contextual representation of a word in multiple contexts. We conduct experiments in English and Italian, and show that our method substantially outperforms strong baselines and establishes a new state-of-the-art without any explicit supervision or fine-tuning. We further show that our method performs particularly well at predicting low-frequency substitutes, and also generates a diverse list of substitute candidates, reducing morphophonetic or morphosyntactic biases induced by article-noun agreement.

Transparent Semantic Parsing with Universal Dependencies Using Graph Transformations
Wessel Poelman | Rik van Noord | Johan Bos

Even though many recent semantic parsers are based on deep learning methods, we should not forget that rule-based alternatives might offer advantages over neural approaches with respect to transparency, portability, and explainability. Taking advantage of existing off-the-shelf Universal Dependency parsers, we present a method that maps a syntactic dependency tree to a formal meaning representation based on Discourse Representation Theory. Rather than using lambda calculus to manage variable bindings, our approach is novel in that it consists of using a series of graph transformations. The resulting UD semantic parser shows good performance for English, German, Italian and Dutch, with F-scores over 75%, outperforming a neural semantic parser for the lower-resourced languages. Unlike neural semantic parsers, our UD semantic parser does not hallucinate output, is relatively easy to port to other languages, and is completely transparent.

Multilingual Word Sense Disambiguation with Unified Sense Representation
Ying Su | Hongming Zhang | Yangqiu Song | Tong Zhang

As a key natural language processing (NLP) task, word sense disambiguation (WSD) evaluates how well NLP models can understand the fine-grained semantics of words under specific contexts. Benefited from the large-scale annotation, current WSD systems have achieved impressive performances in English by combining supervised learning with lexical knowledge. However, such success is hard to be replicated in other languages, where we only have very limited annotations. In this paper, based on that the multilingual lexicon BabelNet describing the same set of concepts across languages, we propose to build knowledge and supervised based Multilingual Word Sense Disambiguation (MWSD) systems. We build unified sense representations for multiple languages and address the annotation scarcity problem for MWSD by transferring annotations from rich sourced languages. With the unified sense representations, annotations from multiple languages can be jointly trained to benefit the MWSD tasks. Evaluations of SemEval-13 and SemEval-15 datasets demonstrate the effectiveness of our methodology.

A Transition-based Method for Complex Question Understanding
Yu Xia | Wenbin Jiang | Yajuan Lyu | Sujian Li

Complex Question Understanding (CQU) parses complex questions to Question Decomposition Meaning Representation (QDMR) which is a sequence of atomic operators. Existing works are based on end-to-end neural models which do not explicitly model the intermediate states and lack interpretability for the parsing process. Besides, they predict QDMR in a mismatched granularity and do not model the step-wise information which is an essential characteristic of QDMR. To alleviate the issues, we treat QDMR as a computational graph and propose a transition-based method where a decider predicts a sequence of actions to build the graph node-by-node. In this way, the partial graph at each step enables better representation of the intermediate states and better interpretability. At each step, the decider encodes the intermediate state with specially designed encoders and predicts several candidates of the next action and its confidence. For inference, a searcher seeks the optimal graph based on the predictions of the decider to alleviate the error propagation. Experimental results demonstrate the parsing accuracy of our method against several strong baselines. Moreover, our method has transparent and human-readable intermediate results, showing improved interpretability.

Semantic Role Labeling as Dependency Parsing: Exploring Latent Tree Structures inside Arguments
Yu Zhang | Qingrong Xia | Shilin Zhou | Yong Jiang | Guohong Fu | Min Zhang

Semantic role labeling (SRL) is a fundamental yet challenging task in the NLP community. Recent works of SRL mainly fall into two lines: 1) BIO-based; 2) span-based. Despite ubiquity, they share some intrinsic drawbacks of not considering internal argument structures, potentially hindering the model’s expressiveness. The key challenge is arguments are flat structures, and there are no determined subtree realizations for words inside arguments. To remedy this, in this paper, we propose to regard flat argument spans as latent subtrees, accordingly reducing SRL to a tree parsing task. In particular, we equip our formulation with a novel span-constrained TreeCRF to make tree structures span-aware and further extend it to the second-order case. We conduct extensive experiments on CoNLL05 and CoNLL12 benchmarks. Results reveal that our methods perform favorably better than all previous syntax-agnostic works, achieving new state-of-the-art under both end-to-end and w/ gold predicates settings.

Noisy Label Regularisation for Textual Regression
Yuxia Wang | Timothy Baldwin | Karin Verspoor

Training with noisy labelled data is known to be detrimental to model performance, especially for high-capacity neural network models in low-resource domains. Our experiments suggest that standard regularisation strategies, such as weight decay and dropout, are ineffective in the face of noisy labels. We propose a simple noisy label detection method that prevents error propagation from the input layer. The approach is based on the observation that the projection of noisy labels is learned through memorisation at advanced stages of learning, and that the Pearson correlation is sensitive to outliers. Extensive experiments over real-world human-disagreement annotations as well as randomly-corrupted and data-augmented labels, across various tasks and domains, demonstrate that our method is effective, regularising noisy labels and improving generalisation performance.

Detecting Suicide Risk in Online Counseling Services: A Study in a Low-Resource Language
Amir Bialer | Daniel Izmaylov | Avi Segal | Oren Tsur | Yossi Levi-Belz | Kobi Gal

With the increased awareness of situations of mental crisis and their societal impact, online services providing emergency support are becoming commonplace in many countries. Computational models, trained on discussions between help-seekers and providers, can support suicide prevention by identifying at-risk individuals. However, the lack of domain-specific models, especially in low-resource languages, poses a significant challenge for the automatic detection of suicide risk. We propose a model that combines pre-trained language models (PLM) with a fixed set of manually crafted (and clinically approved) set of suicidal cues, followed by a two-stage fine-tuning process. Our model achieves 0.91 ROC-AUC and an F2-score of 0.55, significantly outperforming an array of strong baselines even early on in the conversation, which is critical for real-time detection in the field. Moreover, the model performs well across genders and age groups.

Does Meta-learning Help mBERT for Few-shot Question Generation in a Cross-lingual Transfer Setting for Indic Languages?
Aniruddha Roy | Rupak Kumar Thakur | Isha Sharma | Ashim Gupta | Amrith Krishna | Sudeshna Sarkar | Pawan Goyal

Few-shot Question Generation (QG) is an important and challenging problem in the Natural Language Generation (NLG) domain. Multilingual BERT (mBERT) has been successfully used in various Natural Language Understanding (NLU) applications. However, the question of how to utilize mBERT for few-shot QG, possibly with cross-lingual transfer, remains. In this paper, we try to explore how mBERT performs in few-shot QG (cross-lingual transfer) and also whether applying meta-learning on mBERT further improves the results. In our setting, we consider mBERT as the base model and fine-tune it using a seq-to-seq language modeling framework in a cross-lingual setting. Further, we apply the model agnostic meta-learning approach to our base model. We evaluate our model for two low-resource Indian languages, Bengali and Telugu, using the TyDi QA dataset. The proposed approach consistently improves the performance of the base model in few-shot settings and even works better than some heavily parameterized models. Human evaluation also confirms the effectiveness of our approach.

Revisiting Syllables in Language Modelling and Their Application on Low-Resource Machine Translation
Arturo Oncevay | Kervy Dante Rivas Rojas | Liz Karen Chavez Sanchez | Roberto Zariquiey

Language modelling and machine translation tasks mostly use subword or character inputs, but syllables are seldom used. Syllables provide shorter sequences than characters, require less-specialised extracting rules than morphemes, and their segmentation is not impacted by the corpus size. In this study, we first explore the potential of syllables for open-vocabulary language modelling in 21 languages. We use rule-based syllabification methods for six languages and address the rest with hyphenation, which works as a syllabification proxy. With a comparable perplexity, we show that syllables outperform characters and other subwords. Moreover, we study the importance of syllables on neural machine translation for a non-related and low-resource language-pair (Spanish–Shipibo-Konibo). In pairwise and multilingual systems, syllables outperform unsupervised subwords, and further morphological segmentation methods, when translating into a highly synthetic language with a transparent orthography (Shipibo-Konibo). Finally, we perform some human evaluation, and discuss limitations and opportunities.

Aligning Multilingual Embeddings for Improved Code-switched Natural Language Understanding
Barah Fazili | Preethi Jyothi

Multilingual pretrained models, while effective on monolingual data, need additional training to work well with code-switched text. In this work, we present a novel idea of training multilingual models with alignment objectives using parallel text so as to explicitly align word representations with the same underlying semantics across languages. Such an explicit alignment step has a positive downstream effect and improves performance on multiple code-switched NLP tasks. We explore two alignment strategies and report improvements of up to 7.32%, 0.76% and 1.9% on Hindi-English Sentiment Analysis, Named Entity Recognition and Question Answering tasks compared to a competitive baseline model.

Fashioning Local Designs from Generic Speech Technologies in an Australian Aboriginal Community
Éric Le Ferrand | Steven Bird | Laurent Besacier

An increasing number of papers have been addressing issues related to low-resource languages and the transcription bottleneck paradigm. After several years spent in Northern Australia, where some of the strongest Aboriginal languages are spoken, we could observe a gap between the motivations depicted in research contributions in this space and the Northern Australian context. In this paper, we address this gap in research by exploring the potential of speech recognition in an Aboriginal community. We describe our work from training a spoken term detection system to its implementation in an activity with Aboriginal participants. We report here on one side how speech recognition technologies can find their place in an Aboriginal context and, on the other, methodological paths that allowed us to reach better comprehension and engagement from Aboriginal participants.

Few-Shot Pidgin Text Adaptation via Contrastive Fine-Tuning
Ernie Chang | Jesujoba O. Alabi | David Ifeoluwa Adelani | Vera Demberg

The surging demand for multilingual dialogue systems often requires a costly labeling process for each language addition. For low resource languages, human annotators are continuously tasked with the adaptation of resource-rich language utterances for each new domain. However, this prohibitive and impractical process can often be a bottleneck for low resource languages that are still without proper translation systems nor parallel corpus. In particular, it is difficult to obtain task-specific low resource language annotations for the English-derived creoles (e.g. Nigerian and Cameroonian Pidgin). To address this issue, we utilize the pretrained language models i.e. BART which has shown great potential in language generation/understanding – we propose to finetune the BART model to generate utterances in Pidgin by leveraging the proximity of the source and target languages, and utilizing positive and negative examples in constrastive training objectives. We collected and released the first parallel Pidgin-English conversation corpus in two dialogue domains and showed that this simple and effective technique is suffice to yield impressive results for English-to-Pidgin generation, which are two closely-related languages.

Penalizing Divergence: Multi-Parallel Translation for Low-Resource Languages of North America
Garrett Nicolai | Changbing Yang | Miikka Silfverberg

This paper explores a special case in multilingual machine translation: so called multi-parallel translation, where the target data for all language pairs are identical. While multi-parallelism offers benefits which are not available in a standard translation setting, translation models can easily overfit when training data are limited. We introduce a regularizer, the divergence penalty, which penalizes the translation model when it represents source sentences with identical target translations in divergent ways. Experiments on very low-resourced Indigenous North American languages show that an initially deficient multilingual translator can improve by 4.9 BLEU through mBART pre-training, and 5.5 BLEU points with the strategic addition of monolingual data, and that a divergence penalty leads to further increases of 0.4 BLEU. Further experiments on Germanic languages demonstrate a improvement of 0.5 BLEU when applying the divergence penalty. An investigation of the neural encoder representations learned by our translation models shows that the divergence penalty encourages models to learn a unified neural interlingua.

Assessing Digital Language Support on a Global Scale
Gary F. Simons | Abbey L. L. Thomas | Chad K. K. White

The users of endangered languages struggle to thrive in a digitally-mediated world. We have developed an automated method for assessing how well every language recognized by ISO 639 is faring in terms of digital language support. The assessment is based on scraping the names of supported languages from the websites of 143 digital tools selected to represent a full range of ways that digital technology can support languages. The method uses Mokken scale analysis to produce an explainable model for quantifying digital language support and monitoring it on a global scale.

Persian Natural Language Inference: A Meta-learning Approach
Heydar Soudani | Mohammad Hassan Mojab | Hamid Beigy

Incorporating information from other languages can improve the results of tasks in low-resource languages. A powerful method of building functional natural language processing systems for low-resource languages is to combine multilingual pre-trained representations with cross-lingual transfer learning. In general, however, shared representations are learned separately, either across tasks or across languages. This paper proposes a meta-learning approach for inferring natural language in Persian. Alternately, meta-learning uses different task information (such as QA in Persian) or other language information (such as natural language inference in English). Also, we investigate the role of task augmentation strategy for forming additional high-quality tasks. We evaluate the proposed method using four languages and an auxiliary task. Compared to the baseline approach, the proposed model consistently outperforms it, improving accuracy by roughly six percent. We also examine the effect of finding appropriate initial parameters using zero-shot evaluation and CCA similarity.

Global Readiness of Language Technology for Healthcare: What Would It Take to Combat the Next Pandemic?
Ishani Mondal | Kabir Ahuja | Mohit Jain | Jacki O’Neill | Kalika Bali | Monojit Choudhury

The COVID-19 pandemic has brought out both the best and worst of language technology (LT). On one hand, conversational agents for information dissemination and basic diagnosis have seen widespread use, and arguably, had an important role in fighting against the pandemic. On the other hand, it has also become clear that such technologies are readily available for a handful of languages, and the vast majority of the global south is completely bereft of these benefits. What is the state of LT, especially conversational agents, for healthcare across the world’s languages? And, what would it take to ensure global readiness of LT before the next pandemic? In this paper, we try to answer these questions through survey of existing literature and resources, as well as through a rapid chatbot building exercise for 15 Asian and African languages with varying amount of resource-availability. The study confirms the pitiful state of LT even for languages with large speaker bases, such as Sinhala and Hausa, and identifies the gaps that could help us prioritize research and investment strategies in LT for healthcare.

Adapting Pre-trained Language Models to African Languages via Multilingual Adaptive Fine-Tuning
Jesujoba O. Alabi | David Ifeoluwa Adelani | Marius Mosbach | Dietrich Klakow

Multilingual pre-trained language models (PLMs) have demonstrated impressive performance on several downstream tasks for both high-resourced and low-resourced languages. However, there is still a large performance drop for languages unseen during pre-training, especially African languages. One of the most effective approaches to adapt to a new language is language adaptive fine-tuning (LAFT) — fine-tuning a multilingual PLM on monolingual texts of a language using the pre-training objective. However, adapting to target language individually takes large disk space and limits the cross-lingual transfer abilities of the resulting models because they have been specialized for a single language. In this paper, we perform multilingual adaptive fine-tuning on 17 most-resourced African languages and three other high-resource languages widely spoken on the African continent to encourage cross-lingual transfer learning. To further specialize the multilingual PLM, we removed vocabulary tokens from the embedding layer that corresponds to non-African writing scripts before MAFT, thus reducing the model size by around 50%. Our evaluation on two multilingual PLMs (AfriBERTa and XLM-R) and three NLP tasks (NER, news topic classification, and sentiment classification) shows that our approach is competitive to applying LAFT on individual languages while requiring significantly less disk space. Additionally, we show that our adapted PLM also improves the zero-shot cross-lingual transfer abilities of parameter efficient fine-tuning methods.

Noun Class Disambiguation in Runyankore and Related Languages
Joan Byamugisha

Bantu languages are spoken by communities in more than half of the countries on the African continent by an estimated third of a billion people. Despite this populous and the amount of high quality linguistic research done over the years, Bantu languages are still computationally under-resourced. The biggest limitation to the development of computational methods for processing Bantu language text is their complex grammatical structure, chiefly in the system of noun classes. We investigated the use of a combined syntactic and semantic method to disambiguate among singular nouns with the same class prefix but belonging to different noun classes. This combination uses the semantic generalizations of the types of nouns in each class to overcome the limitations of relying only on the prefixes they take. We used the nearest neighbors of a query word as semantic generalizations, and developed a tool to determine the noun class based on resources in Runyankore, a Bantu language indigenous to Uganda. We also investigated whether, with the same Runyankore resources, our method had utility in other Bantu languages, Luganda, indigenous to Uganda, and Kinyarwanda, indigenous to Rwanda. For all three languages, the combined approach resulted in an improvement in accuracy, as compared to using only the syntactic or the semantic approach.

Improving Low-resource RRG Parsing with Cross-lingual Self-training
Kilian Evang | Laura Kallmeyer | Jakub Waszczuk | Kilu von Prince | Tatiana Bladier | Simon Petitjean

This paper considers the task of parsing low-resource languages in a scenario where parallel English data and also a limited seed of annotated sentences in the target language are available, as for example in bootstrapping parallel treebanks. We focus on constituency parsing using Role and Reference Grammar (RRG), a theory that has so far been understudied in computational linguistics but that is widely used in typological research, i.e., in particular in the context of low-resource languages. Starting from an existing RRG parser, we propose two strategies for low-resource parsing: first, we extend the parsing model into a cross-lingual parser, exploiting the parallel data in the high-resource language and unsupervised word alignments by providing internal states of the source-language parser to the target-language parser. Second, we adopt self-training, thereby iteratively expanding the training data, starting from the seed, by including the most confident new parses in each round. Both in simulated scenarios and with a real low-resource language (Daakaka), we find substantial and complementary improvements from both self-training and cross-lingual parsing. Moreover, we also experimented with using gloss embeddings in addition to token embeddings in the target language, and this also improves results. Finally, starting from what we have for Daakaka, we also consider parsing a related language (Dalkalaen) where glosses and English translations are available but no annotated trees at all, i.e., a no-resource scenario wrt. syntactic annotations. We start with cross-lingual parser trained on Daakaka with glosses and use self-training to adapt it to Dalkalaen. The results are surprisingly good.

A Simple and Effective Method to Improve Zero-Shot Cross-Lingual Transfer Learning
Kunbo Ding | Weijie Liu | Yuejian Fang | Weiquan Mao | Zhe Zhao | Tao Zhu | Haoyan Liu | Rong Tian | Yiren Chen

Existing zero-shot cross-lingual transfer methods rely on parallel corpora or bilingual dictionaries, which are expensive and impractical for low-resource languages. To disengage from these dependencies, researchers have explored training multilingual models on English-only resources and transferring them to low-resource languages. However, its effect is limited by the gap between embedding clusters of different languages. To address this issue, we propose Embedding-Push, Attention-Pull, and Robust targets to transfer English embeddings to virtual multilingual embeddings without semantic loss, thereby improving cross-lingual transferability. Experimental results on mBERT and XLM-R demonstrate that our method significantly outperforms previous works on the zero-shot cross-lingual text classification task and can obtain a better multilingual alignment.

Towards Multi-Sense Cross-Lingual Alignment of Contextual Embeddings
Linlin Liu | Thien Hai Nguyen | Shafiq Joty | Lidong Bing | Luo Si

Cross-lingual word embeddings (CLWE) have been proven useful in many cross-lingual tasks. However, most existing approaches to learn CLWE including the ones with contextual embeddings are sense agnostic. In this work, we propose a novel framework to align contextual embeddings at the sense level by leveraging cross-lingual signal from bilingual dictionaries only. We operationalize our framework by first proposing a novel sense-aware cross entropy loss to model word senses explicitly. The monolingual ELMo and BERT models pretrained with our sense-aware cross entropy loss demonstrate significant performance improvement for word sense disambiguation tasks. We then propose a sense alignment objective on top of the sense-aware cross entropy loss for cross-lingual model pretraining, and pretrain cross-lingual models for several language pairs (English to German/Spanish/Japanese/Chinese). Compared with the best baseline results, our cross-lingual models achieve 0.52%, 2.09% and 1.29% average performance improvements on zero-shot cross-lingual NER, sentiment classification and XNLI tasks, respectively.

How to Parse a Creole: When Martinican Creole Meets French
Ludovic Mompelat | Daniel Dakota | Sandra Kübler

We investigate methods to develop a parser for Martinican Creole, a highly under-resourced language, using a French treebank. We compare transfer learning and multi-task learning models and examine different input features and strategies to handle the massive size imbalance between the treebanks. Surprisingly, we find that a simple concatenated (French + Martinican Creole) baseline yields optimal results even though it has access to only 80 Martinican Creole sentences. POS embeddings work better than lexical ones, but they suffer from negative transfer.

Byte-based Multilingual NMT for Endangered Languages
Mengjiao Zhang | Jia Xu

Multilingual neural machine translation (MNMT) jointly trains a shared model for translation with multiple language pairs. However, traditional subword-based MNMT approaches suffer from out-of-vocabulary (OOV) issues and representation bottleneck, which often degrades translation performance on certain language pairs. While byte tokenization is used to tackle the OOV problems in neural machine translation (NMT), until now its capability has not been validated in MNMT. Additionally, existing work has not studied how byte encoding can benefit endangered language translation to our knowledge. We propose a byte-based multilingual neural machine translation system (BMNMT) to alleviate the representation bottleneck and improve translation performance in endangered languages. Furthermore, we design a random byte mapping method with an ensemble prediction to enhance our model robustness. Experimental results show that our BMNMT consistently and significantly outperforms subword/word-based baselines on twelve language pairs up to +18.5 BLEU points, an 840% relative improvement.

BRCC and SentiBahasaRojak: The First Bahasa Rojak Corpus for Pretraining and Sentiment Analysis Dataset
Nanda Putri Romadhona | Sin-En Lu | Bo-Han Lu | Richard Tzong-Han Tsai

Code-mixing refers to the mixed use of multiple languages. It is prevalent in multilingual societies and is also one of the most challenging natural language processing tasks. In this paper, we study Bahasa Rojak, a dialect popular in Malaysia that consists of English, Malay, and Chinese. Aiming to establish a model to deal with the code-mixing phenomena of Bahasa Rojak, we use data augmentation to automatically construct the first Bahasa Rojak corpus for pre-training language models, which we name the Bahasa Rojak Crawled Corpus (BRCC). We also develop a new pre-trained model called “Mixed XLM”. The model can tag the language of the input token automatically to process code-mixing input. Finally, to test the effectiveness of the Mixed XLM model pre-trained on BRCC for social media scenarios where code-mixing is found frequently, we compile a new Bahasa Rojak sentiment analysis dataset, SentiBahasaRojak, with a Kappa value of 0.77.

WordNet-QU: Development of a Lexical Database for Quechua Varieties
Nelsi Melgarejo | Rodolfo Zevallos | Hector Gomez | John E. Ortega

In the effort to minimize the risk of extinction of a language, linguistic resources are fundamental. Quechua, a low-resource language from South America, is a language spoken by millions but, despite several efforts in the past, still lacks the resources necessary to build high-performance computational systems. In this article, we present WordNet-QU which signifies the inclusion of Quechua in a well-known lexical database called wordnet. We propose WordNet-QU to be included as an extension to wordnet after demonstrating a manually-curated collection of multiple digital resources for lexical use in Quechua. Our work uses the synset alignment algorithm to compare Quechua to its geographically nearest high-resource language, Spanish. Altogether, we propose a total of 28,582 unique synset IDs divided according to region like so: 20510 for Southern Quechua, 5993 for Central Quechua, 1121 for Northern Quechua, and 958 for Amazonian Quechua.

When the Student Becomes the Master: Learning Better and Smaller Monolingual Models from mBERT
Pranaydeep Singh | Els Lefever

In this research, we present pilot experiments to distil monolingual models from a jointly trained model for 102 languages (mBERT). We demonstrate that it is possible for the target language to outperform the original model, even with a basic distillation setup. We evaluate our methodology for 6 languages with varying amounts of resources and language families.

Zero-shot Disfluency Detection for Indian Languages
Rohit Kundu | Preethi Jyothi | Pushpak Bhattacharyya

Disfluencies that appear in the transcriptions from automatic speech recognition systems tend to impair the performance of downstream NLP tasks. Disfluency correction models can help alleviate this problem. However, the unavailability of labeled data in low-resource languages impairs progress. We propose using a pretrained multilingual model, finetuned only on English disfluencies, for zero-shot disfluency detection in Indian languages. We present a detailed pipeline to synthetically generate disfluent text and create evaluation datasets for four Indian languages: Bengali, Hindi, Malayalam, and Marathi. Even in the zero-shot setting, we obtain F1 scores of 75 and higher on five disfluency types across all four languages. We also show the utility of synthetically generated disfluencies by evaluating on real disfluent text in Bengali, Hindi, and Marathi. Finetuning the multilingual model on additional synthetic Hindi disfluent text nearly doubles the number of exact matches and yields a 20-point boost in F1 scores when evaluated on real Hindi disfluent text, compared to training with only English disfluent text.

Evaluating Word Embeddings in Extremely Under-Resourced Languages: A Case Study in Bribri
Rolando Coto-Solano

Word embeddings are critical for numerous NLP tasks but their evaluation in actual under-resourced settings needs further examination. This paper presents a case study in Bribri, a Chibchan language from Costa Rica. Four experiments were adapted from English: Word similarities, WordSim353 correlations, odd-one-out tasks and analogies. Here we discuss their adaptation to an under-resourced Indigenous language and we use them to measure semantic and morphological learning. We trained 96 word2vec models with different hyperparameter combinations. The best models for this under-resourced scenario were Skip-grams with an intermediate size (100 dimensions) and large window sizes (10). These had an average correlation of r=0.28 with WordSim353, a 76% accuracy in semantic odd-one-out and 70% accuracy in structural/morphological odd-one-out. The performance was lower for the analogies: The best models could find the appropriate semantic target amongst the first 25 results approximately 60% of the times, but could only find the morphological/structural target 11% of the times. Future research needs to further explore the patterns of morphological/structural learning, to examine the behavior of deep learning embeddings, and to establish a human baseline. This project seeks to improve Bribri NLP and ultimately help in its maintenance and revitalization.

Applying Natural Annotation and Curriculum Learning to Named Entity Recognition for Under-Resourced Languages
Valeriy Lobov | Alexandra Ivoylova | Serge Sharoff

Current practices in building new NLP models for low-resourced languages rely either on Machine Translation of training sets from better resourced languages or on cross-lingual transfer from them. Still we can see a considerable performance gap between the models originally trained within better resourced languages and the models transferred from them. In this study we test the possibility of (1) using natural annotation to build synthetic training sets from resources not initially designed for the target downstream task and (2) employing curriculum learning methods to select the most suitable examples from synthetic training sets. We test this hypothesis across seven Slavic languages and across three curriculum learning strategies on Named Entity Recognition as the downstream task. We also test the possibility of fine-tuning the synthetic resources to reflect linguistic properties, such as the grammatical case and gender, both of which are important for the Slavic languages. We demonstrate the possibility to achieve the mean F1 score of 0.78 across the three basic entities types for Belarusian starting from zero resources in comparison to the baseline of 0.63 using the zero-shot transfer from English. For comparison, the English model trained on the original set achieves the mean F1-score of 0.75. The experimental results are available from

Taking Actions Separately: A Bidirectionally-Adaptive Transfer Learning Method for Low-Resource Neural Machine Translation
Xiaolin Xing | Yu Hong | Minhan Xu | Jianmin Yao | Guodong Zhou

Training Neural Machine Translation (NMT) models suffers from sparse parallel data, in the infrequent translation scenarios towards low-resource source languages. The existing solutions primarily concentrate on the utilization of Parent-Child (PC) transfer learning. It transfers well-trained NMT models on high-resource languages (namely Parent NMT) to low-resource languages, so as to produce Child NMT models by fine-tuning. It has been carefully demonstrated that a variety of PC variants yield significant improvements for low-resource NMT. In this paper, we intend to enhance PC-based NMT by a bidirectionally-adaptive learning strategy. Specifically, we divide inner constituents (6 transformers) of Parent encoder into two “teams”, i.e., T1 and T2. During representation learning, T1 learns to encode low-resource languages conditioned on bilingual shareable latent space. Generative adversarial network and masked language modeling are used for space-shareable encoding. On the other hand, T2 is straightforwardly transferred to low-resource languages, and fine-tuned together with T1 for low-resource translation. Briefly, T1 and T2 take actions separately for different goals. The former aims to adapt to characteristics of low-resource languages during encoding, while the latter adapts to translation experiences learned from high-resource languages. We experiment on benchmark corpora SETIMES, conducting low-resource NMT for Albanian (Sq), Macedonian (Mk), Croatian (Hr) and Romanian (Ro). Experimental results show that our method yields substantial improvements, which allows the NMT performance to reach BLEU4-scores of 62.24%, 56.93%, 50.53% and 54.65% for Sq, Mk, Hr and Ro, respectively.

HCLD: A Hierarchical Framework for Zero-shot Cross-lingual Dialogue System
Zhanyu Ma | Jian Ye | Xurui Yang | Jianfeng Liu

Recently, many task-oriented dialogue systems need to serve users in different languages. However, it is time-consuming to collect enough data of each language for training. Thus, zero-shot adaptation of cross-lingual task-oriented dialog systems has been studied. Most of existing methods consider the word-level alignments to conduct two main tasks for task-oriented dialogue system, i.e., intent detection and slot filling, and they rarely explore the dependency relations among these two tasks. In this paper, we propose a hierarchical framework to classify the pre-defined intents in the high-level and fulfill slot filling under the guidance of intent in the low-level. Particularly, we incorporate sentence-level alignment among different languages to enhance the performance of intent detection. The extensive experiments report that our proposed method achieves the SOTA performance on a public task-oriented dialog dataset.

GraDA: Graph Generative Data Augmentation for Commonsense Reasoning
Adyasha Maharana | Mohit Bansal

Recent advances in commonsense reasoning have been fueled by the availability of large-scale human annotated datasets. Manual annotation of such datasets, many of which are based on existing knowledge bases, is expensive and not scalable. Moreover, it is challenging to build augmentation data for commonsense reasoning because the synthetic questions need to adhere to real-world scenarios. Hence, we present GraDA, a graph-generative data augmentation framework to synthesize factual data samples from knowledge graphs for commonsense reasoning datasets. First, we train a graph-to-text model for conditional generation of questions from graph entities and relations. Then, we train a generator with GAN loss to generate distractors for synthetic questions. Our approach improves performance for SocialIQA, CODAH, HellaSwag and CommonsenseQA, and works well for generative tasks like ProtoQA. We show improvement in robustness to semantic adversaries after training with GraDA and provide human evaluation of the quality of synthetic datasets in terms of factuality and answerability. Our work provides evidence and encourages future research into graph-based generative data augmentation.

Eureka: Neural Insight Learning for Knowledge Graph Reasoning
Alex X. Zhang | Xun Liang | Bo Wu | Xiangping Zheng | Sensen Zhang | Yuhui Guo | Jun Wang | Xinyao Liu

The human recognition system has presented the remarkable ability to effortlessly learn novel knowledge from only a few trigger events based on prior knowledge, which is called insight learning. Mimicking such behavior on Knowledge Graph Reasoning (KGR) is an interesting and challenging research problem with many practical applications. Simultaneously, existing works, such as knowledge embedding and few-shot learning models, have been limited to conducting KGR in either “seen-to-seen” or “unseen-to-unseen” scenarios. To this end, we propose a neural insight learning framework named Eureka to bridge the “seen” to “unseen” gap. Eureka is empowered to learn the seen relations with sufficient training triples while providing the flexibility of learning unseen relations given only one trigger without sacrificing its performance on seen relations. Eureka meets our expectation of the model to acquire seen and unseen relations at no extra cost, and eliminate the need to retrain when encountering emerging unseen relations. Experimental results on two real-world datasets demonstrate that the proposed framework also outperforms various state-of-the-art baselines on datasets of both seen and unseen relations.

CitRet: A Hybrid Model for Cited Text Span Retrieval
Amit Pandey | Avani Gupta | Vikram Pudi

The paper aims to identify cited text spans in the reference paper related to the given citance in the citing paper. We refer to it as cited text span retrieval (CTSR). Most current methods attempt this task by relying on pre-trained off-the-shelf deep learning models like SciBERT. Though these models are pre-trained on large datasets, they under-perform in out-of-domain settings. We introduce CitRet, a novel hybrid model for CTSR that leverages unique semantic and syntactic structural characteristics of scientific documents. This enables us to use significantly less data for finetuning. We use only 1040 documents for finetuning. Our model augments mildly-trained SBERT-based contextual embeddings with pre-trained non-contextual Word2Vec embeddings to calculate semantic textual similarity. We demonstrate the performance of our model on the CLSciSumm shared tasks. It improves the state-of-the-art results by over 15% on the F1 score evaluation.

A Weak Supervision Approach for Predicting Difficulty of Technical Interview Questions
Arpita Kundu | Subhasish Ghosh | Pratik Saini | Tapas Nayak | Indrajit Bhattacharya

Predicting difficulty of questions is crucial for technical interviews. However, such questions are long-form and more open-ended than factoid and multiple choice questions explored so far for question difficulty prediction. Existing models also require large volumes of candidate response data for training. We study weak-supervision and use unsupervised algorithms for both question generation and difficulty prediction. We create a dataset of interview questions with difficulty scores for deep learning and use it to evaluate SOTA models for question difficulty prediction trained using weak supervision. Our analysis brings out the task’s difficulty as well as the promise of weak supervision for it.

Reinforcement Learning with Large Action Spaces for Neural Machine Translation
Asaf Yehudai | Leshem Choshen | Lior Fox | Omri Abend

Applying Reinforcement learning (RL) following maximum likelihood estimation (MLE) pre-training is a versatile method for enhancing neural machine translation (NMT) performance. However, recent work has argued that the gains produced by RL for NMT are mostly due to promoting tokens that have already received a fairly high probability in pre-training. We hypothesize that the large action space is a main obstacle to RL’s effectiveness in MT, and conduct two sets of experiments that lend support to our hypothesis. First, we find that reducing the size of the vocabulary improves RL’s effectiveness. Second, we find that effectively reducing the dimension of the action space without changing the vocabulary also yields notable improvement as evaluated by BLEU, semantic similarity, and human evaluation. Indeed, by initializing the network’s final fully connected layer (that maps the network’s internal dimension to the vocabulary dimension), with a layer that generalizes over similar actions, we obtain a substantial improvement in RL performance: 1.5 BLEU points on average.

Noise Learning for Text Classification: A Benchmark
Bo Liu | Wandi Xu | Yuejia Xiang | Xiaojun Wu | Lejian He | Bowen Zhang | Li Zhu

Noise Learning is important in the task of text classification which depends on massive labeled data that could be error-prone. However, we find that noise learning in text classification is relatively underdeveloped: 1. many methods that have been proven effective in the image domain are not explored in text classification, 2. it is difficult to conduct a fair comparison between previous studies as they do experiments in different noise settings. In this work, we adapt four state-of-the-art methods of noise learning from the image domain to text classification. Moreover, we conduct comprehensive experiments on our benchmark of noise learning with seven commonly-used methods, four datasets, and five noise modes. Additionally, most previous works are based on an implicit hypothesis that the commonly-used datasets such as TREC, Ag-News and Chnsenticorp contain no errors. However, these datasets indeed contain 0.61% to 15.77% noise labels which we define as intrinsic noise that can cause inaccurate evaluation. Therefore, we build a new dataset Golden-Chnsenticorp( G-Chnsenticorp) without intrinsic noise to more accurately compare the effects of different noise learning methods. To the best of our knowledge, this is the first benchmark of noise learning for text classification.

Mitigating the Diminishing Effect of Elastic Weight Consolidation
Canasai Kruengkrai | Junichi Yamagishi

Elastic weight consolidation (EWC, Kirkpatrick et al. 2017) is a promising approach to addressing catastrophic forgetting in sequential training. We find that the effect of EWC can diminish when fine-tuning large-scale pre-trained language models on different datasets. We present two simple objective functions to mitigate this problem by rescaling the components of EWC. Experiments on natural language inference and fact-checking tasks indicate that our methods require much smaller values for the trade-off parameters to achieve results comparable to EWC.

Token and Head Adaptive Transformers for Efficient Natural Language Processing
Chonghan Lee | Md Fahim Faysal Khan | Rita Brugarolas Brufau | Ke Ding | Vijaykrishnan Narayanan

While pre-trained language models like BERT have achieved impressive results on various natural language processing tasks, deploying them on resource-restricted devices is challenging due to their intensive computational cost and memory footprint. Previous approaches mainly focused on training smaller versions of a BERT model with competitive accuracy under limited computational resources. In this paper, we extend Length Adaptive Transformer and propose to design Token and Head Adaptive Transformer, which can compress and accelerate various BERT-based models via simple fine-tuning. We train a transformer with a progressive token and head pruning scheme, eliminating a large number of redundant tokens and attention heads in the later layers. Then, we conduct a multi-objective evolutionary search with the overall number of floating point operations (FLOPs) as its efficiency constraint to find joint token and head pruning strategies that maximize accuracy and efficiency under various computational budgets. Empirical studies show that a large portion of tokens and attention heads could be pruned while achieving superior performance compared to the baseline BERT-based models and Length Adaptive Transformers in various downstream NLP tasks. MobileBERT trained with our joint token and head pruning scheme achieves a GLUE score of 83.0, which is 1.4 higher than Length Adaptive Transformer and 2.9 higher than the original model.

Don’t Judge a Language Model by Its Last Layer: Contrastive Learning with Layer-Wise Attention Pooling
Dongsuk Oh | Yejin Kim | Hodong Lee | H. Howie Huang | Heuiseok Lim

Recent pre-trained language models (PLMs) achieved great success on many natural language processing tasks through learning linguistic features and contextualized sentence representation. Since attributes captured in stacked layers of PLMs are not clearly identified, straightforward approaches such as embedding the last layer are commonly preferred to derive sentence representations from PLMs. This paper introduces the attention-based pooling strategy, which enables the model to preserve layer-wise signals captured in each layer and learn digested linguistic features for downstream tasks. The contrastive learning objective can adapt the layer-wise attention pooling to both unsupervised and supervised manners. It results in regularizing the anisotropic space of pre-trained embeddings and being more uniform. We evaluate our model on standard semantic textual similarity (STS) and semantic search tasks. As a result, our method improved the performance of the base contrastive learned BERTbase and variants.

SHAP-Based Explanation Methods: A Review for NLP Interpretability
Edoardo Mosca | Ferenc Szigeti | Stella Tragianni | Daniel Gallagher | Georg Groh

Model explanations are crucial for the transparent, safe, and trustworthy deployment of machine learning models. The SHapley Additive exPlanations (SHAP) framework is considered by many to be a gold standard for local explanations thanks to its solid theoretical background and general applicability. In the years following its publication, several variants appeared in the literature—presenting adaptations in the core assumptions and target applications. In this work, we review all relevant SHAP-based interpretability approaches available to date and provide instructive examples as well as recommendations regarding their applicability to NLP use cases.

A Simple Log-based Loss Function for Ordinal Text Classification
François Castagnos | Martin Mihelich | Charles Dognin

The cross-entropy loss function is widely used and generally considered the default loss function for text classification. When it comes to ordinal text classification where there is an ordinal relationship between labels, the cross-entropy is not optimal as it does not incorporate the ordinal character into its feedback. In this paper, we propose a new simple loss function called ordinal log-loss (OLL). We show that this loss function outperforms state-of-the-art previously introduced losses on four benchmark text classification datasets.

Ask Question First for Enhancing Lifelong Language Learning
Han Wang | Ruiliu Fu | Xuejun Zhang | Jun Zhou | Qingwei Zhao

Lifelong language learning aims to stream learning NLP tasks while retaining knowledge of previous tasks. Previous works based on the language model and following data-free constraint approaches have explored formatting all data as “begin token (B) + context (C) + question (Q) + answer (A)” for different tasks. However, they still suffer from catastrophic forgetting and are exacerbated when the previous task’s pseudo data is insufficient for the following reasons: (1) The model has difficulty generating task-corresponding pseudo data, and (2) A is prone to error when A and C are separated by Q because the information of the C is diminished before generating A. Therefore, we propose the Ask Question First and Replay Question (AQF-RQ), including a novel data format “BQCA” and a new training task to train pseudo questions of previous tasks. Experimental results demonstrate that AQF-RQ makes it easier for the model to generate more pseudo data that match corresponding tasks, and is more robust to both sufficient and insufficient pseudo-data when the task boundary is both clear and unclear. AQF-RQ can achieve only 0.36% lower performance than multi-task learning.

DoubleMix: Simple Interpolation-Based Data Augmentation for Text Classification
Hui Chen | Wei Han | Diyi Yang | Soujanya Poria

This paper proposes a simple yet effective interpolation-based data augmentation approach termed DoubleMix, to improve the robustness of models in text classification. DoubleMix first leverages a couple of simple augmentation operations to generate several perturbed samples for each training data, and then uses the perturbed data and original data to carry out a two-step interpolation in the hidden space of neural models. Concretely, it first mixes up the perturbed data to a synthetic sample and then mixes up the original data and the synthetic perturbed data. DoubleMix enhances models’ robustness by learning the “shifted” features in hidden space. On six text classification benchmark datasets, our approach outperforms several popular text augmentation methods including token-level, sentence-level, and hidden-level data augmentation techniques. Also, experiments in low-resource settings show our approach consistently improves models’ performance when the training data is scarce. Extensive ablation studies and case studies confirm that each component of our approach contributes to the final performance and show that our approach exhibits superior performance on challenging counterexamples. Additionally, visual analysis shows that text features generated by our approach are highly interpretable.

Large Sequence Representation Learning via Multi-Stage Latent Transformers
Ionut-Catalin Sandu | Daniel Voinea | Alin-Ionut Popa

We present LANTERN, a multi-stage transformer architecture for named-entity recognition (NER) designed to operate on indefinitely large text sequences (i.e. > 512 elements). For a given image of a form with structured text, our method uses language and spatial features to predict the entity tags of each text element. It breaks the quadratic computational constraints of the attention mechanism by operating over a learned latent space representation which encodes the input sequence via the cross-attention mechanism while having the multi-stage encoding component as a refinement over the NER predictions. As a proxy task, we propose RADAR, an LSTM classifier operating at character level, which predicts the relevance of a word with respect to the entity-recognition task. Additionally, we formulate a challenging novel NER use case, nutritional information extraction from food product labels. We created a dataset with 11,926 images depicting food product labels entitled TREAT dataset, with fully detailed annotations. Our method achieves superior performance against two competitive models designed for long sequences on the proposed TREAT dataset.

MockingBERT: A Method for Retroactively Adding Resilience to NLP Models
Jan Jezabek | Akash Singh

Protecting NLP models against misspellings whether accidental or adversarial