Proceedings of the Second Workshop on Data Science with Human in the Loop: Language Advances

Eduard Dragut, Yunyao Li, Lucian Popa, Slobodan Vucetic (Editors)

Anthology ID:
Association for Computational Linguistics
Bib Export formats:

pdf bib
Proceedings of the Second Workshop on Data Science with Human in the Loop: Language Advances
Eduard Dragut | Yunyao Li | Lucian Popa | Slobodan Vucetic

pdf bib
Leveraging Wikipedia Navigational Templates for Curating Domain-Specific Fuzzy Conceptual Bases
Krati Saxena | Tushita Singh | Ashwini Patil | Sagar Sunkle | Vinay Kulkarni

Domain-specific conceptual bases use key concepts to capture domain scope and relevant information. Conceptual bases serve as a foundation for various downstream tasks, including ontology construction, information mapping, and analysis. However, building conceptual bases necessitates domain awareness and takes time. Wikipedia navigational templates offer multiple articles on the same/similar domain. It is possible to use the templates to recognize fundamental concepts that shape the domain. Earlier work in this domain used Wikipedia’s structured and unstructured data to construct open-domain ontologies, domain terminologies, and knowledge bases. We present a novel method for leveraging navigational templates to create domain-specific fuzzy conceptual bases in this work. Our system generates knowledge graphs from the articles mentioned in the template, which we then process using Wikidata and machine learning algorithms. We filter important concepts using fuzzy logic on network metrics to create a crude conceptual base. Finally, the expert helps by refining the conceptual base. We demonstrate our system using an example of RNA virus antiviral drugs.

pdf bib
It is better to Verify: Semi-Supervised Learning with a human in the loop for large-scale NLU models
Verena Weber | Enrico Piovano | Melanie Bradford

When a NLU model is updated, new utter- ances must be annotated to be included for training. However, manual annotation is very costly. We evaluate a semi-supervised learning workflow with a human in the loop in a produc- tion environment. The previous NLU model predicts the annotation of the new utterances, a human then reviews the predicted annotation. Only when the NLU prediction is assessed as incorrect the utterance is sent for human anno- tation. Experimental results show that the pro- posed workflow boosts the performance of the NLU model while significantly reducing the annotation volume. Specifically, in our setup, we see improvements of up to 14.16% for a recall-based metric and up to 9.57% for a F1- score based metric, while reducing the annota- tion volume by 97% and overall cost by 60% for each iteration.

ViziTex: Interactive Visual Sense-Making of Text Corpora
Natraj Raman | Sameena Shah | Tucker Balch | Manuela Veloso

Information visualization is critical to analytical reasoning and knowledge discovery. We present an interactive studio that integrates perceptive visualization techniques with powerful text analytics algorithms to assist humans in sense-making of large complex text corpora. The novel visual representations introduced here encode the features delivered by modern text mining models using advanced metaphors such as hypergraphs, nested topologies and tessellated planes. They enhance human-computer interaction experience for various tasks such as summarization, exploration, organization and labeling of documents. We demonstrate the ability of the visuals to surface the structure, relations and concepts from documents across different domains.

A Visualization Approach for Rapid Labeling of Clinical Notes for Smoking Status Extraction
Saman Enayati | Ziyu Yang | Benjamin Lu | Slobodan Vucetic

Labeling is typically the most human-intensive step during the development of supervised learning models. In this paper, we propose a simple and easy-to-implement visualization approach that reduces cognitive load and increases the speed of text labeling. The approach is fine-tuned for task of extraction of patient smoking status from clinical notes. The proposed approach consists of the ordering of sentences that mention smoking, centering them at smoking tokens, and annotating to enhance informative parts of the text. Our experiments on clinical notes from the MIMIC-III clinical database demonstrate that our visualization approach enables human annotators to label sentences up to 3 times faster than with a baseline approach.

Semi-supervised Interactive Intent Labeling
Saurav Sahay | Eda Okur | Nagib Hakim | Lama Nachman

Building the Natural Language Understanding (NLU) modules of task-oriented Spoken Dialogue Systems (SDS) involves a definition of intents and entities, collection of task-relevant data, annotating the data with intents and entities, and then repeating the same process over and over again for adding any functionality/enhancement to the SDS. In this work, we showcase an Intent Bulk Labeling system where SDS developers can interactively label and augment training data from unlabeled utterance corpora using advanced clustering and visual labeling methods. We extend the Deep Aligned Clustering work with a better backbone BERT model, explore techniques to select the seed data for labeling, and develop a data balancing method using an oversampling technique that utilizes paraphrasing models. We also look at the effect of data augmentation on the clustering process. Our results show that we can achieve over 10% gain in clustering accuracy on some datasets using the combination of the above techniques. Finally, we extract utterance embeddings from the clustering model and plot the data to interactively bulk label the samples, reducing the time and effort for data labeling of the whole dataset significantly.

Human-In-The-LoopEntity Linking for Low Resource Domains
Jan-Christoph Klie | Richard Eckart de Castilho | Iryna Gurevych

Entity linking (EL) is concerned with disambiguating entity mentions in a text against knowledge bases (KB). To quickly annotate texts with EL even in low-resource domains and noisy text, we present a novel Human-In-The-Loop EL approach. We show that it greatly outperforms a strong baseline in simulation. In a user study, annotation time is reduced by 35 % compared to annotating without interactive support; users report that they strongly prefer our system over ones without. An open-source and ready-to-use implementation based on the text annotation platform is made available.

Bridging Multi-disciplinary Collaboration Challenges in ML Development via Domain Knowledge Elicitation
Soya Park

Building a machine learning model in a sophisticated domain is a time-consuming process, partially due to the steep learning curve of domain knowledge for data scientists. We introduce Ziva, an interface for supporting domain knowledge from domain experts to data scientists in two ways: (1) a concept creation interface where domain experts extract important concept of the domain and (2) five kinds of justification elicitation interfaces that solicit elicitation how the domain concept are expressed in data instances.

Active learning and negative evidence for language identification
Thomas Lippincott | Ben Van Durme

Language identification (LID), the task of determining the natural language of a given text, is an essential first step in most NLP pipelines. While generally a solved problem for documents of sufficient length and languages with ample training data, the proliferation of microblogs and other social media has made it increasingly common to encounter use-cases that *don’t* satisfy these conditions. In these situations, the fundamental difficulty is the lack of, and cost of gathering, labeled data: unlike some annotation tasks, no single “expert” can quickly and reliably identify more than a handful of languages. This leads to a natural question: can we gain useful information when annotators are only able to *rule out* languages for a given document, rather than supply a positive label? What are the optimal choices for gathering and representing such *negative evidence* as a model is trained? In this paper, we demonstrate that using negative evidence can improve the performance of a simple neural LID model. This improvement is sensitive to policies of how the evidence is represented in the loss function, and for deciding which annotators to employ given the instance and model state. We consider simple policies and report experimental results that indicate the optimal choices for this task. We conclude with a discussion of future work to determine if and how the results generalize to other classification tasks.

Towards integrated, interactive, and extensible text data analytics with Leam
Peter Griggs | Cagatay Demiralp | Sajjadur Rahman

From tweets to product reviews, text is ubiquitous on the web and often contains valuable information for both enterprises and consumers. However, the online text is generally noisy and incomplete, requiring users to process and analyze the data to extract insights. While there are systems effective for different stages of text analysis, users lack extensible platforms to support interactive text analysis workflows end-to-end. To facilitate integrated text analytics, we introduce LEAM, which aims at combining the strengths of spreadsheets, computational notebooks, and interactive visualizations. LEAM supports interactive analysis via GUI-based interactions and provides a declarative specification language, implemented based on a visual text algebra, to enable user-guided analysis. We evaluate LEAM through two case studies using two popular Kaggle text analytics workflows to understand the strengths and weaknesses of the system.

Data Cleaning Tools for Token Classification Tasks
Karthik Muthuraman | Frederick Reiss | Hong Xu | Bryan Cutler | Zachary Eichenberger

Human-in-the-loop systems for cleaning NLP training data rely on automated sieves to isolate potentially-incorrect labels for manual review. We have developed a novel technique for flagging potentially-incorrect labels with high sensitivity in named entity recognition corpora. We incorporated our sieve into an end-to-end system for cleaning NLP corpora, implemented as a modular collection of Jupyter notebooks built on extensions to the Pandas DataFrame library. We used this system to identify incorrect labels in the CoNLL-2003 corpus for English-language named entity recognition (NER), one of the most influential corpora for NER model research. Unlike previous work that only looked at a subset of the corpus’s validation fold, our automated sieve enabled us to examine the entire corpus in depth. Across the entire CoNLL-2003 corpus, we identified over 1300 incorrect labels (out of 35089 in the corpus). We have published our corrections, along with the code we used in our experiments. We are developing a repeatable version of the process we used on the CoNLL-2003 corpus as an open-source library.

Building Low-Resource NER Models Using Non-Speaker Annotations
Tatiana Tsygankova | Francesca Marini | Stephen Mayhew | Dan Roth

In low-resource natural language processing (NLP), the key problems are a lack of target language training data, and a lack of native speakers to create it. Cross-lingual methods have had notable success in addressing these concerns, but in certain common circumstances, such as insufficient pre-training corpora or languages far from the source language, their performance suffers. In this work we propose a complementary approach to building low-resource Named Entity Recognition (NER) models using “non-speaker” (NS) annotations, provided by annotators with no prior experience in the target language. We recruit 30 participants in a carefully controlled annotation experiment with Indonesian, Russian, and Hindi. We show that use of NS annotators produces results that are consistently on par or better than cross-lingual methods built on modern contextual representations, and have the potential to outperform with additional effort. We conclude with observations of common annotation patterns and recommended implementation practices, and motivate how NS annotations can be used in addition to prior methods for improved performance.

Evaluating and Explaining Natural Language Generation with GenX
Kayla Duskin | Shivam Sharma | Ji Young Yun | Emily Saldanha | Dustin Arendt

Current methods for evaluation of natural language generation models focus on measuring text quality but fail to probe the model creativity, i.e., its ability to generate novel but coherent text sequences not seen in the training corpus. We present the GenX tool which is designed to enable interactive exploration and explanation of natural language generation outputs with a focus on the detection of memorization. We demonstrate the utility of the tool on two domain-conditioned generation use cases - phishing emails and ACL abstracts.

CrossCheck: Rapid, Reproducible, and Interpretable Model Evaluation
Dustin Arendt | Zhuanyi Shaw | Prasha Shrestha | Ellyn Ayton | Maria Glenski | Svitlana Volkova

Evaluation beyond aggregate performance metrics, e.g. F1-score, is crucial to both establish an appropriate level of trust in machine learning models and identify avenues for future model improvements. In this paper we demonstrate CrossCheck, an interactive capability for rapid cross-model comparison and reproducible error analysis. We describe the tool, discuss design and implementation details, and present three NLP use cases – named entity recognition, reading comprehension, and clickbait detection that show the benefits of using the tool for model evaluation. CrossCheck enables users to make informed decisions when choosing between multiple models, identify when the models are correct and for which examples, investigate whether the models are making the same mistakes as humans, evaluate models’ generalizability and highlight models’ limitations, strengths and weaknesses. Furthermore, CrossCheck is implemented as a Jupyter widget, which allows for rapid and convenient integration into existing model development workflows.

TopGuNN: Fast NLP Training Data Augmentation using Large Corpora
Rebecca Iglesias-Flores | Megha Mishra | Ajay Patel | Akanksha Malhotra | Reno Kriz | Martha Palmer | Chris Callison-Burch

Acquiring training data for natural language processing systems can be expensive and time-consuming. Given a few training examples crafted by experts, large corpora can be mined for thousands of semantically similar examples that provide useful variability to improve model generalization. We present TopGuNN, a fast contextualized k-NN retrieval system that can efficiently index and search over contextual embeddings generated from large corpora. TopGuNN is demonstrated for a training data augmentation use case over the Gigaword corpus. Using approximate k-NN and an efficient architecture, TopGuNN performs queries over an embedding space of 4.63TB (approximately 1.5B embeddings) in less than a day.

Everyday Living Artificial Intelligence Hub
Raymond Finzel | Esha Singh | Martin Michalowski | Maria Gini | Serguei Pakhomov

We present the Everyday Living Artificial Intelligence (AI) Hub, a novel proof-of-concept framework for enhancing human health and wellbeing via a combination of tailored wear-able and Conversational Agent (CA) solutions for non-invasive monitoring of physiological signals, assessment of behaviors through unobtrusive wearable devices, and the provision of personalized interventions to reduce stress and anxiety. We utilize recent advancements and industry standards in the Internet of Things (IoT)and AI technologies to develop this proof-of-concept framework.

A Computational Model for Interactive Transcription
William Lane | Mat Bettinson | Steven Bird

Transcribing low resource languages can be challenging in the absence of a good lexicon and trained transcribers. Accordingly, we seek a way to enable interactive transcription whereby the machine amplifies human efforts. This paper presents a data model and a system architecture for interactive transcription, supporting multiple modes of interactivity, increasing the likelihood of finding tasks that engage local participation in language work. The approach also supports other applications which are useful in our context, including spoken document retrieval and language learning.