2022
pdf
bib
Proceedings of the Fourth Workshop on Data Science with Human-in-the-Loop (Language Advances)
Eduard Dragut
|
Yunyao Li
|
Lucian Popa
|
Slobodan Vucetic
|
Shashank Srivastava
Proceedings of the Fourth Workshop on Data Science with Human-in-the-Loop (Language Advances)
pdf
abs
OpenStance: Real-world Zero-shot Stance Detection
Hanzi Xu
|
Slobodan Vucetic
|
Wenpeng Yin
Proceedings of the 26th Conference on Computational Natural Language Learning (CoNLL)
Prior studies of zero-shot stance detection identify the attitude of texts towards unseen topics occurring in the same document corpus. Such task formulation has three limitations: (i) Single domain/dataset. A system is optimized on a particular dataset from a single domain; therefore, the resulting system cannot work well on other datasets; (ii) the model is evaluated on a limited number of unseen topics; (iii) it is assumed that part of the topics has rich annotations, which might be impossible in real-world applications. These drawbacks will lead to an impractical stance detection system that fails to generalize to open domains and open-form topics. This work defines OpenStance: open-domain zero-shot stance detection, aiming to handle stance detection in an open world with neither domain constraints nor topic-specific annotations. The key challenge of OpenStance lies in open-domain generalization: learning a system with fully unspecific supervision but capable of generalizing to any dataset. To solve OpenStance, we propose to combine indirect supervision, from textual entailment datasets, and weak supervision, from data generated automatically by pre-trained Language Models. Our single system, without any topic-specific supervision, outperforms the supervised method on three popular datasets. To our knowledge, this is the first work that studies stance detection under the open-domain zero-shot setting. All data and code will be publicly released.
2021
pdf
bib
Proceedings of the Second Workshop on Data Science with Human in the Loop: Language Advances
Eduard Dragut
|
Yunyao Li
|
Lucian Popa
|
Slobodan Vucetic
Proceedings of the Second Workshop on Data Science with Human in the Loop: Language Advances
pdf
abs
A Visualization Approach for Rapid Labeling of Clinical Notes for Smoking Status Extraction
Saman Enayati
|
Ziyu Yang
|
Benjamin Lu
|
Slobodan Vucetic
Proceedings of the Second Workshop on Data Science with Human in the Loop: Language Advances
Labeling is typically the most human-intensive step during the development of supervised learning models. In this paper, we propose a simple and easy-to-implement visualization approach that reduces cognitive load and increases the speed of text labeling. The approach is fine-tuned for task of extraction of patient smoking status from clinical notes. The proposed approach consists of the ordering of sentences that mention smoking, centering them at smoking tokens, and annotating to enhance informative parts of the text. Our experiments on clinical notes from the MIMIC-III clinical database demonstrate that our visualization approach enables human annotators to label sentences up to 3 times faster than with a baseline approach.
2020
pdf
abs
Improving Word Embeddings through Iterative Refinement of Word- and Character-level Models
Phong Ha
|
Shanshan Zhang
|
Nemanja Djuric
|
Slobodan Vucetic
Proceedings of the 28th International Conference on Computational Linguistics
Embedding of rare and out-of-vocabulary (OOV) words is an important open NLP problem. A popular solution is to train a character-level neural network to reproduce the embeddings from a standard word embedding model. The trained network is then used to assign vectors to any input string, including OOV and rare words. We enhance this approach and introduce an algorithm that iteratively refines and improves both word- and character-level models. We demonstrate that our method outperforms the existing algorithms on 5 word similarity data sets, and that it can be successfully applied to job title normalization, an important problem in the e-recruitment domain that suffers from the OOV problem.
2019
pdf
abs
Spatial Aggregation Facilitates Discovery of Spatial Topics
Aniruddha Maiti
|
Slobodan Vucetic
Proceedings of the 57th Annual Meeting of the Association for Computational Linguistics
Spatial aggregation refers to merging of documents created at the same spatial location. We show that by spatial aggregation of a large collection of documents and applying a traditional topic discovery algorithm on the aggregated data we can efficiently discover spatially distinct topics. By looking at topic discovery through matrix factorization lenses we show that spatial aggregation allows low rank approximation of the original document-word matrix, in which spatially distinct topics are preserved and non-spatial topics are aggregated into a single topic. Our experiments on synthetic data confirm this observation. Our experiments on 4.7 million tweets collected during the Sandy Hurricane in 2012 show that spatial and temporal aggregation allows rapid discovery of relevant spatial and temporal topics during that period. Our work indicates that different forms of document aggregation might be effective in rapid discovery of various types of distinct topics from large collections of documents.
2018
pdf
abs
Regular Expression Guided Entity Mention Mining from Noisy Web Data
Shanshan Zhang
|
Lihong He
|
Slobodan Vucetic
|
Eduard Dragut
Proceedings of the 2018 Conference on Empirical Methods in Natural Language Processing
Many important entity types in web documents, such as dates, times, email addresses, and course numbers, follow or closely resemble patterns that can be described by Regular Expressions (REs). Due to a vast diversity of web documents and ways in which they are being generated, even seemingly straightforward tasks such as identifying mentions of date in a document become very challenging. It is reasonable to claim that it is impossible to create a RE that is capable of identifying such entities from web documents with perfect precision and recall. Rather than abandoning REs as a go-to approach for entity detection, this paper explores ways to combine the expressive power of REs, ability of deep learning to learn from large data, and human-in-the loop approach into a new integrated framework for entity identification from web data. The framework starts by creating or collecting the existing REs for a particular type of an entity. Those REs are then used over a large document corpus to collect weak labels for the entity mentions and a neural network is trained to predict those RE-generated weak labels. Finally, a human expert is asked to label a small set of documents and the neural network is fine tuned on those documents. The experimental evaluation on several entity identification problems shows that the proposed framework achieves impressive accuracy, while requiring very modest human effort.