Sajjadur Rahman
2022
Low-resource Entity Set Expansion: A Comprehensive Study on User-generated Text
Yutong Shao
|
Nikita Bhutani
|
Sajjadur Rahman
|
Estevam Hruschka
Findings of the Association for Computational Linguistics: NAACL 2022
Entity set expansion (ESE) aims at obtaining a more complete set of entities given a textual corpus and a seed set of entities of a concept. Although it is a critical task in many NLP applications, existing benchmarks are limited to well-formed text (e.g., Wikipedia) and well-defined concepts (e.g., countries and diseases). Furthermore, only a small number of predictions are evaluated compared to the actual size of an entity set. A rigorous assessment of ESE methods warrants more comprehensive benchmarks and evaluation. In this paper, we consider user-generated text to understand the generalizability of ESE methods. We develop new benchmarks and propose more rigorous evaluation metrics for assessing the performance of ESE methods. Additionally, we identify phenomena such as non-named entities, multifaceted entities, vague concepts that are more prevalent in user-generated text than well-formed text, and use them to profile ESE methods. We observe that the strong performance of state-of-the-art ESE methods does not generalize well to user-generated text. We conduct comprehensive empirical analysis and draw insights from the findings.
Low-resource Interactive Active Labeling for Fine-tuning Language Models
Seiji Maekawa
|
Dan Zhang
|
Hannah Kim
|
Sajjadur Rahman
|
Estevam Hruschka
Findings of the Association for Computational Linguistics: EMNLP 2022
Recently, active learning (AL) methods have been used to effectively fine-tune pre-trained language models for various NLP tasks such as sentiment analysis and document classification. However, given the task of fine-tuning language models, understanding the impact of different aspects on AL methods such as labeling cost, sample acquisition latency, and the diversity of the datasets necessitates a deeper investigation. This paper examines the performance of existing AL methods within a low-resource, interactive labeling setting. We observe that existing methods often underperform in such a setting while exhibiting higher latency and a lack of generalizability. To overcome these challenges, we propose a novel active learning method TYROUGE that employs a hybrid sampling strategy to minimize labeling cost and acquisition latency while providing a framework for adapting to dataset diversity via user guidance. Through our experiments, we observe that compared to SOTA methods, TYROUGE reduces the labeling cost by up to 43% and the acquisition latency by as much as 11X, while achieving comparable accuracy. Finally, we discuss the strengths and weaknesses of TYROUGE by exploring the impact of dataset characteristics.
2021
Towards integrated, interactive, and extensible text data analytics with Leam
Peter Griggs
|
Cagatay Demiralp
|
Sajjadur Rahman
Proceedings of the Second Workshop on Data Science with Human in the Loop: Language Advances
From tweets to product reviews, text is ubiquitous on the web and often contains valuable information for both enterprises and consumers. However, the online text is generally noisy and incomplete, requiring users to process and analyze the data to extract insights. While there are systems effective for different stages of text analysis, users lack extensible platforms to support interactive text analysis workflows end-to-end. To facilitate integrated text analytics, we introduce LEAM, which aims at combining the strengths of spreadsheets, computational notebooks, and interactive visualizations. LEAM supports interactive analysis via GUI-based interactions and provides a declarative specification language, implemented based on a visual text algebra, to enable user-guided analysis. We evaluate LEAM through two case studies using two popular Kaggle text analytics workflows to understand the strengths and weaknesses of the system.
Search
Co-authors
- Estevam Hruschka 2
- Yutong Shao 1
- Nikita Bhutani 1
- Seiji Maekawa 1
- Dan Zhang 1
- show all...