Ruler: Data Programming by Demonstration for Document Labeling

Sara Evensen, Chang Ge, Cagatay Demiralp


Abstract
Data programming aims to reduce the cost of curating training data by encoding domain knowledge as labeling functions over source data. As such it not only requires domain expertise but also programming experience, a skill that many subject matter experts lack. Additionally, generating functions by enumerating rules is not only time consuming but also inherently difficult, even for people with programming experience. In this paper we introduce Ruler, an interactive system that synthesizes labeling rules using span-level interactive demonstrations over document examples. Ruler is a first-of-a-kind implementation of data programming by demonstration (DPBD). This new framework aims to relieve users from the burden of writing labeling functions, enabling them to focus on higher-level semantic analysis, such as identifying relevant signals for the labeling task. We compare Ruler with conventional data programming through a user study conducted with 10 data scientists who were asked to create labeling functions for sentiment and spam classification tasks. Results show Ruler is easier to learn and to use, and that it offers higher overall user-satisfaction while providing model performances comparable to those achieved by conventional data programming.
Anthology ID:
2020.findings-emnlp.181
Volume:
Findings of the Association for Computational Linguistics: EMNLP 2020
Month:
November
Year:
2020
Address:
Online
Editors:
Trevor Cohn, Yulan He, Yang Liu
Venue:
Findings
SIG:
Publisher:
Association for Computational Linguistics
Note:
Pages:
1996–2005
Language:
URL:
https://aclanthology.org/2020.findings-emnlp.181
DOI:
10.18653/v1/2020.findings-emnlp.181
Bibkey:
Cite (ACL):
Sara Evensen, Chang Ge, and Cagatay Demiralp. 2020. Ruler: Data Programming by Demonstration for Document Labeling. In Findings of the Association for Computational Linguistics: EMNLP 2020, pages 1996–2005, Online. Association for Computational Linguistics.
Cite (Informal):
Ruler: Data Programming by Demonstration for Document Labeling (Evensen et al., Findings 2020)
Copy Citation:
PDF:
https://preview.aclanthology.org/emnlp-22-attachments/2020.findings-emnlp.181.pdf