Shomir Wilson


STAPI: An Automatic Scraper for Extracting Iterative Title-Text Structure from Web Documents
Nan Zhang | Shomir Wilson | Prasenjit Mitra
Proceedings of the Thirteenth Language Resources and Evaluation Conference

Formal documents often are organized into sections of text, each with a title, and extracting this structure remains an under-explored aspect of natural language processing. This iterative title-text structure is valuable data for building models for headline generation and section title generation, but there is no corpus that contains web documents annotated with titles and prose texts. Therefore, we propose the first title-text dataset on web documents that incorporates a wide variety of domains to facilitate downstream training. We also introduce STAPI (Section Title And Prose text Identifier), a two-step system for labeling section titles and prose text in HTML documents. To filter out unrelated content like document footers, its first step involves a filter that reads HTML documents and proposes a set of textual candidates. In the second step, a typographic classifier takes the candidates from the filter and categorizes each one into one of the three pre-defined classes (title, prose text, and miscellany). We show that STAPI significantly outperforms two baseline models in terms of title-text identification. We release our dataset along with a web application to facilitate supervised and semi-supervised training in this domain.

A Tale of Two Regulatory Regimes: Creation and Analysis of a Bilingual Privacy Policy Corpus
Siddhant Arora | Henry Hosseini | Christine Utz | Vinayshekhar Bannihatti Kumar | Tristan Dhellemmes | Abhilasha Ravichander | Peter Story | Jasmine Mangat | Rex Chen | Martin Degeling | Thomas Norton | Thomas Hupperich | Shomir Wilson | Norman Sadeh
Proceedings of the Thirteenth Language Resources and Evaluation Conference

Over the past decade, researchers have started to explore the use of NLP to develop tools aimed at helping the public, vendors, and regulators analyze disclosures made in privacy policies. With the introduction of new privacy regulations, the language of privacy policies is also evolving, and disclosures made by the same organization are not always the same in different languages, especially when used to communicate with users who fall under different jurisdictions. This work explores the use of language technologies to capture and analyze these differences at scale. We introduce an annotation scheme designed to capture the nuances of two new landmark privacy regulations, namely the EU’s GDPR and California’s CCPA/CPRA. We then introduce the first bilingual corpus of mobile app privacy policies consisting of 64 privacy policies in English (292K words) and 91 privacy policies in German (478K words), respectively with manual annotations for 8K and 19K fine-grained data practices. The annotations are used to develop computational methods that can automatically extract “disclosures” from privacy policies. Analysis of a subset of 59 “semi-parallel” policies reveals differences that can be attributed to different regulatory regimes, suggesting that systematic analysis of policies using automated language technologies is indeed a worthwhile endeavor.

A Study of Implicit Bias in Pretrained Language Models against People with Disabilities
Pranav Narayanan Venkit | Mukund Srinath | Shomir Wilson
Proceedings of the 29th International Conference on Computational Linguistics

Pretrained language models (PLMs) have been shown to exhibit sociodemographic biases, such as against gender and race, raising concerns of downstream biases in language technologies. However, PLMs’ biases against people with disabilities (PWDs) have received little attention, in spite of their potential to cause similar harms. Using perturbation sensitivity analysis, we test an assortment of popular word embedding-based and transformer-based PLMs and show significant biases against PWDs in all of them. The results demonstrate how models trained on large corpora widely favor ableist language.


Breaking Down Walls of Text: How Can NLP Benefit Consumer Privacy?
Abhilasha Ravichander | Alan W Black | Thomas Norton | Shomir Wilson | Norman Sadeh
Proceedings of the 59th Annual Meeting of the Association for Computational Linguistics and the 11th International Joint Conference on Natural Language Processing (Volume 1: Long Papers)

Privacy plays a crucial role in preserving democratic ideals and personal autonomy. The dominant legal approach to privacy in many jurisdictions is the “Notice and Choice” paradigm, where privacy policies are the primary instrument used to convey information to users. However, privacy policies are long and complex documents that are difficult for users to read and comprehend. We discuss how language technologies can play an important role in addressing this information gap, reporting on initial progress towards helping three specific categories of stakeholders take advantage of digital privacy policies: consumers, enterprises, and regulators. Our goal is to provide a roadmap for the development and use of language technologies to empower users to reclaim control over their privacy, limit privacy harms, and rally research efforts from the community towards addressing an issue with large social impact. We highlight many remaining opportunities to develop language technologies that are more precise or nuanced in the way in which they use the text of privacy policies.

Privacy at Scale: Introducing the PrivaSeer Corpus of Web Privacy Policies
Mukund Srinath | Shomir Wilson | C Lee Giles
Proceedings of the 59th Annual Meeting of the Association for Computational Linguistics and the 11th International Joint Conference on Natural Language Processing (Volume 1: Long Papers)

Organisations disclose their privacy practices by posting privacy policies on their websites. Even though internet users often care about their digital privacy, they usually do not read privacy policies, since understanding them requires a significant investment of time and effort. Natural language processing has been used to create experimental tools to interpret privacy policies, but there has been a lack of large privacy policy corpora to facilitate the creation of large-scale semi-supervised and unsupervised models to interpret and simplify privacy policies. Thus, we present the PrivaSeer Corpus of 1,005,380 English language website privacy policies collected from the web. The number of unique websites represented in PrivaSeer is about ten times larger than the next largest public collection of web privacy policies, and it surpasses the aggregate of unique websites represented in all other publicly available privacy policy corpora combined. We describe a corpus creation pipeline with stages that include a web crawler, language detection, document classification, duplicate and near-duplicate removal, and content extraction. We employ an unsupervised topic modelling approach to investigate the contents of policy documents in the corpus and discuss the distribution of topics in privacy policies at web scale. We further investigate the relationship between privacy policy domain PageRanks and text features of the privacy policies. Finally, we use the corpus to pretrain PrivBERT, a transformer-based privacy policy language model, and obtain state of the art results on the data practice classification and question answering tasks.


Question Answering for Privacy Policies: Combining Computational and Legal Perspectives
Abhilasha Ravichander | Alan W Black | Shomir Wilson | Thomas Norton | Norman Sadeh
Proceedings of the 2019 Conference on Empirical Methods in Natural Language Processing and the 9th International Joint Conference on Natural Language Processing (EMNLP-IJCNLP)

Privacy policies are long and complex documents that are difficult for users to read and understand. Yet, they have legal effects on how user data can be collected, managed and used. Ideally, we would like to empower users to inform themselves about the issues that matter to them, and enable them to selectively explore these issues. We present PrivacyQA, a corpus consisting of 1750 questions about the privacy policies of mobile applications, and over 3500 expert annotations of relevant answers. We observe that a strong neural baseline underperforms human performance by almost 0.3 F1 on PrivacyQA, suggesting considerable room for improvement for future systems. Further, we use this dataset to categorically identify challenges to question answerability, with domain-general implications for any question answering system. The PrivacyQA corpus offers a challenging corpus for question answering, with genuine real world utility.


Supervised and Unsupervised Methods for Robust Separation of Section Titles and Prose Text in Web Documents
Abhijith Athreya Mysore Gopinath | Shomir Wilson | Norman Sadeh
Proceedings of the 2018 Conference on Empirical Methods in Natural Language Processing

The text in many web documents is organized into a hierarchy of section titles and corresponding prose content, a structure which provides potentially exploitable information on discourse structure and topicality. However, this organization is generally discarded during text collection, and collecting it is not straightforward: the same visual organization can be implemented in a myriad of different ways in the underlying HTML. To remedy this, we present a flexible system for automatically extracting the hierarchical section titles and prose organization of web documents irrespective of differences in HTML representation. This system uses features from syntax, semantics, discourse and markup to build two models which classify HTML text into section titles and prose text. When tested on three different domains of web text, our domain-independent system achieves an overall precision of 0.82 and a recall of 0.98. The domain-dependent variation produces very high precision (0.99) at the expense of recall (0.75). These results exhibit a robust level of accuracy suitable for enhancing question answering, information extraction, and summarization.


Identifying the Provision of Choices in Privacy Policy Text
Kanthashree Mysore Sathyendra | Shomir Wilson | Florian Schaub | Sebastian Zimmeck | Norman Sadeh
Proceedings of the 2017 Conference on Empirical Methods in Natural Language Processing

Websites’ and mobile apps’ privacy policies, written in natural language, tend to be long and difficult to understand. Information privacy revolves around the fundamental principle of Notice and choice, namely the idea that users should be able to make informed decisions about what information about them can be collected and how it can be used. Internet users want control over their privacy, but their choices are often hidden in long and convoluted privacy policy texts. Moreover, little (if any) prior work has been done to detect the provision of choices in text. We address this challenge of enabling user choice by automatically identifying and extracting pertinent choice language in privacy policies. In particular, we present a two-stage architecture of classification models to identify opt-out choices in privacy policy text, labelling common varieties of choices with a mean F1 score of 0.735. Our techniques enable the creation of systems to help Internet users to learn about their choices, thereby effectuating notice and choice and improving Internet privacy.


The Creation and Analysis of a Website Privacy Policy Corpus
Shomir Wilson | Florian Schaub | Aswarth Abhilash Dara | Frederick Liu | Sushain Cherivirala | Pedro Giovanni Leon | Mads Schaarup Andersen | Sebastian Zimmeck | Kanthashree Mysore Sathyendra | N. Cameron Russell | Thomas B. Norton | Eduard Hovy | Joel Reidenberg | Norman Sadeh
Proceedings of the 54th Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers)

This Table is Different: A WordNet-Based Approach to Identifying References to Document Entities
Shomir Wilson | Alan Black | Jon Oberlander
Proceedings of the 8th Global WordNet Conference (GWC)

Writing intended to inform frequently contains references to document entities (DEs), a mixed class that includes orthographically structured items (e.g., illustrations, sections, lists) and discourse entities (arguments, suggestions, points). Such references are vital to the interpretation of documents, but they often eschew identifiers such as “Figure 1” for inexplicit phrases like “in this figure” or “from these premises”. We examine inexplicit references to DEs, termed DE references, and recast the problem of their automatic detection into the determination of relevant word senses. We then show the feasibility of machine learning for the detection of DE-relevant word senses, using a corpus of human-labeled synsets from WordNet. We test cross-domain performance by gathering lemmas and synsets from three corpora: website privacy policies, Wikipedia articles, and Wikibooks textbooks. Identifying DE references will enable language technologies to use the information encoded by them, permitting the automatic generation of finely-tuned descriptions of DEs and the presentation of richly-structured information to readers.


Determiner-Established Deixis to Communicative Artifacts in Pedagogical Text
Shomir Wilson | Jon Oberlander
Proceedings of the 52nd Annual Meeting of the Association for Computational Linguistics (Volume 2: Short Papers)


Toward Automatic Processing of English Metalanguage
Shomir Wilson
Proceedings of the Sixth International Joint Conference on Natural Language Processing


The Creation of a Corpus of English Metalanguage
Shomir Wilson
Proceedings of the 50th Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers)


Distinguishing Use and Mention in Natural Language
Shomir Wilson
Proceedings of the NAACL HLT 2010 Student Research Workshop