Daniel Preoţiuc-Pietro

Also published as: Daniel Preotiuc-Pietro


2021

pdf bib
Identifying Named Entities as they are Typed
Ravneet Arora | Chen-Tse Tsai | Daniel Preotiuc-Pietro
Proceedings of the 16th Conference of the European Chapter of the Association for Computational Linguistics: Main Volume

Identifying named entities in written text is an essential component of the text processing pipeline used in applications such as text editors to gain a better understanding of the semantics of the text. However, the typical experimental setup for evaluating Named Entity Recognition (NER) systems is not directly applicable to systems that process text in real time as the text is being typed. Evaluation is performed on a sentence level assuming the end-user is willing to wait until the entire sentence is typed for entities to be identified and further linked to identifiers or co-referenced. We introduce a novel experimental setup for NER systems for applications where decisions about named entity boundaries need to be performed in an online fashion. We study how state-of-the-art methods perform under this setup in multiple languages and propose adaptations to these models to suit this new experimental setup. Experimental results show that the best systems that are evaluated on each token after its typed, reach performance within 1–5 F1 points of systems that are evaluated at the end of the sentence. These show that entity recognition can be performed in this setup and open up the development of other NLP tools in a similar setup.

pdf bib
Proceedings of the Natural Legal Language Processing Workshop 2021
Nikolaos Aletras | Ion Androutsopoulos | Leslie Barrett | Catalina Goanta | Daniel Preotiuc-Pietro
Proceedings of the Natural Legal Language Processing Workshop 2021

2020

pdf bib
Point-of-Interest Type Inference from Social Media Text
Danae Sánchez Villegas | Daniel Preotiuc-Pietro | Nikolaos Aletras
Proceedings of the 1st Conference of the Asia-Pacific Chapter of the Association for Computational Linguistics and the 10th International Joint Conference on Natural Language Processing

Physical places help shape how we perceive the experiences we have there. We study the relationship between social media text and the type of the place from where it was posted, whether a park, restaurant, or someplace else. To facilitate this, we introduce a novel data set of ~200,000 English tweets published from 2,761 different points-of-interest in the U.S., enriched with place type information. We train classifiers to predict the type of the location a tweet was sent from that reach a macro F1 of 43.67 across eight classes and uncover the linguistic markers associated with each type of place. The ability to predict semantic place information from a tweet has applications in recommendation systems, personalization services and cultural geography.

pdf bib
Fact vs. Opinion: the Role of Argumentation Features in News Classification
Tariq Alhindi | Smaranda Muresan | Daniel Preotiuc-Pietro
Proceedings of the 28th International Conference on Computational Linguistics

A 2018 study led by the Media Insight Project showed that most journalists think that a clearmarking of what is news reporting and what is commentary or opinion (e.g., editorial, op-ed)is essential for gaining public trust. We present an approach to classify news articles into newsstories (i.e., reporting of factual information) and opinion pieces using models that aim to sup-plement the article content representation with argumentation features. Our hypothesis is thatthe nature of argumentative discourse is important in distinguishing between news stories andopinion articles. We show that argumentation features outperform linguistic features used previ-ously and improve on fine-tuned transformer-based models when tested on data from publishersunseen in training. Automatically flagging opinion pieces vs. news stories can aid applicationssuch as fact-checking or event extraction.

pdf bib
Analyzing Political Parody in Social Media
Antonis Maronikolakis | Danae Sánchez Villegas | Daniel Preotiuc-Pietro | Nikolaos Aletras
Proceedings of the 58th Annual Meeting of the Association for Computational Linguistics

Parody is a figurative device used to imitate an entity for comedic or critical purposes and represents a widespread phenomenon in social media through many popular parody accounts. In this paper, we present the first computational study of parody. We introduce a new publicly available data set of tweets from real politicians and their corresponding parody accounts. We run a battery of supervised machine learning models for automatically detecting parody tweets with an emphasis on robustness by testing on tweets from accounts unseen in training, across different genders and across countries. Our results show that political parody tweets can be predicted with an accuracy up to 90%. Finally, we identify the markers of parody through a linguistic analysis. Beyond research in linguistics and political communication, accurately and automatically detecting parody is important to improving fact checking for journalists and analytics such as sentiment analysis through filtering out parodical utterances.

pdf bib
Temporally-Informed Analysis of Named Entity Recognition
Shruti Rijhwani | Daniel Preotiuc-Pietro
Proceedings of the 58th Annual Meeting of the Association for Computational Linguistics

Natural language processing models often have to make predictions on text data that evolves over time as a result of changes in language use or the information described in the text. However, evaluation results on existing data sets are seldom reported by taking the timestamp of the document into account. We analyze and propose methods that make better use of temporally-diverse training data, with a focus on the task of named entity recognition. To support these experiments, we introduce a novel data set of English tweets annotated with named entities. We empirically demonstrate the effect of temporal drift on performance, and how the temporal information of documents can be used to obtain better models compared to those that disregard temporal information. Our analysis gives insights into why this information is useful, in the hope of informing potential avenues of improvement for named entity recognition as well as other NLP tasks under similar experimental setups.

pdf bib
Multi-Domain Named Entity Recognition with Genre-Aware and Agnostic Inference
Jing Wang | Mayank Kulkarni | Daniel Preotiuc-Pietro
Proceedings of the 58th Annual Meeting of the Association for Computational Linguistics

Named entity recognition is a key component of many text processing pipelines and it is thus essential for this component to be robust to different types of input. However, domain transfer of NER models with data from multiple genres has not been widely studied. To this end, we conduct NER experiments in three predictive setups on data from: a) multiple domains; b) multiple domains where the genre label is unknown at inference time; c) domains not encountered in training. We introduce a new architecture tailored to this task by using shared and private domain parameters and multi-task learning. This consistently outperforms all other baseline and competitive methods on all three experimental setups, with differences ranging between +1.95 to +3.11 average F1 across multiple genres when compared to standard approaches. These results illustrate the challenges that need to be taken into account when building real-world NLP applications that are robust to various types of text and the methods that can help, at least partially, alleviate these issues.

2019

pdf bib
Multi-task Pairwise Neural Ranking for Hashtag Segmentation
Mounica Maddela | Wei Xu | Daniel Preoţiuc-Pietro
Proceedings of the 57th Annual Meeting of the Association for Computational Linguistics

Hashtags are often employed on social media and beyond to add metadata to a textual utterance with the goal of increasing discoverability, aiding search, or providing additional semantics. However, the semantic content of hashtags is not straightforward to infer as these represent ad-hoc conventions which frequently include multiple words joined together and can include abbreviations and unorthodox spellings. We build a dataset of 12,594 hashtags split into individual segments and propose a set of approaches for hashtag segmentation by framing it as a pairwise ranking problem between candidate segmentations. Our novel neural approaches demonstrate 24.6% error reduction in hashtag segmentation accuracy compared to the current state-of-the-art method. Finally, we demonstrate that a deeper understanding of hashtag semantics obtained through segmentation is useful for downstream applications such as sentiment analysis, for which we achieved a 2.6% increase in average recall on the SemEval 2017 sentiment analysis dataset.

pdf bib
Categorizing and Inferring the Relationship between the Text and Image of Twitter Posts
Alakananda Vempala | Daniel Preoţiuc-Pietro
Proceedings of the 57th Annual Meeting of the Association for Computational Linguistics

Text in social media posts is frequently accompanied by images in order to provide content, supply context, or to express feelings. This paper studies how the meaning of the entire tweet is composed through the relationship between its textual content and its image. We build and release a data set of image tweets annotated with four classes which express whether the text or the image provides additional information to the other modality. We show that by combining the text and image information, we can build a machine learning approach that accurately distinguishes between the relationship types. Further, we derive insights into how these relationships are materialized through text and image content analysis and how they are impacted by user demographic traits. These methods can be used in several downstream applications including pre-training image tagging models, collecting distantly supervised data for image captioning, and can be directly used in end-user applications to optimize screen estate.

pdf bib
Analyzing Linguistic Differences between Owner and Staff Attributed Tweets
Daniel Preoţiuc-Pietro | Rita Devlin Marier
Proceedings of the 57th Annual Meeting of the Association for Computational Linguistics

Research on social media has to date assumed that all posts from an account are authored by the same person. In this study, we challenge this assumption and study the linguistic differences between posts signed by the account owner or attributed to their staff. We introduce a novel data set of tweets posted by U.S. politicians who self-reported their tweets using a signature. We analyze the linguistic topics and style features that distinguish the two types of tweets. Predictive results show that we are able to predict owner and staff attributed tweets with good accuracy, even when not using any training data from that account.

pdf bib
Automatically Identifying Complaints in Social Media
Daniel Preoţiuc-Pietro | Mihaela Gaman | Nikolaos Aletras
Proceedings of the 57th Annual Meeting of the Association for Computational Linguistics

Complaining is a basic speech act regularly used in human and computer mediated communication to express a negative mismatch between reality and expectations in a particular situation. Automatically identifying complaints in social media is of utmost importance for organizations or brands to improve the customer experience or in developing dialogue systems for handling and responding to complaints. In this paper, we introduce the first systematic analysis of complaints in computational linguistics. We collect a new annotated data set of written complaints expressed on Twitter. We present an extensive linguistic analysis of complaining as a speech act in social media and train strong feature-based and neural models of complaints across nine domains achieving a predictive performance of up to 79 F1 using distant supervision.

pdf bib
Proceedings of the Natural Legal Language Processing Workshop 2019
Nikolaos Aletras | Elliott Ash | Leslie Barrett | Daniel Chen | Adam Meyers | Daniel Preotiuc-Pietro | David Rosenberg | Amanda Stent
Proceedings of the Natural Legal Language Processing Workshop 2019

2018

pdf bib
The Remarkable Benefit of User-Level Aggregation for Lexical-based Population-Level Predictions
Salvatore Giorgi | Daniel Preoţiuc-Pietro | Anneke Buffone | Daniel Rieman | Lyle Ungar | H. Andrew Schwartz
Proceedings of the 2018 Conference on Empirical Methods in Natural Language Processing

Nowcasting based on social media text promises to provide unobtrusive and near real-time predictions of community-level outcomes. These outcomes are typically regarding people, but the data is often aggregated without regard to users in the Twitter populations of each community. This paper describes a simple yet effective method for building community-level models using Twitter language aggregated by user. Results on four different U.S. county-level tasks, spanning demographic, health, and psychological outcomes show large and consistent improvements in prediction accuracies (e.g. from Pearson r=.73 to .82 for median income prediction or r=.37 to .47 for life satisfaction prediction) over the standard approach of aggregating all tweets. We make our aggregated and anonymized community-level data, derived from 37 billion tweets – over 1 billion of which were mapped to counties, available for research.

pdf bib
Why Swear? Analyzing and Inferring the Intentions of Vulgar Expressions
Eric Holgate | Isabel Cachola | Daniel Preoţiuc-Pietro | Junyi Jessy Li
Proceedings of the 2018 Conference on Empirical Methods in Natural Language Processing

Vulgar words are employed in language use for several different functions, ranging from expressing aggression to signaling group identity or the informality of the communication. This versatility of usage of a restricted set of words is challenging for downstream applications and has yet to be studied quantitatively or using natural language processing techniques. We introduce a novel data set of 7,800 tweets from users with known demographic traits where all instances of vulgar words are annotated with one of the six categories of vulgar word use. Using this data set, we present the first analysis of the pragmatic aspects of vulgarity and how they relate to social factors. We build a model able to predict the category of a vulgar word based on the immediate context it appears in with 67.4 macro F1 across six classes. Finally, we demonstrate the utility of modeling the type of vulgar word use in context by using this information to achieve state-of-the-art performance in hate speech detection on a benchmark data set.

pdf bib
User-Level Race and Ethnicity Predictors from Twitter Text
Daniel Preoţiuc-Pietro | Lyle Ungar
Proceedings of the 27th International Conference on Computational Linguistics

User demographic inference from social media text has the potential to improve a range of downstream applications, including real-time passive polling or quantifying demographic bias. This study focuses on developing models for user-level race and ethnicity prediction. We introduce a data set of users who self-report their race/ethnicity through a survey, in contrast to previous approaches that use distantly supervised data or perceived labels. We develop predictive models from text which accurately predict the membership of a user to the four largest racial and ethnic groups with up to .884 AUC and make these available to the research community.

pdf bib
Expressively vulgar: The socio-dynamics of vulgarity and its effects on sentiment analysis in social media
Isabel Cachola | Eric Holgate | Daniel Preoţiuc-Pietro | Junyi Jessy Li
Proceedings of the 27th International Conference on Computational Linguistics

Vulgarity is a common linguistic expression and is used to perform several linguistic functions. Understanding their usage can aid both linguistic and psychological phenomena as well as benefit downstream natural language processing applications such as sentiment analysis. This study performs a large-scale, data-driven empirical analysis of vulgar words using social media data. We analyze the socio-cultural and pragmatic aspects of vulgarity using tweets from users with known demographics. Further, we collect sentiment ratings for vulgar tweets to study the relationship between the use of vulgar words and perceived sentiment and show that explicitly modeling vulgar words can boost sentiment analysis performance.

2017

pdf bib
Controlling Human Perception of Basic User Traits
Daniel Preoţiuc-Pietro | Sharath Chandra Guntuku | Lyle Ungar
Proceedings of the 2017 Conference on Empirical Methods in Natural Language Processing

Much of our online communication is text-mediated and, lately, more common with automated agents. Unlike interacting with humans, these agents currently do not tailor their language to the type of person they are communicating to. In this pilot study, we measure the extent to which human perception of basic user trait information – gender and age – is controllable through text. Using automatic models of gender and age prediction, we estimate which tweets posted by a user are more likely to mis-characterize his traits. We perform multiple controlled crowdsourcing experiments in which we show that we can reduce the human prediction accuracy of gender to almost random – an over 20% drop in accuracy. Our experiments show that it is practically feasible for multiple applications such as text generation, text summarization or machine translation to be tailored to specific traits and perceived as such.

pdf bib
Beyond Binary Labels: Political Ideology Prediction of Twitter Users
Daniel Preoţiuc-Pietro | Ye Liu | Daniel Hopkins | Lyle Ungar
Proceedings of the 55th Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers)

Automatic political orientation prediction from social media posts has to date proven successful only in distinguishing between publicly declared liberals and conservatives in the US. This study examines users’ political ideology using a seven-point scale which enables us to identify politically moderate and neutral users – groups which are of particular interest to political scientists and pollsters. Using a novel data set with political ideology labels self-reported through surveys, our goal is two-fold: a) to characterize the groups of politically engaged users through language use on Twitter; b) to build a fine-grained model that predicts political ideology of unseen users. Our results identify differences in both political leaning and engagement and the extent to which each group tweets using political keywords. Finally, we demonstrate how to improve ideology prediction accuracy by exploiting the relationships between the user groups.

pdf bib
Personality Driven Differences in Paraphrase Preference
Daniel Preoţiuc-Pietro | Jordan Carpenter | Lyle Ungar
Proceedings of the Second Workshop on NLP and Computational Social Science

Personality plays a decisive role in how people behave in different scenarios, including online social media. Researchers have used such data to study how personality can be predicted from language use. In this paper, we study phrase choice as a particular stylistic linguistic difference, as opposed to the mostly topical differences identified previously. Building on previous work on demographic preferences, we quantify differences in paraphrase choice from a massive Facebook data set with posts from over 115,000 users. We quantify the predictive power of phrase choice in user profiling and use phrase choice to study psycholinguistic hypotheses. This work is relevant to future applications that aim to personalize text generation to specific personality types.

pdf bib
Predicting Emotional Word Ratings using Distributional Representations and Signed Clustering
João Sedoc | Daniel Preoţiuc-Pietro | Lyle Ungar
Proceedings of the 15th Conference of the European Chapter of the Association for Computational Linguistics: Volume 2, Short Papers

Inferring the emotional content of words is important for text-based sentiment analysis, dialogue systems and psycholinguistics, but word ratings are expensive to collect at scale and across languages or domains. We develop a method that automatically extends word-level ratings to unrated words using signed clustering of vector space word representations along with affect ratings. We use our method to determine a word’s valence and arousal, which determine its position on the circumplex model of affect, the most popular dimensional model of emotion. Our method achieves superior out-of-sample word rating prediction on both affective dimensions across three different languages when compared to state-of-the-art word similarity based methods. Our method can assist building word ratings for new languages and improve downstream tasks such as sentiment analysis and emotion detection.

2016

pdf bib
Modelling Valence and Arousal in Facebook posts
Daniel Preoţiuc-Pietro | H. Andrew Schwartz | Gregory Park | Johannes Eichstaedt | Margaret Kern | Lyle Ungar | Elisabeth Shulman
Proceedings of the 7th Workshop on Computational Approaches to Subjectivity, Sentiment and Social Media Analysis

pdf bib
Analyzing Biases in Human Perception of User Age and Gender from Text
Lucie Flekova | Jordan Carpenter | Salvatore Giorgi | Lyle Ungar | Daniel Preoţiuc-Pietro
Proceedings of the 54th Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers)

pdf bib
Exploring Stylistic Variation with Age and Income on Twitter
Lucie Flekova | Daniel Preoţiuc-Pietro | Lyle Ungar
Proceedings of the 54th Annual Meeting of the Association for Computational Linguistics (Volume 2: Short Papers)

pdf bib
An Empirical Exploration of Moral Foundations Theory in Partisan News Sources
Dean Fulgoni | Jordan Carpenter | Lyle Ungar | Daniel Preoţiuc-Pietro
Proceedings of the Tenth International Conference on Language Resources and Evaluation (LREC'16)

News sources frame issues in different ways in order to appeal or control the perception of their readers. We present a large scale study of news articles from partisan sources in the US across a variety of different issues. We first highlight that differences between sides exist by predicting the political leaning of articles of unseen political bias. Framing can be driven by different types of morality that each group values. We emphasize differences in framing of different news building on the moral foundations theory quantified using hand crafted lexicons. Our results show that partisan sources frame political issues differently both in terms of words usage and through the moral foundations they relate to.

pdf bib
Studying the Temporal Dynamics of Word Co-occurrences: An Application to Event Detection
Daniel Preoţiuc-Pietro | P. K. Srijith | Mark Hepple | Trevor Cohn
Proceedings of the Tenth International Conference on Language Resources and Evaluation (LREC'16)

Streaming media provides a number of unique challenges for computational linguistics. This paper studies the temporal variation in word co-occurrence statistics, with application to event detection. We develop a spectral clustering approach to find groups of mutually informative terms occurring in discrete time frames. Experiments on large datasets of tweets show that these groups identify key real world events as they occur in time, despite no explicit supervision. The performance of our method rivals state-of-the-art methods for event detection on F-score, obtaining higher recall at the expense of precision.

2015

pdf bib
The role of personality, age, and gender in tweeting about mental illness
Daniel Preoţiuc-Pietro | Johannes Eichstaedt | Gregory Park | Maarten Sap | Laura Smith | Victoria Tobolsky | H. Andrew Schwartz | Lyle Ungar
Proceedings of the 2nd Workshop on Computational Linguistics and Clinical Psychology: From Linguistic Signal to Clinical Reality

pdf bib
Mental Illness Detection at the World Well-Being Project for the CLPsych 2015 Shared Task
Daniel Preoţiuc-Pietro | Maarten Sap | H. Andrew Schwartz | Lyle Ungar
Proceedings of the 2nd Workshop on Computational Linguistics and Clinical Psychology: From Linguistic Signal to Clinical Reality

pdf bib
Analysing domain suitability of a sentiment lexicon by identifying distributionally bipolar words
Lucie Flekova | Daniel Preoţiuc-Pietro | Eugen Ruppert
Proceedings of the 6th Workshop on Computational Approaches to Subjectivity, Sentiment and Social Media Analysis

pdf bib
An analysis of the user occupational class through Twitter content
Daniel Preoţiuc-Pietro | Vasileios Lampos | Nikolaos Aletras
Proceedings of the 53rd Annual Meeting of the Association for Computational Linguistics and the 7th International Joint Conference on Natural Language Processing (Volume 1: Long Papers)

2014

pdf bib
Extracting Socioeconomic Patterns from the News: Modelling Text and Outlet Importance Jointly
Vasileios Lampos | Daniel Preoţiuc-Pietro | Sina Samangooei | Douwe Gelling | Trevor Cohn
Proceedings of the ACL 2014 Workshop on Language Technologies and Computational Social Science

pdf bib
Gaussian Processes for Natural Language Processing
Trevor Cohn | Daniel Preoţiuc-Pietro | Neil Lawrence
Proceedings of the 52nd Annual Meeting of the Association for Computational Linguistics: Tutorials

pdf bib
Predicting and Characterising User Impact on Twitter
Vasileios Lampos | Nikolaos Aletras | Daniel Preoţiuc-Pietro | Trevor Cohn
Proceedings of the 14th Conference of the European Chapter of the Association for Computational Linguistics

2013

pdf bib
A temporal model of text periodicities using Gaussian Processes
Daniel Preoţiuc-Pietro | Trevor Cohn
Proceedings of the 2013 Conference on Empirical Methods in Natural Language Processing

pdf bib
A user-centric model of voting intention from Social Media
Vasileios Lampos | Daniel Preoţiuc-Pietro | Trevor Cohn
Proceedings of the 51st Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers)

2012

pdf bib
Unsupervised document zone identification using probabilistic graphical models
Andrea Varga | Daniel Preoţiuc-Pietro | Fabio Ciravegna
Proceedings of the Eighth International Conference on Language Resources and Evaluation (LREC'12)

Document zone identification aims to automatically classify sequences of text-spans (e.g. sentences) within a document into predefined zone categories. Current approaches to document zone identification mostly rely on supervised machine learning methods, which require a large amount of annotated data, which is often difficult and expensive to obtain. In order to overcome this bottleneck, we propose graphical models based on the popular Latent Dirichlet Allocation (LDA) model. The first model, which we call zoneLDA aims to cluster the sentences into zone classes using only unlabelled data. We also study an extension of zoneLDA called zoneLDAb, which makes distinction between common words and non-common words within the different zone types. We present results on two different domains: the scientific domain and the technical domain. For the latter one we propose a new document zone classification schema, which has been annotated over a collection of 689 documents, achieving a Kappa score of 85%. Overall our experiments show promising results for both of the domains, outperforming the baseline model. Furthermore, on the technical domain the performance of the models are comparable to the supervised approach using the same feature sets. We thus believe that graphical models are a promising avenue of research for automatic document zoning.