Proceedings of the 2017 EMNLP Workshop: Natural Language Processing meets Journalism

Octavian Popescu, Carlo Strapparava (Editors)

Anthology ID:
Copenhagen, Denmark
Association for Computational Linguistics
Bib Export formats:

pdf bib
Proceedings of the 2017 EMNLP Workshop: Natural Language Processing meets Journalism
Octavian Popescu | Carlo Strapparava

pdf bib
Predicting News Values from Headline Text and Emotions
Maria Pia di Buono | Jan Šnajder | Bojana Dalbelo Bašić | Goran Glavaš | Martin Tutek | Natasa Milic-Frayling

We present a preliminary study on predicting news values from headline text and emotions. We perform a multivariate analysis on a dataset manually annotated with news values and emotions, discovering interesting correlations among them. We then train two competitive machine learning models – an SVM and a CNN – to predict news values from headline text and emotions as features. We find that, while both models yield a satisfactory performance, some news values are more difficult to detect than others, while some profit more from including emotion information.

pdf bib
Predicting User Views in Online News
Daniel Hardt | Owen Rambow

We analyze user viewing behavior on an online news site. We collect data from 64,000 news articles, and use text features to predict frequency of user views. We compare predictiveness of the headline and “teaser” (viewed before clicking) and the body (viewed after clicking). Both are predictive of clicking behavior, with the full article text being most predictive.

Tracking Bias in News Sources Using Social Media: the Russia-Ukraine Maidan Crisis of 2013–2014
Peter Potash | Alexey Romanov | Mikhail Gronas | Anna Rumshisky | Mikhail Gronas

This paper addresses the task of identifying the bias in news articles published during a political or social conflict. We create a silver-standard corpus based on the actions of users in social media. Specifically, we reconceptualize bias in terms of how likely a given article is to be shared or liked by each of the opposing sides. We apply our methodology to a dataset of links collected in relation to the Russia-Ukraine Maidan crisis from 2013-2014. We show that on the task of predicting which side is likely to prefer a given article, a Naive Bayes classifier can record 90.3% accuracy looking only at domain names of the news sources. The best accuracy of 93.5% is achieved by a feed forward neural network. We also apply our methodology to gold-labeled set of articles annotated for bias, where the aforementioned Naive Bayes classifier records 82.6% accuracy and a feed-forward neural networks records 85.6% accuracy.

What to Write? A topic recommender for journalists
Alessandro Cucchiarelli | Christian Morbidoni | Giovanni Stilo | Paola Velardi

In this paper we present a recommender system, What To Write and Why, capable of suggesting to a journalist, for a given event, the aspects still uncovered in news articles on which the readers focus their interest. The basic idea is to characterize an event according to the echo it receives in online news sources and associate it with the corresponding readers’ communicative and informative patterns, detected through the analysis of Twitter and Wikipedia, respectively. Our methodology temporally aligns the results of this analysis and recommends the concepts that emerge as topics of interest from Twitter andWikipedia, either not covered or poorly covered in the published news articles.

Comparing Attitudes to Climate Change in the Media using sentiment analysis based on Latent Dirichlet Allocation
Ye Jiang | Xingyi Song | Jackie Harrison | Shaun Quegan | Diana Maynard

News media typically present biased accounts of news stories, and different publications present different angles on the same event. In this research, we investigate how different publications differ in their approach to stories about climate change, by examining the sentiment and topics presented. To understand these attitudes, we find sentiment targets by combining Latent Dirichlet Allocation (LDA) with SentiWordNet, a general sentiment lexicon. Using LDA, we generate topics containing keywords which represent the sentiment targets, and then annotate the data using SentiWordNet before regrouping the articles based on topic similarity. Preliminary analysis identifies clearly different attitudes on the same issue presented in different news sources. Ongoing work is investigating how systematic these attitudes are between different publications, and how these may change over time.

Language-based Construction of Explorable News Graphs for Journalists
Rémi Bois | Guillaume Gravier | Eric Jamet | Emmanuel Morin | Pascale Sébillot | Maxime Robert

Faced with ever-growing news archives, media professionals are in need of advanced tools to explore the information surrounding specific events. This problem is most commonly answered by browsing news datasets, going from article to article and viewing unaltered original content. In this article, we introduce an efficient way to generate links between news items, allowing such browsing through an easily explorable graph, and enrich this graph by automatically typing links in order to inform the user on the nature of the relation between two news pieces. User evaluations are conducted on real world data with journalists in order to assess for the interest of both the graph representation and link typing in a press reviewing task, showing the system to be of significant help for their work.

Storyteller: Visual Analytics of Perspectives on Rich Text Interpretations
Maarten van Meersbergen | Piek Vossen | Janneke van der Zwaan | Antske Fokkens | Willem van Hage | Inger Leemans | Isa Maks

Complexity of event data in texts makes it difficult to assess its content, especially when considering larger collections in which different sources report on the same or similar situations. We present a system that makes it possible to visually analyze complex event and emotion data extracted from texts. We show that we can abstract from different data models for events and emotions to a single data model that can show the complex relations in four dimensions. The visualization has been applied to analyze 1) dynamic developments in how people both conceive and express emotions in theater plays and 2) how stories are told from the perspectyive of their sources based on rich event data extracted from news or biographies.

Analyzing the Revision Logs of a Japanese Newspaper for Article Quality Assessment
Hideaki Tamori | Yuta Hitomi | Naoaki Okazaki | Kentaro Inui

We address the issue of the quality of journalism and analyze daily article revision logs from a Japanese newspaper company. The revision logs contain data that can help reveal the requirements of quality journalism such as the types and number of edit operations and aspects commonly focused in revision. This study also discusses potential applications such as quality assessment and automatic article revision as our future research directions.

Improved Abusive Comment Moderation with User Embeddings
John Pavlopoulos | Prodromos Malakasiotis | Juli Bakagianni | Ion Androutsopoulos

Experimenting with a dataset of approximately 1.6M user comments from a Greek news sports portal, we explore how a state of the art RNN-based moderation method can be improved by adding user embeddings, user type embeddings, user biases, or user type biases. We observe improvements in all cases, with user embeddings leading to the biggest performance gains.

Incongruent Headlines: Yet Another Way to Mislead Your Readers
Sophie Chesney | Maria Liakata | Massimo Poesio | Matthew Purver

This paper discusses the problem of incongruent headlines: those which do not accurately represent the information contained in the article with which they occur. We emphasise that this phenomenon should be considered separately from recognised problematic headline types such as clickbait and sensationalism, arguing that existing natural language processing (NLP) methods applied to these related concepts are not appropriate for the automatic detection of headline incongruence, as an analysis beyond stylistic traits is necessary. We therefore suggest a number of alternative methodologies that may be appropriate to the task at hand as a foundation for future work in this area. In addition, we provide an analysis of existing data sets which are related to this work, and motivate the need for a novel data set in this domain.

Unsupervised Event Clustering and Aggregation from Newswire and Web Articles
Swen Ribeiro | Olivier Ferret | Xavier Tannier

In this paper, we present an unsupervised pipeline approach for clustering news articles based on identified event instances in their content. We leverage press agency newswire and monolingual word alignment techniques to build meaningful and linguistically varied clusters of articles from the web in the perspective of a broader event type detection task. We validate our approach on a manually annotated corpus of Web articles.

Semantic Storytelling, Cross-lingual Event Detection and other Semantic Services for a Newsroom Content Curation Dashboard
Julian Moreno-Schneider | Ankit Srivastava | Peter Bourgonje | David Wabnitz | Georg Rehm

We present a prototypical content curation dashboard, to be used in the newsroom, and several of its underlying semantic content analysis components (such as named entity recognition, entity linking, summarisation and temporal expression analysis). The idea is to enable journalists (a) to process incoming content (agency reports, twitter feeds, reports, blog posts, social media etc.) and (b) to create new articles more easily and more efficiently. The prototype system also allows the automatic annotation of events in incoming content for the purpose of supporting journalists in identifying important, relevant or meaningful events and also to adapt the content currently in production accordingly in a semi-automatic way. One of our long-term goals is to support journalists building up entire storylines with automatic means. In the present prototype they are generated in a backend service using clustering methods that operate on the extracted events.

Deception Detection in News Reports in the Russian Language: Lexics and Discourse
Dina Pisarevskaya

News verification and automated fact checking tend to be very important issues in our world. The research is initial. We collected a corpus for Russian (174 news reports, truthful and fake ones). We held two experiments, for both we applied SVMs algorithm (linear/rbf kernel) and Random Forest to classify the news reports into 2 classes: truthful/deceptive. In the first experiment, we used 18 markers on lexics level, mostly frequencies of POS tags in texts. In the second experiment, on discourse level we used frequencies of rhetorical relations types in texts. The classification task in the first experiment is solved better by SVMs (rbf kernel) (f-measure 0.65). The model based on RST features shows best results with Random Forest Classifier (f-measure 0.54) and should be modified. In the next research, the combination of different deception detection markers for the Russian language should be taken in order to make a better predictive model.

Fake news stance detection using stacked ensemble of classifiers
James Thorne | Mingjie Chen | Giorgos Myrianthous | Jiashu Pu | Xiaoxuan Wang | Andreas Vlachos

Fake news has become a hotly debated topic in journalism. In this paper, we present our entry to the 2017 Fake News Challenge which models the detection of fake news as a stance classification task that finished in 11th place on the leader board. Our entry is an ensemble system of classifiers developed by students in the context of their coursework. We show how we used the stacking ensemble method for this purpose and obtained improvements in classification accuracy exceeding each of the individual models’ performance on the development data. Finally, we discuss aspects of the experimental setup of the challenge.

From Clickbait to Fake News Detection: An Approach based on Detecting the Stance of Headlines to Articles
Peter Bourgonje | Julian Moreno Schneider | Georg Rehm

We present a system for the detection of the stance of headlines with regard to their corresponding article bodies. The approach can be applied in fake news, especially clickbait detection scenarios. The component is part of a larger platform for the curation of digital content; we consider veracity and relevancy an increasingly important part of curating online information. We want to contribute to the debate on how to deal with fake news and related online phenomena with technological means, by providing means to separate related from unrelated headlines and further classifying the related headlines. On a publicly available data set annotated for the stance of headlines with regard to their corresponding article bodies, we achieve a (weighted) accuracy score of 89.59.

‘Fighting’ or ‘Conflict’? An Approach to Revealing Concepts of Terms in Political Discourse
Linyuan Tang | Kyo Kageura

Previous work on the epistemology of fact-checking indicated the dilemma between the needs of binary answers for the public and ambiguity of political discussion. Determining concepts represented by terms in political discourse can be considered as a Word-Sense Disambiguation (WSD) task. The analysis of political discourse, however, requires identifying precise concepts of terms from relatively small data. This work attempts to provide a basic framework for revealing concepts of terms in political discourse with explicit contextual information. The framework consists of three parts: 1) extracting important terms, 2) generating concordance for each term with stipulative definitions and explanations, and 3) agglomerating similar information of the term by hierarchical clustering. Utterances made by Prime Minister Abe Shinzo in the Diet of Japan are used to examine our framework. Importantly, we revealed the conceptual inconsistency of the term Sonritsu-kiki-jitai. The framework was proved to work, but only for a small number of terms due to lack of explicit contextual information.

A News Chain Evaluation Methodology along with a Lattice-based Approach for News Chain Construction
Mustafa Toprak | Özer Özkahraman | Selma Tekir

Chain construction is an important requirement for understanding news and establishing the context. A news chain can be defined as a coherent set of articles that explains an event or a story. There’s a lack of well-established methods in this area. In this work, we propose a methodology to evaluate the “goodness” of a given news chain and implement a concept lattice-based news chain construction method by Hossain et al.. The methodology part is vital as it directly affects the growth of research in this area. Our proposed methodology consists of collected news chains from different studies and two “goodness” metrics, minedge and dispersion coefficient respectively. We assess the utility of the lattice-based news chain construction method by our proposed methodology.

Using New York Times Picks to Identify Constructive Comments
Varada Kolhatkar | Maite Taboada

We examine the extent to which we are able to automatically identify constructive online comments. We build several classifiers using New York Times Picks as positive examples and non-constructive thread comments from the Yahoo News Annotated Comments Corpus as negative examples of constructive online comments. We evaluate these classifiers on a crowd-annotated corpus containing 1,121 comments. Our best classifier achieves a top F1 score of 0.84.

An NLP Analysis of Exaggerated Claims in Science News
Yingya Li | Jieke Zhang | Bei Yu

The discrepancy between science and media has been affecting the effectiveness of science communication. Original findings from science publications may be distorted with altered claim strength when reported to the public, causing misinformation spread. This study conducts an NLP analysis of exaggerated claims in science news, and then constructed prediction models for identifying claim strength levels in science reporting. The results demonstrate different writing styles journal articles and news/press releases use for reporting scientific findings. Preliminary prediction models reached promising result with room for further improvement.