Sebastian Padó

Also published as: Sebastian Pado

2021

The analysis of public debates crucially requires the classification of political demands according to hierarchical claim ontologies (e.g. for immigration, a supercategory “Controlling Migration” might have subcategories “Asylum limit” or “Border installations”). A major challenge for automatic claim classification is the large number and low frequency of such subclasses. We address it by jointly predicting pairs of matching super- and subcategories. We operationalize this idea by (a) encoding soft constraints in the claim classifier and (b) imposing hard constraints via Integer Linear Programming. Our experiments with different claim classifiers on a German immigration newspaper corpus show consistent performance increases for joint prediction, in particular for infrequent categories and discuss the complementarity of the two approaches.

pdf bib abs
Emotion Ratings: How Intensity, Annotation Confidence and Agreements are Entangled
Enrica Troiano | Sebastian Padó | Roman Klinger
Proceedings of the Eleventh Workshop on Computational Approaches to Subjectivity, Sentiment and Social Media Analysis

When humans judge the affective content of texts, they also implicitly assess the correctness of such judgment, that is, their confidence. We hypothesize that people’s (in)confidence that they performed well in an annotation task leads to (dis)agreements among each other. If this is true, confidence may serve as a diagnostic tool for systematic differences in annotations. To probe our assumption, we conduct a study on a subset of the Corpus of Contemporary American English, in which we ask raters to distinguish neutral sentences from emotion-bearing ones, while scoring the confidence of their answers. Confidence turns out to approximate inter-annotator disagreements. Further, we find that confidence is correlated to emotion intensity: perceiving stronger affect in text prompts annotators to more certain classification performances. This insight is relevant for modelling studies of intensity, as it opens the question wether automatic regressors or classifiers actually predict intensity, or rather human’s self-perceived confidence.

pdf bib abs
Disentangling Document Topic and Author Gender in Multiple Languages: Lessons for Adversarial Debiasing
Erenay Dayanik | Sebastian Padó
Proceedings of the Eleventh Workshop on Computational Approaches to Subjectivity, Sentiment and Social Media Analysis

Text classification is a central tool in NLP. However, when the target classes are strongly correlated with other textual attributes, text classification models can pick up “wrong” features, leading to bad generalization and biases. In social media analysis, this problem surfaces for demographic user classes such as language, topic, or gender, which influence the generate text to a substantial extent. Adversarial training has been claimed to mitigate this problem, but thorough evaluation is missing. In this paper, we experiment with text classification of the correlated attributes of document topic and author gender, using a novel multilingual parallel corpus of TED talk transcripts. Our findings are: (a) individual classifiers for topic and author gender are indeed biased; (b) debiasing with adversarial training works for topic, but breaks down for author gender; (c) gender debiasing results differ across languages. We interpret the result in terms of feature space overlap, highlighting the role of linguistic surface realization of the target classes.

pdf bib abs
New Domain, Major Effort? How Much Data is Necessary to Adapt a Temporal Tagger to the Voice Assistant Domain
Touhidul Alam | Alessandra Zarcone | Sebastian Padó
Proceedings of the 14th International Conference on Computational Semantics (IWCS)

Reliable tagging of Temporal Expressions (TEs, e.g., Book a table at L’Osteria for Sunday evening) is a central requirement for Voice Assistants (VAs). However, there is a dearth of resources and systems for the VA domain, since publicly-available temporal taggers are trained only on substantially different domains, such as news and clinical text. Since the cost of annotating large datasets is prohibitive, we investigate the trade-off between in-domain data and performance in DA-Time, a hybrid temporal tagger for the English VA domain which combines a neural architecture for robust TE recognition, with a parser-based TE normalizer. We find that transfer learning goes a long way even with as little as 25 in-domain sentences: DA-Time performs at the state of the art on the news domain, and substantially outperforms it on the VA domain.

2020

pdf bib abs
Dissecting Span Identification Tasks with Performance Prediction
Sean Papay | Roman Klinger | Sebastian Padó
Proceedings of the 2020 Conference on Empirical Methods in Natural Language Processing (EMNLP)

Span identification (in short, span ID) tasks such as chunking, NER, or code-switching detection, ask models to identify and classify relevant spans in a text. Despite being a staple of NLP, and sharing a common structure, there is little insight on how these tasks’ properties influence their difficulty, and thus little guidance on what model families work well on span ID tasks, and why. We analyze span ID tasks via performance prediction, estimating how well neural architectures do on different tasks. Our contributions are: (a) we identify key properties of span ID tasks that can inform performance prediction; (b) we carry out a large-scale experiment on English data, building a model to predict performance for unseen span ID tasks that can support architecture choices; (c), we investigate the parameters of the meta model, yielding new insights on how model and task properties interact to affect span ID performance. We find, e.g., that span frequency is especially important for LSTMs, and that CRFs help when spans are infrequent and boundaries non-distinctive.

pdf bib abs
Swimming with the Tide? Positional Claim Detection across Political Text Types
Nico Blokker | Erenay Dayanik | Gabriella Lapesa | Sebastian Padó
Proceedings of the Fourth Workshop on Natural Language Processing and Computational Social Science

Manifestos are official documents of political parties, providing a comprehensive topical overview of the electoral programs. Voters, however, seldom read them and often prefer other channels, such as newspaper articles, to understand the party positions on various policy issues. The natural question to ask is how compatible these two formats (manifesto and newspaper reports) are in their representation of party positioning. We address this question with an approach that combines political science (manual annotation and analysis) and natural language processing (supervised claim identification) in a cross-text type setting: we train a classifier on annotated newspaper data and test its performance on manifestos. Our findings show a) strong performance for supervised classification even across text types and b) a substantive overlap between the two formats in terms of party positioning, with differences regarding the salience of specific issues.

pdf bib abs
Lost in Back-Translation: Emotion Preservation in Neural Machine Translation
Enrica Troiano | Roman Klinger | Sebastian Padó
Proceedings of the 28th International Conference on Computational Linguistics

Machine translation provides powerful methods to convert text between languages, and is therefore a technology enabling a multilingual world. An important part of communication, however, takes place at the non-propositional level (e.g., politeness, formality, emotions), and it is far from clear whether current MT methods properly translate this information. This paper investigates the specific hypothesis that the non-propositional level of emotions is at least partially lost in MT. We carry out a number of experiments in a back-translation setup and establish that (1) emotions are indeed partially lost during translation; (2) this tendency can be reversed almost completely with a simple re-ranking approach informed by an emotion classifier, taking advantage of diversity in the n-best list; (3) the re-ranking approach can also be applied to change emotions, obtaining a model for emotion style transfer. An in-depth qualitative analysis reveals that there are recurring linguistic changes through which emotions are toned down or amplified, such as change of modality.

pdf bib abs
RiQuA: A Corpus of Rich Quotation Annotation for English Literary Text
Sean Papay | Sebastian Padó
Proceedings of the 12th Language Resources and Evaluation Conference

We introduce RiQuA (RIch QUotation Annotations), a corpus that provides quotations, including their interpersonal structure (speakers and addressees) for English literary text. The corpus comprises 11 works of 19th-century literature that were manually doubly annotated for direct and indirect quotations. For each quotation, its span, speaker, addressee, and cue are identified (if present). This provides a rich view of dialogue structures not available from other available corpora. We detail the process of creating this dataset, discuss the annotation guidelines, and analyze the resulting corpus in terms of inter-annotator agreement and its properties. RiQuA, along with its annotations guidelines and associated scripts, are publicly available for use, modification, and experimentation.

DEbateNet-migr15 is a manually annotated dataset for German which covers the public debate on immigration in 2015. The building block of our annotation is the political science notion of a claim, i.e., a statement made by a political actor (a politician, a party, or a group of citizens) that a specific action should be taken (e.g., vacant flats should be assigned to refugees). We identify claims in newspaper articles, assign them to actors and fine-grained categories and annotate their polarity and date. The aim of this paper is two-fold: first, we release the full DEbateNet-mig15 corpus and document it by means of a quantitative and qualitative analysis; second, we demonstrate its application in a discourse network analysis framework, which enables us to capture the temporal dynamics of the political debate

pdf bib abs
Masking Actor Information Leads to Fairer Political Claims Detection
Erenay Dayanik | Sebastian Padó
Proceedings of the 58th Annual Meeting of the Association for Computational Linguistics

A central concern in Computational Social Sciences (CSS) is fairness: where the role of NLP is to scale up text analysis to large corpora, the quality of automatic analyses should be as independent as possible of textual properties. We analyze the performance of a state-of-the-art neural model on the task of political claims detection (i.e., the identification of forward-looking statements made by political actors) and identify a strong frequency bias: claims made by frequent actors are recognized better. We propose two simple debiasing methods which mask proper names and pronouns during training of the model, thus removing personal information bias. We find that (a) these methods significantly decrease frequency bias while keeping the overall performance stable; and (b) the resulting models improve when evaluated in an out-of-domain setting.

2019

pdf bib abs
Who Sides with Whom? Towards Computational Construction of Discourse Networks for Political Debates
Sebastian Padó | Andre Blessing | Nico Blokker | Erenay Dayanik | Sebastian Haunss | Jonas Kuhn
Proceedings of the 57th Annual Meeting of the Association for Computational Linguistics

Understanding the structures of political debates (which actors make what claims) is essential for understanding democratic political decision making. The vision of computational construction of such discourse networks from newspaper reports brings together political science and natural language processing. This paper presents three contributions towards this goal: (a) a requirements analysis, linking the task to knowledge base population; (b) an annotated pilot corpus of migration claims based on German newspaper reports; (c) initial modeling results.

pdf bib abs
Crowdsourcing and Validating Event-focused Emotion Corpora for German and English
Enrica Troiano | Sebastian Padó | Roman Klinger
Proceedings of the 57th Annual Meeting of the Association for Computational Linguistics

Sentiment analysis has a range of corpora available across multiple languages. For emotion analysis, the situation is more limited, which hinders potential research on crosslingual modeling and the development of predictive models for other languages. In this paper, we fill this gap for German by constructing deISEAR, a corpus designed in analogy to the well-established English ISEAR emotion dataset. Motivated by Scherer’s appraisal theory, we implement a crowdsourcing experiment which consists of two steps. In step 1, participants create descriptions of emotional events for a given emotion. In step 2, five annotators assess the emotion expressed by the texts. We show that transferring an emotion classification model from the original English ISEAR to the German crowdsourced deISEAR via machine translation does not, on average, cause a performance drop.

pdf bib abs
An Environment for Relational Annotation of Political Debates
Andre Blessing | Nico Blokker | Sebastian Haunss | Jonas Kuhn | Gabriella Lapesa | Sebastian Padó
Proceedings of the 57th Annual Meeting of the Association for Computational Linguistics: System Demonstrations

This paper describes the MARDY corpus annotation environment developed for a collaboration between political science and computational linguistics. The tool realizes the complete workflow necessary for annotating a large newspaper text collection with rich information about claims (demands) raised by politicians and other actors, including claim and actor spans, relations, and polarities. In addition to the annotation GUI, the tool supports the identification of relevant documents, text pre-processing, user management, integration of external knowledge bases, annotation comparison and merging, statistical analysis, and the incorporation of machine learning models as “pseudo-annotators”.

pdf bib abs
Frame Identification as Categorization: Exemplars vs Prototypes in Embeddingland
Jennifer Sikos | Sebastian Padó
Proceedings of the 13th International Conference on Computational Semantics - Long Papers

Categorization is a central capability of human cognition, and a number of theories have been developed to account for properties of categorization. Even though many tasks in semantics also involve categorization of some kind, theories of categorization do not play a major role in contemporary research in computational linguistics. This paper follows the idea that embedding-based models of semantics lend themselves well to being formulated in terms of classical categorization theories. The benefit is a space of model families that enables (a) the formulation of hypotheses about the impact of major design decisions, and (b) a transparent assessment of these decisions. We instantiate this idea on the task of frame-semantic frame identification. We define four models that cross two design variables: (a) the choice of prototype vs. exemplar categorization, corresponding to different degrees of generalization applied to the input; and (b) the presence vs. absence of a fine-tuning step, corresponding to generic vs. task-adaptive categorization. We find that for frame identification, generalization and task-adaptive categorization both yield substantial benefits. Our prototype-based, fine-tuned model, which combines the best choices for these variables, establishes a new state of the art in frame identification.

pdf bib abs
Clustering-Based Article Identification in Historical Newspapers
Martin Riedl | Daniela Betz | Sebastian Padó
Proceedings of the 3rd Joint SIGHUM Workshop on Computational Linguistics for Cultural Heritage, Social Sciences, Humanities and Literature

This article focuses on the problem of identifying articles and recovering their text from within and across newspaper pages when OCR just delivers one text file per page. We frame the task as a segmentation plus clustering step. Our results on a sample of 1912 New York Tribune magazine shows that performing the clustering based on similarities computed with word embeddings outperforms a similarity measure based on character n-grams and words. Furthermore, the automatic segmentation based on the text results in low scores, due to the low quality of some OCRed documents.

bib abs
Learning Trilingual Dictionaries for Urdu – Roman Urdu – English
Moiz Rauf | Sebastian Padó
Proceedings of the 2019 Workshop on Widening NLP

In this paper, we present an effort to generate a joint Urdu, Roman Urdu and English trilingual lexicon using automated methods. We make a case for using statistical machine translation approaches and parallel corpora for dictionary creation. To this purpose, we use word alignment tools on the corpus and evaluate translations using human evaluators. Despite different writing script and considerable noise in the corpus our results show promise with over 85% accuracy of Roman Urdu–Urdu and 45% English–Urdu pairs.

pdf bib abs
Modeling Paths for Explainable Knowledge Base Completion
Josua Stadelmaier | Sebastian Padó
Proceedings of the 2019 ACL Workshop BlackboxNLP: Analyzing and Interpreting Neural Networks for NLP

A common approach in knowledge base completion (KBC) is to learn representations for entities and relations in order to infer missing facts by generalizing existing ones. A shortcoming of standard models is that they do not explain their predictions to make them verifiable easily to human inspection. In this paper, we propose the Context Path Model (CPM) which generates explanations for new facts in KBC by providing sets of context paths as supporting evidence for these triples. For example, a new triple (Theresa May, nationality, Britain) may be explained by the path (Theresa May, born in, Eastbourne, contained in, Britain). The CPM is formulated as a wrapper that can be applied on top of various existing KBC models. We evaluate it for the well-established TransE model. We observe that its performance remains very close despite the added complexity, and that most of the paths proposed as explanations provide meaningful evidence to assess the correctness.

pdf bib abs
Quotation Detection and Classification with a Corpus-Agnostic Model
Sean Papay | Sebastian Padó
Proceedings of the International Conference on Recent Advances in Natural Language Processing (RANLP 2019)

The detection of quotations (i.e., reported speech, thought, and writing) has established itself as an NLP analysis task. However, state-of-the-art models have been developed on the basis of specific corpora and incorpo- rate a high degree of corpus-specific assumptions and knowledge, which leads to fragmentation. In the spirit of task-agnostic modeling, we present a corpus-agnostic neural model for quotation detection and evaluate it on three corpora that vary in language, text genre, and structural assumptions. The model (a) approaches the state-of-the-art on the corpora when using established feature sets and (b) shows reasonable performance even when us- ing solely word forms, which makes it applicable for non-standard (i.e., historical) corpora.

pdf bib abs
Text-Based Joint Prediction of Numeric and Categorical Attributes of Entities in Knowledge Bases
V Thejas | Abhijeet Gupta | Sebastian Padó
Proceedings of the International Conference on Recent Advances in Natural Language Processing (RANLP 2019)

Collaboratively constructed knowledge bases play an important role in information systems, but are essentially always incomplete. Thus, a large number of models has been developed for Knowledge Base Completion, the task of predicting new attributes of entities given partial descriptions of these entities. Virtually all of these models either concentrate on numeric attributes (<Italy,GDP,2T$>) or they concentrate on categorical attributes (<Tim Cook,chairman,Apple>). In this paper, we propose a simple feed-forward neural architecture to jointly predict numeric and categorical attributes based on embeddings learned from textual occurrences of the entities in question. Following insights from multi-task learning, our hypothesis is that due to the correlations among attributes of different kinds, joint prediction improves over separate prediction. Our experiments on seven FreeBase domains show that this hypothesis is true of the two attribute types: we find substantial improvements for numeric attributes in the joint model, while performance remains largely unchanged for categorical attributes. Our analysis indicates that this is the case because categorical attributes, many of which describe membership in various classes, provide useful ‘background knowledge’ for numeric prediction, while this is true to a lesser degree in the inverse direction.

pdf bib
Proceedings of the 2019 Conference on Empirical Methods in Natural Language Processing and the 9th International Joint Conference on Natural Language Processing (EMNLP-IJCNLP): System Demonstrations
Sebastian Padó | Ruihong Huang
Proceedings of the 2019 Conference on Empirical Methods in Natural Language Processing and the 9th International Joint Conference on Natural Language Processing (EMNLP-IJCNLP): System Demonstrations

2018

pdf bib abs
DERE: A Task and Domain-Independent Slot Filling Framework for Declarative Relation Extraction
Heike Adel | Laura Ana Maria Bostan | Sean Papay | Sebastian Padó | Roman Klinger
Proceedings of the 2018 Conference on Empirical Methods in Natural Language Processing: System Demonstrations

Most machine learning systems for natural language processing are tailored to specific tasks. As a result, comparability of models across tasks is missing and their applicability to new tasks is limited. This affects end users without machine learning experience as well as model developers. To address these limitations, we present DERE, a novel framework for declarative specification and compilation of template-based information extraction. It uses a generic specification language for the task and for data annotations in terms of spans and frames. This formalism enables the representation of a large variety of natural language processing challenges. The backend can be instantiated by different models, following different paradigms. The clear separation of frame specification and model backend will ease the implementation of new models and the evaluation of different models across different tasks. Furthermore, it simplifies transfer learning, joint learning across tasks and/or domains as well as the assessment of model generalizability. DERE is available as open-source software.

pdf bib abs
A Named Entity Recognition Shootout for German
Martin Riedl | Sebastian Padó
Proceedings of the 56th Annual Meeting of the Association for Computational Linguistics (Volume 2: Short Papers)

We ask how to practically build a model for German named entity recognition (NER) that performs at the state of the art for both contemporary and historical texts, i.e., a big-data and a small-data scenario. The two best-performing model families are pitted against each other (linear-chain CRFs and BiLSTM) to observe the trade-off between expressiveness and data requirements. BiLSTM outperforms the CRF when large datasets are available and performs inferior for the smallest dataset. BiLSTMs profit substantially from transfer learning, which enables them to be trained on multiple corpora, resulting in a new state-of-the-art model for German NER on two contemporary German corpora (CoNLL 2003 and GermEval 2014) and two historic corpora.

pdf bib abs
Lexical Substitution for Evaluating Compositional Distributional Models
Maja Buljan | Sebastian Padó | Jan Šnajder
Proceedings of the 2018 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, Volume 2 (Short Papers)

Compositional Distributional Semantic Models (CDSMs) model the meaning of phrases and sentences in vector space. They have been predominantly evaluated on limited, artificial tasks such as semantic sentence similarity on hand-constructed datasets. This paper argues for lexical substitution (LexSub) as a means to evaluate CDSMs. LexSub is a more natural task, enables us to evaluate meaning composition at the level of individual words, and provides a common ground to compare CDSMs with dedicated LexSub models. We create a LexSub dataset for CDSM evaluation from a corpus with manual “all-words” LexSub annotation. Our experiments indicate that the Practical Lexical Function CDSM outperforms simple component-wise CDSMs and performs on par with the context2vec LexSub model using the same context.

pdf bib abs
Addressing Low-Resource Scenarios with Character-aware Embeddings
Sean Papay | Sebastian Padó | Ngoc Thang Vu
Proceedings of the Second Workshop on Subword/Character LEvel Models

Most modern approaches to computing word embeddings assume the availability of text corpora with billions of words. In this paper, we explore a setup where only corpora with millions of words are available, and many words in any new text are out of vocabulary. This setup is both of practical interests – modeling the situation for specific domains and low-resource languages – and of psycholinguistic interest, since it corresponds much more closely to the actual experiences and challenges of human language learning and use. We compare standard skip-gram word embeddings with character-based embeddings on word relatedness prediction. Skip-grams excel on large corpora, while character-based embeddings do well on small corpora generally and rare and complex words specifically. The models can be combined easily.

pdf bib abs
Using Embeddings to Compare FrameNet Frames Across Languages
Jennifer Sikos | Sebastian Padó
Proceedings of the First Workshop on Linguistic Resources for Natural Language Processing

Much interest in Frame Semantics is fueled by the substantial extent of its applicability across languages. At the same time, lexicographic studies have found that the applicability of individual frames can be diminished by cross-lingual divergences regarding polysemy, syntactic valency, and lexicalization. Due to the large effort involved in manual investigations, there are so far no broad-coverage resources with “problematic” frames for any language pair. Our study investigates to what extent multilingual vector representations of frames learned from manually annotated corpora can address this need by serving as a wide coverage source for such divergences. We present a case study for the language pair English — German using the FrameNet and SALSA corpora and find that inferences can be made about cross-lingual frame applicability using a vector space model.

2017

pdf bib abs
Distributed Prediction of Relations for Entities: The Easy, The Difficult, and The Impossible
Abhijeet Gupta | Gemma Boleda | Sebastian Padó
Proceedings of the 6th Joint Conference on Lexical and Computational Semantics (*SEM 2017)

Word embeddings are supposed to provide easy access to semantic relations such as “male of” (man–woman). While this claim has been investigated for concepts, little is known about the distributional behavior of relations of (Named) Entities. We describe two word embedding-based models that predict values for relational attributes of entities, and analyse them. The task is challenging, with major performance differences between relations. Contrary to many NLP tasks, high difficulty for a relation does not result from low frequency, but from (a) one-to-many mappings; and (b) lack of context patterns expressing the relation that are easy to pick up by word embeddings.

pdf bib abs
Does Free Word Order Hurt? Assessing the Practical Lexical Function Model for Croatian
Zoran Medić | Jan Šnajder | Sebastian Padó
Proceedings of the 6th Joint Conference on Lexical and Computational Semantics (*SEM 2017)

The Practical Lexical Function (PLF) model is a model of computational distributional semantics that attempts to strike a balance between expressivity and learnability in predicting phrase meaning and shows competitive results. We investigate how well the PLF carries over to free word order languages, given that it builds on observations of predicate-argument combinations that are harder to recover in free word order languages. We evaluate variants of the PLF for Croatian, using a new lexical substitution dataset. We find that the PLF works about as well for Croatian as for English, but demonstrate that its strength lies in modeling verbs, and that the free word order affects the less robust PLF variant.

pdf bib abs
Investigating the Relationship between Literary Genres and Emotional Plot Development
Evgeny Kim | Sebastian Padó | Roman Klinger
Proceedings of the Joint SIGHUM Workshop on Computational Linguistics for Cultural Heritage, Social Sciences, Humanities and Literature

Literary genres are commonly viewed as being defined in terms of content and stylistic features. In this paper, we focus on one particular class of lexical features, namely emotion information, and investigate the hypothesis that emotion-related information correlates with particular genres. Using genre classification as a testbed, we compare a model that computes lexicon-based emotion scores globally for complete stories with a model that tracks emotion arcs through stories on a subset of Project Gutenberg with five genres. Our main findings are: (a), the global emotion model is competitive with a large-vocabulary bag-of-words genre classifier (80%F1); (b), the emotion arc model shows a lower performance (59 % F1) but shows complementary behavior to the global model, as indicated by a very good performance of an oracle model (94 % F1) and an improved performance of an ensemble model (84 % F1); (c), genres differ in the extent to which stories follow the same emotional arcs, with particularly uniform behavior for anger (mystery) and fear (adventures, romance, humor, science fiction).

pdf bib abs
Annotation, Modelling and Analysis of Fine-Grained Emotions on a Stance and Sentiment Detection Corpus
Hendrik Schuff | Jeremy Barnes | Julian Mohme | Sebastian Padó | Roman Klinger
Proceedings of the 8th Workshop on Computational Approaches to Subjectivity, Sentiment and Social Media Analysis

There is a rich variety of data sets for sentiment analysis (viz., polarity and subjectivity classification). For the more challenging task of detecting discrete emotions following the definitions of Ekman and Plutchik, however, there are much fewer data sets, and notably no resources for the social media domain. This paper contributes to closing this gap by extending the SemEval 2016 stance and sentiment datasetwith emotion annotation. We (a) analyse annotation reliability and annotation merging; (b) investigate the relation between emotion annotation and the other annotation layers (stance, sentiment); (c) report modelling results as a baseline for future work.

pdf bib
Living a discrete life in a continuous world: Reference in cross-modal entity tracking
Gemma Boleda | Sebastian Padó | Nghia The Pham | Marco Baroni
IWCS 2017 — 12th International Conference on Computational Semantics — Short papers

pdf bib
Are doggies really nicer than dogs? The impact of morphological derivation on emotional valence in German
Gabriella Lapesa | Sebastian Padó | Tillmann Pross | Antje Rossdeutscher
IWCS 2017 — 12th International Conference on Computational Semantics — Short papers

pdf bib
Modeling Derivational Morphology in Ukrainian
Mariia Melymuka | Gabriella Lapesa | Max Kisselew | Sebastian Padó
IWCS 2017 — 12th International Conference on Computational Semantics — Short papers

pdf bib abs
Instances and concepts in distributional space
Gemma Boleda | Abhijeet Gupta | Sebastian Padó
Proceedings of the 15th Conference of the European Chapter of the Association for Computational Linguistics: Volume 2, Short Papers

Instances (“Mozart”) are ontologically distinct from concepts or classes (“composer”). Natural language encompasses both, but instances have received comparatively little attention in distributional semantics. Our results show that instances and concepts differ in their distributional properties. We also establish that instantiation detection (“Mozart – composer”) is generally easier than hypernymy detection (“chemist – scientist”), and that results on the influence of input representation do not transfer from hyponymy to instantiation.

2016

pdf bib
Predicting the Direction of Derivation in English Conversion
Max Kisselew | Laura Rimell | Alexis Palmer | Sebastian Padó
Proceedings of the 14th SIGMORPHON Workshop on Computational Research in Phonetics, Phonology, and Morphology

pdf bib abs
Predictability of Distributional Semantics in Derivational Word Formation
Sebastian Padó | Aurélie Herbelot | Max Kisselew | Jan Šnajder
Proceedings of COLING 2016, the 26th International Conference on Computational Linguistics: Technical Papers

Compositional distributional semantic models (CDSMs) have successfully been applied to the task of predicting the meaning of a range of linguistic constructions. Their performance on semi-compositional word formation process of (morphological) derivation, however, has been extremely variable, with no large-scale empirical investigation to date. This paper fills that gap, performing an analysis of CDSM predictions on a large dataset (over 30,000 German derivationally related word pairs). We use linear regression models to analyze CDSM performance and obtain insights into the linguistic factors that influence how predictable the distributional context of a derived word is going to be. We identify various such factors, notably part of speech, argument structure, and semantic regularity.

pdf bib
Model Architectures for Quotation Detection
Christian Scheible | Roman Klinger | Sebastian Padó
Proceedings of the 54th Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers)

pdf bib
Improving Zero-Shot-Learning for German Particle Verbs by using Training-Space Restrictions and Local Scaling
Maximilian Köper | Sabine Schulte im Walde | Max Kisselew | Sebastian Padó
Proceedings of the Fifth Joint Conference on Lexical and Computational Semantics

2015

pdf bib
Obtaining a Better Understanding of Distributional Models of German Derivational Morphology
Max Kisselew | Sebastian Padó | Alexis Palmer | Jan Šnajder
Proceedings of the 11th International Conference on Computational Semantics

pdf bib
Combining Seemingly Incompatible Corpora for Implicit Semantic Role Labeling
Parvin Sadat Feizabadi | Sebastian Padó
Proceedings of the Fourth Joint Conference on Lexical and Computational Semantics

pdf bib
Dissecting the Practical Lexical Function Model for Compositional Distributional Semantics
Abhijeet Gupta | Jason Utt | Sebastian Padó
Proceedings of the Fourth Joint Conference on Lexical and Computational Semantics

pdf bib
Distributional vectors encode referential attributes
Abhijeet Gupta | Gemma Boleda | Marco Baroni | Sebastian Padó
Proceedings of the 2015 Conference on Empirical Methods in Natural Language Processing

2014

pdf bib abs
Polysemy Index for Nouns: an Experiment on Italian using the PAROLE SIMPLE CLIPS Lexical Database
Francesca Frontini | Valeria Quochi | Sebastian Padó | Monica Monachini | Jason Utt
Proceedings of the Ninth International Conference on Language Resources and Evaluation (LREC'14)

An experiment is presented to induce a set of polysemous basic type alternations (such as Animal-Food, or Building-Institution) by deriving them from the sense alternations found in an existing lexical resource. The paper builds on previous work and applies those results to the Italian lexicon PAROLE SIMPLE CLIPS. The new results show how the set of frequent type alternations that can be induced from the lexicon is partly different from the set of polysemy relations selected and explicitely applied by lexicographers when building it. The analysis of mismatches shows that frequent type alternations do not always correpond to prototypical polysemy relations, nevertheless the proposed methodology represents a useful tool offered to lexicographers to systematically check for possible gaps in their resource.

pdf bib
Towards Semantic Validation of a Derivational Lexicon
Britta Zeller | Sebastian Padó | Jan Šnajder
Proceedings of COLING 2014, the 25th International Conference on Computational Linguistics: Technical Papers

pdf bib abs
Crosslingual and Multilingual Construction of Syntax-Based Vector Space Models
Jason Utt | Sebastian Padó
Transactions of the Association for Computational Linguistics, Volume 2

Syntax-based distributional models of lexical semantics provide a flexible and linguistically adequate representation of co-occurrence information. However, their construction requires large, accurately parsed corpora, which are unavailable for most languages. In this paper, we develop a number of methods to overcome this obstacle. We describe (a) a crosslingual approach that constructs a syntax-based model for a new language requiring only an English resource and a translation lexicon; and (b) multilingual approaches that combine crosslingual with monolingual information, subject to availability. We evaluate on two lexical semantic benchmarks in German and Croatian. We find that the models exhibit complementary profiles: crosslingual models yield higher accuracies while monolingual models provide better coverage. In addition, we show that simple multilingual models can successfully combine their strengths.

pdf bib
What Substitutes Tell Us - Analysis of an “All-Words” Lexical Substitution Corpus
Gerhard Kremer | Katrin Erk | Sebastian Padó | Stefan Thater
Proceedings of the 14th Conference of the European Chapter of the Association for Computational Linguistics

pdf bib
Crowdsourcing Annotation of Non-Local Semantic Roles
Parvin Sadat Feizabadi | Sebastian Padó
Proceedings of the 14th Conference of the European Chapter of the Association for Computational Linguistics, volume 2: Short Papers

2013

pdf bib
A corpus study of clause combination
Olga Nikitina | Sebastian Padó
Proceedings of the 10th International Conference on Computational Semantics (IWCS 2013) – Long Papers

pdf bib
A Search Task Dataset for German Textual Entailment
Britta D. Zeller | Sebastian Padó
Proceedings of the 10th International Conference on Computational Semantics (IWCS 2013) – Long Papers

pdf bib
Fitting, Not Clashing! A Distributional Semantic Model of Logical Metonymy
Alessandra Zarcone | Alessandro Lenci | Sebastian Padó | Jason Utt
Proceedings of the 10th International Conference on Computational Semantics (IWCS 2013) – Short Papers

pdf bib
The Curious Case of Metonymic Verbs: A Distributional Characterization
Jason Utt | Alessandro Lenci | Sebastian Padó | Alessandra Zarcone
Proceedings of the IWCS 2013 Workshop Towards a Formal Distributional Semantics

pdf bib
Design and Realization of the EXCITEMENT Open Platform for Textual Entailment
Günter Neumann | Sebastian Padó
Proceedings of the Joint Symposium on Semantic Processing. Textual Inference and Structures in Corpora

pdf bib
Bridges Across the Language Divide — EU-BRIDGE Excitement: Exploring Customer Interactions through Textual EntailMENT
Ido Dagan | Bernardo Magnini | Guenter Neumann | Sebastian Pado
Proceedings of Machine Translation Summit XIV: European projects

pdf bib
Excitement: Exploring Customer Interactions through Textual EntailMENT
Ido Dagan | Bernardo Magnini | Guenter Neumann | Sebastian Pado
Proceedings of Machine Translation Summit XIV: European projects

pdf bib
DErivBase: Inducing and Evaluating a Derivational Morphology Resource for German
Britta Zeller | Jan Šnajder | Sebastian Padó
Proceedings of the 51st Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers)

pdf bib
Derivational Smoothing for Syntactic Distributional Semantics
Sebastian Padó | Jan Šnajder | Britta Zeller
Proceedings of the 51st Annual Meeting of the Association for Computational Linguistics (Volume 2: Short Papers)

pdf bib
Building and Evaluating a Distributional Memory for Croatian
Jan Šnajder | Sebastian Padó | Željko Agić
Proceedings of the 51st Annual Meeting of the Association for Computational Linguistics (Volume 2: Short Papers)

2012

pdf bib
Modeling covert event retrieval in logical metonymy: probabilistic and distributional accounts
Alessandra Zarcone | Jason Utt | Sebastian Padó
Proceedings of the 3rd Workshop on Cognitive Modeling and Computational Linguistics (CMCL 2012)

pdf bib
Towards a model of formal and informal address in English
Manaal Faruqui | Sebastian Padó
Proceedings of the 13th Conference of the European Chapter of the Association for Computational Linguistics

pdf bib
Regular polysemy: A distributional model
Gemma Boleda | Sebastian Padó | Jason Utt
*SEM 2012: The First Joint Conference on Lexical and Computational Semantics – Volume 1: Proceedings of the main conference and the shared task, and Volume 2: Proceedings of the Sixth International Workshop on Semantic Evaluation (SemEval 2012)

pdf bib abs
French and German Corpora for Audience-based Text Type Classification
Amalia Todirascu | Sebastian Padó | Jennifer Krisch | Max Kisselew | Ulrich Heid
Proceedings of the Eighth International Conference on Language Resources and Evaluation (LREC'12)

This paper presents some of the results of the CLASSYN project which investigated the classification of text according to audience-related text types. We describe the design principles and the properties of the French and German linguistically annotated corpora that we have created. We report on tools used to collect the data and on the quality of the syntactic annotation. The CLASSYN corpora comprise two text collections to investigate general text types difference between scientific and popular science text on the two domains of medical and computer science.

Dans le paradigme FrameNet, cet article aborde le problème de l’annotation précise et automatique de rôles sémantiques dans une langue sans lexique FrameNet existant. Nous évaluons la méthode proposée par Padó et Lapata (2005, 2006), fondée sur la projection de rôles et appliquée initialement à la paire anglais-allemand. Nous testons sa généralisabilité du point de vue (a) des langues, en l’appliquant à la paire (anglais-français) et (b) de la qualité de la source, en utilisant une annotation automatique du côté anglais. Les expériences montrent des résultats à la hauteur de ceux obtenus pour l’allemand, nous permettant de conclure que cette approche présente un grand potentiel pour réduire la quantité de travail nécessaire à la création de telles ressources dans de nombreuses langues.

pdf bib
Flexible, Corpus-Based Modelling of Human Plausibility Judgements
Sebastian Padó | Ulrike Padó | Katrin Erk
Proceedings of the 2007 Joint Conference on Empirical Methods in Natural Language Processing and Computational Natural Language Learning (EMNLP-CoNLL)

2006

pdf bib abs
The SALSA Corpus: a German Corpus Resource for Lexical Semantics
Aljoscha Burchardt | Katrin Erk | Anette Frank | Andrea Kowalski | Sebastian Padó | Manfred Pinkal
Proceedings of the Fifth International Conference on Language Resources and Evaluation (LREC’06)

This paper describes the SALSA corpus, a large German corpus manually annotated with manual role-semantic annotation, based on the syntactically annotated TIGER newspaper corpus. The first release, comprising about 20,000 annotated predicate instances (about half the TIGER corpus), is scheduled for mid-2006. In this paper we discuss the annotation framework (frame semantics) and its cross-lingual applicability, problems arising from exhaustive annotation, strategies for quality control, and possible applications.

pdf bib abs
SALTO - A Versatile Multi-Level Annotation Tool
Aljoscha Burchardt | Katrin Erk | Anette Frank | Andrea Kowalski | Sebastian Pado
Proceedings of the Fifth International Conference on Language Resources and Evaluation (LREC’06)

In this paper, we describe the SALTO tool. It was originally developed for the annotation of semantic roles in the frame semantics paradigm, but can be used for graphical annotation of treebanks with general relational information in a simple drag-and-drop fashion. The tool additionally supports corpus management and quality control.

pdf bib abs
Shalmaneser - A Toolchain For Shallow Semantic Parsing
Katrin Erk | Sebastian Padó
Proceedings of the Fifth International Conference on Language Resources and Evaluation (LREC’06)

This paper presents Shalmaneser, a software package for shallow semantic parsing, the automatic assignment of semantic classes and roles to free text. Shalmaneser is a toolchain of independent modules communicating through a common XML format. System output can be inspected graphically. Shalmaneser can be used either as a black box to obtain semantic parses for new datasets (classifiers for English and German frame-semantic analysis are included), or as a research platform that can be extended to new parsers, languages, or classification paradigms.

pdf bib
Optimal Constituent Alignment with Edge Covers for Semantic Projection
Sebastian Padó | Mirella Lapata
Proceedings of the 21st International Conference on Computational Linguistics and 44th Annual Meeting of the Association for Computational Linguistics