Daniel Dakota

2021

pdf bib
Proceedings of the 20th International Workshop on Treebanks and Linguistic Theories (TLT, SyntaxFest 2021)
Daniel Dakota | Kilian Evang | Sandra Kübler
Proceedings of the 20th International Workshop on Treebanks and Linguistic Theories (TLT, SyntaxFest 2021)

pdf bib
Annotations Matter: Leveraging Multi-task Learning to Parse UD and SUD
Zeeshan Ali Sayyed | Daniel Dakota
Findings of the Association for Computational Linguistics: ACL-IJCNLP 2021

pdf bib
What’s in a Span? Evaluating the Creativity of a Span-Based Neural Constituency Parser
Daniel Dakota | Sandra Kübler
Proceedings of the Society for Computation in Linguistics 2021

pdf bib
Examining the Effects of Preprocessing on the Detection of Offensive Language in German Tweets
Sebastian Reimann | Daniel Dakota
Proceedings of the 17th Conference on Natural Language Processing (KONVENS 2021)

pdf bib abs
Genres, Parsers, and BERT: The Interaction Between Parsers and BERT Models in Cross-Genre Constituency Parsing in English and Swedish
Daniel Dakota
Proceedings of the Second Workshop on Domain Adaptation for NLP

Genre and domain are often used interchangeably, but are two different properties of a text. Successful parser adaptation requires both cross-domain and cross-genre sensitivity (Rehbein and Bildhauer, 2017). While the impact domain differences have on parser performance degradation is more easily measurable in respect to lexical differences, impact of genre differences can be more nuanced. With the predominance of pre-trained language models (LMs; e.g. BERT (Devlin et al., 2019)), there are now additional complexities in developing cross-genre sensitive models due to the infusion of linguistic characteristics derived from, usually, a third genre. We perform a systematic set of experiments using two neural constituency parsers to examine how different parsers behave in combination with different BERT models with varying source and target genres in English and Swedish. We find that there is extensive difficulty in predicting the best source due to the complex interactions between genres, parsers, and LMs. Additionally, the influence of the data used to derive the underlying BERT model heavily influences how best to create more robust and effective cross-genre parsing models.

pdf bib abs
Bidirectional Domain Adaptation Using Weighted Multi-Task Learning
Daniel Dakota | Zeeshan Ali Sayyed | Sandra Kübler
Proceedings of the 17th International Conference on Parsing Technologies and the IWPT 2021 Shared Task on Parsing into Enhanced Universal Dependencies (IWPT 2021)

Domain adaption in syntactic parsing is still a significant challenge. We address the issue of data imbalance between the in-domain and out-of-domain treebank typically used for the problem. We define domain adaptation as a Multi-task learning (MTL) problem, which allows us to train two parsers, one for each do-main. Our results show that the MTL approach is beneficial for the smaller treebank. For the larger treebank, we need to use loss weighting in order to avoid a decrease in performance be-low the single task. In order to determine towhat degree the data imbalance between two domains and the domain differences affect results, we also carry out an experiment with two imbalanced in-domain treebanks and show that loss weighting also improves performance in an in-domain setting. Given loss weighting in MTL, we can improve results for both parsers.

2019

pdf bib abs
Investigating Multilingual Abusive Language Detection: A Cautionary Tale
Kenneth Steimel | Daniel Dakota | Yue Chen | Sandra Kübler
Proceedings of the International Conference on Recent Advances in Natural Language Processing (RANLP 2019)

Abusive language detection has received much attention in the last years, and recent approaches perform the task in a number of different languages. We investigate which factors have an effect on multilingual settings, focusing on the compatibility of data and annotations. In the current paper, we focus on English and German. Our findings show large differences in performance between the two languages. We find that the best performance is achieved by different classification algorithms. Sampling to address class imbalance issues is detrimental for German and beneficial for English. The only similarity that we find is that neither data set shows clear topics when we compare the results of topic modeling to the gold standard. Based on our findings, we can conclude that a multilingual optimization of classifiers is not possible even in settings where comparable data sets are used.

2018

pdf bib
Practical Parsing for Downstream Applications
Daniel Dakota | Sandra Kübler
Proceedings of the 27th International Conference on Computational Linguistics: Tutorial Abstracts

2017

pdf bib abs
Towards Replicability in Parsing
Daniel Dakota | Sandra Kübler
Proceedings of the International Conference Recent Advances in Natural Language Processing, RANLP 2017

We investigate parsing replicability across 7 languages (and 8 treebanks), showing that choices concerning the use of grammatical functions in parsing or evaluation, the influence of the rare word threshold, as well as choices in test sentences and evaluation script options have considerable and often unexpected effects on parsing accuracies. All of those choices need to be carefully documented if we want to ensure replicability.

pdf bib abs
Non-Deterministic Segmentation for Chinese Lattice Parsing
Hai Hu | Daniel Dakota | Sandra Kübler
Proceedings of the International Conference Recent Advances in Natural Language Processing, RANLP 2017

Parsing Chinese critically depends on correct word segmentation for the parser since incorrect segmentation inevitably causes incorrect parses. We investigate a pipeline approach to segmentation and parsing using word lattices as parser input. We compare CRF-based and lexicon-based approaches to word segmentation. Our results show that the lattice parser is capable of selecting the correction segmentation from thousands of options, thus drastically reducing the number of unparsed sentence. Lexicon-based parsing models have a better coverage than the CRF-based approach, but the many options are more difficult to handle. We reach our best result by using a lexicon from the n-best CRF analyses, combined with highly probable words.

2016

Daniel Dakota

2021

2019

2018

2017

2016

2014

Co-authors

Venues