Claudia Hauff


Unsupervised Domain Adaptation for Question Generation with DomainData Selection and Self-training
Peide Zhu | Claudia Hauff
Findings of the Association for Computational Linguistics: NAACL 2022

Question generation (QG) approaches based on large neural models require (i) large-scale and (ii) high-quality training data. These two requirements pose difficulties for specific application domains where training data is expensive and difficult to obtain. The trained QG models’ effectiveness can degrade significantly when they are applied on a different domain due to domain shift. In this paper, we explore an unsupervised domain adaptation approach to combat the lack of training data and domain shift issue with domain data selection and self-training. We first present a novel answer-aware strategy for domain data selection to select data with the most similarity to a new domain. The selected data are then used as pseudo-in-domain data to retrain the QG model. We then present generation confidence guided self-training with two generation confidence modeling methods (i) generated questions’ perplexity and (ii) the fluency score. We test our approaches on three large public datasets with different domain similarities, using a transformer-based pre-trained QG model. The results show that our proposed approaches outperform the baselines, and show the viability of unsupervised domain adaptation with answer-aware data selection and self-training on the QG task.

Answer Quality Aware Aggregation for Extractive QA Crowdsourcing
Peide Zhu | Zhen Wang | Claudia Hauff | Jie Yang | Avishek Anand
Findings of the Association for Computational Linguistics: EMNLP 2022

Quality control is essential for creating extractive question answering (EQA) datasets via crowdsourcing. Aggregation across answers, i.e. word spans within passages annotated, by different crowd workers is one major focus for ensuring its quality. However, crowd workers cannot reach a consensus on a considerable portion of questions. We introduce a simple yet effective answer aggregation method that takes into account the relations among the answer, question, and context passage. We evaluate answer quality from both the view of question answering model to determine how confident the QA model is about each answer and the view of the answer verification model to determine whether the answer is correct. Then we compute aggregation scores with each answer’s quality and its contextual embedding produced by pre-trained language models. The experiments on a large real crowdsourced EQA dataset show that our framework outperforms baselines by around 16% on precision and effectively conduct answer aggregation for extractive QA task.


On the Calibration and Uncertainty of Neural Learning to Rank Models for Conversational Search
Gustavo Penha | Claudia Hauff
Proceedings of the 16th Conference of the European Chapter of the Association for Computational Linguistics: Main Volume

According to the Probability Ranking Principle (PRP), ranking documents in decreasing order of their probability of relevance leads to an optimal document ranking for ad-hoc retrieval. The PRP holds when two conditions are met: [C1] the models are well calibrated, and, [C2] the probabilities of relevance are reported with certainty. We know however that deep neural networks (DNNs) are often not well calibrated and have several sources of uncertainty, and thus [C1] and [C2] might not be satisfied by neural rankers. Given the success of neural Learning to Rank (LTR) approaches—and here, especially BERT-based approaches—we first analyze under which circumstances deterministic neural rankers are calibrated for conversational search problems. Then, motivated by our findings we use two techniques to model the uncertainty of neural rankers leading to the proposed stochastic rankers, which output a predictive distribution of relevance as opposed to point estimates. Our experimental results on the ad-hoc retrieval task of conversation response ranking reveal that (i) BERT-based rankers are not robustly calibrated and that stochastic BERT-based rankers yield better calibration; and (ii) uncertainty estimation is beneficial for both risk-aware neural ranking, i.e. taking into account the uncertainty when ranking documents, and for predicting unanswerable conversational contexts.


pdf bib
Slice-Aware Neural Ranking
Gustavo Penha | Claudia Hauff
Proceedings of the 5th International Workshop on Search-Oriented Conversational AI (SCAI)

Understanding when and why neural ranking models fail for an IR task via error analysis is an important part of the research cycle. Here we focus on the challenges of (i) identifying categories of difficult instances (a pair of question and response candidates) for which a neural ranker is ineffective and (ii) improving neural ranking for such instances. To address both challenges we resort to slice-based learning for which the goal is to improve effectiveness of neural models for slices (subsets) of data. We address challenge (i) by proposing different slicing functions (SFs) that select slices of the dataset—based on prior work we heuristically capture different failures of neural rankers. Then, for challenge (ii) we adapt a neural ranking model to learn slice-aware representations, i.e. the adapted model learns to represent the question and responses differently based on the model’s prediction of which slices they belong to. Our experimental results (the source code and data are available at across three different ranking tasks and four corpora show that slice-based learning improves the effectiveness by an average of 2% over a neural ranker that is not slice-aware.


Feature Engineering for Second Language Acquisition Modeling
Guanliang Chen | Claudia Hauff | Geert-Jan Houben
Proceedings of the Thirteenth Workshop on Innovative Use of NLP for Building Educational Applications

Knowledge tracing serves as a keystone in delivering personalized education. However, few works attempted to model students’ knowledge state in the setting of Second Language Acquisition. The Duolingo Shared Task on Second Language Acquisition Modeling provides students’ trace data that we extensively analyze and engineer features from for the task of predicting whether a student will correctly solve a vocabulary exercise. Our analyses of students’ learning traces reveal that factors like exercise format and engagement impact their exercise performance to a large extent. Overall, we extracted 23 different features as input to a Gradient Tree Boosting framework, which resulted in an AUC score of between 0.80 and 0.82 on the official test set.


#SupportTheCause: Identifying Motivations to Participate in Online Health Campaigns
Dong Nguyen | Tijs van den Broek | Claudia Hauff | Djoerd Hiemstra | Michel Ehrenhard
Proceedings of the 2015 Conference on Empirical Methods in Natural Language Processing

On the Impact of Twitter-based Health Campaigns: A Cross-Country Analysis of Movember
Nugroho Dwi Prasetyo | Claudia Hauff | Dong Nguyen | Tijs van den Broek | Djoerd Hiemstra
Proceedings of the Sixth International Workshop on Health Text Mining and Information Analysis