2023
pdf
abs
Enhancing Extreme Multi-Label Text Classification: Addressing Challenges in Model, Data, and Evaluation
Dan Li
|
Zi Long Zhu
|
Janneke van de Loo
|
Agnes Masip Gomez
|
Vikrant Yadav
|
Georgios Tsatsaronis
|
Zubair Afzal
Proceedings of the 2023 Conference on Empirical Methods in Natural Language Processing: Industry Track
Extreme multi-label text classification is a prevalent task in industry, but it frequently encounters challenges in terms of machine learning perspectives, including model limitations, data scarcity, and time-consuming evaluation. This paper aims to mitigate these issues by introducing novel approaches. Firstly, we propose a label ranking model as an alternative to the conventional SciBERT-based classification model, enabling efficient handling of large-scale labels and accommodating new labels. Secondly, we present an active learning-based pipeline that addresses the data scarcity of new labels during the update of a classification system. Finally, we introduce ChatGPT to assist with model evaluation. Our experiments demonstrate the effectiveness of these techniques in enhancing the extreme multi-label text classification task.
2020
pdf
bib
abs
CORA: A Deep Active Learning Covid-19 Relevancy Algorithm to Identify Core Scientific Articles
Zubair Afzal
|
Vikrant Yadav
|
Olga Fedorova
|
Vaishnavi Kandala
|
Janneke van de Loo
|
Saber A. Akhondi
|
Pascal Coupet
|
George Tsatsaronis
Proceedings of the 1st Workshop on NLP for COVID-19 (Part 2) at EMNLP 2020
Ever since the COVID-19 pandemic broke out, the academic and scientific research community, as well as industry and governments around the world have joined forces in an unprecedented manner to fight the threat. Clinicians, biologists, chemists, bioinformaticians, nurses, data scientists, and all of the affiliated relevant disciplines have been mobilized to help discover efficient treatments for the infected population, as well as a vaccine solution to prevent further the virus spread. In this combat against the virus responsible for the pandemic, key for any advancements is the timely, accurate, peer-reviewed, and efficient communication of any novel research findings. In this paper we present a novel framework to address the information need of filtering efficiently the scientific bibliography for relevant literature around COVID-19. The contributions of the paper are summarized in the following: we define and describe the information need that encompasses the major requirements for COVID-19 articles relevancy, we present and release an expert-curated benchmark set for the task, and we analyze the performance of several state-of-the-art machine learning classifiers that may distinguish the relevant from the non-relevant COVID-19 literature.
2013
pdf
A Self Learning Vocal Interface for Speech-impaired Users
Bart Ons
|
Netsanet Tessema
|
Janneke van de Loo
|
Jort Gemmeke
|
Guy De Pauw
|
Walter Daelemans
|
Hugo Van hamme
Proceedings of the Fourth Workshop on Speech and Language Processing for Assistive Technologies
2012
pdf
Towards a Self-Learning Assistive Vocal Interface: Vocabulary and Grammar Learning
Janneke van de Loo
|
Jort F. Gemmeke
|
Guy De Pauw
|
Joris Driesen
|
Hugo Van hamme
|
Walter Daelemans
Proceedings of the 1st Workshop on Speech and Multimodal Interaction in Assistive Environments
pdf
abs
The Netlog Corpus. A Resource for the Study of Flemish Dutch Internet Language
Mike Kestemont
|
Claudia Peersman
|
Benny De Decker
|
Guy De Pauw
|
Kim Luyckx
|
Roser Morante
|
Frederik Vaassen
|
Janneke van de Loo
|
Walter Daelemans
Proceedings of the Eighth International Conference on Language Resources and Evaluation (LREC'12)
Although in recent years numerous forms of Internet communication ― such as e-mail, blogs, chat rooms and social network environments ― have emerged, balanced corpora of Internet speech with trustworthy meta-information (e.g. age and gender) or linguistic annotations are still limited. In this paper we present a large corpus of Flemish Dutch chat posts that were collected from the Belgian online social network Netlog. For all of these posts we also acquired the users' profile information, making this corpus a unique resource for computational and sociolinguistic research. However, for analyzing such a corpus on a large scale, NLP tools are required for e.g. automatic POS tagging or lemmatization. Because many NLP tools fail to correctly analyze the surface forms of chat language usage, we propose to normalize this anomalous' input into a format suitable for existing NLP solutions for standard Dutch. Additionally, we have annotated a substantial part of the corpus (i.e. the Chatty subset) to provide a gold standard for the evaluation of future approaches to automatic (Flemish) chat language normalization.