Le manque de ressources annotées constitue un défi majeur pour le traitement automatique de la langue en astrophysique. Afin de combler cette lacune, nous présentons astroECR, une extension du corpus TDAC (Time-Domain Astrophysics Corpus). Notre corpus, constitué de 300 rapports d’observation en anglais, étend le schéma d’annotation initial de TDAC en introduisant cinq classes d’entités nommées supplémentaires spécifiques à l’astrophysique. Nous avons enrichi les annotations en incluant les coréférences, les relations sémantiques entre les objets célestes et leurs propriétés physiques, ainsi qu’en normalisant les noms d’objets célestes via des bases de données astronomiques. L’utilité de notre corpus est démontrée en fournissant des scores de référence à travers quatre tâches~: la reconnaissance d’entités nommées, la résolution de coréférences, la détection de relations, et la normalisation des noms d’objets célestes. Nous mettons à disposition le corpus ainsi que son guide d’annotation, les codes sources, et les modèles associés.
Interest in Astrophysical Natural Language Processing (NLP) has increased recently, fueled by the development of specialized language models for information extraction. However, the scarcity of annotated resources for this domain is still a significant challenge. Most existing corpora are limited to Named Entity Recognition (NER) tasks, leaving a gap in resource diversity. To address this gap and facilitate a broader spectrum of NLP research in astrophysics, we introduce astroECR, an extension of our previously built Time-Domain Astrophysics Corpus (TDAC). Our contributions involve expanding it to cover named entities, coreferences, annotations related to astrophysical relationships, and normalizing celestial object names. We showcase practical utility through baseline models for four NLP tasks and provide the research community access to our corpus, code, and models.
The increased interest in time-domain astronomy over the last decades has resulted in a substantial increase in observation reports publication leading to a saturation of how astrophysicists read, analyze and classify information. Due to the short life span of the detected astronomical events, the information related to the characterization of new phenomena has to be communicated and analyzed very rapidly to allow other observatories to react and conduct their follow-up observations. This paper introduces TDAC: the first Corpus in Time-Domain Astrophysics, based on observation reports. We also present the NLP experiments we made for named entity recognition based on annotations we made and annotations from the WIESP NLP Challenge.
Detecting Entities in the Astrophysics Literature (DEAL) is a proposed shared task in the scope of the first Workshop on Information Extraction from Scientific Publications (WIESP) at AACL-IJCNLP 2022. It aims to propose systems identifying astrophysical named entities. This article presents our system based on a majority voting strategy of an ensemble composed of multiple SciBERT models. The system we propose is ranked second and outperforms the baseline provided by the organisers by achieving an F1 score of 0.7993 and a Matthews Correlation Coefficient (MCC) score of 0.8978 in the testing phase.