2025
pdf
bib
Proceedings of the Third Workshop for Artificial Intelligence for Scientific Publications
Alberto Accomazzi
|
Tirthankar Ghosal
|
Felix Grezes
|
Kelly Lockhart
Proceedings of the Third Workshop for Artificial Intelligence for Scientific Publications
pdf
bib
abs
Overview of the Third Workshop for Artificial Intelligence for Scientific Publications
Kelly Lockhart
|
Alberto Accomazzi
|
Felix Grezes
|
Tirthankar Ghosal
Proceedings of the Third Workshop for Artificial Intelligence for Scientific Publications
The Workshop for Artificial Intelligence for Scientific Publications (WASP), formerly Workshop on Information Extraction from Scientific Publications (WIESP), started in 2022 to provide a platform for researchers to discuss research on information extraction, mining, generation, and knowledge discovery from scientific publications using Natural Language Processing and Machine Learning techniques. The third WASP workshop was held at the 14th International Joint Conference on Natural Language Processing and 4th Asia-Pacific Chapter of the Association for Computational Linguistics in Mumbai, India on December 23rd, 2025, as a hybrid event. The WASP workshop saw great interest, with 29 submissions, of which 16 were accepted. The program consisted of the contributed research talks, 2 keynote talks, a panel discussion, and one shared task, Telescope Reference and Astronomy Categorization Shared task (TRACS).
pdf
bib
abs
Overview of TRACS: the Telescope Reference and Astronomy Categorization Dataset & Shared Task
Felix Grezes
|
Jennifer Lynn Bartlett
|
Kelly Lockhart
|
Alberto Accomazzi
|
Ethan Seefried
|
Anjali Pandiri
|
Tirthankar Ghosal
Proceedings of the Third Workshop for Artificial Intelligence for Scientific Publications
To evaluate the scientific influence of observational facilities, astronomers examine the body of publications that have utilized data from those facilities. This depends on curated bibliographies that annotate and connect data products to the corresponding literature, enabling bibliometric analyses to quantify data impact. Compiling such bibliographies is a demanding process that requires expert curators to scan the literature for relevant names, acronyms, and identifiers, and then to determine whether and how specific observations contributed to each publication. These bibliographies have value beyond impact assessment: for research scientists, explicit links between data and literature form an essential pathway for discovering and accessing data. Accordingly, by building on the work of librarians and archivists, telescope bibliographies can be repurposed to directly support scientific inquiry. In this context, we present the Telescope Reference and Astronomy Categorization Shared task (TRACS) and its accompanying dataset, which comprises more than 89,000 publicly available English-language texts drawn from space telescope bibliographies. These texts are labeled according to a new, compact taxonomy developed in consultation with experienced bibliographers.
2024
pdf
bib
abs
INDUS: Effective and Efficient Language Models for Scientific Applications
Bishwaranjan Bhattacharjee
|
Aashka Trivedi
|
Masayasu Muraoka
|
Muthukumaran Ramasubramanian
|
Takuma Udagawa
|
Iksha Gurung
|
Nishan Pantha
|
Rong Zhang
|
Bharath Dandala
|
Rahul Ramachandran
|
Manil Maskey
|
Kaylin Bugbee
|
Michael M. Little
|
Elizabeth Fancher
|
Irina Gerasimov
|
Armin Mehrabian
|
Lauren Sanders
|
Sylvain V. Costes
|
Sergi Blanco-Cuaresma
|
Kelly Lockhart
|
Thomas Allen
|
Felix Grezes
|
Megan Ansdell
|
Alberto Accomazzi
|
Yousef El-Kurdi
|
Davis Wertheimer
|
Birgit Pfitzmann
|
Cesar Berrospi Ramis
|
Michele Dolfi
|
Rafael Teixeira De Lima
|
Panagiotis Vagenas
|
S. Karthik Mukkavilli
|
Peter W. J. Staar
|
Sanaz Vahidinia
|
Ryan McGranaghan
|
Tsengdar J. Lee
Proceedings of the 2024 Conference on Empirical Methods in Natural Language Processing: Industry Track
Large language models (LLMs) trained on general domain corpora showed remarkable results on natural language processing (NLP) tasks. However, previous research demonstrated LLMs trained using domain-focused corpora perform better on specialized tasks. Inspired by this insight, we developed INDUS, a comprehensive suite of LLMs tailored for the closely-related domains of Earth science, biology, physics, heliophysics, planetary sciences and astrophysics, and trained using curated scientific corpora drawn from diverse data sources. The suite of models include: (1) an encoder model trained using domain-specific vocabulary and corpora to address NLP tasks, (2) a contrastive-learning based text embedding model trained using a diverse set of datasets to address information retrieval tasks and (3) smaller versions of these models created using knowledge distillation for applications which have latency or resource constraints. We also created three new scientific benchmark datasets, Climate-Change NER (entity-recognition), NASA-QA (extractive QA) and NASA-IR (IR) to accelerate research in these multi-disciplinary fields. We show that our models outperform both general-purpose (RoBERTa) and domain- specific (SciBERT) encoders on these new tasks as well as existing tasks in the domains of interest. Furthermore, we demonstrate the use of these models in two industrial settings- as a retrieval model for large-scale vector search applications and in automatic content tagging systems.
pdf
bib
abs
astroECR : enrichissement d’un corpus astrophysique en entités nommées, coréférences et relations sémantiques
Atilla Kaan Alkan
|
Felix Grezes
|
Cyril Grouin
|
Fabian Schüssler
|
Pierre Zweigenbaum
Actes de la 31ème Conférence sur le Traitement Automatique des Langues Naturelles, volume 1 : articles longs et prises de position
Le manque de ressources annotées constitue un défi majeur pour le traitement automatique de la langue en astrophysique. Afin de combler cette lacune, nous présentons astroECR, une extension du corpus TDAC (Time-Domain Astrophysics Corpus). Notre corpus, constitué de 300 rapports d’observation en anglais, étend le schéma d’annotation initial de TDAC en introduisant cinq classes d’entités nommées supplémentaires spécifiques à l’astrophysique. Nous avons enrichi les annotations en incluant les coréférences, les relations sémantiques entre les objets célestes et leurs propriétés physiques, ainsi qu’en normalisant les noms d’objets célestes via des bases de données astronomiques. L’utilité de notre corpus est démontrée en fournissant des scores de référence à travers quatre tâches~: la reconnaissance d’entités nommées, la résolution de coréférences, la détection de relations, et la normalisation des noms d’objets célestes. Nous mettons à disposition le corpus ainsi que son guide d’annotation, les codes sources, et les modèles associés.
pdf
bib
abs
Enriching a Time-Domain Astrophysics Corpus with Named Entity, Coreference and Astrophysical Relationship Annotations
Atilla Kaan Alkan
|
Felix Grezes
|
Cyril Grouin
|
Fabian Schussler
|
Pierre Zweigenbaum
Proceedings of the 2024 Joint International Conference on Computational Linguistics, Language Resources and Evaluation (LREC-COLING 2024)
Interest in Astrophysical Natural Language Processing (NLP) has increased recently, fueled by the development of specialized language models for information extraction. However, the scarcity of annotated resources for this domain is still a significant challenge. Most existing corpora are limited to Named Entity Recognition (NER) tasks, leaving a gap in resource diversity. To address this gap and facilitate a broader spectrum of NLP research in astrophysics, we introduce astroECR, an extension of our previously built Time-Domain Astrophysics Corpus (TDAC). Our contributions involve expanding it to cover named entities, coreferences, annotations related to astrophysical relationships, and normalizing celestial object names. We showcase practical utility through baseline models for four NLP tasks and provide the research community access to our corpus, code, and models.
2023
pdf
bib
Proceedings of the Second Workshop on Information Extraction from Scientific Publications
Tirthankar Ghosal
|
Felix Grezes
|
Thomas Allen
|
Kelly Lockhart
|
Alberto Accomazzi
|
Sergi Blanco-Cuaresma
Proceedings of the Second Workshop on Information Extraction from Scientific Publications
pdf
bib
Function of Citation in Astrophysics Literature (FOCAL): Findings of the Shared Task
Felix Grezes
|
Thomas Allen
|
Tirthankar Ghosal
|
Sergi Blanco-Cuaresma
Proceedings of the Second Workshop on Information Extraction from Scientific Publications
2022
pdf
bib
Proceedings of the First Workshop on Information Extraction from Scientific Publications
Tirthankar Ghosal
|
Sergi Blanco-Cuaresma
|
Alberto Accomazzi
|
Robert M. Patton
|
Felix Grezes
|
Thomas Allen
Proceedings of the First Workshop on Information Extraction from Scientific Publications
pdf
bib
abs
Overview of the First Shared Task on Detecting Entities in the Astrophysics Literature (DEAL)
Felix Grezes
|
Sergi Blanco-Cuaresma
|
Thomas Allen
|
Tirthankar Ghosal
Proceedings of the First Workshop on Information Extraction from Scientific Publications
In this article, we describe the overview of our shared task: Detecting Entities in the Astrophysics Literature (DEAL). The DEAL shared task was part of the Workshop on Information Extraction from Scientific Publications (WIESP) in AACL-IJCNLP 2022. Information extraction from scientific publications is critical in several downstream tasks such as identification of critical entities, article summarization, citation classification, etc. The motivation of this shared task was to develop a community-wide effort for entity extraction from astrophysics literature. Automated entity extraction would help to build knowledge bases, high-quality meta-data for indexing and search, and several other use-cases of interests. Thirty-three teams registered for DEAL, twelve of them participated in the system runs, and finally four teams submitted their system descriptions. We analyze their system and performance and finally discuss the findings of DEAL.