Alberto Accomazzi
2025
Proceedings of the Third Workshop for Artificial Intelligence for Scientific Publications
Alberto Accomazzi | Tirthankar Ghosal | Felix Grezes | Kelly Lockhart
Proceedings of the Third Workshop for Artificial Intelligence for Scientific Publications
Alberto Accomazzi | Tirthankar Ghosal | Felix Grezes | Kelly Lockhart
Proceedings of the Third Workshop for Artificial Intelligence for Scientific Publications
Overview of the Third Workshop for Artificial Intelligence for Scientific Publications
Kelly Lockhart | Alberto Accomazzi | Felix Grezes | Tirthankar Ghosal
Proceedings of the Third Workshop for Artificial Intelligence for Scientific Publications
Kelly Lockhart | Alberto Accomazzi | Felix Grezes | Tirthankar Ghosal
Proceedings of the Third Workshop for Artificial Intelligence for Scientific Publications
The Workshop for Artificial Intelligence for Scientific Publications (WASP), formerly Workshop on Information Extraction from Scientific Publications (WIESP), started in 2022 to provide a platform for researchers to discuss research on information extraction, mining, generation, and knowledge discovery from scientific publications using Natural Language Processing and Machine Learning techniques. The third WASP workshop was held at the 14th International Joint Conference on Natural Language Processing and 4th Asia-Pacific Chapter of the Association for Computational Linguistics in Mumbai, India on December 23rd, 2025, as a hybrid event. The WASP workshop saw great interest, with 29 submissions, of which 16 were accepted. The program consisted of the contributed research talks, 2 keynote talks, a panel discussion, and one shared task, Telescope Reference and Astronomy Categorization Shared task (TRACS).
Overview of TRACS: the Telescope Reference and Astronomy Categorization Dataset & Shared Task
Felix Grezes | Jennifer Lynn Bartlett | Kelly Lockhart | Alberto Accomazzi | Ethan Seefried | Anjali Pandiri | Tirthankar Ghosal
Proceedings of the Third Workshop for Artificial Intelligence for Scientific Publications
Felix Grezes | Jennifer Lynn Bartlett | Kelly Lockhart | Alberto Accomazzi | Ethan Seefried | Anjali Pandiri | Tirthankar Ghosal
Proceedings of the Third Workshop for Artificial Intelligence for Scientific Publications
To evaluate the scientific influence of observational facilities, astronomers examine the body of publications that have utilized data from those facilities. This depends on curated bibliographies that annotate and connect data products to the corresponding literature, enabling bibliometric analyses to quantify data impact. Compiling such bibliographies is a demanding process that requires expert curators to scan the literature for relevant names, acronyms, and identifiers, and then to determine whether and how specific observations contributed to each publication. These bibliographies have value beyond impact assessment: for research scientists, explicit links between data and literature form an essential pathway for discovering and accessing data. Accordingly, by building on the work of librarians and archivists, telescope bibliographies can be repurposed to directly support scientific inquiry. In this context, we present the Telescope Reference and Astronomy Categorization Shared task (TRACS) and its accompanying dataset, which comprises more than 89,000 publicly available English-language texts drawn from space telescope bibliographies. These texts are labeled according to a new, compact taxonomy developed in consultation with experienced bibliographers.
AstroMLab 5: Structured Summaries and Concept Extraction for 400,000 Astrophysics Papers
Yuan-Sen Ting | Alberto Accomazzi | Tirthankar Ghosal | Tuan Dung Nguyen | Rui Pan | Zechang Sun | Tijmen de Haan
Proceedings of the Third Workshop for Artificial Intelligence for Scientific Publications
Yuan-Sen Ting | Alberto Accomazzi | Tirthankar Ghosal | Tuan Dung Nguyen | Rui Pan | Zechang Sun | Tijmen de Haan
Proceedings of the Third Workshop for Artificial Intelligence for Scientific Publications
We present a dataset of 408,590 astrophysics papers from arXiv (astro-ph), spanning 1992 through July 2025. Each paper has been processed through a multi-stage pipeline to produce: (1) structured summaries organized into six semantic sections (Background, Motivation, Methodology, Results, Interpretation, Implication), and (2) concept extraction yielding 9,999 unique concepts with detailed descriptions. The dataset contains 3.8 million paper-concept associations and includes semantic embeddings for all concepts. Comparison with traditional ADS keywords reveals that the concepts provide denser coverage and more uniform distribution, while analysis of embedding space structure demonstrates that concepts are semantically dispersed within papers—enabling discovery through multiple diverse entry points. Concept vocabulary and embeddings are publicly released at https://github.com/tingyuansen/astro-ph_knowledge_graph.
2024
INDUS: Effective and Efficient Language Models for Scientific Applications
Bishwaranjan Bhattacharjee | Aashka Trivedi | Masayasu Muraoka | Muthukumaran Ramasubramanian | Takuma Udagawa | Iksha Gurung | Nishan Pantha | Rong Zhang | Bharath Dandala | Rahul Ramachandran | Manil Maskey | Kaylin Bugbee | Michael M. Little | Elizabeth Fancher | Irina Gerasimov | Armin Mehrabian | Lauren Sanders | Sylvain V. Costes | Sergi Blanco-Cuaresma | Kelly Lockhart | Thomas Allen | Felix Grezes | Megan Ansdell | Alberto Accomazzi | Yousef El-Kurdi | Davis Wertheimer | Birgit Pfitzmann | Cesar Berrospi Ramis | Michele Dolfi | Rafael Teixeira De Lima | Panagiotis Vagenas | S. Karthik Mukkavilli | Peter W. J. Staar | Sanaz Vahidinia | Ryan McGranaghan | Tsengdar J. Lee
Proceedings of the 2024 Conference on Empirical Methods in Natural Language Processing: Industry Track
Bishwaranjan Bhattacharjee | Aashka Trivedi | Masayasu Muraoka | Muthukumaran Ramasubramanian | Takuma Udagawa | Iksha Gurung | Nishan Pantha | Rong Zhang | Bharath Dandala | Rahul Ramachandran | Manil Maskey | Kaylin Bugbee | Michael M. Little | Elizabeth Fancher | Irina Gerasimov | Armin Mehrabian | Lauren Sanders | Sylvain V. Costes | Sergi Blanco-Cuaresma | Kelly Lockhart | Thomas Allen | Felix Grezes | Megan Ansdell | Alberto Accomazzi | Yousef El-Kurdi | Davis Wertheimer | Birgit Pfitzmann | Cesar Berrospi Ramis | Michele Dolfi | Rafael Teixeira De Lima | Panagiotis Vagenas | S. Karthik Mukkavilli | Peter W. J. Staar | Sanaz Vahidinia | Ryan McGranaghan | Tsengdar J. Lee
Proceedings of the 2024 Conference on Empirical Methods in Natural Language Processing: Industry Track
Large language models (LLMs) trained on general domain corpora showed remarkable results on natural language processing (NLP) tasks. However, previous research demonstrated LLMs trained using domain-focused corpora perform better on specialized tasks. Inspired by this insight, we developed INDUS, a comprehensive suite of LLMs tailored for the closely-related domains of Earth science, biology, physics, heliophysics, planetary sciences and astrophysics, and trained using curated scientific corpora drawn from diverse data sources. The suite of models include: (1) an encoder model trained using domain-specific vocabulary and corpora to address NLP tasks, (2) a contrastive-learning based text embedding model trained using a diverse set of datasets to address information retrieval tasks and (3) smaller versions of these models created using knowledge distillation for applications which have latency or resource constraints. We also created three new scientific benchmark datasets, Climate-Change NER (entity-recognition), NASA-QA (extractive QA) and NASA-IR (IR) to accelerate research in these multi-disciplinary fields. We show that our models outperform both general-purpose (RoBERTa) and domain- specific (SciBERT) encoders on these new tasks as well as existing tasks in the domains of interest. Furthermore, we demonstrate the use of these models in two industrial settings- as a retrieval model for large-scale vector search applications and in automatic content tagging systems.
2023
Proceedings of the Second Workshop on Information Extraction from Scientific Publications
Tirthankar Ghosal | Felix Grezes | Thomas Allen | Kelly Lockhart | Alberto Accomazzi | Sergi Blanco-Cuaresma
Proceedings of the Second Workshop on Information Extraction from Scientific Publications
Tirthankar Ghosal | Felix Grezes | Thomas Allen | Kelly Lockhart | Alberto Accomazzi | Sergi Blanco-Cuaresma
Proceedings of the Second Workshop on Information Extraction from Scientific Publications
AstroLLaMA: Towards Specialized Foundation Models in Astronomy
Tuan Dung Nguyen | Yuan-Sen Ting | Ioana Ciuca | Charles O’Neill | Ze-Chang Sun | Maja Jabłońska | Sandor Kruk | Ernest Perkowski | Jack Miller | Jason Jason Jingsh Li | Josh Peek | Kartheik Iyer | Tomasz Rozanski | Pranav Khetarpal | Sharaf Zaman | David Brodrick | Sergio J. Rodriguez Mendez | Thang Bui | Alyssa Goodman | Alberto Accomazzi | Jill Naiman | Jesse Cranney | Kevin Schawinski | Roberta Raileanu
Proceedings of the Second Workshop on Information Extraction from Scientific Publications
Tuan Dung Nguyen | Yuan-Sen Ting | Ioana Ciuca | Charles O’Neill | Ze-Chang Sun | Maja Jabłońska | Sandor Kruk | Ernest Perkowski | Jack Miller | Jason Jason Jingsh Li | Josh Peek | Kartheik Iyer | Tomasz Rozanski | Pranav Khetarpal | Sharaf Zaman | David Brodrick | Sergio J. Rodriguez Mendez | Thang Bui | Alyssa Goodman | Alberto Accomazzi | Jill Naiman | Jesse Cranney | Kevin Schawinski | Roberta Raileanu
Proceedings of the Second Workshop on Information Extraction from Scientific Publications
2022
Search
Fix author
Co-authors
- Tirthankar Ghosal 6
- Felix Grezes 6
- Kelly Lockhart 5
- Thomas Allen 3
- Sergi Blanco-Cuaresma 3
- Tuan Dung Nguyen 2
- Yuan-Sen Ting 2
- Megan Ansdell 1
- Jennifer Lynn Bartlett 1
- Cesar Berrospi Ramis 1
- Bishwaranjan Bhattacharjee 1
- David Brodrick 1
- Kaylin Bugbee 1
- Thang Bui 1
- Ioana Ciuca 1
- Sylvain V. Costes 1
- Jesse Cranney 1
- Bharath Dandala 1
- Rafael Teixeira De Lima 1
- Michele Dolfi 1
- Yousef El-Kurdi 1
- Elizabeth Fancher 1
- Irina Gerasimov 1
- Alyssa Goodman 1
- Iksha Gurung 1
- Kartheik Iyer 1
- Maja Jabłońska 1
- Pranav Khetarpal 1
- Sandor Kruk 1
- Tsengdar J. Lee 1
- Jason Jason Jingsh Li 1
- Michael M. Little 1
- Manil Maskey 1
- Ryan McGranaghan 1
- Armin Mehrabian 1
- Jack Miller 1
- S. Karthik Mukkavilli 1
- Masayasu Muraoka 1
- Jill Naiman 1
- Charles O’Neill 1
- Rui Pan 1
- Anjali Pandiri 1
- Nishan Pantha 1
- Robert M. Patton 1
- Josh Peek 1
- Ernest Perkowski 1
- Birgit Pfitzmann 1
- Roberta Raileanu 1
- Rahul Ramachandran 1
- Muthukumaran Ramasubramanian 1
- Sergio José Rodríguez Méndez 1
- Tomasz Rozanski 1
- Lauren Sanders 1
- Kevin Schawinski 1
- Ethan Seefried 1
- Peter W. J. Staar 1
- Ze-Chang Sun 1
- Zechang Sun 1
- Aashka Trivedi 1
- Takuma Udagawa 1
- Panagiotis Vagenas 1
- Sanaz Vahidinia 1
- Davis Wertheimer 1
- Sharaf Zaman 1
- Rong Zhang 1
- Tijmen de Haan 1