ORBIT: Cost-Effective Dataset Curation for Large Language Model Domain Adaptation with an Astronomy Case Study

Eric Modesitt; Ke Yang; Spencer Hulsey; Xin Liu; ChengXiang Zhai; Volodymyr Kindratenko

ORBIT: Cost-Effective Dataset Curation for Large Language Model Domain Adaptation with an Astronomy Case Study

Eric Modesitt, Ke Yang, Spencer Hulsey, Xin Liu, ChengXiang Zhai, Volodymyr Kindratenko

Abstract

Recent advances in language modeling demonstrate the need for high-quality domain-specific training data, especially for tasks that require specialized knowledge. General-purpose models, while versatile, often lack the depth needed for expert-level tasks because of limited domain-specific information. Domain adaptation training can enhance these models, but it demands substantial, high-quality data. To address this, we propose ORBIT, a cost-efficient methodology for curating massive, high-quality domain-specific datasets from noisy web sources, tailored for training specialist large language models. Using astronomy as a primary case study, we refined the 1.3T-token FineWeb-Edu dataset into a high-quality, 10B-token subset focused on astronomy. Fine-tuning LLaMA-3-8B on a 1B-token astronomy subset improved performance on the MMLU astronomy benchmark from 69% to 76% and achieved top results on AstroBench, an astronomy-specific benchmark. Moreover, our model (Orbit-LLaMA) outperformed LLaMA-3-8B-base, with GPT-4o evaluations preferring it in 73% of cases across 1000 astronomy-specific questions. Additionally, we validated ORBIT’s generalizability by applying it to law and medicine, achieving a significant improvement of data quality compared to an unfiltered baseline. We open-source the ORBIT methodology, including the curated datasets, the codebase, and the resulting model.

Anthology ID:: 2025.findings-acl.51
Volume:: Findings of the Association for Computational Linguistics: ACL 2025
Month:: July
Year:: 2025
Address:: Vienna, Austria
Editors:: Wanxiang Che, Joyce Nabende, Ekaterina Shutova, Mohammad Taher Pilehvar
Venues:: Findings | WS
SIG:
Publisher:: Association for Computational Linguistics
Note:
Pages:: 907–926
Language:
URL:: https://preview.aclanthology.org/ingestion-acl-25/2025.findings-acl.51/
DOI:
Bibkey:
Cite (ACL):: Eric Modesitt, Ke Yang, Spencer Hulsey, Xin Liu, ChengXiang Zhai, and Volodymyr Kindratenko. 2025. ORBIT: Cost-Effective Dataset Curation for Large Language Model Domain Adaptation with an Astronomy Case Study. In Findings of the Association for Computational Linguistics: ACL 2025, pages 907–926, Vienna, Austria. Association for Computational Linguistics.
Cite (Informal):: ORBIT: Cost-Effective Dataset Curation for Large Language Model Domain Adaptation with an Astronomy Case Study (Modesitt et al., Findings 2025)
Copy Citation:
PDF:: https://preview.aclanthology.org/ingestion-acl-25/2025.findings-acl.51.pdf

PDF Cite Search Fix data