Junxian Li
2025
Biology-Instructions: A Dataset and Benchmark for Multi-Omics Sequence Understanding Capability of Large Language Models
Haonan He
|
Yuchen Ren
|
Yining Tang
|
Ziyang Xu
|
Junxian Li
|
Minghao Yang
|
Di Zhang
|
Yuan Dong
|
Tao Chen
|
Shufei Zhang
|
Yuqiang Li
|
Nanqing Dong
|
Wanli Ouyang
|
Dongzhan Zhou
|
Peng Ye
Findings of the Association for Computational Linguistics: EMNLP 2025
Large language models (LLMs) have shown remarkable capabilities in general domains, but their application to multi-omics biology remains underexplored. To address this gap, we introduce Biology-Instructions, the first large-scale instruction-tuning dataset for multi-omics biological sequences, including DNA, RNA, proteins, and multi-molecules. This dataset bridges LLMs and complex biological sequence-related tasks, enhancing their versatility and reasoning while maintaining conversational fluency. We also highlight significant limitations of current state-of-the-art LLMs on multi-omics tasks without specialized training. To overcome this, we propose ChatMultiOmics, a strong baseline with a novel three-stage training pipeline, demonstrating superior biological understanding through Biology-Instructions. Both resources are publicly available, paving the way for better integration of LLMs in multi-omics analysis. The Biology-Instructions is publicly available at: https://github.com/hhnqqq/Biology-Instructions.
2024
Large Language Models for Automated Open-domain Scientific Hypotheses Discovery
Zonglin Yang
|
Xinya Du
|
Junxian Li
|
Jie Zheng
|
Soujanya Poria
|
Erik Cambria
Findings of the Association for Computational Linguistics: ACL 2024
Hypothetical induction is recognized as the main reasoning type when scientists make observations about the world and try to propose hypotheses to explain those observations. Past research on hypothetical induction is under a constrained setting: (1) the observation annotations in the dataset are carefully manually handpicked sentences (resulting in a close-domain setting); and (2) the ground truth hypotheses are mostly commonsense knowledge, making the task less challenging. In this work, we tackle these problems by proposing the first dataset for social science academic hypotheses discovery, with the final goal to create systems that automatically generate valid, novel, and helpful scientific hypotheses, given only a pile of raw web corpus. Unlike previous settings, the new dataset requires (1) using open-domain data (raw web corpus) as observations; and (2) proposing hypotheses even new to humanity. A multi-module framework is developed for the task, including three different feedback mechanisms to boost performance, which exhibits superior performance in terms of both GPT-4 based and expert-based evaluation.To the best of our knowledge, this is the first work showing that LLMs are able to generate novel (”not existing in literature”) and valid (”reflecting reality”) scientific hypotheses.
Search
Fix author
Co-authors
- Erik Cambria 1
- Tao Chen 1
- Yuan Dong 1
- Nanqing Dong 1
- Xinya Du 1
- show all...