2025
pdf
bib
abs
ELAINE-medLLM: Lightweight English Japanese Chinese Trilingual Large Language Model for Bio-medical Domain
Ken Yano
|
Zheheng Luo
|
Jimin Huang
|
Qianqian Xie
|
Masaki Asada
|
Chenhan Yuan
|
Kailai Yang
|
Makoto Miwa
|
Sophia Ananiadou
|
Jun’ichi Tsujii
Proceedings of the 31st International Conference on Computational Linguistics
We propose ELAINE (EngLish-jApanese-chINesE)-medLLM, a trilingual (English, Japanese, Chinese) large language model adapted for the bio-medical domain based on Llama-3-8B. The training dataset was carefully curated in terms of volume and diversity to adapt to the biomedical domain and endow trilingual capability while preserving the knowledge and abilities of the base model. The training follows 2-stage paths: continued pre-training and supervised fine-tuning (SFT). Our results demonstrate that ELAINE-medLLM exhibits superior trilingual capabilities compared to existing bilingual or multilingual medical LLMs without severely sacrificing the base model’s capability.
pdf
bib
abs
VTechAGP: An Academic-to-General-Audience Text Paraphrase Dataset and Benchmark Models
Ming Cheng
|
Jiaying Gong
|
Chenhan Yuan
|
William A Ingram
|
Edward Fox
|
Hoda Eldardiry
Proceedings of the 2025 Conference of the Nations of the Americas Chapter of the Association for Computational Linguistics: Human Language Technologies (Volume 1: Long Papers)
Existing text simplification or paraphrase datasets mainly focus on sentence-level text generation in a general domain. These datasets are typically developed without using domain knowledge. In this paper, we release a novel dataset, VTechAGP, which is the first academic-to-general-audience text paraphrase dataset consisting of document-level these and dissertation academic and general-audience abstract pairs from 8 colleges authored over 25 years. We also propose a novel dynamic soft prompt generative language model, DSPT5. For training, we leverage a contrastive-generative loss function to learn the keyword vectors in the dynamic prompt. For inference, we adopt a crowd-sampling decoding strategy at both semantic and structural levels to further select the best output candidate. We evaluate DSPT5 and various state-of-the-art large language models (LLMs) from multiple perspectives. Results demonstrate that the SOTA LLMs do not provide satisfactory outcomes, while the lightweight DSPT5 can achieve competitive results. To the best of our knowledge, we are the first to build a benchmark dataset and solutions for academic-to-general-audience text paraphrase dataset. Models will be public after acceptance.
pdf
bib
abs
CAST: Corpus-Aware Self-similarity Enhanced Topic modelling
Yanan Ma
|
Chenghao Xiao
|
Chenhan Yuan
|
Sabine N Van Der Veer
|
Lamiece Hassan
|
Chenghua Lin
|
Goran Nenadic
Proceedings of the 2025 Conference of the Nations of the Americas Chapter of the Association for Computational Linguistics: Human Language Technologies (Volume 1: Long Papers)
Topic modelling is a pivotal unsupervised machine learning technique for extracting valuable insights from large document collections. Existing neural topic modelling methods often encode contextual information of documents, while ignoring contextual details of candidate centroid words, leading to the inaccurate selection of topic words due to the *contextualization gap*. In parallel, it is found that functional words are frequently selected over topical words. To address these limitations, we introduce **CAST**: **C**orpus-**A**ware **S**elf-similarity Enhanced **T**opic modelling, a novel topic modelling method that builds upon candidate centroid word embeddings contextualized on the dataset, and a novel self-similarity-based method to filter out less meaningful tokens. Inspired by findings in contrastive learning that self-similarities of functional token embeddings in different contexts are much lower than topical tokens, we find self-similarity to be an effective metric to prevent functional words from acting as candidate topic words. Our approach significantly enhances the coherence and diversity of generated topics, as well as the topic model’s ability to handle noisy data. Experiments on news benchmark datasets and one Twitter dataset demonstrate the method’s superiority in generating coherent, diverse topics, and handling noisy data, outperforming strong baselines.
2024
pdf
bib
abs
Predicting Rewards Alongside Tokens: Non-disruptive Parameter Insertion for Efficient Inference Intervention in Large Language Model
Chenhan Yuan
|
Fei Huang
|
Ru Peng
|
Keming Lu
|
Bowen Yu
|
Chang Zhou
|
Jingren Zhou
Proceedings of the 2024 Conference on Empirical Methods in Natural Language Processing
Transformer-based large language models (LLMs) exhibit limitations such as generating unsafe responses, unreliable reasoning, etc. Existing inference intervention approaches attempt to mitigate these issues by finetuning additional models to produce calibration signals (such as rewards) that guide the LLM’s decoding process. However, this solution introduces substantial time and space overhead due to the separate models required. This work proposes Non-disruptive parameters insertion (Otter), inserting extra parameters into the transformer architecture to predict calibration signals along with the original LLM output. Otter offers state-of-the-art performance on multiple demanding tasks while saving up to 86.5% extra space and 98.5% extra time. Furthermore, Otter seamlessly integrates with existing inference engines, requiring only a one-line code change, and the original model response remains accessible after the parameter insertion.
pdf
bib
FinNLP-AgentScen-2024 Shared Task: Financial Challenges in Large Language Models - FinLLMs
Qianqian Xie
|
Jimin Huang
|
Dong Li
|
Zhengyu Chen
|
Ruoyu Xiang
|
Mengxi Xiao
|
Yangyang Yu
|
Vijayasai Somasundaram
|
Kailai Yang
|
Chenhan Yuan
|
Zheheng Luo
|
Zhiwei Liu
|
Yueru He
|
Yuechen Jiang
|
Haohang Li
|
Duanyu Feng
|
Xiao-Yang Liu
|
Benyou Wang
|
Hao Wang
|
Yanzhao Lai
|
Jordan Suchow
|
Alejandro Lopez-Lira
|
Min Peng
|
Sophia Ananiadou
Proceedings of the Eighth Financial Technology and Natural Language Processing and the 1st Agent AI for Scenario Planning
2023
pdf
bib
abs
Zero-shot Temporal Relation Extraction with ChatGPT
Chenhan Yuan
|
Qianqian Xie
|
Sophia Ananiadou
The 22nd Workshop on Biomedical Natural Language Processing and BioNLP Shared Tasks
The goal of temporal relation extraction is to infer the temporal relation between two events in the document. Supervised models are dominant in this task. In this work, we investigate ChatGPT’s ability on zero-shot temporal relation extraction. We designed three different prompt techniques to break down the task and evaluate ChatGPT. Our experiments show that ChatGPT’s performance has a large gap with that of supervised methods and can heavily rely on the design of prompts. We further demonstrate that ChatGPT can infer more small relation classes correctly than supervised methods. The current shortcomings of ChatGPT on temporal relation extraction are also discussed in this paper. We found that ChatGPT cannot keep consistency during temporal inference and it fails in actively long-dependency temporal inference.
2021
pdf
bib
abs
Unsupervised Relation Extraction: A Variational Autoencoder Approach
Chenhan Yuan
|
Hoda Eldardiry
Proceedings of the 2021 Conference on Empirical Methods in Natural Language Processing
Unsupervised relation extraction works by clustering entity pairs that have the same relations in the text. Some existing variational autoencoder (VAE)-based approaches train the relation extraction model as an encoder that generates relation classifications. A decoder is trained along with the encoder to reconstruct the encoder input based on the encoder-generated relation classifications. These classifications are a latent variable so they are required to follow a pre-defined prior distribution which results in unstable training. We propose a VAE-based unsupervised relation extraction technique that overcomes this limitation by using the classifications as an intermediate variable instead of a latent variable. Specifically, classifications are conditioned on sentence input, while the latent variable is conditioned on both the classifications and the sentence input. This allows our model to connect the decoder with the encoder without putting restrictions on the classification distribution; which improves training stability. Our approach is evaluated on the NYT dataset and outperforms state-of-the-art methods.
2019
pdf
bib
Efficient text generation of user-defined topic using generative adversarial networks
Chenhan Yuan
|
Yi-Chin Huang
|
Cheng-Hung Tsai
Proceedings of the 4th Workshop on Computational Creativity in Language Generation