Chia-Hsuan Chang
2026
CoreELM: An Open-Source Framework for Aligning Large Language Models to Embedding Spaces
Brian Ondov | Chia-Hsuan Chang | Yujia Zhou | Mauro Giuffrè | Hua Xu
BioNLP 2026
Brian Ondov | Chia-Hsuan Chang | Yujia Zhou | Mauro Giuffrè | Hua Xu
BioNLP 2026
Text embeddings have become an essential part of a variety of language applications. However, methods for interpreting, exploring and reversing embedding spaces are limited, reducing transparency and precluding potentially valuable generative use cases. In this work, we develop an open-source, domain-agnostic framework for aligning Large Language Models to embedding spaces using the recently reported Embedding Language Model (ELM) method. We demonstrate our framework by training models to recover, summarize, and compare clinical trial abstracts from embeddings alone. In addition to inverting embeddings back to text more reliably than existing methods, our models can decode novel, interpolated embeddings into new clinical trial abstracts that human experts cannot distinguish from real ones. We further show that these generated abstracts are responsive to moving embeddings along concept vectors for age and sex of study subjects. Our public ELM implementation and experimental results will aid the alignment of Large Language Models to embedding spaces in the biomedical domain and beyond.
2025
Refining Dimensions for Improving Clustering-based Cross-lingual Topic Models
Chia-Hsuan Chang | Tien Yuan Huang | Yi-Hang Tsai | Chia-Ming Chang | San-Yih Hwang
Proceedings of the 18th Workshop on Building and Using Comparable Corpora (BUCC)
Chia-Hsuan Chang | Tien Yuan Huang | Yi-Hang Tsai | Chia-Ming Chang | San-Yih Hwang
Proceedings of the 18th Workshop on Building and Using Comparable Corpora (BUCC)
Recent works in clustering-based topic models perform well in monolingual topic identification by introducing a pipeline to cluster the contextualized representations. However, the pipeline is suboptimal in identifying topics across languages due to the presence of language-dependent dimensions (LDDs) generated by multilingual language models. To address this issue, we introduce a novel, SVD-based dimension refinement component into the pipeline of the clustering-based topic model. This component effectively neutralizes the negative impact of LDDs, enabling the model to accurately identify topics across languages. Our experiments on three datasets demonstrate that the updated pipeline with the dimension refinement component generally outperforms other state-of-the-art cross-lingual topic models.