Zhouxiang Fang

2025

pdf bib abs
ICL CIPHERS: Quantifying ”Learning” in In-Context Learning via Substitution Ciphers
Zhouxiang Fang | Aayush Mishra | Muhan Gao | Anqi Liu | Daniel Khashabi
Proceedings of the 2025 Conference on Empirical Methods in Natural Language Processing

Recent works have suggested that In-Context Learning (ICL) operates in dual modes, i.e. task retrieval (remember learned patterns from pre-training) and task learning (inference-time ”learning” from demonstrations). However, disentangling these the two modes remains a challenging goal. We introduce ICL CIPHERS, a class of task reformulations based on substitution ciphers borrowed from classic cryptography. In this approach, a subset of tokens in the in-context inputs are substituted with other (irrelevant) tokens, rendering English sentences less comprehensible to human eye. However, by design, there is a latent, fixed pattern to this substitution, making it reversible. This bijective (reversible) cipher ensures that the task remains a well-defined task in some abstract sense, despite the transformations. It is a curious question if LLMs can solve tasks reformulated by ICL CIPHERS with a BIJECTIVE mapping, which requires ”deciphering” the latent cipher. We show that LLMs are better at solving tasks reformulated by ICL CIPHERS with BIJECTIVE mappings than the NON-BIJECTIVE (irreversible) baseline, providing a novel approach to quantify ”learning” in ICL. While this gap is small, it is consistent across the board on four datasets and six models. Finally, our interpretability analysis shows evidence that LLMs can internally decode ciphered inputs.

pdf bib abs
Benchmarking Large Language Models on Answering and Explaining Challenging Medical Questions
Hanjie Chen | Zhouxiang Fang | Yash Singla | Mark Dredze
Proceedings of the 2025 Conference of the Nations of the Americas Chapter of the Association for Computational Linguistics: Human Language Technologies (Volume 1: Long Papers)

LLMs have demonstrated impressive performance in answering medical questions, such as achieving passing scores on medical licensing examinations. However, medical board exams or general clinical questions do not capture the complexity of realistic clinical cases. Moreover, the lack of reference explanations means we cannot easily evaluate the reasoning of model decisions, a crucial component of supporting doctors in making complex medical decisions. To address these challenges, we construct two new datasets: JAMA Clinical Challenge and Medbullets. JAMA Clinical Challenge consists of questions based on challenging clinical cases, while Medbullets comprises simulated clinical questions. Both datasets are structured as multiple-choice question-answering tasks, accompanied by expert-written explanations. We evaluate seven LLMs on the two datasets using various prompts. Experiments demonstrate that our datasets are harder than previous benchmarks. In-depth automatic and human evaluations of model-generated explanations provide insights into the promise and deficiency of LLMs for explainable medical QA.

Co-authors

Aayush Mishra 1

Yash Singla 1

Venues

emnlp1
naacl1

Fix author