Yu Huang
Also published as: Yun Huang
Other people with similar names: Yu Huang
Unverified author pages with similar names: Yu Huang
2026
Beyond Single View: A Comprehensive Benchmark for Medical Multimodal Large Language Models on Multi-Image Understanding
Dexuan Xu | Yuan Jiayin | Jianing Wang | Yanyuan Chen | Hanpin Wang | Yu Huang
Proceedings of the 64th Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers)
Dexuan Xu | Yuan Jiayin | Jianing Wang | Yanyuan Chen | Hanpin Wang | Yu Huang
Proceedings of the 64th Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers)
Recent advancements in Multimodal Large Language Models (MLLMs) have demonstrated impressive capabilities in interpreting single medical images. However, real-world clinical diagnosis is intrinsically a multi-view process, requiring the synthesis of information across volumetric slices, temporal sequences, and comparative modalities. Existing benchmarks fail to capture this complexity, limiting the assessment of models in realistic clinical workflows. To bridge this gap, we introduce MedMultiBench, the first large-scale benchmark specifically designed for medical multi-image understanding. Comprising 11,392 expert-curated samples, MedMultiBench evaluates MLLMs across four distinct dimensions: Joint Reasoning, Comparative Analysis, Comprehensive Perception, and In-Context Learning. We benchmark 13 state-of-the-art MLLMs, revealing that while current models excel in single-view tasks, they struggle significantly with multi-image contexts. Our experiments identify a performance degradation in open-source models when processing increased visual loads, whereas closed-source models demonstrate better scalability. MedMultiBench provides a robust framework to facilitate the development of MLLMs capable of holistic clinical reasoning.
2025
DatawiseAgent: A Notebook-Centric LLM Agent Framework for Adaptive and Robust Data Science Automation
Ziming You | Yumiao Zhang | Dexuan Xu | Yiwei Lou | Yandong Yan | Wei Wang | Huamin Zhang | Yu Huang
Proceedings of the 2025 Conference on Empirical Methods in Natural Language Processing
Ziming You | Yumiao Zhang | Dexuan Xu | Yiwei Lou | Yandong Yan | Wei Wang | Huamin Zhang | Yu Huang
Proceedings of the 2025 Conference on Empirical Methods in Natural Language Processing
Existing large language model (LLM) agents for automating data science show promise, but they remain constrained by narrow task scopes, limited generalization across tasks and models, and over-reliance on state-of-the-art (SOTA) LLMs. We introduce DatawiseAgent, a notebook-centric LLM agent framework for adaptive and robust data science automation. Inspired by how human data scientists work in computational notebooks, DatawiseAgent introduces a unified interaction representation and a multi-stage architecture based on finite-state transducers (FSTs). This design enables flexible long-horizon planning, progressive solution development, and robust recovery from execution failures. Extensive experiments across diverse data science scenarios and models show that DatawiseAgent consistently achieves SOTA performance by surpassing strong baselines such as AutoGen and TaskWeaver, demonstrating superior effectiveness and adaptability. Further evaluations reveal graceful performance degradation under weaker or smaller models, underscoring the robustness and scalability.
ValueCompass: A Framework for Measuring Contextual Value Alignment Between Human and LLMs
Hua Shen | Tiffany Knearem | Reshmi Ghosh | Yu-Ju Yang | Nicholas Clark | Tanu Mitra | Yun Huang
Proceedings of the 9th Widening NLP Workshop
Hua Shen | Tiffany Knearem | Reshmi Ghosh | Yu-Ju Yang | Nicholas Clark | Tanu Mitra | Yun Huang
Proceedings of the 9th Widening NLP Workshop
As AI advances, aligning it with diverse human and societal values grows critical. But how do we define these values and measure AI’s adherence to them? We present ValueCompass, a framework grounded in psychological theories, to assess human-AI alignment. Applying it to five diverse LLMs and 112 humans from seven countries across four scenarios—collaborative writing, education, public sectors, and healthcare—we uncover key misalignments. For example, humans prioritize national security, while LLMs often reject it. Values also shift across contexts, demanding scenario-specific alignment strategies. This work advances AI design by mapping how systems can better reflect societal ethics.
2024
Detection, Diagnosis, and Explanation: A Benchmark for Chinese Medical Hallucination Evaluation
Chengfeng Dou | Ying Zhang | Yanyuan Chen | Zhi Jin | Wenpin Jiao | Haiyan Zhao | Yongqiang Zhao | Zhenwei Tao | Yun Huang
Proceedings of the 2024 Joint International Conference on Computational Linguistics, Language Resources and Evaluation (LREC-COLING 2024)
Chengfeng Dou | Ying Zhang | Yanyuan Chen | Zhi Jin | Wenpin Jiao | Haiyan Zhao | Yongqiang Zhao | Zhenwei Tao | Yun Huang
Proceedings of the 2024 Joint International Conference on Computational Linguistics, Language Resources and Evaluation (LREC-COLING 2024)
Large Language Models (LLMs) have made significant progress recently. However, their practical use in healthcare is hindered by their tendency to generate hallucinations. One specific type, called snowballing hallucination, occurs when LLMs encounter misleading information, and poses a security threat to LLMs. To understand how well LLMs can resist these hallucination, we create the Chinese Medical Hallucination Evaluation benchmark (CMHE). This benchmark can be used to evaluate LLMs’ ability to detect medical hallucinations, make accurate diagnoses in noisy conditions, and provide plausible explanations. The creation of this benchmark involves a combination of manual and model-based approaches. In addition, we use ICD-10 as well as MeSH, two specialized glossaries, to aid in the evaluation. Our experiments show that the LLM struggles to identify fake medical terms and makes poor diagnoses in distracting environments. However, improving the model’s understanding of medical concepts can help it resist interference to some extent. Our dataset is available at https://drive.google.com/drive/folders/1DrdovKwZIh6AX_JjL8BVpUmI9djiIwn_?usp=drive_link.
Reference-based Metrics Disprove Themselves in Question Generation
Bang Nguyen | Mengxia Yu | Yun Huang | Meng Jiang
Findings of the Association for Computational Linguistics: EMNLP 2024
Bang Nguyen | Mengxia Yu | Yun Huang | Meng Jiang
Findings of the Association for Computational Linguistics: EMNLP 2024
Reference-based metrics such as BLEU and BERTScore are widely used to evaluate question generation (QG). In this study, on QG benchmarks such as SQuAD and HotpotQA, we find that using human-written references cannot guarantee the effectiveness of the reference-based metrics. Most QG benchmarks have only one reference; we replicate the annotation process and collect another reference. A good metric is expected to grade a human-validated question no worse than generated questions. However, the results of reference-based metrics on our newly collected reference disproved the metrics themselves. We propose a reference-free metric consisted of multi-dimensional criteria such as naturalness, answerability, and complexity, utilizing large language models. These criteria are not constrained to the syntactic or semantic of a single reference question, and the metric does not require a diverse set of references. Experiments reveal that our metric accurately distinguishes between high-quality questions and flawed ones, and achieves state-of-the-art alignment with human judgment.
2012
Improved Constituent Context Model with Features
Yun Huang | Min Zhang | Chew Lim Tan
Proceedings of the 26th Pacific Asia Conference on Language, Information, and Computation
Yun Huang | Min Zhang | Chew Lim Tan
Proceedings of the 26th Pacific Asia Conference on Language, Information, and Computation
Improved Combinatory Categorial Grammar Induction with Boundary Words and Bayesian Inference
Yun Huang | Min Zhang | Chew-Lim Tan
Proceedings of COLING 2012
Yun Huang | Min Zhang | Chew-Lim Tan
Proceedings of COLING 2012
2011
Nonparametric Bayesian Machine Transliteration with Synchronous Adaptor Grammars
Yun Huang | Min Zhang | Chew Lim Tan
Proceedings of the 49th Annual Meeting of the Association for Computational Linguistics: Human Language Technologies
Yun Huang | Min Zhang | Chew Lim Tan
Proceedings of the 49th Annual Meeting of the Association for Computational Linguistics: Human Language Technologies
2009
Improving Statistical Machine Translation Using Domain Bilingual Multiword Expressions
Zhixiang Ren | Yajuan Lü | Jie Cao | Qun Liu | Yun Huang
Proceedings of the Workshop on Multiword Expressions: Identification, Interpretation, Disambiguation and Applications (MWE 2009)
Zhixiang Ren | Yajuan Lü | Jie Cao | Qun Liu | Yun Huang
Proceedings of the Workshop on Multiword Expressions: Identification, Interpretation, Disambiguation and Applications (MWE 2009)
2008
The ICT system description for IWSLT 2008.
Yang Liu | Zhongjun He | Haitao Mi | Yun Huang | Yang Feng | Wenbin Jiang | Yajuan Lu | Qun Liu
Proceedings of the 5th International Workshop on Spoken Language Translation: Evaluation Campaign
Yang Liu | Zhongjun He | Haitao Mi | Yun Huang | Yang Feng | Wenbin Jiang | Yajuan Lu | Qun Liu
Proceedings of the 5th International Workshop on Spoken Language Translation: Evaluation Campaign
This paper presents a description for the ICT systems involved in the IWSLT 2008 evaluation campaign. This year, we participated in Chinese-English and English-Chinese translation directions. Four statistical machine translation systems were used: one linguistically syntax-based, two formally syntax-based, and one phrase-based. The outputs of the four SMT systems were fed to a sentence-level system combiner, which was expected to produce better translations than single systems. We will report the results of the four single systems and the combiner on both the development and test sets.
2007
The ICT statistical machine translation systems for IWSLT 2007
Zhongjun He | Haitao Mi | Yang Liu | Deyi Xiong | Weihua Luo | Yun Huang | Zhixiang Ren | Yajuan Lu | Qun Liu
Proceedings of the Fourth International Workshop on Spoken Language Translation
Zhongjun He | Haitao Mi | Yang Liu | Deyi Xiong | Weihua Luo | Yun Huang | Zhixiang Ren | Yajuan Lu | Qun Liu
Proceedings of the Fourth International Workshop on Spoken Language Translation
In this paper, we give an overview of the ICT statistical machine translation systems for the evaluation campaign of the International Workshop on Spoken Language Translation (IWSLT) 2007. In this year’s evaluation, we participated in the Chinese-English transcript translation task, and developed three systems based on different techniques: a formally syntax-based system Bruin, an extended phrase-based system Confucius and a linguistically syntax-based system Lynx. We will describe the models of these three systems, and compare their performance in detail. We set Bruin as our primary system, which ranks 2 among the 15 primary results according to the official evaluation results.
Search
Fix author
Co-authors
- Qun Liu 4
- Yang Liu (刘洋) 3
- Yajuan Lü 3
- Chew Lim Tan 3
- Min Zhang 3
- Yanyuan Chen 2
- Zhongjun He 2
- Haitao Mi 2
- Zhixiang Ren 2
- Dexuan Xu 2
- Jie Cao 1
- Nicholas Clark 1
- Chengfeng Dou 1
- Yang Feng 1
- Reshmi Ghosh 1
- Wenbin Jiang 1
- Meng Jiang 1
- Wenpin Jiao 1
- Yuan Jiayin 1
- Zhi Jin 1
- Tiffany Knearem 1
- Shouxun Lin 1
- Yiwei Lou 1
- Weihua Luo 1
- Tanu Mitra 1
- Bang Nguyen 1
- Hua Shen 1
- Zhenwei Tao 1
- Wei Wang 1
- Jianing Wang 1
- Hanpin Wang 1
- Deyi Xiong (德意 熊) 1
- Yandong Yan 1
- Yu-Ju Yang 1
- Ziming You 1
- Mengxia Yu 1
- Yumiao Zhang 1
- Huamin Zhang 1
- Ying Zhang 1
- Haiyan Zhao 1
- Yongqiang Zhao 1