Sicong Huang


2025

pdf bib
UCSC at SemEval-2025 Task 3: Context, Models and Prompt Optimization for Automated Hallucination Detection in LLM Output
Sicong Huang | Jincheng He | Shiyuan Huang | Karthik Raja Anandan | Arkajyoti Chakraborty | Ian Lane
Proceedings of the 19th International Workshop on Semantic Evaluation (SemEval-2025)

Hallucinations pose a significant challenge for large language models when answering knowledge-intensive queries. As LLMs become more widely adopted, it is crucial not only to detect if hallucinations occur but also to pinpoint where they arise. SemEval 2025 Task 3, Mu-SHROOM: Multilingual Shared-task on Hallucinations and Related Observable Overgeneration Mistakes, is a recent effort in this direction. This paper describes our solution to the shared task. We propose a framework that first retrieves relevant context, next identifies false content from the answer, and finally maps them back to spans. The process is further enhanced by automatically optimizing prompts. Our system achieves the highest overall performance, ranking #1 in average position across all languages.

pdf bib
UCSC at SemEval-2025 Task 8: Question Answering over Tabular Data
Neng Wan | Sicong Huang | Esha Ubale | Ian Lane
Proceedings of the 19th International Workshop on Semantic Evaluation (SemEval-2025)

Table question answering (Table QA) remains challenging due to the varied structures of tables and the complexity of queries, which often require specialized reasoning. We introduce a system that leverages large language models (LLMs) to generate executable code as an intermediate step for answering questions on tabular data. The methodology uniformly represents tables as dataframes and prompts an LLM to translate natural-language questions into code that can be executed on these tables. This approach addresses key challenges by handling diverse table formats, enhancing interpretability through code execution. Experimental results on the DataBench benchmarks demonstrate that the proposed code-then-execute approach achieves high accuracy. Moreover, by offloading computation to code execution, the system requires fewer LLM invocations, thereby improving efficiency. These findings highlight the effectiveness of an LLM-based coding approach for reliable, scalable, and interpretable Table QA.

2024

pdf bib
Toward Faithful Dialogs: Evaluating and Improving the Faithfulness of Dialog Systems
Sicong Huang
Proceedings of the 20th Workshop of Young Researchers' Roundtable on Spoken Dialogue Systems

My primary research interests lie in evaluating and improving the faithfulness of language model-based text generation systems. Recent advances in large language models (LLMs) such as GPT-4 and Llama have enabled the wide adoption of LLMs in various aspects of natural language processing (NLP). Despite their widespread use, LLMs still suffer from the problem of hallucination, limiting the practicality of deploying such systems in use cases where being factual and faithful is of critical importance. My research specifically aims to evaluate and improve the faithfulness, i.e. the factual alignment between the generated text and a given context, of text generation systems. By developing techniques to reliably evaluate, label, and improve generation faithfulness, we can enable wider adoption of dialog systems that need to converse with human users using accurate information.

2019

pdf bib
WTMED at MEDIQA 2019: A Hybrid Approach to Biomedical Natural Language Inference
Zhaofeng Wu | Yan Song | Sicong Huang | Yuanhe Tian | Fei Xia
Proceedings of the 18th BioNLP Workshop and Shared Task

Natural language inference (NLI) is challenging, especially when it is applied to technical domains such as biomedical settings. In this paper, we propose a hybrid approach to biomedical NLI where different types of information are exploited for this task. Our base model includes a pre-trained text encoder as the core component, and a syntax encoder and a feature encoder to capture syntactic and domain-specific information. Then we combine the output of different base models to form more powerful ensemble models. Finally, we design two conflict resolution strategies when the test data contain multiple (premise, hypothesis) pairs with the same premise. We train our models on the MedNLI dataset, yielding the best performance on the test set of the MEDIQA 2019 Task 1.