Alan Huang


2025

pdf bib
MemeQA: Holistic Evaluation for Meme Understanding
Khoi P. N. Nguyen | Terrence Li | Derek Lou Zhou | Gabriel Xiong | Pranav Balu | Nandhan Alahari | Alan Huang | Tanush Chauhan | Harshavardhan Bala | Emre Guzelordu | Affan Kashfi | Aaron Xu | Suyesh Shrestha | Megan Vu | Jerry Wang | Vincent Ng
Proceedings of the 63rd Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers)

Automated meme understanding requires systems to demonstrate fine-grained visual recognition, commonsense reasoning, and extensive cultural knowledge. However, existing benchmarks for meme understanding only concern narrow aspects of meme semantics. To fill this gap, we present MemeQA, a dataset of over 9,000 multiple-choice questions designed to holistically evaluate meme comprehension across seven cognitive aspects. Experiments show that state-of-the-art Large Multimodal Models perform much worse than humans on MemeQA. While fine-tuning improves their performance, they still make many errors on memes wherein proper understanding requires going beyond surface-level sentiment. Moreover, injecting “None of the above” into the available options makes the questions more challenging for the models. Our dataset is publicly available at https://github.com/npnkhoi/memeqa.

2024

pdf bib
LawBench: Benchmarking Legal Knowledge of Large Language Models
Zhiwei Fei | Xiaoyu Shen | Dawei Zhu | Fengzhe Zhou | Zhuo Han | Alan Huang | Songyang Zhang | Kai Chen | Zhixin Yin | Zongwen Shen | Jidong Ge | Vincent Ng
Proceedings of the 2024 Conference on Empirical Methods in Natural Language Processing

We present LawBench, the first evaluation benchmark composed of 20 tasks aimed to assess the ability of Large Language Models (LLMs) to perform Chinese legal-related tasks. LawBench is meticulously crafted to enable precise assessment of LLMs’ legal capabilities from three cognitive levels that correspond to the widely accepted Bloom’s cognitive taxonomy. Using LawBench, we present a comprehensive evaluation of 21 popular LLMs and the first comparative analysis of the empirical results in order to reveal their relative strengths and weaknesses. All data, model predictions and evaluation code are accessible from https://github.com/open-compass/LawBench.