Sheikh Shafayat


2025

pdf bib
The BiGGen Bench: A Principled Benchmark for Fine-grained Evaluation of Language Models with Language Models
Seungone Kim | Juyoung Suk | Ji Yong Cho | Shayne Longpre | Chaeeun Kim | Dongkeun Yoon | Guijin Son | Yejin Cho | Sheikh Shafayat | Jinheon Baek | Sue Hyun Park | Hyeonbin Hwang | Jinkyung Jo | Hyowon Cho | Haebin Shin | Seongyun Lee | Hanseok Oh | Noah Lee | Namgyu Ho | Se June Joo | Miyoung Ko | Yoonjoo Lee | Hyungjoo Chae | Jamin Shin | Joel Jang | Seonghyeon Ye | Bill Yuchen Lin | Sean Welleck | Graham Neubig | Moontae Lee | Kyungjae Lee | Minjoon Seo
Proceedings of the 2025 Conference of the Nations of the Americas Chapter of the Association for Computational Linguistics: Human Language Technologies (Volume 1: Long Papers)

As language models (LMs) become capable of handling a wide range of tasks, their evaluation is becoming as challenging as their development. Most generation benchmarks currently assess LMs using abstract evaluation criteria-like helpfulness and harmlessness-which often lack the flexibility and granularity of human assessment. Additionally, these benchmarks tend to focus disproportionately on specific capabilities such as instruction following, leading to coverage bias. To overcome these limitations, we introduce the BiGGen Bench, a principled generation benchmark designed to thoroughly evaluate nine distinct capabilities of LMs across 77 diverse tasks. A key feature of the BiGGen Bench is its use of instance-specific evaluation criteria, closely mirroring the nuanced discernment of human evaluation. We apply this benchmark to assess 100 frontier LMs using five evaluator LMs. Our code, data, and evaluation results are all publicly available at https://github.com/prometheus-eval/prometheus-eval.

2024

pdf bib
LangBridge: Multilingual Reasoning Without Multilingual Supervision
Dongkeun Yoon | Joel Jang | Sungdong Kim | Seungone Kim | Sheikh Shafayat | Minjoon Seo
Proceedings of the 62nd Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers)

We introduce LangBridge, a zero-shot approach to adapt language models for multilingual reasoning tasks without multilingual supervision. LangBridge operates by bridging two models, each specialized in different aspects: (1) one specialized in understanding multiple languages (e.g., mT5 encoder) and (2) one specialized in reasoning (e.g., MetaMath). LangBridge connects the two models by introducing minimal trainable parameters between them. Despite utilizing only English data for training, LangBridge considerably enhances the performance of language models on low-resource languages across mathematical reasoning, code completion, logical reasoning, and commonsense reasoning. Our analysis suggests that the efficacy of LangBridge stems from the language-agnostic characteristics of multilingual representations. We publicly release our code and models.

pdf bib
BEnQA: A Question Answering Benchmark for Bengali and English
Sheikh Shafayat | H Hasan | Minhajur Mahim | Rifki Putri | James Thorne | Alice Oh
Findings of the Association for Computational Linguistics: ACL 2024

In this study, we introduce BEnQA, a dataset comprising parallel Bengali and English exam questions for middle and high school levels in Bangladesh. Our dataset consists of approximately 5K questions covering several subjects in science with different types of questions, including factual, application, and reasoning-based questions. We benchmark several Large Language Models (LLMs) with our parallel dataset and observe a notable performance disparity between the models in Bengali and English. We also investigate some prompting methods, and find that Chain-of-Thought prompting is beneficial mostly on reasoning questions, but not so much on factual ones. We also find that appending English translation helps to answer questions in Bengali. Our findings point to promising future research directions for improving the performance of LLMs in Bengali and more generally in low-resource languages.