2025
pdf
bib
abs
Cross-Lingual Auto Evaluation for Assessing Multilingual LLMs
Sumanth Doddapaneni
|
Mohammed Safi Ur Rahman Khan
|
Dilip Venkatesh
|
Raj Dabre
|
Anoop Kunchukuttan
|
Mitesh M Khapra
Proceedings of the 63rd Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers)
Evaluating machine-generated text remains a significant challenge in NLP, especially for non-English languages. Current methodologies, including automated metrics, human assessments, and LLM-based evaluations, predominantly focus on English, revealing a significant gap in multilingual evaluation frameworks. We introduce the Cross Lingual Auto Evaluation (CIA) Suite, an extensible framework that includes evaluator LLMs (Hercule) and a novel test set (Recon) specifically designed for multilingual evaluation. Our test set features 500 human-annotated instructions spanning various task capabilities along with human judgment scores across six languages. This would enable benchmarking of general-purpose multilingual LLMs and facilitate meta-evaluation of Evaluator LLMs. The proposed model, Hercule, is a cross-lingual evaluation model that addresses the scarcity of reference answers in the target language by learning to assign scores to responses based on easily available reference answers in English. Our experiments demonstrate that Hercule aligns more closely with human judgments compared to proprietary models, demonstrating the effectiveness of such cross-lingual evaluation in low resource scenarios. Further, it is also effective in zero-shot evaluation on unseen languages. This study is the first comprehensive examination of cross-lingual evaluation using LLMs, presenting a scalable and effective approach for multilingual assessment. All code, datasets, and models will be publicly available to enable further research in this important area.
2024
pdf
bib
abs
BITS Pilani at SemEval-2024 Task 10: Fine-tuning BERT and Llama 2 for Emotion Recognition in Conversation
Dilip Venkatesh
|
Pasunti Prasanjith
|
Yashvardhan Sharma
Proceedings of the 18th International Workshop on Semantic Evaluation (SemEval-2024)
Emotion Recognition in Conversation (ERC)aims to assign an emotion to a dialogue in aconversation between people. The first subtaskof EDiReF shared task aims to assign an emo-tions to a Hindi-English code mixed conversa-tion. For this, our team proposes a system toidentify the emotion based on fine-tuning largelanguage models on the MaSaC dataset. Forour study we have fine tuned 2 LLMs BERTand Llama 2 to perform sequence classificationto identify the emotion of the text.
pdf
bib
abs
BITS Pilani at SemEval-2024 Task 9: Prompt Engineering with GPT-4 for Solving Brainteasers
Dilip Venkatesh
|
Yashvardhan Sharma
Proceedings of the 18th International Workshop on Semantic Evaluation (SemEval-2024)
Solving brainteasers is a task that requires complex reasoning prowess. The increase of research in natural language processing has leadto the development of massive large languagemodels with billions (or trillions) of parameters that are able to solve difficult questionsdue to their advanced reasoning capabilities.The SemEval BRAINTEASER shared tasks consists of sentence and word puzzles along withoptions containing the answer for the puzzle.Our team uses OpenAI’s GPT-4 model alongwith prompt engineering to solve these brainteasers.
pdf
bib
abs
BITS Pilani at SemEval-2024 Task 1: Using text-embedding-3-large and LaBSE embeddings for Semantic Textual Relatedness
Dilip Venkatesh
|
Sundaresan Raman
Proceedings of the 18th International Workshop on Semantic Evaluation (SemEval-2024)
Semantic Relatedness of a pair of text (sentences or words) is the degree to which theirmeanings are close. The Track A of the Semantic Textual Relatedness shared task aimsto find the semantic relatedness for the English language along with multiple other lowresource languages with the use of pretrainedlanguage models. We proposes a system tofind the Spearman coefficient of a textual pairusing pretrained embedding models like textembedding-3-large and LaBSE.