Konstantin Chernyshev
2025
U-MATH: A University-Level Benchmark for Evaluating Mathematical Skills in Large Language Models
Konstantin Chernyshev
|
Vitaliy Polshkov
|
Vlad Stepanov
|
Alex Myasnikov
|
Ekaterina Artemova
|
Alexei Miasnikov
|
Sergei Tilga
Proceedings of the Fourth Workshop on Generation, Evaluation and Metrics (GEM²)
Current evaluations of mathematical skills in Large Language Models are constrained by benchmarks lacking scope, particularly for multi-modal problems — frequently relying on school-level, niche Olympiad-style, simple quiz-format, or relatively small datasets.To address this, we introduce **U-MATH**, a novel benchmark comprising **1,100** unpublished open-ended university-level problems sourced from current US curricula, with **20%** incorporating visual elements. Given the free-form nature of U-MATH problems, we employ LLM judges for solution evaluation and release 𝜇**-MATH**, a meta-evaluation benchmark composed of **1,084** U-MATH-derived tasks enabling precise assessment of these judges.Benchmarking leading LLMs reveals marked limitations in multi-modal reasoning, with maximum accuracy reaching 93.1% on textual tasks but only 58.5% on visual ones. Furthermore, solution judgment proves challenging, requiring the most advanced models to achieve meaningfully high performance, even still peaking at an imperfect F1-score of 90.1%.
2023
LCT-1 at SemEval-2023 Task 10: Pre-training and Multi-task Learning for Sexism Detection and Classification
Konstantin Chernyshev
|
Ekaterina Garanina
|
Duygu Bayram
|
Qiankun Zheng
|
Lukas Edman
Proceedings of the 17th International Workshop on Semantic Evaluation (SemEval-2023)
Misogyny and sexism are growing problems in social media. Advances have been made in online sexism detection but the systems are often uninterpretable. SemEval-2023 Task 10 on Explainable Detection of Online Sexism aims at increasing explainability of the sexism detection, and our team participated in all the proposed subtasks. Our system is based on further domain-adaptive pre-training. Building on the Transformer-based models with the domain adaptation, we compare fine-tuning with multi-task learning and show that each subtask requires a different system configuration. In our experiments, multi-task learning performs on par with standard fine-tuning for sexism detection and noticeably better for coarse-grained sexism classification, while fine-tuning is preferable for fine-grained classification.
Search
Fix author
Co-authors
- Ekaterina Artemova 1
- Duygu Bayram 1
- Lukas Edman 1
- Ekaterina Garanina 1
- Alexei Miasnikov 1
- show all...