Yang Zhang

Other people with similar names: Yang Zhang, Yang Zhang, Yang Zhang, Yang Zhang, Yang Zhang (USTC)

Unverified author pages with similar names: Yang Zhang

2026

Large Language Models (LLMs) are commonly trained on multilingual corpora that include Greek, yet reliable evaluation benchmarks for Greek—particularly those based on authentic, native-sourced content—remain limited. Existing datasets are often machine-translated from English, failing to capture Greek linguistic and cultural characteristics. We introduce GreekMMLU, a native-sourced benchmark for massive multitask language understanding in Greek, comprising 21,805 multiple-choice questions across 45 subject areas, organized under a newly defined subject taxonomy and annotated with educational difficulty levels spanning primary to professional examinations. All questions are sourced or authored in Greek from academic, professional, and governmental exams. We publicly release 16,857 samples and reserve 4,948 samples for a private leaderboard to enable robust and contamination-resistant evaluation. Evaluations of over 80 open- and closed-source LLMs reveal substantial performance gaps between frontier and open-weight models, as well as between Greek-adapted models and general multilingual ones. Finally, we provide a systematic analysis of factors influencing performance—including model scale, adaptation, and prompting—and derive insights for improving LLM capabilities in Greek.

pdf bib abs

Fast-Decoding Diffusion Language Models via Progress-Aware Confidence Schedules
Amr Mohamed | Yang Zhang | Michalis Vazirgiannis | Guokan Shang
Findings of the Association for Computational Linguistics: ACL 2026

Diffusion large language models (dLLMs) offer a promising alternative to autoregressive models, but their practical utility is severely hampered by slow, iterative sampling. We present *SchED*, a training-free, model-agnostic early-exit algorithm that terminates diffusion decoding using a progress-aware confidence threshold. We evaluate *SchED* across multiple diffusion model families and a diverse set of benchmarks spanning multiple-choice, math, long-form QA, and translation. *SchED* delivers substantial acceleration: on instruction-tuned models, it achieves approximately 4× speedups while retaining baseline performance on average. On base models, *SchED* yields consistent speedup gains with 99.1–100% performance retention, with up to 2.34× under more aggressive settings. Under a conservative quality–penalized speed metric, *SchED* consistently outperforms prior confidence-based early-exit methods, including on long-form generation where existing approaches tend to break down. An entropy analysis of the model’s token predictions reveals that instruction tuning speeds up the decay of predictive entropy. By leveraging inherent confidence stabilization as a signal for computational efficiency, *SchED* provides a robust framework for efficient dLLM inference.

pdf bib abs

Beyond Random Sampling: Efficient Language Model Pretraining via Curriculum Learning
Yang Zhang | Amr Mohamed | Hadi Abdine | Guokan Shang | Michalis Vazirgiannis
Proceedings of the 19th Conference of the European Chapter of the Association for Computational Linguistics (Volume 1: Long Papers)

Curriculum learning—organizing training data from easy to hard—has improved efficiency across machine learning domains, yet remains underexplored for language model pretraining. We present the first systematic investigation of curriculum learning in LLM pretraining, with over 200 models trained on up to 100B tokens across three strategies: vanilla curriculum learning, pacing-based sampling, and interleaved curricula, guided by six difficulty metrics spanning linguistic and information-theoretic properties. We evaluate performance on eight benchmarks under three realistic scenarios: limited data, unlimited data, and continual training. Our experiments show that curriculum learning consistently accelerates convergence in early and mid-training phases, reducing training steps by 18-45% to reach baseline performance. When applied as a warmup strategy before standard random sampling, curriculum learning yields sustained improvements up to 3.5%. We identify compression ratio, lexical diversity (MTLD), and readability (Flesch Reading Ease) as the most effective difficulty signals. Our findings demonstrate that data ordering—orthogonal to existing data selection methods—provides a practical mechanism for more efficient LLM pretraining.

pdf bib abs

Lost in the Mix: Evaluating LLM Understanding of Code-Switched Text
Amr Mohamed | Yang Zhang | Michalis Vazirgiannis | Guokan Shang
Proceedings of the 64th Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers)

Code-switching (CSW) is the act of alternating between two or more languages within a single discourse. This phenomenon is widespread in multilingual communities and increasingly prevalent online, exposing large language models (LLMs) to mixed-language inputs. We present a systematic evaluation of LLM *comprehension* under code-switching by generating linguistically grounded CSW variants of established benchmarks (Belebele, MMLU, XNLI) across five typologically diverse languages. Our contributions are: (i) a controlled pipeline for producing CSW test sets that respect linguistic constraints on code-switching; (ii) a multi-model, multi-language analysis showing that inserting non-English tokens into English consistently reduces accuracy on comprehension and reasoning benchmarks, whereas embedding English into non-English contexts often improves it; and (iii) a mitigation study contrasting in-context learning (ICL) with fine-tuning. Across model families, ICL cues yield inconsistent, and sometimes negative, effects, while fine-tuning on CSW data provides modest but reliable gains, partially recovering accuracy under CSW.

Co-authors

Mersin Konomi 1

Giannis Nikolentzos 1

Konstantinos Skianis 1

Giorgos Stamou 1

Christos Xypolopoulos 1

Venues

Fix author