Shubh Sehgal


2026

This paper presents team SAAKTH’s system for the BEA 2026 Shared Task on Vocabulary Difficulty Prediction (Closed Track). We address the key challenge that English word difficulty is not fixed but varies with English learners’ native language. Our approach combines a fine-tuned XLM-RoBERTa-large encoder with handcrafted psycholinguistic features engineered separately for each L1 group. These features are integrated via a shallow multilayer perceptron and optimized separately per L1, with five-seed ensembling and XGBoost-based blending for stability. Our system achieves RMSEs of 0.997 (es), 1.002 (de), and 0.932 (cn) on the development set, improving 20–25% over the baseline. Results highlight the effectiveness of L1-aware modeling under limited data.