Kamruzzaman Khan Alve


2026

We present a unified, language-agnostic system for the BEA 2026 Shared Task on vocabulary difficulty prediction. The system uses a single training pipeline across Spanish, German, and Mandarin Chinese without any language-specific adaptation. Input features include serialized text fields and four scalar length-based features, processed using an XLM-RoBERTa encoder with attention-mask-weighted mean pooling. Hyperparameters are tuned with Optuna under reduced cross-validation, followed by full 5-fold training and checkpoint-based ensembling.Our approach improves over the official closed-track baseline across all three L1 conditions, demonstrating that a shared architecture and training strategy can yield consistent gains without language-specific engineering. Error analysis shows higher prediction error at difficulty extremes, suggesting a regression-to-the-mean tendency.