Tayyab Latif


2026

Vocabulary difficulty prediction aims to estimate how difficult a word is for a learner. This is an important problem because word difficulty is shaped not only by the word itself, but also by the learner’s background and the context in which the word appears. In this work, we predict continuous difficulty scores for English target words using learnerspecific information. Our approach combines a fine-tuned DeBERTa v3 Large model with a CatBoost regressor trained on transformer-based embeddings. The final score is produced through weighted ensembling, where DeBERTa provides the main prediction and CatBoost adds a smaller complementary signal. Our final system achieved RMSE scores of 1.040 for Spanish, 0.992 for German, and 0.882 for Chinese. The results were also stable across multiple runs, showing that the model behaved consistently under small changes in ensemble weight. These findings show that a simple hybrid system can provide reliable performance for vocabulary difficulty prediction. They also suggest that combining strong contextual representations with a lightweight regression model is an effective way to model learner-sensitive word difficulty.