Prasanth
2026
Translation-Augmented Multilingual Summarization for Low-Resource Languages
Prasanth
Proceedings of the Sixth Workshop on Language Technology for Equality, Diversity, Inclusion
Prasanth
Proceedings of the Sixth Workshop on Language Technology for Equality, Diversity, Inclusion
While automatic text summarization has achieved remarkable success in English,extending these capabilities to low-resource languages remains a significantchallenge due to the scarcity of labeled training data. We propose atranslation-augmented approach to multilingual summarization: we systematicallytranslate high-quality English summarization corpora into low-resource targetlanguages using NLLB-200, and use the resulting parallel data to train andevaluate sequence-to-sequence models. We experiment across three typologicallydiverse languages—Swahili, Hausa, and Afrikaans—comparing monolingualfine-tuning (MONO), cross-lingual transfer (XLT), and joint multilingualtraining (TAMT) on mBART-large-50. Monolingual fine-tuning achieves the bestperformance for Swahili (ROUGE-L 13.9) and Afrikaans (ROUGE-L 15.7),surpassing the Lead-3 baseline in both cases, while cross-lingual transferremains strongest for Hausa (ROUGE-L 14.5). We show that native language tokenavailability in mBART-50 is a critical determinant of fine-tuning performance,and characterize the conditions under which the theoretically expectedTAMT > MONO > XLT ordering breaks down. We release our dataset, code, andevaluation infrastructure to support future research on low-resourcemultilingual summarization.
Efficient Visual Grounding in VQA via Question-Guided Sparse Attention
Prasanth
Proceedings of the 4th Workshop on Advances in Language and Vision Research (ALVR)
Prasanth
Proceedings of the 4th Workshop on Advances in Language and Vision Research (ALVR)
Visual Question Answering (VQA) models process all image patches uniformlydespite questions typically requiring only a small subset of visual information.This inefficiency leads to unnecessary computation and can result in attentiondilution across irrelevant image regions. We propose Question-GuidedSparse Attention (QGSA), a plug-and-play mechanism that dynamically selectsrelevant image patches conditioned on question semantics. Our approach introducesthree components: (1)a differentiable patch selector based on Gumbel-Softmaxreparameterisation that enables end-to-end training with hard patch selection atinference; (2)a self-supervised grounding loss that encourages spatialselectivity without bounding-box annotations, combining contrastive patchselection with patch–word alignment via a frozen CLIP encoder; and (3)anadaptive sparsity mechanism that adjusts the number of selected patches accordingto estimated question complexity. Experiments on SmolVLM-256M-Instruct andSmolVLM-500M-Instruct across three VQA benchmarks (VQA-RAD, A-OKVQA, RefCOCO)demonstrate that QGSA reduces cross-attention FLOPs by 91–99% across inputresolutions, achieving up to 76× theoretical speedup at 576px resolution, whilemaintaining exact accuracy parity with the dense baseline (𝛥=0.0 ppon all datasets).Wall-clock parity with the dense baseline is reached at 336px; realisedend-to-end speedup requires larger models where cross-attention dominates totalcompute. QGSA consistently selects an average of k≈17 patches out of576 (256M model), up to k≈18 (500M model), yielding up to a 34×reduction in the visual token sequence. These small-scale results validate thefeasibility of question-conditioned sparse attention and provide a foundation forscaling to larger VLMs.