Prasanth


2026

While automatic text summarization has achieved remarkable success in English,extending these capabilities to low-resource languages remains a significantchallenge due to the scarcity of labeled training data. We propose atranslation-augmented approach to multilingual summarization: we systematicallytranslate high-quality English summarization corpora into low-resource targetlanguages using NLLB-200, and use the resulting parallel data to train andevaluate sequence-to-sequence models. We experiment across three typologicallydiverse languages—Swahili, Hausa, and Afrikaans—comparing monolingualfine-tuning (MONO), cross-lingual transfer (XLT), and joint multilingualtraining (TAMT) on mBART-large-50. Monolingual fine-tuning achieves the bestperformance for Swahili (ROUGE-L 13.9) and Afrikaans (ROUGE-L 15.7),surpassing the Lead-3 baseline in both cases, while cross-lingual transferremains strongest for Hausa (ROUGE-L 14.5). We show that native language tokenavailability in mBART-50 is a critical determinant of fine-tuning performance,and characterize the conditions under which the theoretically expectedTAMT > MONO > XLT ordering breaks down. We release our dataset, code, andevaluation infrastructure to support future research on low-resourcemultilingual summarization.
Visual Question Answering (VQA) models process all image patches uniformlydespite questions typically requiring only a small subset of visual information.This inefficiency leads to unnecessary computation and can result in attentiondilution across irrelevant image regions. We propose Question-GuidedSparse Attention (QGSA), a plug-and-play mechanism that dynamically selectsrelevant image patches conditioned on question semantics. Our approach introducesthree components: (1)a differentiable patch selector based on Gumbel-Softmaxreparameterisation that enables end-to-end training with hard patch selection atinference; (2)a self-supervised grounding loss that encourages spatialselectivity without bounding-box annotations, combining contrastive patchselection with patch–word alignment via a frozen CLIP encoder; and (3)anadaptive sparsity mechanism that adjusts the number of selected patches accordingto estimated question complexity. Experiments on SmolVLM-256M-Instruct andSmolVLM-500M-Instruct across three VQA benchmarks (VQA-RAD, A-OKVQA, RefCOCO)demonstrate that QGSA reduces cross-attention FLOPs by 91–99% across inputresolutions, achieving up to 76× theoretical speedup at 576px resolution, whilemaintaining exact accuracy parity with the dense baseline (𝛥=0.0 ppon all datasets).Wall-clock parity with the dense baseline is reached at 336px; realisedend-to-end speedup requires larger models where cross-attention dominates totalcompute. QGSA consistently selects an average of k≈17 patches out of576 (256M model), up to k≈18 (500M model), yielding up to a 34×reduction in the visual token sequence. These small-scale results validate thefeasibility of question-conditioned sparse attention and provide a foundation forscaling to larger VLMs.
Search
Co-authors
    Venues
    Fix author