Arshia Kermani
2026
Temporal-Linguistic Adaptive Streaming for Continuous Sign Language Translation
Arshia Kermani | Habib Irani | Deautaun Ross | Vangelis Metsis
Proceedings of the 4th Workshop on Advances in Language and Vision Research (ALVR)
Arshia Kermani | Habib Irani | Deautaun Ross | Vangelis Metsis
Proceedings of the 4th Workshop on Advances in Language and Vision Research (ALVR)
Real-time sign language translation must generate text incrementally as signs arrive, yet existing streaming policies treat glosses as a flat token sequence and discard the temporal rhythm of signing. Inter-gloss pauses reliably mark sentence boundaries in continuous discourse, but policies such as Wait-k cause arbitrary cross-boundary fragmentation. We propose Temporal-Linguistic Adaptive Streaming (TLAS), which fuses a Temporal Pause Detector (TPD, tracking inter-gloss interval statistics via an exponential moving average) and a Linguistic Readiness Estimator (LRE, a trained neural head on a frozen T5 encoder) through an Adaptive Fusion Gate (AFG). A proactive timeout fires before the next gloss arrives when the inter-gloss gap exceeds a threshold, producing clean sentence segmentation without oracle boundary information. We also contribute a synthetic discourse dataset of 1,400 ASL discourse groups with LLM-generated per-gloss timestamps and introduce a continuous-stream evaluation paradigm requiring autonomous boundary detection from an unbroken gloss stream. Under such conditions, TLAS significantly outperforms current heuristic baselines, such as Wait-k, and methods relying solely on linguistic content.
2025
Finetuning Pre-trained Language Models for Bidirectional Sign Language Gloss to Text Translation
Arshia Kermani | Habib Irani | Vangelis Metsis
Proceedings of the Workshop on Sign Language Processing (WSLP)
Arshia Kermani | Habib Irani | Vangelis Metsis
Proceedings of the Workshop on Sign Language Processing (WSLP)
Sign Language Translation (SLT) is a crucial technology for fostering communication accessibility for the Deaf and Hard-of-Hearing (DHH) community. A dominant approach in SLT involves a two-stage pipeline: first, transcribing video to sign language glosses, and then translating these glosses into natural text. This second stage, gloss-to-text translation, is a challenging, low-resource machine translation task due to data scarcity and significant syntactic divergence. While prior work has often relied on training translation models from scratch, we show that fine-tuning large, pre-trained language models (PLMs) offers a more effective and data-efficient paradigm. In this work, we conduct a comprehensive bidirectional evaluation of several PLMs (T5, Flan-T5, mBART, and Llama) on this task. We use a collection of popular SLT datasets (RWTH-PHOENIX-14T, SIGNUM, and ASLG-PC12) and evaluate performance using standard machine translation metrics. Our results show that fine-tuned PLMs consistently and significantly outperform Transformer models trained from scratch, establishing new state-of-the-art results. Crucially, our bidirectional analysis reveals a significant performance gap, with Text-to-Gloss translation posing a greater challenge than Gloss-to-Text. We conclude that leveraging the linguistic knowledge of pre-trained models is a superior strategy for gloss translation and provides a more practical foundation for building robust, real-world SLT systems.
A Systematic Evaluation of LLM Strategies for Mental Health Text Analysis: Fine-tuning vs. Prompt Engineering vs. RAG
Arshia Kermani | Veronica Perez-Rosas | Vangelis Metsis
Proceedings of the 10th Workshop on Computational Linguistics and Clinical Psychology (CLPsych 2025)
Arshia Kermani | Veronica Perez-Rosas | Vangelis Metsis
Proceedings of the 10th Workshop on Computational Linguistics and Clinical Psychology (CLPsych 2025)
This study presents a systematic comparison of three approaches for the analysis of mental health text using large language models (LLMs): prompt engineering, retrieval augmented generation (RAG), and fine-tuning. Using LLaMA 3, we evaluate these approaches on emotion classification and mental health condition detection tasks across two datasets. Fine-tuning achieves the highest accuracy (91% for emotion classification, 80% for mental health conditions) but requires substantial computational resources and large training sets, while prompt engineering and RAG offer more flexible deployment with moderate performance (40-68% accuracy). Our findings provide practical insights for implementing LLM-based solutions in mental health applications, highlighting the trade-offs between accuracy, computational requirements, and deployment flexibility.