This is an internal, incomplete preview of a proposed change to the ACL Anthology.
For efficiency reasons, we don't generate MODS or Endnote formats, and the preview may be incomplete in other ways, or contain mistakes.
Do not treat this content as an official publication.
Md MushfiqurRahman
Fixing paper assignments
Please select all papers that do not belong to this person.
Indicate below which author they should be assigned to.
There has been little systematic study on how dialectal differences affect toxicity detection by modern LLMs. Furthermore, although using LLMs as evaluators (“LLM-as-a-judge”) is a growing research area, their sensitivity to dialectal nuances is still underexplored and requires more focused attention. In this paper, we address these gaps through a comprehensive toxicity evaluation of LLMs across diverse dialects. We create a multi-dialect dataset through synthetic transformations and human-assisted translations, covering 10 language clusters and 60 varieties. We then evaluate five LLMs on their ability to assess toxicity, measuring multilingual, dialectal, and LLM-human consistency. Our findings show that LLMs are sensitive to both dialectal shifts and low-resource multilingual variation, though the most persistent challenge remains aligning their predictions with human judgments.
Patients often ask semantically similar medical questions in linguistically diverse ways that vary in readability tone and background knowledge. A robust question answering QA system should both provide semantically consistent answers across stylistic differences and adapt its response style to match the users input however existing QA evaluations rarely test this capability creating critical gaps in QA evaluation that undermine accessibility and health literacy. We introduce SPQA an evaluation framework and benchmark that applies controlled stylistic perturbations to consumer health questions while preserving semantic intent then measures how model answers change across correctness completeness coherence fluency and linguistic adaptability using a human-validated LLM-based judge. The style axes include reading level formality and patient background knowledge all perturbations are grounded in human annotations to ensure fidelity and alignment with human judgments. Our contributions include a readability-aware evaluation methodology a style-diverse benchmark with human-grounded perturbations and an automated evaluation pipeline validated against expert judgments. Evaluation results across multiple health QA models indicate that stylistic perturbations lead to measurable performance degradation even when semantic intent is preserved during perturbation. The largest performance drops occur in answer correctness and completeness while models also show limited ability to adapt their style to match the input. These findings underscore the risk of inequitable information delivery and highlight the need for accessibility-aware QA evaluation.
As voice assistants cement their place in our technologically advanced society, there remains a need to cater to the diverse linguistic landscape, including colloquial forms of low-resource languages. Our study introduces the first-ever comprehensive dataset for intent detection and slot filling in formal Bangla, colloquial Bangla, and Sylheti languages, totaling 984 samples across 10 unique intents. Our analysis reveals the robustness of large language models for tackling downstream tasks with inadequate data. The GPT-3.5 model achieves an impressive F1 score of 0.94 in intent detection and 0.51 in slot filling for colloquial Bangla.