Proceedings of the Second Workshop on Customizable NLP: Progress and Challenges in Customizing NLP for a Domain, Application, Group, or Individual (CustomNLP4U)

Sheshera Mysore, Sachin Kumar, Vidhisha Balachandran, Shirley Anugrah Hayati, Faeze Brahman, Hanane Nour Moussa, Alireza Salemi (Editors)

Anthology ID:: 2026.customnlp4u-1
Month:: July
Year:: 2026
Address:: San Diego, California, USA
Venues:: CustomNLP4U | WS
Events:: Annual Meeting of the Association for Computational Linguistics (2026) | The 1st Workshop on Customizable NLP: Progress and Challenges in Customizing NLP for a Domain, Application, Group, or Individual (CustomNLP4U) (2026) | Other Workshops and Events (2026)
SIG:
Publisher:: Association for Computational Linguistics
URL:: https://preview.aclanthology.org/ingest-acl-workshops/2026.customnlp4u-1/
DOI:
ISBN:: 979-8-89176-396-8
Bib Export formats:: BibTeX
PDF:: https://preview.aclanthology.org/ingest-acl-workshops/2026.customnlp4u-1.pdf

PDF (full) BibTeX Search

pdf bib

pdf bib abs

BAID: A Benchmark for Bias Assessment of AI Detectors
Priyam Basu | Yunfeng Zhang | Vipul Raheja

AI-generated text detectors gain adoption in educational and professional contexts, their fairness remains underexamined. While prior research has uncovered isolated cases of bias, particularly against English Language Learners (ELLs), there is a lack of systematic evaluation of such systems across broader sociolinguistic factors. In this work, we propose a comprehensive evaluation framework for AI detectors across various types of biases. As part of this framework, we introduce a suite of targeted datasets spanning 7 major categories: demographics, age, educational grade level, dialect, formality, political leaning, and topic. Using this, we evaluate four open-source state-of-theart AI text detectors and find consistent disparities in detection performance, particularly low recall rates for texts from underrepresented groups. Our contributions provide a scalable, transparent approach for auditing AI detectors and emphasize the need for bias-aware evaluation before these tools are deployed for public use.

pdf bib abs

Small Language Models for the Democratization of Financial Literacy: Challenges and Opportunities
Tagore Rao Kosireddy | Jeffrey David Wall | Evan Lucas

This study seeks to test whether low-cost inference and efficient Small Language Models (SLMs) fine-tuned on existing open-source question answering datasets are capable of creating financial literacy chat bots that can answer financial questions for those with limited financial knowledge. The use of SLMs is growing in popularity across many domains, but SLMs are not thoroughly explored in the finance sector. This study offers an exploration of challenges and opportunities that exist in the finance sector to utilize SLMs for open-source financial question answering applications. In particular, this study examines the outputs of several open-source SLMs fine-tuned on the open-source FinGPT FiQA_QA financial question answering dataset. We fine-tuned two versions of each model, one with an instruction prompt and one without an instruction prompt and compared the model outputs with ground truth human responses from the dataset. Further qualitative rating and analysis are provided for model outputs and the dataset. The exploration highlighted challenges with available open data and the fine-tuned SLMs. Existing open data sets in the financial AI research community are not sufficient to produce high-quality outputs with SLMs. Successful fine-tuning of SLMs has occurred in other domains with high quality data sets. We thus issue a call for new and better open financial question answering datasets that could result in higher-quality small language models.

pdf bib abs

From Understanding to Engagement: Personalized pharmacy Video Clips via Vision Language Models (VLMs)
Suyash Mishra | Qiang Li | Anubhav Girdhar | Srikanth Patil

Vision Language Models (VLMs) are poised to revolutionize the digital transformation of pharmacyceutical industry by enabling intelligent, scalable, and automated multi-modality content processing. Traditional manual annotation of heterogeneous data modalities (text, images, video, audio, and web links), is prone to inconsistencies, quality degradation, and inefficiencies in content utilization. The sheer volume of long video and audio data further exacerbates these challenges, (e.g. long clinical trial interviews and educational seminars). Here, we introduce a domain-adapted Video-to-Video Clip Generation framework that integrates Audio-Language Models (ALMs) and Vision Language Models (VLMs) to produce highlight clips. Our contributions are threefold: (i) a reproducible Cut Merge algorithm with fade-in/out and timestamp normalization, ensuring smooth transitions and audio/visual alignment; (ii) a personalization mechanism based on role definition and prompt injection for tailored outputs (marketing, training, regulatory); (iii) a cost-efficient e2e pipeline strategy balancing ALM/VLM-enhanced processing. Evaluations on Video-MME benchmark (900) and our proprietary dataset of 16,159 pharmacy videos across 14 disease areas demonstrate 3–4× speedup, 4× cost reduction, and competitive clip quality. Beyond efficiency gains, we also report our methods improved clip coherence scores (0.348) and informativeness scores (0.721) over state-of-the-art VLM baselines (e.g., Gemini 2.5 Pro), highlighting the potential of transparent, custom extractive, and compliance-supporting video summarization for life sciences. Demo: https://video-clips-highlight-generator-338849523617.us-west1.run.app/.

pdf bib abs

Evaluating Customized vs. Generalist Transformer-based Models for Legal Contract Classification
Amrita Singh | H. Suhan Karaca | Aditya Joshi | Hye-young Paik | Jiaojiao Jiang

Despite advances in legal NLP, no comprehensive evaluation of Transformer-based models customized for legal tasks (referred to as ’legal-specific’ models in this paper) exists for contract classification tasks. To address this gap, we present an evaluation of 13 legal-specific transformer-based models on 3 English-language contract classification tasks and compare them with 9 generalist models. The results show that legal-specific models consistently outperform generalist models, especially on tasks requiring nuanced legal understanding. They also help reduce misclassification of rare classes in imbalanced datasets. Legal-BERT and Contracts-BERT establish new SOTAs on two of the three tasks, despite having 69% fewer parameters than the best-performing generalist models. We also identify CaseLaw-BERT and LexLM as strong additional baselines for contract classification. Our results highlight the shortcomings of generalist models, emphasizing the need for domain-specific customization, particularly in the context of legal applications.

pdf bib abs

Personalizing News Headlines with Retrieval-Augmented Generation
Jiajing Wan | Samia Touileb | Lubos Steskal | Lilja Øvrelid

We focus on personalized news headline generation, where we aim to improve headline generation by extending the generation context to incorporate the news reading history of users. In particular, we study a RAG-LLM-based system that customizes news headlines with user histories to improve news headline personalization. Our experiments show that our approach not only produces better headlines for specific users, but also makes the generated headlines closer to the original headlines. We experiment with different retrievers and analyze the generated outputs through systematic comparisons with both original and rewritten headlines. These analyses provide insights into the role of retrieval and personalization in headline generation, highlighting how the user history contributes to meaningful improvement while remaining aligned with original headlines.

pdf bib abs

Building Multi-turn Intent Classification with LLM-based Labeling
Biancen Xie | Kaiqi Bian | Jai Ranjan Singh Gusain | Manikandarajan Ramanathan | Raj Maragoud

Intent classification is essential for customer service routing, connecting customers to the appropriate agents and reducing handling time and operational cost. Developing a real-world multi-turn intent classification system is challenging due to complex intent taxonomies, dynamic intent switching within conversations, and limited labeled training data. To address these challenges, we propose a scalable multi-turn intent classification framework for ecommerce customer service that models intent along multiple dimensions. We introduce LLMbased labeling strategies to annotate real customer transcripts at scale and augment training with LLM-simulated multi-turn dialogues that expand coverage of topic and intent switches, which are rare in existing transcripts. Through extensive experiments, we find that explanationguided labeling with a self-critique step produces the most accurate training labels. Finetuned models built on a RoBERTa backbone outperform zero-shot LLM prompting while achieving substantially lower inference latency. Finally, we show that a hybrid approach that combines the fine-tuned classifier with LLM prompting further improves accuracy over either component alone. Overall, our results provide practical guidance for building and deploying high-accuracy, low-latency, large-scale multi-turn intent classification systems.

pdf bib abs

Cross-Tokenizer LLM Distillation through a Byte-Level Interface
Avyav Kumar Singh | Yen-Chen Wu | Alexandru Cioba | Alberto Bernacchia | Davide Buffelli

Cross-tokenizer distillation (CTD), the transfer of knowledge from a teacher to a student language model when the two use different tokenizers, remains a largely unsolved problem. Existing approaches rely on heuristic strategies to align mismatched vocabularies, introducing considerable complexity. In this paper, we propose a simple but effective baseline called Byte-Level Distillation (BLD) which enables CTD by operating at a common interface across tokenizers: the byte level. In more detail, we convert the teacher’s output distribution to byte-level probabilities, attach a lightweight byte-level decoder head to the student, and distill through this shared byte-level interface. Despite its simplicity, BLD performs competitively with–and on several benchmarks surpasses–significantly more sophisticated CTD methods, across a range of distillation tasks with models from 1B to 8B parameters. Our results suggest that the byte level is a natural common ground for cross-tokenizer knowledge transfer, while also highlighting that consistent improvements across all tasks and benchmarks remain elusive, underscoring that CTD is still an open problem.

pdf bib abs

Fine-grained Readability Controlled Summarization of Scientific Documents via Control Vectors
Isabel Cachola | Kuleen Sasse | Mark Dredze

Plain Language Summarization (PLS) generates summaries of technical documents accessible to non-expert audiences. Readability – commonly used to evaluate PLS – has often been treated coarsely (expert vs. lay) although it exists on a spectrum with different levels for different readers. We propose a light weight control vector method for fine-grained readability control in scientific summarization along with a requirements-based framework for data selection. Our framework enforces: (1) readability levels differ substantially, and (2) paired examples share comparable content. Under this, control vectors enable more precise readability control than other popular methods.

pdf bib abs

Building a Custom Taxonomy of AI Skills and Tasks from the Ground Up with Job Postings
Stephen Meisenbacher | Peter Norlander

Utilizing LLMs for automated taxonomy construction presents a clear opportunity for the comprehensive, yet efficient mapping of potentially complex domains. When contending with high volumes of rapidly growing corpora, however, it becomes unclear how to best leverage such data for optimal taxonomy construction. Taking the case of systematizing *AI skills in the workplace*, we use two large-scale job postings corpora to investigate key design decisions for the inclusion (or exclusion) of data points for taxonomy construction. We propose **TaxonomyBuilder** as a blueprint for our systematic study, with which we evaluate various configurations of custom, data-informed, and hierarchical taxonomies. We demonstrate that *less* data can provide more clarity: filtering inputs to **TaxonomyBuilder** provides better domain-specific coverage than offering unfiltered inputs to clustering and LLM-enhanced hierarchical taxonomy labeling tools.

pdf bib abs

Large language models are known to be vulnerable to adversarial perturbations such as synonym-based word substitutions. However, previous analyses of adversarial influence focus only on output behavior and provide limited insight into the propagation of substitution-based input perturbations through internal representations. In this work, we introduce a topological data analysis (TDA) framework to study the structural effects of adversarial attacks on attention maps across model layers. We evaluate small encoder-based architectures (BERT, RoBERTa, DistilBERT) fine-tuned to solve binary classification on the IMDb review dataset, which were attacked using TextFooler. We convert attention maps into distance matrices and apply TDA to extract topological features, which we then compare using Wasserstein distances between original and perturbed features. In parallel, we compute a non-TDA baseline on attention maps using per-head L₁ distances between original and perturbed attentions. In addition, we analyze these models on a layer-by-layer basis. We find that adversarial perturbations induce systematic and statistically significant topological changes across layers, with the largest deviations occurring in late layers and smaller but notable effects in early layers. These patterns are consistent across models and are validated using both non-parametric (Kruskal–Wallis, Dunn) and parametric (one-way ANOVA, Tukey) tests on log-transformed Wasserstein distances. Compared to our non-TDA baseline, our results show more distinct layer-wise separation and provides a robust and interpretable framework for evaluating how adversarial perturbations alter internal model structure. Our code is publicly available at: https://github.com/angelinatsai04/mitll_clinic/tree/adam_spring.

pdf bib abs

Customizing ASR for Language Documentation and Resource Prioritization
Alexandra Fort | Shobhana Lakshmi Chelliah

Research in language documentation has the potential to benefit from integration of ASR models, especially through the assisted transcription of recordings with audio. Recent advancements in ASR for low-resource languages demonstrate the ability to adapt general, multilingual models for unseen languages with limited fine-tuning data, supporting the creation of custom ASR models. However, resources are still required to collect and prepare the fine-tuning data, necessitating exploration of optimization of resource allocation within the process of data collection and preparation. This paper outlines important considerations for the collection and preparation of data for customizing an ASR model for use in language documentation projects. With the development of a Lamkang ASR model as an example, prioritization of tasks within a language documentation project is outlined by analyzing the relative impact of time spent on transcription correction versus time spent on manual alignment on ASR model performance. Results from this research suggest prioritization of transcription correction over manual-alignment of data and suggest fine-tuning multilingual ASR systems produces superior results to zero-shot ASR models, despite recent advancements in the technology.

pdf bib abs

Improving Medical Hallucination Detection with System Combination and Rule-based Customization
Jonathan Lasko | Damianos Karakos | Francis Keith

The presence of factuality errors (hallucinations) in the outputs of patient-facing medical chatbots is a serious problem: they can lead to patient harm and erode people’s trust in the medical profession. For this reason, it is crucial to detect hallucinations in chatbot outputs and forward them to clinicians for review. In this paper, we present the system we built for detecting such errors: it consists of multiple LLM-powered detectors which are combined together with a novel alignment procedure. We ran our system on the MedExpert-Benchmark dataset (Yarmohammadi et al., 2025) and our results on two use cases, Mental Health and Prenatal Care, show that the combined system gives nice gains over the individual systems. Additionally, we show that further customization of the system to each one of the use cases leads to further gains, but at the cost of reduced generalizability. Our code and dataset are available here: https://github.com/BBN-E/medic-customnlp4u.

pdf bib abs

Large language models are widely used by everyday users, and can be asked to perform tasks that require specialized expertise, such as interpreting contractual terms and conditions, filing personal taxes, or diagnosing medical symptoms. Although these tools should not be used in place of professional advice, they can be useful starting points for users seeking professional help, improving users’ access and interactions with professionals. In this vein, this paper introduces a legal question reformulation task to assist non-experts in their interactions with lawyers. This has the potential to streamline discussions between lawyers and clients, who may not know the correct legal language to communicate their needs. Using a novel evaluation framework informed by legal expertise, we investigate the quality of model-generated legal question reformulations on in-the-wild data from non-experts seeking legal advice. Our findings indicate that LLMs have significant potential in legal reasoning, but some unexpected safety concerns may emerge. Further, adding linguisticallyaligned in-domain text samples can improve performance for smaller models, even when the samples are not aligned factually with the given question.

pdf bib abs

When Valid Signals Fail: Regime Boundaries Between LLM Features and RL Trading Policies
Zhengzhe Yang

Can large language models (LLMs) generate continuous numerical features that improve reinforcement learning (RL) trading agents? We build a modular pipeline where a frozen LLM serves as a stateless feature extractor, transforming unstructured daily news and filings into a fixed-dimensional vector consumed by a downstream PPO agent. We introduce an automated prompt-optimization loop that treats the extraction prompt as a discrete hyperparameter and tunes it directly against the Information Coefficient—the Spearman rank correlation between predicted and realized returns—rather than NLP losses. The optimized prompt discovers genuinely predictive features (IC above ∼0.15 on held-out data). However, these valid intermediate representations do not automatically translate into downstream task performance: during a distribution shift caused by a macroeconomic shock, LLM-derived features add noise, and the augmented agent under-performs a price-only baseline. In a calmer test regime the agent recovers, yet macroeconomic state variables remain the most robust driver of policy improvement. Our findings highlight a gap between feature-level validity and policy-level robustness that parallels known challenges in transfer learning under distribution shift.

pdf bib abs

Unintended Effects of Geographic Conditioning in Large Language Models
Naz Col | David M. Chan

Modern conversational AI systems frequently rely on user metadata to localize responses, yet the unintended regional biases introduced by this hidden context remain poorly understood. In this work, we evaluate _location leakage_: the phenomenon where a model generates geographic references despite receiving a geographically neutral user prompt. Across both creative writing and open-ended Q&A prompts, even state-of-the-art LLMs systematically favor region-specific outputs when exposed to location metadata, with leakage spiking by up to 793 times above baseline (e.g., from 0.04% to 31.7% for Llama 3.1-8B, and 21.3% and 8.8% for Qwen3-8B and Claude Sonnet 4.6, respectively). Our analysis further shows a novel structural conditioning effect: replacing the injected location with the placeholder "Unknown" still elevates leakage by up to 72 times above baseline, demonstrating that the user profile frame itself, independent of any geographic content, acts as a generative conditioning signal.

pdf bib abs

Efficiency vs. Verifiability in Evidence-Aware RAG: Does Prompt Compression Preserve Citation Grounding?
Aiyu Li | Qian Peng | Bin Chen

Retrieval-augmented generation (RAG) is widely used in domain-specific and knowledge-intensive applications, where long prompts increase inference cost and may exceed context limits. Prompt compression is therefore appealing, but existing evaluations focus primarily on answer quality, overlooking whether compressed systems remain faithful to the retrieved evidence. In this paper, we ask: does compression that preserves answers also preserve grounding? Using Self-RAG and LLMLingua-2 in a controlled setting, we evaluate compressed RAG on ASQA in terms of both answer correctness and citation grounding. Under increasing compression, answer correctness drops by only 2-4%, whereas grounding drops by 40-50%. This stark divergence shows that answer-only evaluation can substantially overestimate the reliability of compressed RAG in evidence-aware scenarios. We further propose a lightweight hierarchical compression strategy that prioritizes evidence-bearing spans. It recovers nearly all grounding loss while maintaining comparable answer quality. Our results reveal a clear trade-off between efficiency and verifiability, and suggest that compression in RAG should be customized to downstream verification needs rather than treated as a one-size-fits-all efficiency intervention.

pdf bib abs

When Gradients Collide: Failure Modes of Multi-Objective Prompt Optimization for LLM Judges
Parth Darshan | Abhishek Divekar

Customizing an LLM judge to a specific task or domain often involves optimizing its prompt across multiple evaluation criteria simultaneously. Textual gradient methods automate this for a single judge criterion, however they produce natural-language critiques, not numerical vectors. Thus, the conflict-resolution toolkit of multi-task learning (PCGrad, MGDA) doesn’t apply to the multi-objective textual gradient setting. We test five decomposition modes of textual gradient optimizers by varying how much cross-task information the loss, gradient and optimizer LLMs share. In 6 of 10 configurations on SummEval, we observe that optimization never improves over the initial prompt. Gradient specificity drops by 59% (from 9.0 to 3.7) when the gradient LLM processes multiple criteria jointly. Separately, we observe that naively combining per-task instructions into a single prompt degrades Spearman’s ρ by -5.3%. These results identify two separable failure modes: optimization-time gradient dilution and inference-time instruction interference, which together constrain the design space for multi-objective judge customization using textual feedback.