Jonathan Richard Schwarz
2025
Automatic Expert Discovery in LLM Upcycling via Sparse Interpolated Mixture-of-Experts
Shengzhuang Chen
|
Ying Wei
|
Jonathan Richard Schwarz
Proceedings of the 63rd Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers)
We present Sparse Interpolated Mixture-of-Experts (SIMoE) instruction-tuning, an end-to-end algorithm designed to fine-tune a dense pre-trained Large Language Model (LLM) into a MoE-style model that possesses capabilities in multiple specialized domains. During instruction-tuning, SIMoE automatically identifies multiple specialized experts under a specified sparsity constraint, with each expert representing a structurally sparse subset of the seed LLM’s parameters that correspond to domain-specific knowledge within the data. SIMoE simultaneously learns an input-dependent expert merging strategy via a router network, leveraging rich cross-expert knowledge for superior downstream generalization that surpasses existing baselines. Empirically, SIMoE consistently achieves state-of-the-art performance on common instruction-tuning benchmarks while maintaining an optimal performance-compute trade-off compared to all baselines.
Beyond Pointwise Scores: Decomposed Criteria-Based Evaluation of LLM Responses
Fangyi Yu
|
Nabeel Seedat
|
Drahomira Herrmannova
|
Frank Schilder
|
Jonathan Richard Schwarz
Proceedings of the 2025 Conference on Empirical Methods in Natural Language Processing: Industry Track
Evaluating long-form answers in high-stakes domains such as law or medicine remains a fundamental challenge. Standard metrics like BLEU and ROUGE fail to capture semantic correctness, and current LLM-based evaluators often reduce nuanced aspects of answer quality into a single undifferentiated score. We introduce DeCE, a decomposed LLM evaluation framework that separates precision (factual accuracy and relevance) and recall (coverage of required concepts), using instance-specific criteria automatically extracted from gold answer requirements. DeCE is model-agnostic and domain-general, requiring no predefined taxonomies or handcrafted rubrics. We instantiate DeCE to evaluate different LLMs on a real-world legal QA task involving multi-jurisdictional reasoning and citation grounding. DeCE achieves substantially stronger correlation with expert judgments (r=0.78), compared to traditional metrics (r=0.12) and pointwise LLM scoring (r=0.35). It also reveals interpretable trade-offs: generalist models favor recall, while specialized models favor precision. Importantly, only 11.95% of LLM-generated criteria required expert revision, underscoring DeCE’s scalability. DeCE offers an interpretable and actionable LLM evaluation framework in expert domains.
Evaluating the Role of Verifiers in Test-Time Scaling for Legal Reasoning Tasks
Davide Romano
|
Jonathan Richard Schwarz
|
Daniele Giofrè
Proceedings of the Natural Legal Language Processing Workshop 2025
Test-time scaling (TTS) techniques can improve the performance of large language models (LLMs) at the expense of additional computation and latency. While TTS has proven effective in formal domains such as mathematics and programming (Snell et al., 2024; Chen et al., 2024), its value in argumentative domains such as law remains underexplored. We present an empirical study of verifier-based TTS methods for legal multiple-choice QA (MCQA) across five benchmarks. Using a family of 7 reward models, we evaluate both outcome-level (Best-of-N) and process-level (tree search) verification under realistic low-N budgets. Our analysis systematically investigates how verifier utility is affected by key properties such as domain specialization, model size, and supervision type (process-supervised PRMs vs. outcome-only ORMs), even when applied across different roles.
Search
Fix author
Co-authors
- Shengzhuang Chen 1
- Daniele Giofrè 1
- Drahomira Herrmannova 1
- Davide Romano 1
- Frank Schilder 1
- show all...