Too Helpful, Too Harmless, Too Honest or Just Right?

Gautam Siddharth Kashyap, Mark Dras, Usman Naseem


Abstract
Large Language Models (LLMs) exhibit strong performance across a wide range of NLP tasks, yet aligning their outputs with the principles of Helpfulness, Harmlessness, and Honesty (HHH) remains a persistent challenge. Existing methods often optimize for individual alignment dimensions in isolation, leading to trade-offs and inconsistent behavior. While Mixture-of-Experts (MoE) architectures offer modularity, they suffer from poorly calibrated routing, limiting their effectiveness in alignment tasks. We propose TrinityX, a modular alignment framework that incorporates a Mixture of Calibrated Experts (MoCaE) within the Transformer architecture. TrinityX leverages separately trained experts for each HHH dimension, integrating their outputs through a calibrated, task-adaptive routing mechanism that combines expert signals into a unified, alignment-aware representation. Extensive experiments on three standard alignment benchmarks—Alpaca (Helpfulness), BeaverTails (Harmlessness), and TruthfulQA (Honesty)—demonstrate that TrinityX outperforms strong baselines, achieving relative improvements of 32.5% in win rate, 33.9% in safety score, and 28.4% in truthfulness. In addition, TrinityX reduces memory usage and inference latency by over 40% compared to prior MoE-based approaches. Ablation studies highlight the importance of calibrated routing, and cross-model evaluations confirm TrinityX’s generalization across diverse LLM backbones. Ourcode is available at: https://github.com/gskgautam/TrinityX
Anthology ID:
2025.emnlp-main.1510
Volume:
Proceedings of the 2025 Conference on Empirical Methods in Natural Language Processing
Month:
November
Year:
2025
Address:
Suzhou, China
Editors:
Christos Christodoulopoulos, Tanmoy Chakraborty, Carolyn Rose, Violet Peng
Venue:
EMNLP
SIG:
Publisher:
Association for Computational Linguistics
Note:
Pages:
29711–29722
Language:
URL:
https://preview.aclanthology.org/ingest-emnlp/2025.emnlp-main.1510/
DOI:
Bibkey:
Cite (ACL):
Gautam Siddharth Kashyap, Mark Dras, and Usman Naseem. 2025. Too Helpful, Too Harmless, Too Honest or Just Right?. In Proceedings of the 2025 Conference on Empirical Methods in Natural Language Processing, pages 29711–29722, Suzhou, China. Association for Computational Linguistics.
Cite (Informal):
Too Helpful, Too Harmless, Too Honest or Just Right? (Kashyap et al., EMNLP 2025)
Copy Citation:
PDF:
https://preview.aclanthology.org/ingest-emnlp/2025.emnlp-main.1510.pdf
Checklist:
 2025.emnlp-main.1510.checklist.pdf