Junrui Wei


2026

Maintaining a stable persona is central to sustained spoken role-playing, yet when an agent breaks character, current evaluations often do not isolate which component caused the failure, making fixes slow and ad hoc.We propose PED (Persona-Emotion Decoupling), a diagnostic evaluation framework that decomposes persona expression into two observable routes: what the agent says (text) and how it sounds (speech).PED operationalizes the affective slice of persona expression by projecting transcripts and audio into a shared affective measurement space for route-comparable, reference-based analyses of separability, drift, failures, and coupling.We demonstrate PED via two worked instantiations spanning an end-to-end Speech LLM and a cascaded LLM+TTS pipeline under a fixed dialogue protocol.Within this setting, PED surfaces four recurring diagnostic signatures:(i) route-level separability is bounded by reference overlap and can differ sharply across architectures,(ii) text-route drift is stress-linked and tends toward a neutral-heavy region,(iii) text-audio consistency is weakly coupled, yielding route-asymmetric failures,and (iv) audio-route structure can be materially shaped by an explicit intermediate style cue in cascaded pipelines.Overall, PED reframes holistic "voice+character" grading as turn-level, fault-localizing signals for faster debugging and iteration.