Guangjing Wang


2026

Terminal simulation, framed as a terminal command-level Turing test, is a long-standing problem of symbolic language generation in dialogue and interactive systems. Prior scripted simulators lack the flexibility needed for complex, multi-turn interactions, while LLM-based approaches often misinterpret commands, break output formats, drift from system state, and remain vulnerable to prompt injection. In this work, we propose MANTIS, a terminal simulation framework that improves realism, consistency, and robustness in command-language generation. MANTIS integrates a multi-agent architecture with a filter-based routing model that safely dispatches commands to external tools or an LLM-based agent, enabling support for interactive commands while defending against prompt injection attacks. In addition, we design an agentic file system with history pruning to preserve long-term state consistency. We release three datasets: 28,045 real terminal input-output pairs, a 1,000-session multi-turn interaction dataset, and a 25,849-instance labeled classification dataset. MANTIS outperforms state-of-the-art baselines by more than 9%, achieving over 95% accuracy on multi-turn terminal simulation. The dataset and source code are available at https://github.com/kaiwei666a/MANTIS_Terminal_Simulation
High-fidelity audio generation techniques, such as voice conversion and singing voice synthesis, have significantly increased the risk of audio deepfakes. Although existing methods perform well on conversational speech deepfake detection, they fail severely under the speech-to-singing domain shift. To address this limitation, we propose GenuVoice, a unified deepfake detector based on a multi-branch mixture-of-experts architecture that integrates three complementary feature views: Wav2Vec 2.0 representations, log-mel spectrograms, and mel-frequency cepstral coefficients (MFCC). Each expert is trained to remain independently discriminative, while a learned gating network dynamically weights expert contributions. A speech-retentive multi-domain fine-tuning strategy enables adaptation to singing without degrading speech performance. GenuVoice achieves 1.82% Equal Error Rate (EER) on CtrSVDD, compared to 37–62% for existing speech-trained detectors, while preserving strong speech performance (0.38% EER on ASVspoof 2019) and generalizing to unseen generators (8.89% EER on held-out ASVspoof 2021). Extensive ablations confirm the importance of multi-expert fusion and speech retention, establishing GenuVoice as an effective unified detector for speech and singing deepfakes. The implementation code is available at https://github.com/aastha-sharma/genuvoice