Hyun-woo Cho


2026

Real-time speech translation with large language models (LLMs) has become feasible in controlled wideband settings—mobile apps, web browsers, and end-to-end full-duplex systems pushing latency below 200 ms—where developers can assume client-side echo cancellation. However, deploying such systems over the Public Switched Telephone Network (PSTN) remains challenging due to narrowband G.711 audio, unpredictable round-trip delays, and absence of client-side signal processing. We present **WIGVO** (WIGTN Voice-Only), a server-side relay system that enables bidirectional LLM-based speech translation over ordinary telephone calls without requiring app installation or carrier integration. A central contribution is addressing what we term *echo-induced self-reinforcing translation loops*: synthesized speech echoing back through the PSTN gets re-ingested and repeatedly translated. WIGVO solves this through a dual-session architecture with deterministic silence injection and energy-based voice activity detection (VAD) gating. We evaluate WIGVO on 155 Korean–English PSTN calls (148 instrumented, 147 completed) across three communication modes—voice-to-voice (V2V), text-to-voice (T2V), and full-agent—observing 555 ms median caller-to-callee latency and 2,684 ms median callee-to-caller latency, zero echo-induced translation loops, COMET semantic adequacy of 0.71 (en→ko) and 0.62 (ko→en) against offline LLM references, and USD 0.28 per minute cost. The system is deployed at https://wigvo.wigtn.com, with a video walkthrough at https://youtu.be/4Uf6zMPOInY. Evaluation scripts and anonymized call logs are available in the open-source repository.