Axel G. Ekstrom


2026

We present a pipeline for deep neural network assisted modeling and analysis of the behavior of an acoustic tube. The vocal tract is represented as a series of cylindrical tube segments, each characterized by fixed length and variable cross-sectional area. A large synthetic dataset of such tube configurations is generated, and a circuit theory–based algorithm predicts corresponding formant frequencies. To explore mapping between vocal tract shapes and formant values, the pipeline integrates both linear regression and nonlinear machine learning models - including multilayer perceptrons. Model interpretability is measured using Shapley Additive Explanations (SHAP), which quantifies the contribution of each segment to predicted formant frequencies. The proposed framework enables detailed exploration of the articulatory-acoustic relationships inherent to an acoustic tube and vocal tract simulacrum. We present and describe the pipeline in the context of modeling effects of perturbations on the first three formants for a 16-cm tube, divided into 1 cm segments. Our pipeline can be applied to any method that models predictions of behavior of an acoustic tube, where the tube is conceived as a series of segmented units.
This study investigates the stability and convergence of vowel formants (F1, F2, F3) in read speech through an extensive corpus of audiobook recordings. While most formant studies rely on brief, isolated utterances recorded in laboratory settings, this analysis draws on 3,384 chapters (about 942 hours) of continuous, stylistically varied speech from publicly available audiobooks. The data was processed using an automated pipeline that comprised transcription, phoneme alignment, and formant extraction. Several statistical techniques – First Token Within (FTW), Cumulative Sum (CUSUM), Two-Sample t-Test, Confidence Interval (CI) Shrinkage, Piecewise Linear Fitting (PWLF), and Binary Segmentation (BinSeg) – were compared for their effectiveness in identifying stabilization points. Findings indicate that formant means generally stabilize within 60 to 230 vowel tokens per phoneme, dependent on vowel type and speaker gender. Of the methods that were evaluated, CUSUM yielded the most consistent and informative results. The results provide practical guidelines for determining the quantity of non-laboratory speech required to obtain reliable vowel formant averages.