Jeih-weih Hung

Also published as: Jeih-Weih Hung

2023

pdf
ESC MA-SD Net: Effective Speaker Separation through Convolutional Multi-View Attention and SudoNet
Che-Wei Liao | Aye Nyein Aung | Jeih-Weih Hung
Proceedings of the 35th Conference on Computational Linguistics and Speech Processing (ROCLING 2023)

2022

pdf abs
A Preliminary Study of the Application of Discrete Wavelet Transform Features in Conv-TasNet Speech Enhancement Model
Yan-Tong Chen | Zong-Tai Wu | Jeih-Weih Hung
Proceedings of the 34th Conference on Computational Linguistics and Speech Processing (ROCLING 2022)

Nowadays, time-domain features have been widely used in speech enhancement (SE) networks like frequency-domain features to achieve excellent performance in eliminating noise from input utterances. This study primarily investigates how to extract information from time-domain utterances to create more effective features in speech enhancement. We present employing sub-signals dwelled in multiple acoustic frequency bands in time domain and integrating them into a unified feature set. We propose using the discrete wavelet transform (DWT) to decompose each input frame signal to obtain sub-band signals, and a projection fusion process is performed on these signals to create the ultimate features. The corresponding fusion strategy is the bi-projection fusion (BPF). In short, BPF exploits the sigmoid function to create ratio masks for two feature sources. The concatenation of fused DWT features and time features serves as the encoder output of a celebrated SE framework, fully-convolutional time-domain audio separation network (Conv-TasNet), to estimate the mask and then produce the enhanced time-domain utterances. The evaluation experiments are conducted on the VoiceBank-DEMAND and VoiceBank-QUT tasks. The experimental results reveal that the proposed method achieves higher speech quality and intelligibility than the original Conv-TasNet that uses time features only, indicating that the fusion of DWT features created from the input utterances can benefit time features to learn a superior Conv-TasNet in speech enhancement.

pdf abs
Exploiting the compressed spectral loss for the learning of the DEMUCS speech enhancement network
Chi-En Dai | Qi-Wei Hong | Jeih-Weih Hung
Proceedings of the 34th Conference on Computational Linguistics and Speech Processing (ROCLING 2022)

This study aims to improve a highly effective speech enhancement technique, DEMUCS, by revising the respective loss function in learning. DEMUCS, developed by Facebook Team, is built on the Wave-UNet and consists of convolutional layer encoding and decoding blocks with an LSTM layer in between. Although DEMUCS processes the input speech utterance purely in the time (wave) domain, the applied loss function consists of wave-domain L1 distance and multi-scale shorttime-Fourier-transform (STFT) loss. That is, both time- and frequency-domain features are taken into consideration in the learning of DEMUCS. In this study, we present revising the STFT loss in DEMUCS by employing the compressed magnitude spectrogram. The compression is done by either the power-law operation with a positive exponent less than one, or the logarithmic operation. We evaluate the presented novel framework on the VoiceBank-DEMAND database and task. The preliminary experimental results suggest that DEMUCS containing the power-law compressed magnitude spectral loss outperforms the original DEMUCS by providing the test utterances with higher objective quality and intelligibility scores (PESQ and STOI). Relatively, the logarithm compressed magnitude spectral loss does not benefit DEMUCS. Therefore, we reveal that DEMUCS can be further improved by properly revising the STFT terms of its loss function.

2021

pdf abs
Employing low-pass filtered temporal speech features for the training of ideal ratio mask in speech enhancement
Yan-Tong Chen | Zi-Qiang Lin | Jeih-Weih Hung
Proceedings of the 33rd Conference on Computational Linguistics and Speech Processing (ROCLING 2021)

The masking-based speech enhancement method pursues a multiplicative mask that applies to the spectrogram of input noise-corrupted utterance, and a deep neural network (DNN) is often used to learn the mask. In particular, the features commonly used for automatic speech recognition can serve as the input of the DNN to learn the well-behaved mask that significantly reduce the noise distortion of processed utterances. This study proposes to preprocess the input speech features for the ideal ratio mask (IRM)-based DNN by lowpass filtering in order to alleviate the noise components. In particular, we employ the discrete wavelet transform (DWT) to decompose the temporal speech feature sequence and scale down the detail coefficients, which correspond to the high-pass portion of the sequence. Preliminary experiments conducted on a subset of TIMIT corpus reveal that the proposed method can make the resulting IRM achieve higher speech quality and intelligibility for the babble noise-corrupted signals compared with the original IRM, indicating that the lowpass filtered temporal feature sequence can learn a superior IRM network for speech enhancement.

pdf
使用低通時序列語音特徵訓練理想比率遮罩法之語音強化 (Employing Low-Pass Filtered Temporal Speech Features for the Training of Ideal Ratio Mask in Speech Enhancement)
Yan-Tong Chen | Jeih-weih Hung
International Journal of Computational Linguistics & Chinese Language Processing, Volume 26, Number 2, December 2021

Co-authors

Venues

rocling26
ijclclp4

Jeih-weih Hung

2023

2022

2021

2020

2019

2017

2013

2012

2011

2010

2009

2008

2007

Co-authors

Venues