\`x^2+y_1+z_12^34\`
Advanced Search
Article Contents
Article Contents

Regularized inverse filtering and machine learning methods for speech enhancement - the Helsinki Speech Challenge 2024

  • *Corresponding author: Kim Knudsen

    *Corresponding author: Kim Knudsen 

KK was supported by The Villum Foundation (Grant No. 25893).

Abstract / Introduction Full Text(HTML) Figure(10) / Table(3) Related Papers Cited by
  • Speech enhancement can be seen as an ill-posed inverse problem modeled by convolution with an impulse response, where the goal is to recover the clean speech signal from a corrupted one. In this work, we propose various methods for solving this problem in the cases of low-pass filtered and reverberated speech signals. Based on energy time curves, and energy decay curves we first estimate the impulse response functions, which can then be used as inverse filters in a deconvolution. We propose methods combining in different ways spectral subtraction, deconvolution by inverse filtering, regularized inverse filtering, and a machine learning method based on convolutional neural networks. We systematically collect results for the performance of the methods on different cases and different levels of complexity. The results highlight that no single method is superior across all tasks and levels, but in general a successful approach should be based on both spectral subtraction and a mathematical impulse response model, possibly together with a neural network. The work was done in the context of the Helsinki Speech Challenge 2024.

    Mathematics Subject Classification: Primary: 15A29, 94A12; Secondary: 68T10.

    Citation:

    \begin{equation} \\ \end{equation}
  • 加载中
  • Figure 1.  Spectrogram comparison between a clean signal (left) and a recorded signal (right) for T1L3 in the HSC dataset. The figure is from [17]

    Figure 2.  Spectrogram comparison between a clean signal (left) and a recorded signal (right) for T2L2 in the HSC dataset. The figure is from [17]

    Figure 3.  Spectrogram comparison between a clean signal (left) and a recorded signal (right) for T3L2 in the HSC dataset. The figure is from [17]

    Figure 4.  Energy time curve (ETC) computed from the impulse response (T2L1) derived via deconvolution of a swept sine signal. The curve shows an initial decay followed by a flattening trend, indicating the presence of background noise

    Figure 5.  Energy decay curve (EDC) computed from the same impulse response (T2L1) using backward integration. The curve demonstrates exponential decay until it flattens

    Figure 6.  T1L2 results showing clean, recorded, and restored audio with IR-method in both spectrograms and waveforms

    Figure 7.  T2L2 results showing clean, recorded, and restored audio in both spectrograms and waveforms recovered using a regularized inverse filter

    Figure 8.  Spectrogram comparisons for T2L2 showing clean, recorded, and reconstructed signals using the IR method with no regularization and with (15) regularization

    Figure 9.  T3L1 results showing clean, recorded, and restored audio in both spectrograms and waveforms recovered using a combination of two different impulse responses (one for a low-pass filter and one for reverb) with one of them regularized and VoiceFixer

    Figure 10.  Spectrogram showcasing clean, recorded, and the restored audio (T2L2) using the IR-method, and regularization with max-norm, Tikhonov, and Lasso described by equation (16) and (17). For Tikhonov and Lasso regularization, the parameters are $ \alpha_T = \alpha_L = 0.1 $

    Table 1.  Mean character error rates (CER) for various methods on training data. We consider spectral subtraction (Spec. sub.), the impulse response (IR)-based recovery as described in 4.2, VoiceFixer (VF), and the combination IR + VF. The lowest rates for each level are shown in bold, while underlined rates corresponds to the submitted winning solution methods. *Note that the Task 3 IR solutions are combinations of level-specific Task 1 and 2 filters, see 6.3

    Mean CER on Training data (%)
    Task ID Recorded Spec. sub. IR Reg. IR Voicefixer IR + Voicefixer
    Filtering T1L1 4.0 4.0 1.5 3.3 2.8 3.7
    T1L2 7.0 7.2 1.9 4.6 4.2 1.8
    T1L3 31.9 28.1 7.8 15.9 16.0 8.4
    T1L4 66.7 68.4 35.4 44.2 41.3 31.0
    T1L5 82.7 83.4 46.9 54.9 51.1 44.7
    T1L6 90.6 87.8 61.3 64.6 62.0 59.4
    T1L7 85.7 82.1 68.1 72.5 67.3 67.5
    Reverb T2L1 12.7 12.3 36.4 17.5 13.1 34.0
    T2L2 47.5 47.9 44.4 23.4 40.2 46.3
    T2L3 55.8 56.9 21.3 11.3 47.4 24.0
    Combined T3L1 90.8 92.2 65.6* 73.5 63.1*
    T3L2 99.9 99.9 77.8* 80.0 67.8*
     | Show Table
    DownLoad: CSV

    Table 2.  Mean character error rates (CER) for various methods on test data. We consider spectral subtraction (Spec. sub.), the impulse response (IR)-based recovery as described in 4.2, VoiceFixer (VF), and the combination IR + VF. The lowest rates for each level are shown in bold, while underlined rates corresponds to the submitted winning solution methods. *Note that the Task 3 IR solutions are level-specific combinations of existing Task 1 and 2 filters, see 6.3

    Mean CER on Test data (%)
    Task ID Recorded Spec. sub. IR Reg. IR Voicefixer IR + Voicefixer
    Filtering T1L1 3.4 2.8 0.9 2.0 1.0 2.2
    T1L2 5.7 5.7 1.5 4.2 2.8 1.2
    T1L3 29.1 26.7 8.4 19.1 18.3 9.0
    T1L4 65.5 67.2 41.4 49.0 45.0 34.4
    T1L5 82.5 82.4 48.4 57.1 52.3 45.1
    T1L6 89.5 86.0 63.4 68.8 63.2 61.7
    T1L7 85.8 81.9 66.0 71.2 65.4 66.4
    Reverb T2L1 12.2 12.8 39.4 18.7 12.5 36.5
    T2L2 47.3 48.1 46.3 27.8 41.6 45.4
    T2L3 56.0 57.4 23.1 11.3 49.7 25.5
    Combined T3L1 91.5 93.2 72.6* 73.5 64.2*
    T3L2 99.7 99.8 78.5* 79.6 68.7*
     | Show Table
    DownLoad: CSV

    Table 3.  Mean character error rates (CER) in (%) for various methods on a training data example from Task 2 Level 2

    Clean Recorded IR Max-norm Tikhonov Lasso
    0.0 33.3 62.5 25.0 50.0 37.5
     | Show Table
    DownLoad: CSV
  • [1] M. A. Abd-El-Fattah and M. I. Dessouky, Speech deconvolution as an inverse problem, International Journal of Speech Technology, 14 (2011), 273-284. 
    [2] J. B. Allen and L. R. Rabiner, A unified approach to short-time fourier analysis and synthesis, Proceedings of the IEEE, 65 (2005), 1558-1564.  doi: 10.1109/PROC.1977.10770.
    [3] S. ArridgeP. MaassO. Öktem and C. B. Schönlieb, Solving inverse problems using data-driven models, Acta Numerica, 28 (2019), 1-174.  doi: 10.1017/S0962492919000059.
    [4] M. Benning and M. Burger, Modern regularization methods for inverse problems, Acta Numerica, 27 (2018), 1–111.
    [5] S. F. Boll, Suppression of acoustic noise in speech using spectral subtraction, IEEE Trans Acoust Speech Signal Process, ASSP-27 (1979), 113-120.  doi: 10.1109/TASSP.1979.1163209.
    [6] D. de Oliveira, T. Peer and T. Gerkmann, Efficient transformer-based speech enhancement using long frames and stft magnitudes, Interspeech 2022, ISCA, (2022), 2948-2952. doi: 10.21437/Interspeech.2022-10781.
    [7] A. Farina, Simultaneous Measurement of Impulse Response and Distortion with a Swept-Sine Technique, Audio engineering society convention 108. Audio Engineering Society, 2000.
    [8] P. GonzalezZ.-H. TanJ. ØstergaardJ. JensenT. S. Alstrøm and T. May, Investigating the design space of diffusion models for speech enhancement, IEEE/ACM Transactions on Audio, Speech, and Language Processing, 32 (2024), 4486-4500.  doi: 10.1109/TASLP.2024.3473319.
    [9] P. J. Goulart and Y. Chen, Clarabel: An interior-point solver for conic programs with quadratic objectives, (2024), https://arXiv.org/abs/2405.12762.
    [10] A. Hannun, C. Case, J. Casper, B. Catanzaro, G. Diamos, E. Elsen, R. Prenger, S. Satheesh, S. Sengupta, A. Coates and A. Y. Ng, Deep speech: Scaling up end-to-end speech recognition, (2014), https://arXiv.org/abs/1412.5567.
    [11] R. C. Heyser, Acoustical measurements by time delay spectrometry, Journal of the Audio Engineering Society, 15 (1967), 370-382. 
    [12] Y. HuY. LiuS. LvM. XingS. ZhangY. FuJ. WuB. Zhang and L. Xie, Dccrn: Deep complex convolution recurrent network for phase-aware speech enhancement, Proceedings of the Annual Conference of the International Speech Communication Association, Interspeech, 2020 (2020), 2472-2476. 
    [13] F. Jacobsen and P. M. Juhl, Fundamentals of General Linear Acoustics, John Wiley & Sons, 2013.
    [14] W. J. Klippel, Active reduction of nonlinear loudspeaker distortion, Proceedings of Active 99: the International Symposium on Active Control of Sound and Vibration, 1, 2, 1135-1146.
    [15] J. S. Lim and A. V. Oppenheim, Enhancement and bandwidth compression of noisy speech, Proceedings of the IEEE.
    [16] H. LiuX. LiuQ. KongQ. TianY. ZhaoD. L. WangC. Huang and Y. Wang, Voicefixer: A unified framework for high-fidelity speech restoration, Proceedings of the Annual Conference of the International Speech Communication Association, Interspeech, 2022 (2022), 4232-4236. 
    [17] M. LudvigsenE. KarvonenM. Juvonen and S. Siltanen, Helsinki speech challenge 2024 – competition and open dataset, Applied Mathematics for Modern Challenges, 6 (2025), 24-44. 
    [18] S. Müller, Measuring transfer-functions and impulse responses, Handbook of Signal Processing in Acoustics, (2008), 65-85. doi: 10.1007/978-0-387-30441-0_5.
    [19] K. Prawda, S. J. Schlecht and V. Välimäki, Time variance in measured room impulse responses, Proceedings of the 10th Convention of the European Acoustics Association Forum Acusticum, (2023), 1-8.
    [20] J. G. Proakis, Digital Signal Processing: Principles Algorithms and Applications, Pearson Education India, 2001.
    [21] M. Rajan, Convergence analysis of a regularized approximation for solving fredholm integral equations of the first kind, Journal of Mathematical Analysis and Applications, 279 (2003), 522-530, https://www.sciencedirect.com/science/article/pii/S0022247X03000271. doi: 10.1016/S0022-247X(03)00027-1.
    [22] J. RichterS. WelkerJ. M. LemercierB. Lay and T. Gerkmann, Speech enhancement and dereverberation with diffusion-based generative models, IEEE/ACM Transactions on Audio Speech and Language Processing, 31 (2023), 2351-2364.  doi: 10.1109/TASLP.2023.3285241.
    [23] M. R. Schroeder, New method of measuring reverberation time, Journal of the Acoustical Society of America, 37 (1965), 409-412.  doi: 10.1121/1.1909343.
    [24] Silero Team, Silero vad: Pre-trained enterprise-grade voice activity detector, https://github.com/snakers4/silero-vad/tree/v4.0, (2024).
    [25] N. Upadhyay and A. Karmakar, Speech enhancement using spectral subtraction-type algorithms: A comparison and simulation study, Procedia Computer Science, 54 (2015), 574-584. 
  • 加载中

Figures(10)

Tables(3)

SHARE

Article Metrics

HTML views(1867) PDF downloads(116) Cited by(0)

Access History

Catalog

    /

    DownLoad:  Full-Size Img  PowerPoint
    Return
    Return