ICASSP 2019 results

Speech denoising by parametric resynthesis [pdf]

Soumi Maiti and Michael Mandel

This work proposes the use of clean speech vocoder parameters as the target for a neural network performing speech enhancement. These parameters have been designed for text-to-speech synthesis so that they both produce high-quality resyntheses and also are straightforward to model with neural networks, but have not been utilized in speech enhancement until now. In comparison to a matched text-to-speech system that is given the ground truth transcripts of the noisy speech, our model is able to produce more natural speech because it has access to the true prosody in the noisy speech. In comparison to two denoising systems, the oracle Wiener mask and a DNN-based mask predictor, our model equals the oracle Wiener mask in subjective quality and intelligibility and surpasses the realistic system. A vocoder-based upper bound shows that there is still room for improvement with this approach beyond the oracle Wiener mask. We test speaker-dependence with two speakers and show that a single model can be used for multiple speakers.

Audio files

Wav files from the CMU Arctic corpus mixed with noise from the CHiME-3 noise recordings

Real systems

FileNoisy speechParametric resynthesisDNN-predicted IRMText-to-speechOriginal clean speech
b0481
b0487
b0489
b0490
b0496
b0498
b0499
b0505
b0508
b0509
b0511
b0517

Oracle systems

FileOriginal clean speechVocoder encode-decodePR from cleanOracle Wiener mask
b0481
b0487
b0489
b0490
b0496
b0498
b0499
b0505
b0508
b0509
b0511
b0517