Abstract: This paper introduces a new approach to dictionary-based source separation employing a learned non-linear metric. In contrast to existing parametric source separation systems, this model is able to utilize a rich dictionary of speech signals. In contrast to previous dictionary-based source separation systems, the system can utilize perceptually relevant non-linear features of the noisy and clean audio. This approach utilizes a deep neural network (DNN) to predict whether a noisy chunk of audio contains a given clean chunk. Speaker-dependent experiments on the CHiME2-GRID corpus show that this model is able to accurately resynthesize clean speech from noisy observations. Preliminary listening tests show that the system's output has much higher audio quality than existing parametric systems trained on the same data, achieving noise suppression levels close to those of the original clean speech.
Wav files from the CHiME2-GRID devel corpus evaluated in intelligibility and MUltiple Stimuli with Hidden Reference and Anchor(MUSHRA) listening tests.
Compares four systems:
|File||SNR||Clean||Noisy||Concat||Concat No-trans||Ideal ratio mask NN||Noisy-to-clean NN|