A dilemma of ground truth in noisy speech separation and an approach to lessen the impact of imperfect training data

Matthew Maciejewski*, Jing Shi, Shinji Watanabe, Sanjeev Khudanpur

*Corresponding author for this work

Research output: Contribution to journalArticlepeer-review

Abstract

As the performance of single-channel speech separation systems has improved, there has been a shift in the research community towards tackling more challenging conditions that are more representative of many real-world applications, including the addition of noise and reverberation. The need for ground truth in training state-of-the-art separation systems leads to a requirement of training on artificial mixtures, where single-speaker recordings are summed digitally. However, this leads to two separate approaches for creating noisy mixtures: one in which noise has been artificially added, maintaining perfect ground truth information, and one in which the noise is already present in the single-speaker recordings, allowing for in-domain training. In this work, we document a severe negative impact in both training and evaluation of models in the latter paradigm. We provide an explanation for this – the implicit task of separating noise – and propose an improved training objective that allows errors resulting from failing to separate noise to be minimized.

Original languageEnglish
Article number101410
JournalComputer Speech and Language
Volume77
DOIs
Publication statusPublished - 2023 Jan
Externally publishedYes

Keywords

  • Deep learning
  • Noisy speech
  • Speech separation

ASJC Scopus subject areas

  • Theoretical Computer Science
  • Software
  • Human-Computer Interaction

Fingerprint

Dive into the research topics of 'A dilemma of ground truth in noisy speech separation and an approach to lessen the impact of imperfect training data'. Together they form a unique fingerprint.

Cite this