Train from scratch: Single-stage joint training of speech separation and recognition

Jing Shi, Xuankai Chang, Shinji Watanabe*, Bo Xu

*Corresponding author for this work

Research output: Contribution to journalArticlepeer-review

5 Citations (Scopus)


Multi-speaker speech separation and recognition gains much attention in the speech community recently. Previously, most studies train the front-end separation module and back-end recognition module individually. The two modules after training are combined together either with a hybrid structure or by fine-tuning the resulting model. In this work, we present a unified and flexible multi-speaker end-to-end ASR model. In contrast to previous studies, our proposed model is trained from scratch with a complete single stage, rather than multiple training stages based on pre-training and the following fine-tuning. Our model can deal with either single-channel or multi-channel speech input. Moreover, the proposed model can be trained with or without the clean source speech references. We evaluate the proposed model on the WSJ0-2mix dataset in both single-channel and spatialized multi-channel conditions. The experiments demonstrate that the proposed methods can improve the performance of the end-to-end model in recognizing the separated streams without much degradation in speech separation, achieving a new state-of-the-art in the WSJ0-2mix dataset. Moreover, we systematically assess the impact of various features for the success of the joint-training model and will release all our codes, which may provide a new guidance for the integration of front-end and back-end towards complex auditory scenes.

Original languageEnglish
Article number101387
JournalComputer Speech and Language
Publication statusPublished - 2022 Nov
Externally publishedYes


  • Cocktail party problem
  • End-to-end
  • Joint-training
  • Multi-speaker speech recognition
  • Speech separation

ASJC Scopus subject areas

  • Theoretical Computer Science
  • Software
  • Human-Computer Interaction


Dive into the research topics of 'Train from scratch: Single-stage joint training of speech separation and recognition'. Together they form a unique fingerprint.

Cite this