End-to-end SpeakerBeam for single channel target speech recognition

Marc Delcroix, Shinji Watanabe, Tsubasa Ochiai, Keisuke Kinoshita, Shigeki Karita, Atsunori Ogawa, Tomohiro Nakatani

Research output: Contribution to journalConference articlepeer-review

12 Citations (Scopus)


End-to-end (E2E) automatic speech recognition (ASR) that directly maps a sequence of speech features into a sequence of characters using a single neural network has received a lot of attention as it greatly simplifies the training and decoding pipelines and enables optimizing the whole system E2E. Recently, such systems have been extended to recognize speech mixtures by inserting a speech separation mechanism into the neural network, allowing to output recognition results for each speaker in the mixture. However, speech separation suffers from a global permutation ambiguity issue, i.e. arbitrary mapping between source speakers and outputs. We argue that this ambiguity would seriously limit the practical use of E2E separation systems. SpeakerBeam has been proposed as an alternative to speech separation to mitigate the global permutation ambiguity. SpeakerBeam aims at extracting only a target speaker in a mixture based on his/her speech characteristics, thus avoiding the global permutation problem. In this paper, we combine SpeakerBeam and an E2E ASR system to allow E2E training of a target speech recognition system. We show promising target speech recognition results in mixtures of two speakers, and discuss interesting properties of the proposed system in terms of speech enhancement and diarization ability.

Original languageEnglish
Pages (from-to)451-455
Number of pages5
JournalProceedings of the Annual Conference of the International Speech Communication Association, INTERSPEECH
Publication statusPublished - 2019
Externally publishedYes
Event20th Annual Conference of the International Speech Communication Association: Crossroads of Speech and Language, INTERSPEECH 2019 - Graz, Austria
Duration: 2019 Sept 152019 Sept 19


  • End-to-end speech recognition
  • SpeakerBeam
  • Target speech extraction

ASJC Scopus subject areas

  • Language and Linguistics
  • Human-Computer Interaction
  • Signal Processing
  • Software
  • Modelling and Simulation


Dive into the research topics of 'End-to-end SpeakerBeam for single channel target speech recognition'. Together they form a unique fingerprint.

Cite this