End-to-end multilingual multi-speaker speech recognition

Hiroshi Seki, Takaaki Hori, Shinji Watanabe, Jonathan Le Roux, John R. Hershey

Research output: Contribution to journalConference articlepeer-review

4 Citations (Scopus)


The expressive power of end-to-end automatic speech recognition (ASR) systems enables direct estimation of a character or word label sequence from a sequence of acoustic features. Direct optimization of the whole system is advantageous because it not only eliminates the internal linkage necessary for hybrid systems, but also extends the scope of potential applications by training the model for various objectives. In this paper, we tackle the challenging task of multilingual multi-speaker ASR using such an all-in-one end-to-end system. Several multilingual ASR systems were recently proposed based on a monolithic neural network architecture without language-dependent modules, showing that modeling of multiple languages is well within the capabilities of an end-to-end framework. There has also been growing interest in multi-speaker speech recognition, which enables generation of multiple label sequences from single-channel mixed speech. In particular, a multi-speaker end-to-end ASR system that can directly model one-to-many mappings without additional auxiliary clues was recently proposed. The proposed model, which integrates the capabilities of these two systems, is evaluated using mixtures of two speakers generated by using 10 languages, including code-switching utterances.

Original languageEnglish
Pages (from-to)3755-3759
Number of pages5
JournalProceedings of the Annual Conference of the International Speech Communication Association, INTERSPEECH
Publication statusPublished - 2019
Externally publishedYes
Event20th Annual Conference of the International Speech Communication Association: Crossroads of Speech and Language, INTERSPEECH 2019 - Graz, Austria
Duration: 2019 Sept 152019 Sept 19


  • CTC
  • Code-switching
  • Encoder-decoder
  • End-to-end ASR
  • Multi-speaker ASR
  • Multilingual ASR

ASJC Scopus subject areas

  • Language and Linguistics
  • Human-Computer Interaction
  • Signal Processing
  • Software
  • Modelling and Simulation


Dive into the research topics of 'End-to-end multilingual multi-speaker speech recognition'. Together they form a unique fingerprint.

Cite this