End-to-End Dereverberation, Beamforming, and Speech Recognition in a Cocktail Party

Wangyou Zhang, Xuankai Chang, Christoph Boeddeker, Tomohiro Nakatani, Shinji Watanabe, Yanmin Qian*

*Corresponding author for this work

Research output: Contribution to journalArticlepeer-review

8 Citations (Scopus)


Far-field multi-speaker automatic speech recognition (ASR) has drawn increasing attention in recent years. Most existing methods feature a signal processing frontend and an ASR backend. In realistic scenarios, these modules are usually trained separately or progressively, which suffers from either inter-module mismatch or a complicated training process. In this paper, we propose an end-to-end multi-channel model that jointly optimizes the speech enhancement (including speech dereverberation, denoising, and separation) frontend and the ASR backend as a single system. To the best of our knowledge, this is the first work that proposes to optimize dereverberation, beamforming, and multi-speaker ASR in a fully end-to-end manner. The frontend module consists of a weighted prediction error (WPE) based submodule for dereverberation and a neural beamformer for denoising and speech separation. For the backend, we adopt a widely used end-to-end (E2E) ASR architecture. It is worth noting that the entire model is differentiable and can be optimized in a fully end-to-end manner using only the ASR criterion, without the need of parallel signal-level labels. We evaluate the proposed model on several multi-speaker benchmark datasets, and experimental results show that the fully E2E ASR model can achieve competitive performance on both noisy and reverberant conditions, with over 30% relative word error rate (WER) reduction over the single-channel baseline systems.

Original languageEnglish
Pages (from-to)3173-3188
Number of pages16
JournalIEEE/ACM Transactions on Audio Speech and Language Processing
Publication statusPublished - 2022
Externally publishedYes


  • End-to-end
  • beamforming
  • dereverberation
  • multi-talker speech recognition
  • speech separation

ASJC Scopus subject areas

  • Computer Science (miscellaneous)
  • Acoustics and Ultrasonics
  • Computational Mathematics
  • Electrical and Electronic Engineering


Dive into the research topics of 'End-to-End Dereverberation, Beamforming, and Speech Recognition in a Cocktail Party'. Together they form a unique fingerprint.

Cite this