MIMO-Speech: End-To-End Multi-Channel Multi-Speaker Speech Recognition

Xuankai Chang, Wangyou Zhang, Yanmin Qian*, Jonathan Le Roux, Shinji Watanabe

*この研究の対応する著者

研究成果: Conference contribution

79 被引用数 (Scopus)

抄録

Recently, the end-To-end approach has proven its efficacy in monaural multi-speaker speech recognition. However, high word error rates (WERs) still prevent these systems from being used in practical applications. On the other hand, the spatial information in multi-channel signals has proven helpful in far-field speech recognition tasks. In this work, we propose a novel neural sequence-To-sequence (seq2seq) architecture, MIMO-Speech, which extends the original seq2seq to deal with multi-channel input and multi-channel output so that it can fully model multi-channel multi-speaker speech separation and recognition. MIMO-Speech is a fully neural end-To-end framework, which is optimized only via an ASR criterion. It is comprised of: 1) a monaural masking network, 2) a multi-source neural beamformer, and 3) a multi-output speech recognition model. With this processing, the input overlapped speech is directly mapped to text sequences. We further adopted a curriculum learning strategy, making the best use of the training set to improve the performance. The experiments on the spatialized wsj1-2mix corpus show that our model can achieve more than 60% WER reduction compared to the single-channel system with high quality enhanced signals (SI-SDR = 23.1 dB) obtained by the above separation function.

本文言語English
ホスト出版物のタイトル2019 IEEE Automatic Speech Recognition and Understanding Workshop, ASRU 2019 - Proceedings
出版社Institute of Electrical and Electronics Engineers Inc.
ページ237-244
ページ数8
ISBN(電子版)9781728103068
DOI
出版ステータスPublished - 2019 12月
外部発表はい
イベント2019 IEEE Automatic Speech Recognition and Understanding Workshop, ASRU 2019 - Singapore, Singapore
継続期間: 2019 12月 152019 12月 18

出版物シリーズ

名前2019 IEEE Automatic Speech Recognition and Understanding Workshop, ASRU 2019 - Proceedings

Conference

Conference2019 IEEE Automatic Speech Recognition and Understanding Workshop, ASRU 2019
国/地域Singapore
CitySingapore
Period19/12/1519/12/18

ASJC Scopus subject areas

  • コンピュータ ネットワークおよび通信
  • 信号処理
  • 言語学および言語
  • 通信

フィンガープリント

「MIMO-Speech: End-To-End Multi-Channel Multi-Speaker Speech Recognition」の研究トピックを掘り下げます。これらがまとまってユニークなフィンガープリントを構成します。

引用スタイル