Abstract
End-to-end (E2E) automatic speech recognition (ASR) that directly maps a sequence of speech features into a sequence of characters using a single neural network has received a lot of attention as it greatly simplifies the training and decoding pipelines and enables optimizing the whole system E2E. Recently, such systems have been extended to recognize speech mixtures by inserting a speech separation mechanism into the neural network, allowing to output recognition results for each speaker in the mixture. However, speech separation suffers from a global permutation ambiguity issue, i.e. arbitrary mapping between source speakers and outputs. We argue that this ambiguity would seriously limit the practical use of E2E separation systems. SpeakerBeam has been proposed as an alternative to speech separation to mitigate the global permutation ambiguity. SpeakerBeam aims at extracting only a target speaker in a mixture based on his/her speech characteristics, thus avoiding the global permutation problem. In this paper, we combine SpeakerBeam and an E2E ASR system to allow E2E training of a target speech recognition system. We show promising target speech recognition results in mixtures of two speakers, and discuss interesting properties of the proposed system in terms of speech enhancement and diarization ability.
Original language | English |
---|---|
Pages (from-to) | 451-455 |
Number of pages | 5 |
Journal | Proceedings of the Annual Conference of the International Speech Communication Association, INTERSPEECH |
Volume | 2019-September |
DOIs | |
Publication status | Published - 2019 |
Externally published | Yes |
Event | 20th Annual Conference of the International Speech Communication Association: Crossroads of Speech and Language, INTERSPEECH 2019 - Graz, Austria Duration: 2019 Sept 15 → 2019 Sept 19 |
Keywords
- End-to-end speech recognition
- SpeakerBeam
- Target speech extraction
ASJC Scopus subject areas
- Language and Linguistics
- Human-Computer Interaction
- Signal Processing
- Software
- Modelling and Simulation