Abstract
We propose a domain adaptation method that enables connectionist temporal classification (CTC)-based end-to-end (E2E) automatic speech recognition (ASR) models to adapt to a target domain using unpaired text data. The performance of ASR models deteriorates for words and topics not present in the training data, such as the latest news. Although it is difficult to collect paired speech and text data for such subjects, unpaired text data is relatively easy to obtain. Therefore, a domain adaptation method using unpaired text data is proposed for the E2E ASR model based on the intermediate CTC. This model introduces an adaptation branch to embed acoustic and linguistic information in the same latent space, allowing for domain adaptation using unpaired text data of the target domain. Experimental comparisons for multiple out-of-domain settings demonstrate that the proposed text-only domain adaptation achieves a comparable or better performance than the existing shallow-fusion-based domain adaptation, and further performance improvement is achieved by integration with shallow fusion.
Original language | English |
---|---|
Pages (from-to) | 2208-2212 |
Number of pages | 5 |
Journal | Proceedings of the Annual Conference of the International Speech Communication Association, INTERSPEECH |
Volume | 2022-September |
DOIs | |
Publication status | Published - 2022 |
Event | 23rd Annual Conference of the International Speech Communication Association, INTERSPEECH 2022 - Incheon, Korea, Republic of Duration: 2022 Sept 18 → 2022 Sept 22 |
Keywords
- domain adaptation
- end-to-end speech recognition
- non-autoregressive
- unpaired text
ASJC Scopus subject areas
- Language and Linguistics
- Human-Computer Interaction
- Signal Processing
- Software
- Modelling and Simulation