Abstract
Transformer-based encoder-decoder models have so far been widely used for end-to-end automatic speech recognition. However, it has been found that the self-attention weight matrix could be too peaky and biased toward the diagonal component. Such attention weight matrix contains little useful context information, which may result in poor speech recognition performance. Therefore, we propose the following two attention weight smoothing methods based on the hypothesis that an attention weight matrix whose diagonal components are not peaky can capture more context information. One is a method to linearly interpolate the attention weight using a learnable truncated prior distribution. The other uses the attention weight from a previous layer as a prior distribution given that lower-layer weights tend to be less peaky and diagonal. Experiments on LibriSpeech and Wall Street Journal show that the proposed approach achieves 2.9% and 7.9% relative improvement, respectively, over a vanilla Transformer model.
Original language | English |
---|---|
Pages (from-to) | 1071-1075 |
Number of pages | 5 |
Journal | Proceedings of the Annual Conference of the International Speech Communication Association, INTERSPEECH |
Volume | 2022-September |
DOIs | |
Publication status | Published - 2022 |
Externally published | Yes |
Event | 23rd Annual Conference of the International Speech Communication Association, INTERSPEECH 2022 - Incheon, Korea, Republic of Duration: 2022 Sept 18 → 2022 Sept 22 |
Keywords
- end-to-end speech recognition
- transformer
ASJC Scopus subject areas
- Language and Linguistics
- Human-Computer Interaction
- Signal Processing
- Software
- Modelling and Simulation