Attention Weight Smoothing Using Prior Distributions for Transformer-Based End-to-End ASR

Takashi Maekaku, Yuya Fujita, Yifan Peng, Shinji Watanabe

Research output: Contribution to journalConference articlepeer-review

2 Citations (Scopus)

Abstract

Transformer-based encoder-decoder models have so far been widely used for end-to-end automatic speech recognition. However, it has been found that the self-attention weight matrix could be too peaky and biased toward the diagonal component. Such attention weight matrix contains little useful context information, which may result in poor speech recognition performance. Therefore, we propose the following two attention weight smoothing methods based on the hypothesis that an attention weight matrix whose diagonal components are not peaky can capture more context information. One is a method to linearly interpolate the attention weight using a learnable truncated prior distribution. The other uses the attention weight from a previous layer as a prior distribution given that lower-layer weights tend to be less peaky and diagonal. Experiments on LibriSpeech and Wall Street Journal show that the proposed approach achieves 2.9% and 7.9% relative improvement, respectively, over a vanilla Transformer model.

Original languageEnglish
Pages (from-to)1071-1075
Number of pages5
JournalProceedings of the Annual Conference of the International Speech Communication Association, INTERSPEECH
Volume2022-September
DOIs
Publication statusPublished - 2022
Externally publishedYes
Event23rd Annual Conference of the International Speech Communication Association, INTERSPEECH 2022 - Incheon, Korea, Republic of
Duration: 2022 Sept 182022 Sept 22

Keywords

  • end-to-end speech recognition
  • transformer

ASJC Scopus subject areas

  • Language and Linguistics
  • Human-Computer Interaction
  • Signal Processing
  • Software
  • Modelling and Simulation

Fingerprint

Dive into the research topics of 'Attention Weight Smoothing Using Prior Distributions for Transformer-Based End-to-End ASR'. Together they form a unique fingerprint.

Cite this