TY - JOUR
T1 - Structural classification methods based on weighted finite-state transducers for automatic speech recognition
AU - Kubo, Yotaro
AU - Watanabe, Shinji
AU - Hori, Takaaki
AU - Nakamura, Atsushi
N1 - Funding Information:
Manuscript received January 17, 2012; revised April 19, 2012; accepted April 24, 2012. Date of publication May 11, 2012; date of current version August 09, 2012. This work was supported in part by the Japan Society for the Promotion of Science under Grant-in-Aid Scientific Research No. 22300064. The associate editor coordinating the review of this manuscript and approving it for publication was Dr. Brian Kingsbury.
PY - 2012
Y1 - 2012
N2 - The potential of structural classification methods for automatic speech recognition (ASR) has been attracting the speech community since they can realize the unified modeling of acoustic and linguistic aspects of recognizers. However, the structural classification approaches involve well-known tradeoffs between the richness of features and the computational efficiency of decoders. If we are to employ, for example, a frame-synchronous one-pass decoding technique, features considered to calculate the likelihood of each hypothesis must be restricted to the same form as the conventional acoustic and language models. This paper tackles this limitation directly by exploiting the structure of the weighted finite-state transducers (WFSTs) used for decoding. Although WFST arcs provide rich contextual information, close integration with a computationally efficient decoding technique is still possible since most decoding techniques only require that their likelihood functions are factorizable for each decoder arc and time frame. In this paper, we compare two methods for structural classification with the WFST-based features; the structured perceptron and conditional random field (CRF) techniques. To analyze the advantages of these two classifiers, we present experimental results for the TIMIT continuous phoneme recognition task, the WSJ transcription task, and the MIT lecture transcription task. We confirmed that the proposed approach improved the ASR performance without sacrificing the computational efficiency of the decoders, even though the baseline systems are already trained with discriminative training techniques (e.g., MPE).
AB - The potential of structural classification methods for automatic speech recognition (ASR) has been attracting the speech community since they can realize the unified modeling of acoustic and linguistic aspects of recognizers. However, the structural classification approaches involve well-known tradeoffs between the richness of features and the computational efficiency of decoders. If we are to employ, for example, a frame-synchronous one-pass decoding technique, features considered to calculate the likelihood of each hypothesis must be restricted to the same form as the conventional acoustic and language models. This paper tackles this limitation directly by exploiting the structure of the weighted finite-state transducers (WFSTs) used for decoding. Although WFST arcs provide rich contextual information, close integration with a computationally efficient decoding technique is still possible since most decoding techniques only require that their likelihood functions are factorizable for each decoder arc and time frame. In this paper, we compare two methods for structural classification with the WFST-based features; the structured perceptron and conditional random field (CRF) techniques. To analyze the advantages of these two classifiers, we present experimental results for the TIMIT continuous phoneme recognition task, the WSJ transcription task, and the MIT lecture transcription task. We confirmed that the proposed approach improved the ASR performance without sacrificing the computational efficiency of the decoders, even though the baseline systems are already trained with discriminative training techniques (e.g., MPE).
KW - Automatic speech recognition (ASR)
KW - structural classification
KW - weighted finite-state transducers (WFST)
UR - http://www.scopus.com/inward/record.url?scp=84865227975&partnerID=8YFLogxK
UR - http://www.scopus.com/inward/citedby.url?scp=84865227975&partnerID=8YFLogxK
U2 - 10.1109/TASL.2012.2199112
DO - 10.1109/TASL.2012.2199112
M3 - Article
AN - SCOPUS:84865227975
SN - 1558-7916
VL - 20
SP - 2240
EP - 2251
JO - IEEE Transactions on Audio, Speech and Language Processing
JF - IEEE Transactions on Audio, Speech and Language Processing
IS - 8
M1 - 6198870
ER -