TY - JOUR
T1 - Stream selection and integration in multistream ASR using GMM-based performance monitoring
AU - Ogawa, Tetsuji
AU - Li, Feipeng
AU - Hermansky, Hynek
PY - 2013/1/1
Y1 - 2013/1/1
N2 - A moderately deep and rather wide artificial neural net is applied in phoneme recognition of noisy speech. The net is formed by first estimating posterior probabilities of phonemes in 21 band-limited streams covering the whole speech spectrum. These 21 band-limited streams are subdivided into three seven band-limited stream subsets, by differently sub-sampling the original 21 band-limited streams. In the second processing stage, all non-empty combinations of seven band-limited streams from each subset are formed as inputs to 127 artificial neural nets that are again trained to yield phoneme posteriors. In this way, 127 × 3 = 381 processing streams are formed. A novel technique for finding the best combination of the resulting 381 parallel processing streams, which uses the likelihood of a single-state Gaussian mixture model of the final classifier output is applied to selecting the most efficient streams. The technique is efficient in phoneme recognition of speech that is corrupted by realistic additive noise.
AB - A moderately deep and rather wide artificial neural net is applied in phoneme recognition of noisy speech. The net is formed by first estimating posterior probabilities of phonemes in 21 band-limited streams covering the whole speech spectrum. These 21 band-limited streams are subdivided into three seven band-limited stream subsets, by differently sub-sampling the original 21 band-limited streams. In the second processing stage, all non-empty combinations of seven band-limited streams from each subset are formed as inputs to 127 artificial neural nets that are again trained to yield phoneme posteriors. In this way, 127 × 3 = 381 processing streams are formed. A novel technique for finding the best combination of the resulting 381 parallel processing streams, which uses the likelihood of a single-state Gaussian mixture model of the final classifier output is applied to selecting the most efficient streams. The technique is efficient in phoneme recognition of speech that is corrupted by realistic additive noise.
KW - Gaussian mixture model
KW - Multilayer perceptron
KW - Multistream speech recognition
KW - Performance monitoring
UR - http://www.scopus.com/inward/record.url?scp=84906283768&partnerID=8YFLogxK
UR - http://www.scopus.com/inward/citedby.url?scp=84906283768&partnerID=8YFLogxK
M3 - Conference article
AN - SCOPUS:84906283768
SN - 2308-457X
SP - 3332
EP - 3336
JO - Proceedings of the Annual Conference of the International Speech Communication Association, INTERSPEECH
JF - Proceedings of the Annual Conference of the International Speech Communication Association, INTERSPEECH
T2 - 14th Annual Conference of the International Speech Communication Association, INTERSPEECH 2013
Y2 - 25 August 2013 through 29 August 2013
ER -