TY - GEN
T1 - Enhanced robot speech recognition based on microphone array source separation and missing feature theory
AU - Yamamoto, Shun'ichi
AU - Valin, Jean Marc
AU - Nakadai, Kazuhiro
AU - Rouat, Jean
AU - Michaud, François
AU - Ogata, Tetsuya
AU - Okuno, Hiroshi G.
PY - 2005
Y1 - 2005
N2 - A humanoid robot under real-world environments usually hears mixtures of sounds, and thus three capabilities are essential for robot audition; sound source localization, separation, and recognition of separated sounds. While the first two are frequently addressed, the last one has not been studied so much. We present a system that gives a humanoid robot the ability to localize, separate and recognize simultaneous sound sources. A microphone array is used along with a real-time dedicated implementation of Geometric Source Separation (GSS) and a multi-channel post-filter that gives us a further reduction of interferences from other sources. An automatic speech recognizer (ASR) based on the Missing Feature Theory (MFT) recognizes separated sounds in real-time by generating missing feature masks automatically from the post-filtering step. The main advantage of this approach for humanoid robots resides in the fact that the ASR with a clean acoustic model can adapt the distortion of separated sound by consulting the post-filter feature masks. Recognition rates are presented for three simultaneous speakers located at 2m from the robot. Use of both the post-filter and the missing feature mask results in an average reduction in error rate of 42% (relative).
AB - A humanoid robot under real-world environments usually hears mixtures of sounds, and thus three capabilities are essential for robot audition; sound source localization, separation, and recognition of separated sounds. While the first two are frequently addressed, the last one has not been studied so much. We present a system that gives a humanoid robot the ability to localize, separate and recognize simultaneous sound sources. A microphone array is used along with a real-time dedicated implementation of Geometric Source Separation (GSS) and a multi-channel post-filter that gives us a further reduction of interferences from other sources. An automatic speech recognizer (ASR) based on the Missing Feature Theory (MFT) recognizes separated sounds in real-time by generating missing feature masks automatically from the post-filtering step. The main advantage of this approach for humanoid robots resides in the fact that the ASR with a clean acoustic model can adapt the distortion of separated sound by consulting the post-filter feature masks. Recognition rates are presented for three simultaneous speakers located at 2m from the robot. Use of both the post-filter and the missing feature mask results in an average reduction in error rate of 42% (relative).
UR - http://www.scopus.com/inward/record.url?scp=33846170539&partnerID=8YFLogxK
UR - http://www.scopus.com/inward/citedby.url?scp=33846170539&partnerID=8YFLogxK
U2 - 10.1109/ROBOT.2005.1570323
DO - 10.1109/ROBOT.2005.1570323
M3 - Conference contribution
AN - SCOPUS:33846170539
SN - 078038914X
SN - 9780780389144
T3 - Proceedings - IEEE International Conference on Robotics and Automation
SP - 1477
EP - 1482
BT - Proceedings of the 2005 IEEE International Conference on Robotics and Automation
T2 - 2005 IEEE International Conference on Robotics and Automation
Y2 - 18 April 2005 through 22 April 2005
ER -