TY - GEN
T1 - Target speech detection and separation for humanoid robots in sparse dialogue with noisy home environments
AU - Kim, Hyun Don
AU - Kim, Jinsung
AU - Komatani, Kazunori
AU - Ogata, Tetsuya
AU - Okuno, Hiroshi G.
PY - 2008/12/1
Y1 - 2008/12/1
N2 - In normal human communication, people face the speaker when listening and usually pay attention to the speaker' face. Therefore, in robot audition, the recognition of the front talker is critical for smooth interactions. This paper presents an enhanced speech detection method for a humanoid robot that can separate and recognize speech signals originating from the front even in noisy home environments. The robot audition system consists of a new type of voice activity detection (VAD) based on the complex spectrum circle centroid (CSCC) method and a maximum signal-to-noise (Max-SNR) beamformer. This VAD based on CSCC can classify speech signals that are retrieved at the frontal region of two microphones embedded on the robot. The system works in real-time without needing training filter coefficients given in advance even in a noisy environment (SNR > 0 dB). It can cope with speech noise generated from televisions and audio devices that does not originate from the center. Experiments using a humanoid robot, SIG2, with two microphones showed that our system enhanced extracted target speech signals more than 12 dB (SNR) and the success rate of automatic speech recognition for Japanese words was increased about 17 points.
AB - In normal human communication, people face the speaker when listening and usually pay attention to the speaker' face. Therefore, in robot audition, the recognition of the front talker is critical for smooth interactions. This paper presents an enhanced speech detection method for a humanoid robot that can separate and recognize speech signals originating from the front even in noisy home environments. The robot audition system consists of a new type of voice activity detection (VAD) based on the complex spectrum circle centroid (CSCC) method and a maximum signal-to-noise (Max-SNR) beamformer. This VAD based on CSCC can classify speech signals that are retrieved at the frontal region of two microphones embedded on the robot. The system works in real-time without needing training filter coefficients given in advance even in a noisy environment (SNR > 0 dB). It can cope with speech noise generated from televisions and audio devices that does not originate from the center. Experiments using a humanoid robot, SIG2, with two microphones showed that our system enhanced extracted target speech signals more than 12 dB (SNR) and the success rate of automatic speech recognition for Japanese words was increased about 17 points.
UR - http://www.scopus.com/inward/record.url?scp=69549083168&partnerID=8YFLogxK
UR - http://www.scopus.com/inward/citedby.url?scp=69549083168&partnerID=8YFLogxK
U2 - 10.1109/IROS.2008.4650977
DO - 10.1109/IROS.2008.4650977
M3 - Conference contribution
AN - SCOPUS:69549083168
SN - 9781424420582
T3 - 2008 IEEE/RSJ International Conference on Intelligent Robots and Systems, IROS
SP - 1705
EP - 1711
BT - 2008 IEEE/RSJ International Conference on Intelligent Robots and Systems, IROS
T2 - 2008 IEEE/RSJ International Conference on Intelligent Robots and Systems, IROS
Y2 - 22 September 2008 through 26 September 2008
ER -