TY - GEN
T1 - Analysis of robustness of deep single-channel speech separation using corpora constructed from multiple domains
AU - Maciejewski, Matthew
AU - Sell, Gregory
AU - Fujita, Yusuke
AU - Garcia-Perera, Leibny Paola
AU - Watanabe, Shinji
AU - Khudanpur, Sanjeev
N1 - Publisher Copyright:
© 2019 IEEE.
PY - 2019/10
Y1 - 2019/10
N2 - Deep-learning based single-channel speech separation has been studied with great success, though evaluations have typically been limited to relatively controlled environments based on clean, near-field, and read speech. This work investigates the robustness of such representative techniques in more realistic environments with multiple and diverse conditions. To this end, we first construct datasets from the Mixer 6 and CHiME-5 corpora, featuring studio interviews and dinner parties respectively, using a procedure carefully designed to generate desirable synthetic overlap data sufficient for evaluation as well as for training deep learning models. Using these new datasets, we demonstrate the substantial shortcomings in mismatched conditions of these separation techniques. Though multi-condition training greatly mitigated the performance degradation in near-field conditions, one of the important findings is that both matched and multi-condition training have significant gaps from the oracle performance in far-field conditions, which advocates a need for extending existing separation techniques to deal with far-field/highly-reverberant speech mixtures.
AB - Deep-learning based single-channel speech separation has been studied with great success, though evaluations have typically been limited to relatively controlled environments based on clean, near-field, and read speech. This work investigates the robustness of such representative techniques in more realistic environments with multiple and diverse conditions. To this end, we first construct datasets from the Mixer 6 and CHiME-5 corpora, featuring studio interviews and dinner parties respectively, using a procedure carefully designed to generate desirable synthetic overlap data sufficient for evaluation as well as for training deep learning models. Using these new datasets, we demonstrate the substantial shortcomings in mismatched conditions of these separation techniques. Though multi-condition training greatly mitigated the performance degradation in near-field conditions, one of the important findings is that both matched and multi-condition training have significant gaps from the oracle performance in far-field conditions, which advocates a need for extending existing separation techniques to deal with far-field/highly-reverberant speech mixtures.
KW - deep learning
KW - far-field speech
KW - single-channel speech separation
UR - http://www.scopus.com/inward/record.url?scp=85078563730&partnerID=8YFLogxK
UR - http://www.scopus.com/inward/citedby.url?scp=85078563730&partnerID=8YFLogxK
U2 - 10.1109/WASPAA.2019.8937153
DO - 10.1109/WASPAA.2019.8937153
M3 - Conference contribution
AN - SCOPUS:85078563730
T3 - IEEE Workshop on Applications of Signal Processing to Audio and Acoustics
SP - 165
EP - 169
BT - 2019 IEEE Workshop on Applications of Signal Processing to Audio and Acoustics, WASPAA 2019
PB - Institute of Electrical and Electronics Engineers Inc.
T2 - 2019 IEEE Workshop on Applications of Signal Processing to Audio and Acoustics, WASPAA 2019
Y2 - 20 October 2019 through 23 October 2019
ER -