TY - GEN
T1 - ESPnet-SE
T2 - 2021 IEEE Spoken Language Technology Workshop, SLT 2021
AU - Li, Chenda
AU - Shi, Jing
AU - Zhang, Wangyou
AU - Subramanian, Aswin Shanmugam
AU - Chang, Xuankai
AU - Kamo, Naoyuki
AU - Hira, Moto
AU - Hayashi, Tomoki
AU - Boeddeker, Christoph
AU - Chen, Zhuo
AU - Watanabe, Shinji
N1 - Funding Information:
A part of this work was studied during JSALT 2020 at JHU, with support from Microsoft, Amazon, and Google. Chenda Li and Wangyou Zhang are also supported by the China NSFC project (No. 62071288 and No.U1736202)
Publisher Copyright:
© 2021 IEEE.
PY - 2021/1/19
Y1 - 2021/1/19
N2 - We present ESPnet-SE, which is designed for the quick development of speech enhancement and speech separation systems in a single framework, along with the optional downstream speech recognition module. ESPnet-SE is a new project which integrates rich automatic speech recognition related models, resources and systems to support and validate the proposed front-end implementation (i.e. speech enhancement and separation). It is capable of processing both single-channel and multi-channel data, with various functionalities including dereverberation, denoising and source separation. We provide all-in-one recipes including data pre-processing, feature extraction, training and evaluation pipelines for a wide range of benchmark datasets. This paper describes the design of the toolkit, several important functionalities, especially the speech recognition integration, which differentiates ESPnet-SE from other open source toolkits, and experimental results with major benchmark datasets.
AB - We present ESPnet-SE, which is designed for the quick development of speech enhancement and speech separation systems in a single framework, along with the optional downstream speech recognition module. ESPnet-SE is a new project which integrates rich automatic speech recognition related models, resources and systems to support and validate the proposed front-end implementation (i.e. speech enhancement and separation). It is capable of processing both single-channel and multi-channel data, with various functionalities including dereverberation, denoising and source separation. We provide all-in-one recipes including data pre-processing, feature extraction, training and evaluation pipelines for a wide range of benchmark datasets. This paper describes the design of the toolkit, several important functionalities, especially the speech recognition integration, which differentiates ESPnet-SE from other open source toolkits, and experimental results with major benchmark datasets.
KW - Open-source
KW - end-to-end
KW - source separation
KW - speech enhancement
KW - speech recognition
UR - http://www.scopus.com/inward/record.url?scp=85102362997&partnerID=8YFLogxK
UR - http://www.scopus.com/inward/citedby.url?scp=85102362997&partnerID=8YFLogxK
U2 - 10.1109/SLT48900.2021.9383615
DO - 10.1109/SLT48900.2021.9383615
M3 - Conference contribution
AN - SCOPUS:85102362997
T3 - 2021 IEEE Spoken Language Technology Workshop, SLT 2021 - Proceedings
SP - 785
EP - 792
BT - 2021 IEEE Spoken Language Technology Workshop, SLT 2021 - Proceedings
PB - Institute of Electrical and Electronics Engineers Inc.
Y2 - 19 January 2021 through 22 January 2021
ER -