TY - GEN
T1 - The 2020 ESPnet update
T2 - 2021 IEEE Data Science and Learning Workshop, DSLW 2021
AU - Watanabe, Shinji
AU - Boyer, Florian
AU - Chang, Xuankai
AU - Guo, Pengcheng
AU - Hayashi, Tomoki
AU - Higuchi, Yosuke
AU - Hori, Takaaki
AU - Huang, Wen Chin
AU - Inaguma, Hirofumi
AU - Kamo, Naoyuki
AU - Karita, Shigeki
AU - Li, Chenda
AU - Shi, Jing
AU - Subramanian, Aswin Shanmugam
AU - Zhang, Wangyou
N1 - Publisher Copyright:
© 2021 IEEE.
PY - 2021/6/5
Y1 - 2021/6/5
N2 - This paper describes the recent development of ESPnet (https://github.com/espnet/espnet), an end-to-end speech processing toolkit. This project was initiated in December 2017 to mainly deal with end-to-end speech recognition experiments based on sequence-to-sequence modeling. The project has grown rapidly and now covers a wide range of speech processing applications. Now ESPnet also includes text to speech (TTS), voice conversation (VC), speech translation (ST), and speech enhancement (SE) with support for beamforming, speech separation, denoising, and dereverberation. All applications are trained in an end-to-end manner, thanks to the generic sequence to sequence modeling properties, and they can be further integrated and jointly optimized. Also, ESPnet provides reproducible all-in-one recipes for these applications with state-of-the-art performance in various benchmarks by incorporating transformer, advanced data augmentation, and conformer. This project aims to provide up-to-date speech processing experience to the community so that researchers in academia and various industry scales can develop their technologies collaboratively.
AB - This paper describes the recent development of ESPnet (https://github.com/espnet/espnet), an end-to-end speech processing toolkit. This project was initiated in December 2017 to mainly deal with end-to-end speech recognition experiments based on sequence-to-sequence modeling. The project has grown rapidly and now covers a wide range of speech processing applications. Now ESPnet also includes text to speech (TTS), voice conversation (VC), speech translation (ST), and speech enhancement (SE) with support for beamforming, speech separation, denoising, and dereverberation. All applications are trained in an end-to-end manner, thanks to the generic sequence to sequence modeling properties, and they can be further integrated and jointly optimized. Also, ESPnet provides reproducible all-in-one recipes for these applications with state-of-the-art performance in various benchmarks by incorporating transformer, advanced data augmentation, and conformer. This project aims to provide up-to-date speech processing experience to the community so that researchers in academia and various industry scales can develop their technologies collaboratively.
KW - End-to-end neural network
KW - Speech enhancement
KW - Speech recognition
KW - Speech translation
KW - Text-to-speech
UR - http://www.scopus.com/inward/record.url?scp=85115413774&partnerID=8YFLogxK
UR - http://www.scopus.com/inward/citedby.url?scp=85115413774&partnerID=8YFLogxK
U2 - 10.1109/DSLW51110.2021.9523402
DO - 10.1109/DSLW51110.2021.9523402
M3 - Conference contribution
AN - SCOPUS:85115413774
T3 - 2021 IEEE Data Science and Learning Workshop, DSLW 2021
BT - 2021 IEEE Data Science and Learning Workshop, DSLW 2021
PB - Institute of Electrical and Electronics Engineers Inc.
Y2 - 5 June 2021 through 6 June 2021
ER -