TY - GEN
T1 - Efficient and stable adversarial learning using unpaired data for unsupervised multichannel speech separation
AU - Nakagome, Yu
AU - Togami, Masahito
AU - Ogawa, Tetsuji
AU - Kobayashi, Tetsunori
N1 - Funding Information:
The research was supported by NII CRIS collaborative research program operated by NII CRIS and LINE Corporation.
Publisher Copyright:
Copyright © 2021 ISCA.
PY - 2021
Y1 - 2021
N2 - This study presents a framework to enable efficient and stable adversarial learning of unsupervised multichannel source separation models. When the paired data, i.e., the mixture and the corresponding clean speech, are not available for training, it is promising to exploit generative adversarial networks (GANs), where a source separation system is treated as a generator and trained to bring the distribution of the separated (fake) speech closer to that of the clean (real) speech. The separated speech, however, contains many errors, especially when the system is trained unsupervised and can be easily distinguished from the clean speech. A real/fake binary discriminator therefore will stop the adversarial learning process unreasonably early. This study aims to balance the convergence of the generator and discriminator to achieve efficient and stable learning. For that purpose, the autoencoder-based discriminator and more stable adversarial loss, which are designed in boundary equilibrium GAN (BEGAN), are introduced. In addition, generator-specific distortions are added to real examples so that the models can be trained to focus only on source separation. Experimental comparisons demonstrated that the present stabilizing learning techniques improved the performance of multiple unsupervised source separation systems.
AB - This study presents a framework to enable efficient and stable adversarial learning of unsupervised multichannel source separation models. When the paired data, i.e., the mixture and the corresponding clean speech, are not available for training, it is promising to exploit generative adversarial networks (GANs), where a source separation system is treated as a generator and trained to bring the distribution of the separated (fake) speech closer to that of the clean (real) speech. The separated speech, however, contains many errors, especially when the system is trained unsupervised and can be easily distinguished from the clean speech. A real/fake binary discriminator therefore will stop the adversarial learning process unreasonably early. This study aims to balance the convergence of the generator and discriminator to achieve efficient and stable learning. For that purpose, the autoencoder-based discriminator and more stable adversarial loss, which are designed in boundary equilibrium GAN (BEGAN), are introduced. In addition, generator-specific distortions are added to real examples so that the models can be trained to focus only on source separation. Experimental comparisons demonstrated that the present stabilizing learning techniques improved the performance of multiple unsupervised source separation systems.
KW - Boundary equilibrium generative adversarial network
KW - Multichannel speech separation
KW - Unsupervised training
KW - unpaired data
UR - http://www.scopus.com/inward/record.url?scp=85119198182&partnerID=8YFLogxK
UR - http://www.scopus.com/inward/citedby.url?scp=85119198182&partnerID=8YFLogxK
U2 - 10.21437/Interspeech.2021-523
DO - 10.21437/Interspeech.2021-523
M3 - Conference contribution
AN - SCOPUS:85119198182
T3 - Proceedings of the Annual Conference of the International Speech Communication Association, INTERSPEECH
SP - 2323
EP - 2327
BT - 22nd Annual Conference of the International Speech Communication Association, INTERSPEECH 2021
PB - International Speech Communication Association
T2 - 22nd Annual Conference of the International Speech Communication Association, INTERSPEECH 2021
Y2 - 30 August 2021 through 3 September 2021
ER -