The Sound of Bounding-Boxes

Takashi Oya*, Shohei Iwase*, Shigeo Morishima

*Corresponding author for this work

Research output: Chapter in Book/Report/Conference proceedingConference contribution


In the task of audio-visual sound source separation, which leverages visual information for sound source separation, identifying objects in an image is a crucial step prior to separating the sound source. However, existing methods that assign sound on detected bounding boxes suffer from a problem that their approach heavily relies on pre-trained object detectors. Specifically, when using these existing methods, it is required to predetermine all the possible categories of objects that can produce sound and use an object detector applicable to all such categories. To tackle this problem, we propose a fully unsupervised method that learns to detect objects in an image and separate sound source simultaneously. As our method does not rely on any pre-trained detector, our method is applicable to arbitrary categories without any additional annotation. Furthermore, although being fully unsupervised, we found that our method performs comparably in separation accuracy.

Original languageEnglish
Title of host publication2022 26th International Conference on Pattern Recognition, ICPR 2022
PublisherInstitute of Electrical and Electronics Engineers Inc.
Number of pages7
ISBN (Electronic)9781665490627
Publication statusPublished - 2022
Event26th International Conference on Pattern Recognition, ICPR 2022 - Montreal, Canada
Duration: 2022 Aug 212022 Aug 25

Publication series

NameProceedings - International Conference on Pattern Recognition
ISSN (Print)1051-4651


Conference26th International Conference on Pattern Recognition, ICPR 2022

ASJC Scopus subject areas

  • Computer Vision and Pattern Recognition


Dive into the research topics of 'The Sound of Bounding-Boxes'. Together they form a unique fingerprint.

Cite this