Duration-Controlled LSTM for Polyphonic Sound Event Detection

Tomoki Hayashi*, Shinji Watanabe, Tomoki Toda, Takaaki Hori, Jonathan Le Roux, Kazuya Takeda

*Corresponding author for this work

Research output: Contribution to journalArticlepeer-review

64 Citations (Scopus)


This paper presents a new hybrid approach called duration-controlled long short-term memory (LSTM) for polyphonic sound event detection (SED). It builds upon a state-of-the-art SED method that performs frame-by-frame detection using a bidirectional LSTM recurrent neural network (BLSTM), and incorporates a duration-controlled modeling technique based on a hidden semi-Markov model. The proposed approach makes it possible to model the duration of each sound event precisely and to perform sequence-by-sequence detection without having to resort to thresholding, as in conventional frame-by-frame methods. Furthermore, to effectively reduce sound event insertion errors, which often occur under noisy conditions, we also introduce a binary-mask-based postprocessing that relies on a sound activity detection network to identify segments with any sound event activity, an approach inspired by the well-known benefits of voice activity detection in speech recognition systems. We conduct an experiment using the DCASE2016 task 2 dataset to compare our proposed method with typical conventional methods, such as nonnegative matrix factorization and standard BLSTM. Our proposed method outperforms the conventional methods both in an event-based evaluation, achieving a 75.3% F1 score and a 44.2% error rate, and in a segment-based evaluation, achieving an 81.1% F1 score, and a 32.9% error rate, outperforming the best results reported in the DCASE2016 task 2 Challenge.

Original languageEnglish
Pages (from-to)2059-2070
Number of pages12
JournalIEEE/ACM Transactions on Audio Speech and Language Processing
Issue number11
Publication statusPublished - 2017 Nov
Externally publishedYes


  • Duration control
  • hidden semi-Markov model (HSMM)
  • hybrid model
  • long short-term memory (LSTM)
  • polyphonic sound event detection (SED)
  • recurrent neural network

ASJC Scopus subject areas

  • Computer Science (miscellaneous)
  • Acoustics and Ultrasonics
  • Computational Mathematics
  • Electrical and Electronic Engineering


Dive into the research topics of 'Duration-Controlled LSTM for Polyphonic Sound Event Detection'. Together they form a unique fingerprint.

Cite this