TY - JOUR
T1 - Gamma Boltzmann Machine for Audio Modeling
AU - Nakashika, Toru
AU - Yatabe, Kohei
N1 - Funding Information:
Manuscript received September 8, 2020; revised April 11, 2021 and June 17, 2021; accepted July 4, 2021. Date of publication July 8, 2021; date of current version August 13, 2021. This work was supported by JSPS KAKENHI under Grant 21K11957. The associate editor coordinating the review of this manuscript and approving it for publication was Dr. Yu Tsao. (Corresponding author: Toru Nakashika).
Publisher Copyright:
© 2014 IEEE.
PY - 2021
Y1 - 2021
N2 - This paper presents an energy-based probabilistic model that handles nonnegative data in consideration of both linear and logarithmic scales. In audio applications, magnitude of time-frequency representation, including spectrogram, is regarded as one of the most important features. Such magnitude-based features have been extensively utilized in learning-based audio processing. Since a logarithmic scale is important in terms of auditory perception, the features are usually computed with a logarithmic function. That is, a logarithmic function is applied within the computation of features so that a learning machine does not have to explicitly model the logarithmic scale. We think in a different way and propose a restricted Boltzmann machine (RBM) that simultaneously models linear- and log-magnitude spectra. RBM is a stochastic neural network that can discover data representations without supervision. To manage both linear and logarithmic scales, we define an energy function based on both scales. This energy function results in a conditional distribution (of the observable data, given hidden units) that is written as the gamma distribution, and hence the proposed RBM is termed gamma-Bernoulli RBM. The proposed gamma-Bernoulli RBM was compared to the ordinary Gaussian-Bernoulli RBM by speech representation experiments. Both objective and subjective evaluations illustrated the advantage of the proposed model.
AB - This paper presents an energy-based probabilistic model that handles nonnegative data in consideration of both linear and logarithmic scales. In audio applications, magnitude of time-frequency representation, including spectrogram, is regarded as one of the most important features. Such magnitude-based features have been extensively utilized in learning-based audio processing. Since a logarithmic scale is important in terms of auditory perception, the features are usually computed with a logarithmic function. That is, a logarithmic function is applied within the computation of features so that a learning machine does not have to explicitly model the logarithmic scale. We think in a different way and propose a restricted Boltzmann machine (RBM) that simultaneously models linear- and log-magnitude spectra. RBM is a stochastic neural network that can discover data representations without supervision. To manage both linear and logarithmic scales, we define an energy function based on both scales. This energy function results in a conditional distribution (of the observable data, given hidden units) that is written as the gamma distribution, and hence the proposed RBM is termed gamma-Bernoulli RBM. The proposed gamma-Bernoulli RBM was compared to the ordinary Gaussian-Bernoulli RBM by speech representation experiments. Both objective and subjective evaluations illustrated the advantage of the proposed model.
KW - Boltzmann machine
KW - gamma distribution
KW - nonnegative data modeling
KW - speech parameterization
KW - speech synthesis
UR - http://www.scopus.com/inward/record.url?scp=85112868842&partnerID=8YFLogxK
UR - http://www.scopus.com/inward/citedby.url?scp=85112868842&partnerID=8YFLogxK
U2 - 10.1109/TASLP.2021.3095656
DO - 10.1109/TASLP.2021.3095656
M3 - Article
AN - SCOPUS:85112868842
SN - 2329-9290
VL - 29
SP - 2591
EP - 2605
JO - IEEE/ACM Transactions on Audio Speech and Language Processing
JF - IEEE/ACM Transactions on Audio Speech and Language Processing
M1 - 9478208
ER -