This paper describes an unsupervised disentangled representation learning method for musical instrument sounds with pitched and unpitched spectra. Since conventional methods have commonly attempted to disentangle timbral features (e.g., instruments) and pitches (e.g., MIDI note numbers and FOs), they can be applied to only pitched sounds. Global timbres unique to instruments and local variations (e.g., expressions and playstyles) are also treated without distinction. Instead, we represent the spectrogram of a musical instrument sound with a variational autoencoder (VAE) that has timbral, pitch, and variation features as latent variables. The pitch clarity or percussiveness, brightness, and FOs (if existing) are considered to be represented in the abstract pitch features. The unsupervised disentanglement is achieved by extracting time-invariant and time-varying features as global timbres and local variations from randomly pitch-shifted input sounds and time-varying features as local pitch features from randomly timbre-distorted input sounds. To enhance the disentanglement of timbral and variation features from pitch features, input sounds are separated into spectral envelopes and fine structures with cepstrum analysis. The experiments showed that the proposed method can provide effective timbral and pitch features for better musical instrument classification and pitch estimation.