Log-Mel Spectrogram

#atom

Perceptually-motivated time-frequency representation for audio processing

Core Idea: A log-mel spectrogram is a time-frequency representation that applies logarithmic scaling to both the frequency axis (using the mel scale) and the amplitude values, better matching human auditory perception and serving as the standard input for many speech and audio processing models.

Key Elements

Mel Scale Fundamentals

Creation Pipeline

  1. Compute Spectrogram: Generate standard spectrogram using STFT
  2. Apply Mel Filterbank: Map frequency bins to mel scale using triangular filters
  3. Logarithmic Compression: Take logarithm of mel-spectrogram values
  4. Normalization: Often normalize values for machine learning applications

Standard Parameters (Whisper Example)

Advantages for Audio ML

Applications in Machine Learning

Technical Implementation

python

# Typical librosa implementation
import librosa
import numpy as np

# Generate log-mel spectrogram
y, sr = librosa.load(audio_file, sr=16000)
mel_spec = librosa.feature.melspectrogram(
    y=y, 
    sr=sr, 
    n_mels=80,
    n_fft=400,
    hop_length=160
)
log_mel_spec = librosa.power_to_db(mel_spec, ref=np.max)

Comparison with Alternatives

Additional Connections

References

  1. Stevens, S. S., Volkmann, J., & Newman, E. B. (1937). "A scale for the measurement of the psychological magnitude pitch"
  2. Davis, S., & Mermelstein, P. (1980). "Comparison of parametric representations for monosyllabic word recognition"
  3. McFee, B., et al. (2015). "librosa: Audio and music signal analysis in Python"

#mel-spectrogram #audio-features #speech-processing #machine-learning #signal-processing