Log-Mel Spectrogram

Perceptually-motivated time-frequency representation for audio processing

Core Idea: A log-mel spectrogram is a time-frequency representation that applies logarithmic scaling to both the frequency axis (using the mel scale) and the amplitude values, better matching human auditory perception and serving as the standard input for many speech and audio processing models.

Key Elements

Mel Scale Fundamentals

Perceptual Motivation: Based on human hearing characteristics where we perceive pitch differences logarithmically
Mel Formula: mel(f) = 2595 × log₁₀(1 + f/700)
Frequency Mapping: Maps linear frequency to perceptual scale
Critical Bands: Approximates the frequency resolution of human hearing

Creation Pipeline

Compute Spectrogram: Generate standard spectrogram using STFT
Apply Mel Filterbank: Map frequency bins to mel scale using triangular filters
Logarithmic Compression: Take logarithm of mel-spectrogram values
Normalization: Often normalize values for machine learning applications

Standard Parameters (Whisper Example)

Mel Bins: 80 mel-frequency bands
Window Size: 25ms (400 samples at 16kHz)
Hop Size: 10ms stride between windows
Frequency Range: Typically 0-8000 Hz for speech
Log Compression: Usually natural logarithm or log₁₀

Advantages for Audio ML

Dimensionality Reduction: Fewer frequency bins than raw spectrogram
Perceptual Relevance: Better matches human hearing characteristics
Noise Robustness: Log compression reduces impact of amplitude variations
Feature Efficiency: Captures essential speech/audio information compactly

Applications in Machine Learning

Whisper: Uses 80-bin log-mel spectrograms as input features
Speech Recognition: Standard feature representation for ASR systems
Speaker Recognition: Captures speaker-specific characteristics
Music Classification: Genre, instrument, and mood detection
Audio Event Detection: Environmental sound classification

Technical Implementation

python

# Typical librosa implementation
import librosa
import numpy as np

# Generate log-mel spectrogram
y, sr = librosa.load(audio_file, sr=16000)
mel_spec = librosa.feature.melspectrogram(
    y=y, 
    sr=sr, 
    n_mels=80,
    n_fft=400,
    hop_length=160
)
log_mel_spec = librosa.power_to_db(mel_spec, ref=np.max)

Comparison with Alternatives

Raw Spectrogram: More frequency detail but higher dimensionality
MFCC: Further processing of log-mel features, more compact
CQT: Constant-Q transform, better for music analysis
Wavelet Transform: Better time resolution at high frequencies

Additional Connections

Broader Context: Psychoacoustics (human hearing perception), Audio Feature Extraction (feature engineering)
Applications: Automatic Speech Recognition, Music Information Retrieval, Audio Classification
See Also: Mel Scale (frequency scale), MFCC (derived features), Filterbank (signal processing component)

References

Stevens, S. S., Volkmann, J., & Newman, E. B. (1937). "A scale for the measurement of the psychological magnitude pitch"
Davis, S., & Mermelstein, P. (1980). "Comparison of parametric representations for monosyllabic word recognition"
McFee, B., et al. (2015). "librosa: Audio and music signal analysis in Python"

#mel-spectrogram #audio-features #speech-processing #machine-learning #signal-processing