Log-Mel Spectrogram
Perceptually-motivated time-frequency representation for audio processing
Core Idea: A log-mel spectrogram is a time-frequency representation that applies logarithmic scaling to both the frequency axis (using the mel scale) and the amplitude values, better matching human auditory perception and serving as the standard input for many speech and audio processing models.
Key Elements
Mel Scale Fundamentals
- Perceptual Motivation: Based on human hearing characteristics where we perceive pitch differences logarithmically
- Mel Formula: mel(f) = 2595 × log₁₀(1 + f/700)
- Frequency Mapping: Maps linear frequency to perceptual scale
- Critical Bands: Approximates the frequency resolution of human hearing
Creation Pipeline
- Compute Spectrogram: Generate standard spectrogram using STFT
- Apply Mel Filterbank: Map frequency bins to mel scale using triangular filters
- Logarithmic Compression: Take logarithm of mel-spectrogram values
- Normalization: Often normalize values for machine learning applications
Standard Parameters (Whisper Example)
- Mel Bins: 80 mel-frequency bands
- Window Size: 25ms (400 samples at 16kHz)
- Hop Size: 10ms stride between windows
- Frequency Range: Typically 0-8000 Hz for speech
- Log Compression: Usually natural logarithm or log₁₀
Advantages for Audio ML
- Dimensionality Reduction: Fewer frequency bins than raw spectrogram
- Perceptual Relevance: Better matches human hearing characteristics
- Noise Robustness: Log compression reduces impact of amplitude variations
- Feature Efficiency: Captures essential speech/audio information compactly
Applications in Machine Learning
- Whisper: Uses 80-bin log-mel spectrograms as input features
- Speech Recognition: Standard feature representation for ASR systems
- Speaker Recognition: Captures speaker-specific characteristics
- Music Classification: Genre, instrument, and mood detection
- Audio Event Detection: Environmental sound classification
Technical Implementation
python
# Typical librosa implementation
import librosa
import numpy as np
# Generate log-mel spectrogram
y, sr = librosa.load(audio_file, sr=16000)
mel_spec = librosa.feature.melspectrogram(
y=y,
sr=sr,
n_mels=80,
n_fft=400,
hop_length=160
)
log_mel_spec = librosa.power_to_db(mel_spec, ref=np.max)
Comparison with Alternatives
- Raw Spectrogram: More frequency detail but higher dimensionality
- MFCC: Further processing of log-mel features, more compact
- CQT: Constant-Q transform, better for music analysis
- Wavelet Transform: Better time resolution at high frequencies
Additional Connections
- Broader Context: Psychoacoustics (human hearing perception), Audio Feature Extraction (feature engineering)
- Applications: Automatic Speech Recognition, Music Information Retrieval, Audio Classification
- See Also: Mel Scale (frequency scale), MFCC (derived features), Filterbank (signal processing component)
References
- Stevens, S. S., Volkmann, J., & Newman, E. B. (1937). "A scale for the measurement of the psychological magnitude pitch"
- Davis, S., & Mermelstein, P. (1980). "Comparison of parametric representations for monosyllabic word recognition"
- McFee, B., et al. (2015). "librosa: Audio and music signal analysis in Python"
#mel-spectrogram #audio-features #speech-processing #machine-learning #signal-processing