Dynamic Token Merging

A learned approach to reduce sequence length by selectively removing redundant tokens

Core Idea: Dynamic token merging is a technique that allows language models to learn which tokens can be safely removed after initial processing, automatically compressing sequences while preserving important information.

Key Elements

Core Mechanism

Information Diffusion Phase: Initial model layers process full sequence
- Attention mechanism distributes information across token representations
- Each token gradually incorporates context from surrounding tokens
Learned Selection: Model determines which tokens contain redundant information
- No pre-defined rules for token selection
- Selection based on contextual importance of information
Implementation Methods:
- Attention Masking: During training, mask out selected tokens from attention
- Hard Deletion: During inference, physically remove tokens from sequence

Technical Implementation

Gating Function: Learned function determines token importance scores
- Each token assigned a probability of being kept/dropped
- Controller algorithm adjusts dropout rate to meet target compression
Training Objective: Balance between compression rate and task performance
- Optimized jointly with primary task objective
- Compression rate can be explicitly controlled as a hyperparameter
Architecture Integration: Typically added after initial Transformer layers
- Early layers (1-2) process full sequence
- Middle and later layers operate on compressed sequence

Advantages Over Static Approaches

Content-Aware Compression: Adapts to specific content of each sequence
- Highly predictable segments receive higher compression
- Complex or important segments preserved with less compression
Language-Specific Adaptation: Automatically learns different compression rates for different languages
- Chinese compressed less due to information-dense characters
- Latin script languages compressed more aggressively
No Vocabulary Constraints: Works with any tokenization scheme including byte-level
- Compatible with fixed vocabularies
- No need for language-specific tokenizers

Practical Applications

Inference Optimization: Significantly reduces computation for deployed models
- Particularly effective for encoder-heavy architectures
- 45-50% sequence length reduction with minimal performance impact
Multilingual Fairness: Helps address tokenization disparities across languages
- Automatically adjusts compression based on information density
- Reduces efficiency gap between high and low-resource languages

Additional Connections

Broader Context: Sequence Length Optimization (broader category)
Applications: Inference Cost Reduction (practical benefit)
See Also: Attention Mechanism Efficiency (related optimization approach)

References

Kini, J. (2024). Mr. T5: Dynamic token merging for efficient byte-level language models. TWIML AI Podcast interview.
Kini, J., et al. (2023). Mr. T5: Dynamic token merging for efficient byte-level language models. Research paper.

#nlp #efficiency #language-models #optimization #sequence-compression

Connections:

Sources:

From: The TWIML AI Podcast with Sam Charrington - Dynamic Token Merging for Efficient Byte-level Language Models with Julie Kallini - 724