Tokenization in Language Models

The pre-processing step that breaks text into discrete units for model processing

Core Idea: Tokenization is the essential pre-processing step in language models that converts raw text into discrete tokens (words or subwords) for efficient processing, effectively compressing input sequences.

Key Elements

Functionality

Acts as a compression mechanism to reduce sequence length before processing
Converts variable-length text into fixed vocabulary tokens
Enables efficient processing in Transformer architectures
Addresses quadratic complexity issues in attention mechanisms

Common Approaches

Subword Tokenization: Most widely used in models like GPT, BERT, T5
- Balances vocabulary size and sequence length
- Creates tokens representing partial words based on frequency
- Example: "tokenization" → ["token", "ization"]
Character-level: Processes text character by character
- Highest sequence lengths but smallest vocabulary
Byte-level: Operates on raw byte sequences
- Universal representation across all languages
- Fixed vocabulary size (256 possible bytes)

Critical Limitations

Character Sensitivity: Spelling errors or capitalization can completely change token representation
Uneven Compression Rates: Different languages achieve vastly different compression efficiencies
- High-resource languages (English): ~4-5 characters per token
- Low-resource languages: Often tokenized at near-character level
Language Fairness Issues: Users of low-resource languages are "overcharged" when using token-based API pricing
Morphological Challenges: Languages with non-concatenative morphology (like Arabic) tokenize inefficiently
- Infixes and non-adjacent meaningful units break tokenizer assumptions

Practical Implications

API costs vary significantly by language due to tokenization inefficiency
Model performance on character-level tasks depends heavily on tokenization approach
Spelling correction and word search tasks particularly challenging for subword models
Important consideration for multilingual applications

Additional Connections

Broader Context: Transformer Architecture (designed to work with tokenized input)
Applications: API Pricing Models (token-based charging)
See Also: Byte-level Language Models (alternative approach avoiding tokenization problems)

References

Kini, J. (2024). Mr. T5: Dynamic token merging for efficient byte-level language models. TWIML AI Podcast interview.
Tay, Y., et al. (2021). Charformer: Fast character transformers via gradient-based subword tokenization.

#nlp #language-models #tokenization #preprocessing

Connections:

Sources:

From: The TWIML AI Podcast with Sam Charrington - Dynamic Token Merging for Efficient Byte-level Language Models with Julie Kallini - 724