The pre-processing step that breaks text into discrete units for model processing
Core Idea: Tokenization is the essential pre-processing step in language models that converts raw text into discrete tokens (words or subwords) for efficient processing, effectively compressing input sequences.
Key Elements
Functionality
- Acts as a compression mechanism to reduce sequence length before processing
- Converts variable-length text into fixed vocabulary tokens
- Enables efficient processing in Transformer architectures
- Addresses quadratic complexity issues in attention mechanisms
Common Approaches
- Subword Tokenization: Most widely used in models like GPT, BERT, T5
- Balances vocabulary size and sequence length
- Creates tokens representing partial words based on frequency
- Example: "tokenization" → ["token", "ization"]
- Character-level: Processes text character by character
- Highest sequence lengths but smallest vocabulary
- Byte-level: Operates on raw byte sequences
- Universal representation across all languages
- Fixed vocabulary size (256 possible bytes)
Critical Limitations
- Character Sensitivity: Spelling errors or capitalization can completely change token representation
- Uneven Compression Rates: Different languages achieve vastly different compression efficiencies
- High-resource languages (English): ~4-5 characters per token
- Low-resource languages: Often tokenized at near-character level
- Language Fairness Issues: Users of low-resource languages are "overcharged" when using token-based API pricing
- Morphological Challenges: Languages with non-concatenative morphology (like Arabic) tokenize inefficiently
- Infixes and non-adjacent meaningful units break tokenizer assumptions
Practical Implications
- API costs vary significantly by language due to tokenization inefficiency
- Model performance on character-level tasks depends heavily on tokenization approach
- Spelling correction and word search tasks particularly challenging for subword models
- Important consideration for multilingual applications
Additional Connections
- Broader Context: Transformer Architecture (designed to work with tokenized input)
- Applications: API Pricing Models (token-based charging)
- See Also: Byte-level Language Models (alternative approach avoiding tokenization problems)
References
- Kini, J. (2024). Mr. T5: Dynamic token merging for efficient byte-level language models. TWIML AI Podcast interview.
- Tay, Y., et al. (2021). Charformer: Fast character transformers via gradient-based subword tokenization.
#nlp #language-models #tokenization #preprocessing
Connections:
Sources: