LLM Tokens

The discrete units of text that language models process

Core Idea: Tokens are the fundamental units that language models use to process text, representing words, parts of words, or individual characters that are converted into numerical values for model processing.

Key Elements

Tokenization Process

Text is broken down into tokens according to a defined vocabulary
A vocabulary typically contains 50,000-200,000 possible tokens
Common words may be single tokens, while rare words are split into multiple tokens
Tokenization is language-dependent and optimized for specific language patterns
Special tokens mark the beginning/end of messages or indicate specific functions

Token Characteristics

Each token is assigned a unique ID number for model processing
English words average roughly 4 characters per token (varies by language)
Spaces, punctuation, and formatting characters also consume tokens
The same word may be tokenized differently depending on capitalization, spacing, or surrounding context
Special characters, emojis, and non-Latin scripts often require more tokens per character

Technical Implementation

Tokenization happens before the text reaches the neural network
Tokens are converted to vector embeddings for model processing
Model size and capabilities are often described by the number of tokens they were trained on
Token limits are a key constraint in model design and usage

Practical Considerations

Token usage directly impacts computational cost and processing time
More tokens = more expensive computation
Context windows have fixed token limits (e.g., 8K, 32K, 128K tokens)
Effective prompt engineering involves efficient token usage
Both input and output consume tokens from the available context window

Connections

Related Concepts: LLM Context Window (the space where tokens exist and are processed), Embeddings (vector representations of tokens)
Broader Context: NLP Preprocessing (tokenization as part of the text processing pipeline)
Applications: Token-Efficient Prompting (techniques to minimize token usage), Prompt Engineering (crafting inputs with token constraints in mind)
Components: BPE Tokenization (a common tokenization algorithm), Vocabulary Size (impact on model capabilities)

References

Tiktoken documentation (OpenAI's tokenizer)
"Attention Is All You Need" paper (introducing transformer models that process tokens)
OpenAI API documentation on token counting and limits

#LLM #tokens #tokenization #NLP #vocabulary

Connections:

Sources:

From: Andrej Karpathy - How I use LLMs