The foundational neural network design powering modern language models
Core Idea: The Transformer is a neural network architecture based on self-attention mechanisms that enables parallel processing of sequences, allowing models to capture complex dependencies in language without using recurrence or convolution.
Key Elements
Core Components
- Self-Attention Mechanism: Allows tokens to "attend" to all other tokens
- Computes relationships between all tokens in a sequence
- Creates rich contextual representations
- Maps queries, keys, and values to produce weighted aggregations
- Multi-Head Attention: Parallel attention computations with different projections
- Captures different relationship types simultaneously
- Attends to information from different representation subspaces
- Feed-Forward Networks: Position-wise fully connected layers
- Processes each token's representation independently
- Typically implemented as two linear transformations with a ReLU activation
- Residual Connections: Add & normalize between sub-layers
- Facilitates training of deep networks
- Preserves information flow through the network
- Positional Encoding: Injects sequence order information
- Necessary because attention has no inherent notion of position
- Typically uses sinusoidal functions or learned embeddings
Architectural Variants
- Encoder-Only: Used for understanding tasks (BERT, RoBERTa)
- Bidirectional attention (tokens attend to all positions)
- Optimized for classification and representation learning
- Decoder-Only: Used for generative tasks (GPT family)
- Autoregressive attention (tokens attend only to previous positions)
- Optimized for text generation and completion
- Encoder-Decoder: Used for sequence-to-sequence tasks (T5, BART)
- Encoder creates representations, decoder generates output
- Optimized for translation, summarization, and other transformation tasks
Computational Characteristics
- Parallelization: Processes entire sequences simultaneously
- Major advantage over RNNs which require sequential processing
- Enables efficient training on modern hardware
- Quadratic Complexity: Attention computation scales with sequence length squared
- Each token attends to all other tokens: O(n²) operation
- Major bottleneck for processing very long documents
- Token-Based Processing: Designed to work with discrete tokens
- Relies on efficient tokenization for practical performance
- Sequence length directly impacts computational requirements
Evolution and Optimizations
- Efficient Attention Variants: Address quadratic scaling issue
- Sparse attention patterns (Longformer, BigBird)
- Linear attention approximations (Performer, Linear Transformer)
- Sliding window approaches (Transformer-XL)
- Parameter Sharing: Techniques to reduce model size
- Albert: Cross-layer parameter sharing
- Universal Transformers: Recurrent application of same layers
Additional Connections
- Broader Context: Deep Learning Architectures (evolution of neural networks)
- Applications: Large Language Models (primary application area)
- See Also: Tokenization in Language Models (preprocessing for Transformers), API Pricing Models (cost implications of architecture choices)
References
- Vaswani, A., et al. (2017). Attention Is All You Need. NeurIPS.
- Devlin, J., et al. (2018). BERT: Pre-training of Deep Bidirectional Transformers for Language Understanding.
#transformer #deep-learning #neural-networks #attention-mechanism #language-models
Connections:
Sources: