Neural network systems that transform data into numerical vectors

Core Idea: Embedding models convert complex data (text, images, etc.) into numerical vector representations that capture semantic meaning, enabling machines to process, compare, and understand relationships between different pieces of information.

Key Elements

Definition and Purpose

Transforms input data into fixed-size numerical vectors (Embedding)
Preserves semantic relationships between items in the vector space
Enables similarity calculations, clustering, and other mathematical operations
Creates foundation for search, recommendations, and information retrieval

Implementation Approaches

Neural Network-Based (most common)
- Trained through supervised or self-supervised learning
- Learns representations through context prediction tasks
- Can capture complex relationships and hierarchical structures
- Examples: Word2Vec, BERT, Sentence Transformers
Traditional Methods
- Term Frequency-Inverse Document Frequency: Creates document embeddings with term importance dimensions
- Latent Semantic Analysis: Uses matrix decompositions for dimensionality reduction
- Random projections: Multiplies by randomized matrices (justified by Johnson-Lindenstrauss lemma)
- Count-based methods: Earlier approaches like Global Vectors for Word Representation

Technical Implementation (Neural Network Example)

import torch
import torch.nn as nn

class SimpleEmbeddingModel(nn.Module):
    # vocab_size is the number of different words, embedding_dim is the length of the output vector
    def __init__(self, vocab_size, embedding_dim):
        super().__init__()
        # This embeddings prop is a lookup table to hold a mapping of keyword to a learnable vector
        self.embeddings = nn.Embedding(vocab_size, embedding_dim)
        # This Linear Layer converts vectors into scores for each word in the vocabulary, predicting associated word
        self.linear = nn.Linear(embedding_dim, vocab_size)
    
    def forward(self, x):
        # Being x the index, gets the vector for it
        embedded = self.embeddings(x)
        return self.linear(embedded)

Training Process

# This training loop would try to predict subject from adjectives and adjectives from subjects in X is Y sentences
def train_model(sentences, model, epochs=100):
    # CrossEntropyLoss because it's a classification problem
    criterion = nn.CrossEntropyLoss()
    # Adam optimizer is the one who actually updates the weights
    optimizer = torch.optim.Adam(model.parameters())
    
    for epoch in range(epochs):
        for subj, _, attr in sentences:
            subj_idx = word_to_idx[subj]
            attr_idx = word_to_idx[attr]
            # We get the index for each word
            
            #Create a tensor from the index
            subj_tensor = torch.tensor([subj_idx])
            # Run it through the model and get a prediction
            output = model(subj_tensor)
            # Calculate loss (How wrong the model was)
            loss = criterion(output, torch.tensor([attr_idx]))
            # Reset the gradient (PyTorch accumulates gradients as a memory saving method)
            optimizer.zero_grad()
            # Use backpropagation to calculate loss gradient (This changes the gradient values)
            loss.backward()
            # Update the model based on the gradients
            optimizer.step()

Practical Applications

Similarity Search: Finding related items through vector comparisons
Semantic Search: Enhancing retrieval with meaning, not just keywords
Knowledge Management: Creating connections between notes based on meaning
Recommendation Systems: Suggesting related content through embedding proximity
Information Retrieval: Improving search quality with semantic understanding

Working with Embeddings (Code Example)

# Get embeddings for specific words
def get_word_embedding(model, word, word_to_idx):
    # Convert word to index
    word_idx = word_to_idx[word]
    # Get embedding (detach removes it from computation graph)
    embedding = model.embeddings(torch.tensor([word_idx])).detach()
    return embedding

# Find similar words
def find_similar_words(model, word, word_to_idx, idx_to_word, top_k=5):
    # Get target word embedding
    target_embedding = get_word_embedding(model, word, word_to_idx)
    
    # Get all embeddings
    all_embeddings = model.embeddings.weight.detach()
    
    # Calculate cosine similarity
    similarities = torch.nn.functional.cosine_similarity(
        target_embedding, 
        all_embeddings
    )
    
    # Get top k similar words
    top_indices = similarities.argsort(descending=True)[:top_k]
    return [(idx_to_word[idx.item()], similarities[idx].item()) 
            for idx in top_indices]

Additional Connections

Broader Context: Vector Space Models (theoretical foundation)
Applications: Semantic Search (practical implementation)
See Also: Neural Networks (underlying technology), Cosine Similarity (comparison method)

References

Mikolov, T., Chen, K., Corrado, G., & Dean, J. (2013). Efficient estimation of word representations in vector space.
Reimers, N., & Gurevych, I. (2019). Sentence-BERT: Sentence Embeddings using Siamese BERT-Networks.
Code examples adapted from PyTorch documentation

#embedding #vector-space #neural-networks #information-retrieval #knowledge-representation

Terms used in code: Backpropagation, Optimizer, Neural Network Loss, Neural Network Gradient

2024 12 30 03 42 50 - 🪆 Introduction to Matryoshka Embedding Models
2024 12 30 03 35 36 - Binary vector embeddings are so cool