Neural network systems that transform data into numerical vectors

Core Idea: Embedding models convert complex data (text, images, etc.) into numerical vector representations that capture semantic meaning, enabling machines to process, compare, and understand relationships between different pieces of information.

Key Elements

Definition and Purpose

Implementation Approaches

  1. Neural Network-Based (most common)

    • Trained through supervised or self-supervised learning
    • Learns representations through context prediction tasks
    • Can capture complex relationships and hierarchical structures
    • Examples: Word2Vec, BERT, Sentence Transformers
  2. Traditional Methods

Technical Implementation (Neural Network Example)

import torch
import torch.nn as nn

class SimpleEmbeddingModel(nn.Module):
    # vocab_size is the number of different words, embedding_dim is the length of the output vector
    def __init__(self, vocab_size, embedding_dim):
        super().__init__()
        # This embeddings prop is a lookup table to hold a mapping of keyword to a learnable vector
        self.embeddings = nn.Embedding(vocab_size, embedding_dim)
        # This Linear Layer converts vectors into scores for each word in the vocabulary, predicting associated word
        self.linear = nn.Linear(embedding_dim, vocab_size)
    
    def forward(self, x):
        # Being x the index, gets the vector for it
        embedded = self.embeddings(x)
        return self.linear(embedded)

Training Process

# This training loop would try to predict subject from adjectives and adjectives from subjects in X is Y sentences
def train_model(sentences, model, epochs=100):
    # CrossEntropyLoss because it's a classification problem
    criterion = nn.CrossEntropyLoss()
    # Adam optimizer is the one who actually updates the weights
    optimizer = torch.optim.Adam(model.parameters())
    
    for epoch in range(epochs):
        for subj, _, attr in sentences:
            subj_idx = word_to_idx[subj]
            attr_idx = word_to_idx[attr]
            # We get the index for each word
            
            #Create a tensor from the index
            subj_tensor = torch.tensor([subj_idx])
            # Run it through the model and get a prediction
            output = model(subj_tensor)
            # Calculate loss (How wrong the model was)
            loss = criterion(output, torch.tensor([attr_idx]))
            # Reset the gradient (PyTorch accumulates gradients as a memory saving method)
            optimizer.zero_grad()
            # Use backpropagation to calculate loss gradient (This changes the gradient values)
            loss.backward()
            # Update the model based on the gradients
            optimizer.step()

Practical Applications

Working with Embeddings (Code Example)

# Get embeddings for specific words
def get_word_embedding(model, word, word_to_idx):
    # Convert word to index
    word_idx = word_to_idx[word]
    # Get embedding (detach removes it from computation graph)
    embedding = model.embeddings(torch.tensor([word_idx])).detach()
    return embedding

# Find similar words
def find_similar_words(model, word, word_to_idx, idx_to_word, top_k=5):
    # Get target word embedding
    target_embedding = get_word_embedding(model, word, word_to_idx)
    
    # Get all embeddings
    all_embeddings = model.embeddings.weight.detach()
    
    # Calculate cosine similarity
    similarities = torch.nn.functional.cosine_similarity(
        target_embedding, 
        all_embeddings
    )
    
    # Get top k similar words
    top_indices = similarities.argsort(descending=True)[:top_k]
    return [(idx_to_word[idx.item()], similarities[idx].item()) 
            for idx in top_indices]

Additional Connections

References

  1. Mikolov, T., Chen, K., Corrado, G., & Dean, J. (2013). Efficient estimation of word representations in vector space.
  2. Reimers, N., & Gurevych, I. (2019). Sentence-BERT: Sentence Embeddings using Siamese BERT-Networks.
  3. Code examples adapted from PyTorch documentation

#embedding #vector-space #neural-networks #information-retrieval #knowledge-representation

Terms used in code: Backpropagation, Optimizer, Neural Network Loss, Neural Network Gradient


2024 12 30 03 42 50 - 🪆 Introduction to Matryoshka Embedding Models
2024 12 30 03 35 36 - Binary vector embeddings are so cool

Embedding Model