SmolDocling

A small-parameter document understanding model from Hugging Face and IBM

Core Idea: SmolDocling is a 256 million parameter vision-language model designed for document understanding, OCR, and conversion tasks that can run on GPUs with limited VRAM.

Key Elements

Architecture: Built on the SmolVLM architecture, combining a SigLIP vision encoder (93M parameters) and a Smol LM model (135M parameters) with projection layers
Size Advantage: At only 256M parameters, it requires significantly less computational resources than larger document processing models
Capabilities:
- Text recognition (OCR)
- Document structure recognition
- Code block identification
- Formula recognition
- Table extraction
- Chart identification
- Location tagging of document elements
Output Format: Produces structured output in DocTags format, providing both text content and positional information about document elements
Performance Claims: According to developers, outperforms competing models of similar size by up to 27× (though not compared against larger state-of-the-art models like olmOCR or Mistral OCR)

Use Cases

Processing documents on hardware with limited GPU resources
Document conversion pipelines
Specialized document processing after fine-tuning
Extracting structured information from documents for further processing

Implementation

Available on Hugging Face
Can be run using the Transformers library or vLLM for faster batch inference
Most valuable when fine-tuned for specific document processing tasks

Limitations

Not competitive with larger state-of-the-art OCR models for general OCR tasks
May require fine-tuning for optimal performance on specific document types
Still requires a GPU for practical usage despite small size
Demo examples show inconsistent handling of complex document structures

Connections

Related Concepts: Document Understanding Models (SmolDocling is an implementation), Hugging Face Smol Models (part of this family), DocTags Format (output structure used)
Broader Context: Vision-Language Models (a specialized application), OCR Technology (a key capability)
Applications: Document Conversion Pipelines (practical implementation context)
Similar Models: olmOCR (larger alternative), Mistral OCR (larger alternative)

References

Hugging Face SmolDocling repository
SmolDocling research paper
Hugging Face blog post on Smol VLMs

#DocumentAI #OCR #SmolModels #HuggingFace #VisionLanguageModels

Connections:

Sources:

From: Sam Witteveen - SmolDocling ¿la solución SmolOCR