DocTags Format

A structured format for representing extracted document elements with positional information

Core Idea: DocTags is a markup format used by document understanding models like SmolDocling to represent both the content and the structural information of processed documents, including element types and spatial positioning.

Key Elements

Format Structure:
- HTML-like hierarchical structure
- Element type identification
- Positional coordinates for elements
- Nested representation of document hierarchy
- Content extraction within tags
Element Types:
- Text blocks
- Headings
- Lists and list items
- Tables
- Images
- Code blocks
- Formulas
- Charts
- Other specialized document elements
Positional Information:
- Bounding box coordinates
- Page numbers (for multi-page documents)
- Relative positioning data
- Layout flow indicators
Example Structure:

<doc> <text loc="[x1, y1, x2, y2]">Extracted text content</text> <list> <item loc="[x1, y1, x2, y2]">List item 1</item> <item loc="[x1, y1, x2, y2]">List item 2</item> </list> <code loc="[x1, y1, x2, y2]"> def example(): return "code block" </code> </doc>

Applications

Intermediate representation in document processing pipelines
Input format for downstream document analysis tasks
Preserving document structure for accurate recreation
Enabling spatial reasoning about document elements
Supporting document element retrieval by type or location

Advantages

Combines content and structure in a single representation
Preserves spatial relationships between elements
Enables selective processing of specific element types
Supports reconstruction of visual layout
Facilitates document navigation and search

Further Processing

Conversion to other formats (Markdown, HTML, etc.)
Extraction of specific element types for specialized processing
Integration with LLMs for content analysis
Database storage of structured document information
Visualization of document structure

Connections

Related Concepts: Document Understanding Models, Structured Document Representation
Implementation Context: SmolDocling (uses this format), Document Conversion Pipelines
Similar Formats: ALTO XML, hOCR (other document markup formats)
Applications: Document Layout Analysis, Content Extraction Systems

References

SmolDocling paper and documentation
DocTags format specification
Document understanding literature

#DocumentMarkup #StructuredData #DocumentAI #InformationExtraction #DocumentRepresentation

Connections:

Sources:

From: Sam Witteveen - SmolDocling ¿la solución SmolOCR