Doubleword launches Snowflake Native App
Snowflake logo black
Doubleword logo black
Product
Resources
Resource CenterAI DictionaryCustomer Stories
Docs
Pricing
Book a demo
Book a demo
Resources
/
Blog
/
Behind the Stack, Ep 5 - Making RAG Work for Multimodal Documents
June 24, 2025

Behind the Stack, Ep 5 - Making RAG Work for Multimodal Documents

Jamie Dborin
Share:
https://doubleword.ai/resources/behind-the-stack-ep-5-making-rag-work-for-multimodal-enterprise-documents
Copied
To Webinar
•

Introduction

Most retrieval-augmented generation (RAG) systems assume that documents are clean, structured, and text-based. But in enterprise environments, the reality is different. Documents often contain:

  • Tables with nested headers, merged cells, or embedded footnotes
  • Charts and images that convey critical insights
  • Layout-heavy formats like invoices, reports, or scanned documents

When such content passes through standard RAG pipelines, the results are often poor - irrelevant retrieval and hallucinated outputs during generation.

This post explores practical strategies to enable accurate retrieval and grounded generation from messy, multimodal documents. We focus on two key stages:

  1. Retrieval – How to index and surface relevant content that isn’t just plain text
  2. Generation – How to present structured or visual content to an LLM for high-quality answers

We’ll cover proven architectures, model recommendations, and implementation details used in real-world production systems.

Why Tables and Images Break Traditional RAG

A typical RAG pipeline looks like this:

  1. Split the document into text chunks (paragraphs, sections)
  2. Serialize as plaintext or Markdown
  3. Embed with a text model (e.g., text-embedding-3, BGE, Cohere)
  4. Store embeddings in a vector database (e.g., FAISS, Qdrant)
  5. Retrieve top-k relevant chunks
  6. Inject them into an LLM for answer generation

This pipeline breaks down when documents include:

  • Tables: Markdown flattens structure. Semantics like merged headers or totals are lost.
  • Images: Often dropped or processed via low-quality OCR, losing charts and diagrams.
  • Scanned Layouts: Visual hierarchy and multi-column formats disappear during text flattening.

The result: Important context is invisible to retrieval and misrepresented during generation.

Strategy 1: Full-Document Multimodal Embedding

Skip text serialization entirely. Instead, render the entire document as images and embed using a multimodal model that understands layout, tables, and visual context.

Pipeline:

  1. Render each document page to an image (e.g., with Poppler or PDFPlumber)

  2. Pass each page image to a multimodal model:

    • Open source: CoLLaMA, OpenCLIP, SigLIP, BLIP-2
    • Commercial: jina, XDoc, LayoutXLM (token-based)

  3. Store embeddings in your vector index

Why It Works:

  • Multimodal models trained on image+text pairs preserve layout, structure, and non-textual info
  • No need for lossy flattening or serialization heuristics

Caveats:

  • Fewer production-grade models are available
  • Retrofitting may require reindexing
  • Higher compute and memory costs

Strategy 2: Visual Summarization for Text-Based Stacks

If you’re sticking with text-only embeddings, you can still integrate visual content using descriptive captions.

Pipeline:

  1. Pipeline:
    1. Segment the document into layout blocks using:

      • AWS Textract
      • Azure Document Intelligence
      • LayoutParser + Tesseract (open source)

    2. For each table or image block, run OCR and generate a text summary using a multimodal LLM:

      • Hosted: GPT-4 Vision, Claude 3 Opus, Gemini 1.5 Pro
      • Open source: LLaVA 1.5, MiniGPT-4, Kosmos-2

    3. Prompt example:
      "Summarize this table, including headers, key values, and trends."

    4. Append summaries to surrounding text and embed the result using your existing model (e.g., text-embedding-3, bge-large-en)
    Example Chunk:
“Table Summary: Q1 revenue by region shows APAC growth of 23%, EMEA flat, and a YoY increase in total revenue of 11%. See Figure 4.”
This lets visual information participate in search relevance scoring - without changing your pipeline.

Pros:

  • Easy integration
  • Compatible with any retriever or vector DB
  • Enables semantic search over image/table content

Cons:

  • Summaries may lose granularity
  • Requires preprocessing for all visual elements

Generation Over Structured and Visual Content

Once you retrieve relevant chunks, you need models that can reason over visual and structured inputs.

Text-only LLMs often fail due to:

  • Flattened or malformed formatting (especially tables)
  • Missing image content
  • Lost spatial or layout context

Solution:

Use vision-capable LLMs during generation. These models accept image inputs and can reason over full document renders.

Supported Models:

  • GPT-4 Turbo (Vision)
  • Claude 3.5 Opus
  • Gemini 1.5 Pro
  • LLaMA 3.2 Vision (70B, self-hosted)
  • Yi-VL, Qwen-VL

Benefits:

  • No need for text serialization
  • Direct reasoning over charts, tables, and layout
  • Accurate answers to layout-grounded queries like “What does this figure show?”
Integration Tip:
Pass full document context (e.g., PDF render or HTML snapshot) into your generation model. Cross-modal attention models (like LLaMA 3.2 Vision) handle this better than concatenation-based ones.

Toolchain Summary

Document Segmentation -  AWS Textract, Azure Form Recognizer, LayoutParser

OCR - Tesseract, EasyOCR

Multimodal Summarization - GPT-4V, Claude 3, Gemini, LLaVA, MiniGPT-4

Text Embedding - text-embedding-3, bge-large-en, Cohere Embed

Multimodal Embedding - CoLLaMA, SigLIP, XDoc, OpenCLIP

Vision Generation - GPT-4V, Claude 3.5, LLaMA 3.2 Vision, Yi-VL 

Conclusion

Traditional RAG fails when applied to structured and visual content. Tables, figures, and layout matter, especially in high-stakes domains like finance, law, and enterprise analytics.

To close the gap:

  • Use multimodal embedding for full-fidelity search
  • Or adopt visual summarization to extend existing pipelines
  • Use vision-capable models at generation time for grounded, reliable outputs

Multimodal RAG isn’t theoretical - it’s operational. With the right tools and strategies, you can bring structure, charts, and layout into your RAG workflows - and generate responses grounded in the full reality of enterprise documents.

Footnotes

Table of contents:

Heading 2
Heading 3
Heading 4
Heading 5
Heading 6
"
Learn more about self-hosted AI Inference
Subscribe to our newsletter
Thanks you for subscription!
Oops! Something went wrong while submitting the form.

Want to learn more?

We work with enterprises at every stage of their self-hosting journey - whether you're deploying your first model in an on-prem environment or scaling dozens of fine-tuned, domain-specific models across a hybrid, multi-cloud setup. Doubleword is here to help you do it faster, easier, and with confidence.

Book a demo
Doubleword logo white
Sitemap
HomePricingDocsResourcesBook a demo
Contact
hello@doubleword.ai
Address
Farringdon, London
JOIN THE COMMUNITY
Subscribe to our newsletter
Thanks you for subscription!
Oops! Something went wrong while submitting the form.
©2025 Doubleword. All rights reserved.
designed by
celerart
Privacy Policy
We use cookies to ensure you get the best experience on our website.
Accept
Deny