OCR using multi-modal LLMs

107 views

Skip to first unread message

Karthik

unread,

Jul 12, 2025, 8:22:07 PMJul 12

to sanskrit-programmers

My experience has been that multi-modal LLMs like Gemini produce far more accurate output from scanned PDFs of (old) books than "dumb" OCR Engines (Tesseract/Google Vision).

So, I asked Gemini a pointed question: "How do multi-modal LLMs like Gemini handle PDFs of books? Do they send the file to an OCR engine like Google Vision and then operate on its output? Or, do they using tools to extract individual images, tokenize the images and then produce output from the images on their own?"

The follow-up question: "Would you say that processing PDFs using multi-modal LLMs is *always* guaranteed to produce *more* accurate output compared to using plain OCR engines (Tesseract/ Google Vision)?"

While it hedges its bets as it is wont to do because I asked for confirmation in 100% of the cases, the answer must be read as an definite YES. One of the biggest reasons being: LLMs have access to the entire context while "dumb" OCR engines do not.

Final question: "If I gave an LLM a PDF of an English translation of the Ramayana and it encounters a word that reads like 'Alexander,' what would it do? Simply add it to the output stream, or try to figure out if the word makes sense in the context?"

The answer is a definite NO to the "simply add it to the output stream." It WILL try to figure out if the word makes sense in the context of the rest of the document together with comparison to the visual evidence.

You could always bind the LLM using extremely targeted prompting that tells it what to do if it encounters weird content (like having it wrap those in curly brackets like so: `{{DOUBTFUL PHRASE}}`)

In summary, this only confirms my own experience.

The three responses follow for those who are interested.

RESPONSE 1:

```

Multi-modal LLMs like Gemini don't just pick one method; they employ a sophisticated combination of techniques to handle PDFs of books, leveraging both dedicated tools and their inherent visual understanding capabilities.

Here's a breakdown of how they likely operate:

1. **PDF Pre-processing and Parsing:**
* **Initial Extraction:** The very first step is to parse the PDF document's structure. PDFs are complex containers, not just raw text files. This involves:
* **Extracting native text:** Many PDFs (especially digitally created ones) contain text that can be directly extracted. This is the most accurate and efficient form of text extraction.
* **Identifying embedded images:** Figures, diagrams, photos, and scanned pages embedded as images are identified.
* **Understanding page structure:** Identifying page boundaries, potential page numbers, headers, footers, and other structural elements.

2. **OCR for Non-Native Text and Layout:**
* **Yes, they absolutely send the file (or individual pages/regions) to an OCR engine like Google Vision AI.** This is crucial for:
* **Scanned books:** If a book is a scanned image of a physical copy, the entire text content will be image-based. OCR is indispensable here to convert those images of text into machine-readable text.
* **Text embedded as images:** Sometimes, even in digitally created PDFs, text might be "rasterized" or embedded as an image for specific design reasons. OCR handles this.
* **Capturing Layout Information:** Modern OCR engines don't just extract text; they also provide bounding box coordinates for each word, line, and paragraph. This spatial information is vital for understanding the document's layout (e.g., distinguishing between a main paragraph, a sidebar, a caption, or a header). This layout information is then converted into a structured format or special tokens that the LLM can interpret.

3. **Visual Tokenization and Understanding (The Multi-modal Part):**
* **Yes, they also use tools to render individual pages as images and then tokenize those images.** This is where the "multi-modal" aspect truly shines.
* **Page Rendering:** Each page of the PDF is rendered into a high-resolution image (similar to taking a screenshot of the entire page).
* **Visual Encoding:** These rendered page images are then fed into the multi-modal LLM's visual encoder. The encoder breaks down the image into smaller patches (similar to how text is broken into words/subwords) and generates visual "tokens" or embeddings for each patch.
* **Understanding Visual Elements:** This allows the LLM to "see" and interpret:
* **Overall layout:** How text flows, column structure, where images are placed relative to text.
* **Diagrams, charts, graphs:** It can understand the *visual information* within these elements, even if their text labels were extracted by OCR. For example, it can see the trends in a line graph or the relationships in a flowchart.
* **Formulas and equations:** While some can be extracted as text, complex ones are better understood visually.
* **Handwritten notes or annotations:** If present in a scanned document.
* **Font styles, bolding, italics:** Subtle visual cues that convey meaning.

4. **Information Fusion and Contextualization:**
* **The core capability of a multi-modal LLM is to *fuse* these different modalities.** It combines:
* The extracted text (from native PDF text and OCR, along with its layout information).
* The visual tokens/embeddings derived from the rendered page images.
* This fusion allows the model to understand the **interplay** between text and visuals. For example, it can understand that "Figure 3.1" refers to the diagram directly below or next to the text, and that the text explains the elements within that diagram.

**In summary, it's a synergistic process:**

* **OCR and Native Text Extraction** provide the precise textual content and its initial spatial coordinates.
* **Visual Encoding** provides a holistic understanding of the page's layout, non-textual elements (like diagrams), and the overall visual presentation, including the relationship between text and images.

By combining these, a multi-modal LLM like Gemini can not only extract information from a PDF but also *understand* the context, layout, and visual elements of a complex document like a book, leading to much more accurate and insightful answers.

```

RESPONSE 2:

```

No, it is **not always guaranteed** that processing PDFs using multi-modal LLMs will produce *more* accurate content compared to using plain OCR engines like Tesseract or Google Vision AI.

The distinction lies in what kind of "accuracy" we are talking about, and the purpose of the processing.

Let's break down the nuances:

1. **Text Extraction Accuracy (Character-Level):**
* **Plain OCR (Google Vision AI, advanced Tesseract):** For clean, high-quality, digitally-generated PDFs where text is embedded as images (or scanned documents with good resolution), dedicated OCR engines are highly optimized for **character-level accuracy**. Google Vision AI, for instance, is state-of-the-art and incredibly precise at turning pixels into text. It's their primary, focused function.
* **Multi-modal LLMs:** While they often *leverage* an internal or integrated OCR module, their core strength isn't necessarily a marginal improvement in raw character recognition. If the underlying OCR component of the multi-modal LLM is less sophisticated than a standalone, highly optimized OCR service, it *could* theoretically produce slightly less accurate text for very challenging or noisy input. However, in practice, leading multi-modal LLMs likely use very good OCR components.

2. **Layout and Structure Accuracy:**
* **Plain OCR:** Modern OCR engines go beyond just text; they provide bounding box information, reading order, and can often identify paragraphs, lines, and sometimes even tables. They are good at preserving the spatial relationships of text.
* **Multi-modal LLMs:** This is where multi-modal LLMs excel significantly. By combining visual understanding with textual output, they can:
* **Infer more robust reading order:** They "see" the flow of text across columns, around images, and can understand complex document layouts (e.g., sidebars, footnotes) more accurately than OCR purely relying on spatial heuristics.
* **Preserve hierarchical structure:** They can better distinguish headings, subheadings, bullet points, and main body text based on visual cues (font size, bolding, spacing) combined with content.
* **Understand relationships:** They grasp that a caption belongs to a specific image or that a paragraph refers to a particular diagram.

3. **Semantic and Non-Textual Content Understanding:**
* **Plain OCR:** They have **no understanding** of the content's meaning or non-textual elements. They just extract text. They can't tell you what a graph represents, what a diagram illustrates, or summarize a passage.
* **Multi-modal LLMs:** This is their **biggest advantage**. They can:
* **Interpret images, charts, and diagrams:** They can "read" the data from a bar chart, describe the flow in a flowchart, or identify objects in a photograph within the document.
* **Summarize content:** Understand the context and meaning across text and visuals to provide summaries.
* **Answer complex questions:** Respond to queries that require synthesizing information from both text and visual elements.
* **Reason about the content:** Make inferences or draw conclusions based on the combined information.

**When Multi-modal LLMs are Superior:**

* **When holistic understanding is needed:** If you need to understand the *meaning* of the book, including its diagrams, charts, and the relationships between text and visuals.
* **For complex layouts:** If the book has multi-column layouts, interspersed images, footnotes, or other challenging visual structures.
* **For answering questions or summarizing:** When you want to interact with the content semantically, not just extract it.
* **For visually rich documents:** Books with many figures, graphs, or tables that convey significant information visually.

**When Plain OCR is Sufficient or Preferred:**

* **Pure text extraction:** If your only goal is to extract the raw, unformatted text from a high-quality scanned book or a "flattened" PDF.
* **Cost and speed:** OCR engines are typically faster and less computationally intensive for just text extraction compared to running a full multi-modal LLM.
* **High-volume, specific text processing:** If you're doing something like indexing every word for search, or simply converting millions of scanned pages into searchable text archives, a highly optimized OCR pipeline might be more efficient.
* **Avoiding hallucination:** An OCR engine won't invent text that isn't there. A multi-modal LLM, while powerful, still has a risk of hallucinating or misinterpreting very ambiguous visual information, leading to "accurate-looking" but incorrect content.

**Conclusion:**

Multi-modal LLMs are designed for a **broader and deeper understanding** of a document's content, including its visual elements and the relationships between them. This generally leads to **more accurate and comprehensive *interpretations* of the content as a whole.**

However, for the specific task of **raw character-level text extraction from high-quality input**, a highly optimized, dedicated OCR engine might still be on par or even marginally superior in very specific benchmarks, purely because that's its singular, honed focus.

So, it's not "always guaranteed" to produce *more* accurate content if "accuracy" is narrowly defined as perfect character recognition, but it is almost always guaranteed to produce a **richer, more contextualized, and semantically accurate *understanding* of the document's content.**

```

RESPONSE 3:

```

This is a fascinating and insightful question that highlights the core capabilities and potential limitations of multi-modal LLMs.

Here's a breakdown of what a multi-modal LLM like Gemini would likely do if it encountered "Alexander" in an English translation of the Ramayana PDF:

1. **Initial Text Extraction and OCR (if needed):**
* If the PDF contains native, selectable text, the LLM would receive "Alexander" directly.
* If the PDF page is a scanned image (e.g., an old translation), an internal OCR engine (like Google Vision AI) would first attempt to convert the image of the word into text. This is where the first potential point of error or ambiguity can arise:
* Could "Alexander" be an OCR misrecognition of a different, more contextually appropriate Sanskrit name (e.g., "Alakshana," "Lakshmana," or another name that visually resembles "Alexander")? This is a very common issue with OCR on historical or complex scripts.

2. **Contextual Analysis by the LLM:**
* Regardless of whether the word came from native text or OCR, the LLM's powerful language model component would then process "Alexander" within its surrounding textual context.
* **It absolutely would try to figure out if the word makes sense in the context.** This is a fundamental strength of LLMs. They are trained on vast amounts of text and code, allowing them to understand semantic relationships, historical periods, and the typical characters and themes of epic literature.
* The LLM has a "world model" or knowledge base that includes historical figures. It knows that Alexander the Great lived in a different historical period and geographical location than the events of the Ramayana.

3. **Multi-modal Verification (The "Seeing" Part):**
* This is where the "multi-modal" aspect becomes crucial. The LLM isn't just seeing the text string "Alexander"; it's also "seeing" the *image* of the word on the page.
* **It would compare the textual interpretation with the visual evidence.**
* **Scenario A: The visual evidence strongly supports "Alexander."**
* If the word is clearly printed as "Alexander" (e.g., a typo in the original translation, a modern commentary within the PDF that mentions Alexander for comparison, or a genuine, albeit highly unusual, translation choice), the LLM would likely accept it as is.
* It might internally flag it as highly unusual or out-of-context. If asked to summarize the characters or plot, it might still omit "Alexander" or mention its unusual appearance.
* If asked "Who is Alexander in the Ramayana?", it might respond: "Alexander the Great is a historical figure from ancient Macedonia and is not a character in the traditional narrative of the Ramayana. If the text mentions 'Alexander,' it might be a modern annotation, a comparative analysis, or an unusual translation choice. Could you provide more context?" This shows it's trying to reconcile the conflicting information.
* **Scenario B: The visual evidence is ambiguous, or suggests an OCR error.**
* If the image of the word is blurry, smudged, or uses an ornate font, the LLM, leveraging its visual encoder, might determine that the visual pattern for "Alexander" is weak or that it strongly resembles another, more contextually appropriate word (e.g., "Lakshmana," "Bharata," "Dasharatha," or some other Sanskrit name).
* In this case, the LLM might internally **correct** the OCR output based on both the visual ambiguity *and* the strong contextual probability. It might infer that the original word was *meant* to be "Lakshmana" (or a similar name) and proceed with that understanding in its output.
* It might then generate output that refers to "Lakshmana" without even explicitly mentioning the "Alexander" misrecognition, or it might briefly note: "The text appears to refer to Lakshmana, although the word was initially unclear."

**In summary:**

A multi-modal LLM like Gemini would **not simply add "Alexander" to the output stream blindly.** It would:

1. **Receive the word (either natively or via OCR).**
2. **Analyze its semantic context:** Does "Alexander" make sense in the Ramayana? (No, not typically).
3. **Visually inspect the word (multi-modal capability):** Does the image of the word strongly support "Alexander," or could it be an OCR error of a more plausible name?
4. **Make a probabilistic decision:**
* If the word is clearly "Alexander" in the image *and* the context slightly allows for it (e.g., a commentary), it will process it as such, but with awareness of its unusual nature.
* If the word is visually ambiguous *and* the context strongly contradicts it, the LLM is likely to infer an OCR error and attempt to reconstruct the original, contextually relevant word.

This intelligent reasoning and cross-modal verification is precisely one of the key advantages multi-modal LLMs offer over plain OCR or purely text-based models. They aim for **coherence and contextual accuracy** in their understanding, not just raw character transcription.

```

vishal jaiswal

unread,

Jul 14, 2025, 12:50:26 AMJul 14

to sanskrit-p...@googlegroups.com

I am not sure whether it is possible to completely suppress the creative abilities of gemini. I have tried explicit commands but it cant resist ignoring this at times. If this were possible, please share such a prompt.

--
You received this message because you are subscribed to the Google Groups "sanskrit-programmers" group.
To unsubscribe from this group and stop receiving emails from it, send an email to sanskrit-program...@googlegroups.com.
To view this discussion visit https://groups.google.com/d/msgid/sanskrit-programmers/14aef01c-ec7a-4cc9-b861-ea8cb2328259n%40googlegroups.com.

Karthik

unread,

Jul 14, 2025, 10:27:10 PMJul 14

to sanskrit-programmers

The temperature and top_p parameters control creativity.

The prompt, however, has the largest impact.

Split the workload into smaller chunks with specific output expectations.

Reply all

Reply to author

Forward

0 new messages