Re: OCR mixed script files?

34 views
Skip to first unread message

विश्वासो वासुकिजः (Vishvas Vasuki)

unread,
Feb 23, 2026, 7:09:55 AM (8 days ago) Feb 23
to sanskrit-programmers
Gemini flash latest does a good job with this prompt accompanying an image - 

give me the mixed script text here, exactly as it appears

How to run this for thousands of pages?


On Thu, 17 Oct 2024 at 10:14, विश्वासो वासुकिजः (Vishvas Vasuki) <vishvas...@gmail.com> wrote:

maNipravALa (sanskrit + tamil) works are really hard to OCR properly. Google oft yields junk for minor-script strings. Would you have a good solution? Example files - (devanAgarI + tamil in this case) - https://sendgb.com/1Tnbfjq2NIs


Another problem is with grantha script texts (there is a treasure trove of those) - no OCR works well. Interested in solving that as well?

Avinash L Varna

unread,
Feb 27, 2026, 12:20:20 PM (4 days ago) Feb 27
to sanskrit-p...@googlegroups.com
Are you asking from a cost perspective? How many tokens does it use per page on average? If it is on the order of 1k-10k tokens per page, that would translate to about 100-1k pages/1M tokens = ~$3.5 combining input and output. So that would be < $35 for a thousand pages. Perhaps it'll be possible to find someone to sponsor the funds needed for such a project?

Avinash

--
You received this message because you are subscribed to the Google Groups "sanskrit-programmers" group.
To unsubscribe from this group and stop receiving emails from it, send an email to sanskrit-program...@googlegroups.com.
To view this discussion visit https://groups.google.com/d/msgid/sanskrit-programmers/CAFY6qgEp%2BH-ubHjf7yTy7m4AvsMWoSqCERY5o%3D7_GouZj5EuEA%40mail.gmail.com.

विश्वासो वासुकिजः (Vishvas Vasuki)

unread,
Feb 27, 2026, 11:05:40 PM (3 days ago) Feb 27
to sanskrit-p...@googlegroups.com
On Fri, 27 Feb 2026 at 22:50, Avinash L Varna <avinas...@gmail.com> wrote:
Are you asking from a cost perspective? How many tokens does it use per page on average? If it is on the order of 1k-10k tokens per page, that would translate to about 100-1k pages/1M tokens = ~$3.5 combining input and output. So that would be < $35 for a thousand pages. Perhaps it'll be possible to find someone to sponsor the funds needed for such a project?

Thanks - was actually wondering if someone has made a (say python) script I could use to extract text with (this or any other) prompt on thousands of pages.

 

Avinash

On Mon, Feb 23, 2026 at 6:09 AM विश्वासो वासुकिजः (Vishvas Vasuki) <vishvas...@gmail.com> wrote:
Gemini flash latest does a good job with this prompt accompanying an image - 

give me the mixed script text here, exactly as it appears

How to run this for thousands of pages?


On Thu, 17 Oct 2024 at 10:14, विश्वासो वासुकिजः (Vishvas Vasuki) <vishvas...@gmail.com> wrote:

maNipravALa (sanskrit + tamil) works are really hard to OCR properly. Google oft yields junk for minor-script strings. Would you have a good solution? Example files - (devanAgarI + tamil in this case) - https://sendgb.com/1Tnbfjq2NIs


Another problem is with grantha script texts (there is a treasure trove of those) - no OCR works well. Interested in solving that as well?

--
You received this message because you are subscribed to the Google Groups "sanskrit-programmers" group.
To unsubscribe from this group and stop receiving emails from it, send an email to sanskrit-program...@googlegroups.com.
To view this discussion visit https://groups.google.com/d/msgid/sanskrit-programmers/CAFY6qgEp%2BH-ubHjf7yTy7m4AvsMWoSqCERY5o%3D7_GouZj5EuEA%40mail.gmail.com.

--
You received this message because you are subscribed to the Google Groups "sanskrit-programmers" group.
To unsubscribe from this group and stop receiving emails from it, send an email to sanskrit-program...@googlegroups.com.

Andrew Ollett

unread,
Feb 27, 2026, 11:16:18 PM (3 days ago) Feb 27
to sanskrit-p...@googlegroups.com
I have a Python script that I run to OCR text using gemini-3-pro-preview (Google AI Studio can easily generate this for you but I am pasting mine below for reference), but it only works on chunks of about 15-20 pages. Obviously one option would be to split the PDF and feed the chunks to the script on a timer (or on completion of the previous chunk).

--------
from google import genai
from google.genai import types
import argparse

def ocr_pdf(project_id, location, gcs_uri, model_name="gemini-3-pro-preview"):
    client = genai.Client(vertexai=True, project=project_id, location=location)
    file_part = types.Part.from_uri(file_uri=gcs_uri, mime_type="application/pdf")
    config = types.GenerateContentConfig(
        max_output_tokens=65536,
        temperature=0.0,
        top_p=0.95
    )    
    response = client.models.generate_content_stream(
        model=model_name,
        contents=[file_part, "prompt"],
        config=config
    )

Kishore Chitrapu

unread,
Mar 2, 2026, 10:16:17 AM (20 hours ago) Mar 2
to sanskrit-p...@googlegroups.com
Hi Vishvas,

I would like to try processing these files with a workflow I recently developed. I can't access the link you provided: https://sendgb.com/1Tnbfjq2NIs. Do you have another link where I can download the files? I will test them and share the results with you.

Kishore

विश्वासो वासुकिजः (Vishvas Vasuki)

unread,
Mar 2, 2026, 10:55:58 AM (19 hours ago) Mar 2
to sanskrit-p...@googlegroups.com
On Mon, 2 Mar 2026 at 20:46, Kishore Chitrapu <chit...@gmail.com> wrote:
Hi Vishvas,

I would like to try processing these files with a workflow I recently developed. I can't access the link you provided: https://sendgb.com/1Tnbfjq2NIs. Do you have another link where I can download the files? I will test them and share the results with you.

Reply all
Reply to author
Forward
0 new messages