Re: OCR mixed script files?

विश्वासो वासुकिजः (Vishvas Vasuki)

unread,

Feb 23, 2026, 7:09:55 AMFeb 23

to sanskrit-programmers

Gemini flash latest does a good job with this prompt accompanying an image -

give me the mixed script text here, exactly as it appears

How to run this for thousands of pages?

On Thu, 17 Oct 2024 at 10:14, विश्वासो वासुकिजः (Vishvas Vasuki) <vishvas...@gmail.com> wrote:

maNipravALa (sanskrit + tamil) works are really hard to OCR properly. Google oft yields junk for minor-script strings. Would you have a good solution? Example files - (devanAgarI + tamil in this case) - https://sendgb.com/1Tnbfjq2NIs

Another problem is with grantha script texts (there is a treasure trove of those) - no OCR works well. Interested in solving that as well?

Avinash L Varna

unread,

Feb 27, 2026, 12:20:20 PMFeb 27

to sanskrit-p...@googlegroups.com

Are you asking from a cost perspective? How many tokens does it use per page on average? If it is on the order of 1k-10k tokens per page, that would translate to about 100-1k pages/1M tokens = ~$3.5 combining input and output. So that would be < $35 for a thousand pages. Perhaps it'll be possible to find someone to sponsor the funds needed for such a project?

Avinash

--
You received this message because you are subscribed to the Google Groups "sanskrit-programmers" group.
To unsubscribe from this group and stop receiving emails from it, send an email to sanskrit-program...@googlegroups.com.
To view this discussion visit https://groups.google.com/d/msgid/sanskrit-programmers/CAFY6qgEp%2BH-ubHjf7yTy7m4AvsMWoSqCERY5o%3D7_GouZj5EuEA%40mail.gmail.com.

विश्वासो वासुकिजः (Vishvas Vasuki)

unread,

Feb 27, 2026, 11:05:40 PMFeb 27

to sanskrit-p...@googlegroups.com

On Fri, 27 Feb 2026 at 22:50, Avinash L Varna <avinas...@gmail.com> wrote:

Are you asking from a cost perspective? How many tokens does it use per page on average? If it is on the order of 1k-10k tokens per page, that would translate to about 100-1k pages/1M tokens = ~$3.5 combining input and output. So that would be < $35 for a thousand pages. Perhaps it'll be possible to find someone to sponsor the funds needed for such a project?

Thanks - was actually wondering if someone has made a (say python) script I could use to extract text with (this or any other) prompt on thousands of pages.

Avinash

On Mon, Feb 23, 2026 at 6:09 AM विश्वासो वासुकिजः (Vishvas Vasuki) <vishvas...@gmail.com> wrote:
Gemini flash latest does a good job with this prompt accompanying an image -

give me the mixed script text here, exactly as it appears

How to run this for thousands of pages?

On Thu, 17 Oct 2024 at 10:14, विश्वासो वासुकिजः (Vishvas Vasuki) <vishvas...@gmail.com> wrote:

maNipravALa (sanskrit + tamil) works are really hard to OCR properly. Google oft yields junk for minor-script strings. Would you have a good solution? Example files - (devanAgarI + tamil in this case) - https://sendgb.com/1Tnbfjq2NIs

Another problem is with grantha script texts (there is a treasure trove of those) - no OCR works well. Interested in solving that as well?

--
You received this message because you are subscribed to the Google Groups "sanskrit-programmers" group.
To unsubscribe from this group and stop receiving emails from it, send an email to sanskrit-program...@googlegroups.com.
To view this discussion visit https://groups.google.com/d/msgid/sanskrit-programmers/CAFY6qgEp%2BH-ubHjf7yTy7m4AvsMWoSqCERY5o%3D7_GouZj5EuEA%40mail.gmail.com.

--
You received this message because you are subscribed to the Google Groups "sanskrit-programmers" group.
To unsubscribe from this group and stop receiving emails from it, send an email to sanskrit-program...@googlegroups.com.

To view this discussion visit https://groups.google.com/d/msgid/sanskrit-programmers/CAALtx9bHnbemhnfNud%2B6G_nmp1zeukYLpGwiUGjzi4Y8OxKjwQ%40mail.gmail.com.

--

--
Vishvas /विश्वासः

Andrew Ollett

unread,

Feb 27, 2026, 11:16:18 PMFeb 27

to sanskrit-p...@googlegroups.com

I have a Python script that I run to OCR text using gemini-3-pro-preview (Google AI Studio can easily generate this for you but I am pasting mine below for reference), but it only works on chunks of about 15-20 pages. Obviously one option would be to split the PDF and feed the chunks to the script on a timer (or on completion of the previous chunk).

--------

from google import genai
from google.genai import types
import argparse

def ocr_pdf(project_id, location, gcs_uri, model_name="gemini-3-pro-preview"):
client = genai.Client(vertexai=True, project=project_id, location=location)
file_part = types.Part.from_uri(file_uri=gcs_uri, mime_type="application/pdf")
config = types.GenerateContentConfig(
max_output_tokens=65536,
temperature=0.0,
top_p=0.95
)
response = client.models.generate_content_stream(
model=model_name,
contents=[file_part, "prompt"],
config=config
)

To view this discussion visit https://groups.google.com/d/msgid/sanskrit-programmers/CAFY6qgEbRXpXsSQeEcKJnxtoporZ%3DvGbTrKJypNpjtRQEdZhKw%40mail.gmail.com.

Kishore Chitrapu

unread,

Mar 2, 2026, 10:16:17 AMMar 2

to sanskrit-p...@googlegroups.com

Hi Vishvas,

I would like to try processing these files with a workflow I recently developed. I can't access the link you provided: https://sendgb.com/1Tnbfjq2NIs. Do you have another link where I can download the files? I will test them and share the results with you.

Kishore

To view this discussion visit https://groups.google.com/d/msgid/sanskrit-programmers/CAANHO14kSVGn0o8dhHtfK3G7cKY%3DE9eE4pUbfpxD31a02DL57A%40mail.gmail.com.

विश्वासो वासुकिजः (Vishvas Vasuki)

unread,

Mar 2, 2026, 10:55:58 AMMar 2

to sanskrit-p...@googlegroups.com

On Mon, 2 Mar 2026 at 20:46, Kishore Chitrapu <chit...@gmail.com> wrote:

Hi Vishvas,

I would like to try processing these files with a workflow I recently developed. I can't access the link you provided: https://sendgb.com/1Tnbfjq2NIs. Do you have another link where I can download the files? I will test them and share the results with you.

Would be splendid -

ra-tra-sA_5-commentaries_1.pdf

ra-tra-sA_5-commentaries_2.pdf

244.2M

ra-tra-sA_5-commentaries_3.pdf

200.8M

ra-tra-sA_5-commentaries_4.pdf

To view this discussion visit https://groups.google.com/d/msgid/sanskrit-programmers/CAMZkcQUKxsOT7w7bmED__hDEN-9YR_PPCdZxwjKTcbNqmGZZ%3DQ%40mail.gmail.com.

विश्वासो वासुकिजः (Vishvas Vasuki)

unread,

Mar 19, 2026, 7:37:18 AMMar 19

to sanskrit-p...@googlegroups.com

Could someone develop 2-script (devanAgarI + tamiL) OCR or just publish organized training data for such with this data -

https://archive.org/search?query=creator:%22Shri%20Nrisimgha%20Priya%20Trust%22

Clean text at (which needs to be aligned with page section images above semi manually with LLM) - mUla parts , commentary parts .

It would be quite valuable for the maNipravaLa texts - currently both normal OCR and LLMs don't give satisfactory output.

Reply all

Reply to author

Forward