pdf text extractor leads?

19 views
Skip to first unread message

विश्वासो वासुकिजः (Vishvas Vasuki)

unread,
Nov 21, 2025, 6:55:59 AM (13 days ago) Nov 21
to sanskrit-programmers
Could you do better than xpdf below for extracting text off https://drive.google.com/file/d/1fg0Oeh9DkK3CZgATiT5pi9D2nNCJg9kR/view ?

---------- Forwarded message ---------
From: Bhasha IME
Date: Fri, 21 Nov 2025 at 16:00
Subject: Re: text extraction request
To: विश्वासो वासुकिजः 


done. lot of manual cleanup required. This is the best with xpdf. if you are aware of any other pdf text extractor that can extract with font info, let me know

v_noDict_U.rtf

Bhasha IME

unread,
Nov 21, 2025, 10:11:36 AM (13 days ago) Nov 21
to sanskrit-p...@googlegroups.com
I know no other which is better than xpdf; at most they are equal. The raw mode preserves order of above/below marks. other modes don't. same with poppler and every other extractor. They just cannot maintain order of text. If you are aware of any, let me know.


--
You received this message because you are subscribed to the Google Groups "sanskrit-programmers" group.
To unsubscribe from this group and stop receiving emails from it, send an email to sanskrit-program...@googlegroups.com.
To view this discussion visit https://groups.google.com/d/msgid/sanskrit-programmers/CAFY6qgGVM2khGHH%3DbEhhCnsNdb%3DEB-Dh0dbd92nBgEMBAP1d0A%40mail.gmail.com.

Shreevatsa R

unread,
Nov 21, 2025, 12:22:25 PM (12 days ago) Nov 21
to sanskrit-p...@googlegroups.com
It may be possible to do better than xpdf, but in general, OCR is the best way to extract text from PDFs, sad to say.

For OCR, some options:


विश्वासो वासुकिजः (Vishvas Vasuki)

unread,
Nov 22, 2025, 12:12:45 AM (12 days ago) Nov 22
to sanskrit-p...@googlegroups.com
(fwiw tried google drive api [input NOT image/ imagified] to see if it yields something more recoverable - it didn't in this case)



--
--
Vishvas /विश्वासः

veda-nityatA.pdf.txt
Reply all
Reply to author
Forward
0 new messages