pdf font conversion issue.

271 views
Skip to first unread message

विश्वासो वासुकिजः (Vishvas Vasuki)

unread,
Jul 24, 2021, 9:51:20 AM7/24/21
to sanskrit-programmers, Suhas M सुहासो महेशसूनुः कविः बहुभाषाज्ञः भूतशास्त्रज्ञः
namaste!

myself and suhAs (cc-ed) will be very grateful if someone can extract text at https://ia601804.us.archive.org/5/items/unabridged-mahabharata-6-volumes-set-in-hindi-by-veda-vyasa-compressed/Unabridged%20Mahabharata%206%20Volumes%20Set%20in%20Hindi%20by%20Veda%20Vyasa.pdf without error and send us the plain text. (copy pasting does not work well, and ocr-ing might introduce errors - so is a last resort.)

In case the below helps -

image.png

--
--
Vishvas /विश्वासः

Prasanna Venkatesh

unread,
Jul 24, 2021, 1:08:59 PM7/24/21
to sanskrit-programmers
Namaste,

I tried to convert page 5 ("नम्र निवेदन") using pdf2txt and pdftotext. There are some missing characters as shown in the image below. The same characters are skipped when I try to select the text in my PDF viewer.

I think this is the same problem as discussed here.

Prasanna Venkatesh

ksnip_20210724-210305.png

Prasanna Venkatesh

unread,
Jul 24, 2021, 1:59:25 PM7/24/21
to sanskrit-programmers
Sir, the output I got from `pdffonts` command on the file differs from yours:

```
name                                 type              encoding         emb sub uni object ID
------------------------------------ ----------------- ---------------- --- --- --- ---------
APZKLW+NotoSansDevanagari-Bold       CID TrueType      Identity-H       yes yes yes  15454  0
ATMSNB+NotoSansDevanagari            CID TrueType      Identity-H       yes yes yes  15461  0
ASZHUB+Times-Roman                   CID TrueType      Identity-H       yes yes yes  15466  0
ASLUDF+Times-Bold                    CID TrueType      Identity-H       yes yes yes  15467  0
```

Shreevatsa R

unread,
Jul 25, 2021, 7:08:18 AM7/25/21
to sanskrit-programmers, Suhas Mahesh
Extracting text from PDF can be a hard problem, and OCR is not necessarily a bad idea.

I imagine one could try to reverse-engineer the mapping by carefully examining the object streams and text operators in the PDF (use iText RUPS or something… some old notes of mine here). But I haven't tried it (may be harder than I think), and OCR may not be so bad considering it's not a scan.

...

I was about to send the above message but I took a closer look at the PDF file, and it indeed seems doable.

The basic idea is that the text in the PDF is done with the "show" operator "Tj", so (in a suitably cleaned up version of the PDF file), you can just grep for lines ending with "Tj". (Or could use a proper PDF parser.) 

So, taking page 200 of Volume 1.pdf from archive.org (=page 201 from the PDF linked above), you'll get lines like:

<003A0050002C001900200044002C004B0003> Tj
<0019003D0042030B> Tj
<003C00460028002C00CA> Tj
<002A002C02DE0003> Tj

The <...> means hex strings, and it looks like (in this file at least) groups of 4 hex digits (two bytes) are characters, so the above should be read as:

 ['003A', '0050', '002C', '0019', '0020', '0044', '002C', '004B', '0003'],
 ['0019', '003D', '0042', '030B'],
 ['003C', '0046', '0028', '002C', '00CA'],
 ['002A', '002C', '02DE', '0003'],

etc, where these are glyph ids (I think). The mapping to Unicode equivalents is in a CID map in the PDF: for instance 003A maps to 0936 which is DEVANAGARI LETTER SHA, and 002C maps to 0928 which is DEVANAGARI LETTER NA, etc. The problem is just that this mapping is incomplete (some vowel signs are missing, each different length of "vowel sign i" is a separate glyph etc), so some tedious work is required to fill it up (add the missing glyph ids, and correct the ones that are mapped to "0000"); also the text in the PDF is in glyph order rather than Unicode order (consider 'vowel sign i' and for consonant 'ra' at the start of a sequence), but those can be dealt with too; it's just a matter of patience.

Here's the sample from the top of the same page (with the "vowel sign i" in the wrong place etc: haven't bothered fixing them):

शौनकजीने कहा—सूतनन्दन! पुरातन ऋिष एवं यशस्वी ब्राह्मणआस्तीककी इस मनोरम कथाको मैं पूणर्पसे सुनना चाहता हूँ  । ।  ५  । ।सौितरुवाचइितहासिममं िवप्राः पुराणं पिरचक्षते  । ।  ६  । ।कृष्णद्वैपायनप्रोक्तं नैिमषारण्यवािसषु  ।पूव प्रचोिदतः सूतः िपता मे लोमहषर्णः  । ।  ७  । 

etc. (The missing "रू" in "पूर्णरूपसॆ" is just because I was too lazy to figure out the proper mapping for this case. The missing spaces in "ब्राह्मणआस्तीककी" and "सौतिरुवाचइतिहासिममं" are more concerning, but the text operations come from entirely different lines so can be handled too by paying just the slightest attention to the text position: I think standard tools like pdftotext already do that.)

I don't think I'll be able to spend more time on this by myself, but if someone else is going to do the work I'm happy to explain more / answer questions if I can.

To get started; this is (I think) everything of what was needed to extract the above sample: https://gist.github.com/shreevatsa/346a86ba6b616fab3c2464089aa13324

Hope this helps,
Shreevatsa


--
You received this message because you are subscribed to the Google Groups "sanskrit-programmers" group.
To unsubscribe from this group and stop receiving emails from it, send an email to sanskrit-program...@googlegroups.com.
To view this discussion on the web visit https://groups.google.com/d/msgid/sanskrit-programmers/1c9c6d93-6d79-4380-8d02-52649cb7a888n%40googlegroups.com.

Anunad Singh

unread,
Jul 26, 2021, 12:29:33 AM7/26/21
to sanskrit-programmers
A third option should also be tried, that is requesting Geeta Press to kindly make available the pre-PDF form of the text.

-- anunAda

Shreevatsa R

unread,
Jul 26, 2021, 3:54:09 AM7/26/21
to sanskrit-programmers
+1 sure, this is like the last answer to the barometer question :-) (The story is found online in many places e.g. here; there's even a Wikipedia article.)

Meanwhile, I did a search this morning for some terms like [pdf CMap] and [pdf CMap GPOS] and [pdf CMap indic] and found some useful results like Understanding the PDF file format – Embedded CMAP tables and in particular these slides: PDF and OpenType technology (see especially slides 36 and 40 to see what the problem is). The solution it proposes is /ActualText. For producing PDFs where text can be copied correctly, the most reliable method I know currently is to use lualatex (which invokes LuaHBTeX) with the HarfBuzz renderer, and /ActualText is precisely what it does too.[1]

This means that, in general, the following approach should work for (all?) such PDFs containing Indic text:
  1. Build a mapping from glyph-id to what-that-glyph-means. This requires looking at the PDF (the text operators in it, and corresponding visual output) for possibly a few pages, but may not be so hard, because
    • this only needs to be done once,
    • there are only a finite number of glyphs in any Devanagari font (a few hundred?),
    • for many of them the mapping to Unicode is already given in the PDF itself, and
    • it's likely the mapping is the same for a fixed font (e.g. always the same for Noto Sans Devanagari), across PDFs.
  2. Write a tool for converting from glyph sequences to Unicode.
    • My understanding is that there are already such tools, for dealing with legacy pre-Unicode fonts.
  3. (Optional) Process the PDF to add back the /Span<</ActualText … around the text operators, so that copying from the PDF, or standard tools like pdftotext, will work correctly.
None of these steps is particularly hard, and once done it would be useful for all such PDFs in future, it appears.  Anyone interested?

–Shreevatsa

[1]: About the LuaHBTeX: look at the attached PDF in a text editor (look at "5 0 obj"). It was generated from:

% !TEX TS-program = lualatex
\documentclass[border=3mm]{standalone}
\usepackage{fontspec}
\setmainfont{Noto Sans Devanagari}[Renderer=Harfbuzz,Script=Devanagari]
\begin{document}
वर्णों
\end{document}


simple-indic-harfbuzz-qdf.pdf

Shreevatsa R

unread,
Jul 26, 2021, 1:12:47 PM7/26/21
to sanskrit-programmers
"the mapping is the same for a fixed font"  seems to be the case: after running "ttx NotoSansDevanagari-Regular.ttf" and looking in NotoSansDevanagari-Regular.ttx, the three glyphs in "वर्णों", namely 0039 0027 01cf (in decimal: 57, 39, 463) seem to be present in the XML as:

    <GlyphID id="57" name="vadeva"/>
    <GlyphID id="39" name="nnadeva"/>
    <GlyphID id="463" name="ovowelsignrephanusvaradeva"/>

There are 967 glyphs in all, but not all will be encountered even in the original large PDF.
And there are Unicode mappings for 270 of them already:

      <map code="0x923" name="nnadeva"/><!-- DEVANAGARI LETTER NNA -->
      <map code="0x935" name="vadeva"/><!-- DEVANAGARI LETTER VA -->

The same with the six MBh PDFs from archive.org (which seem to be generated with "PDF-XChange Editor 8.0.340").
Also, the same with NotoSansDevanagari-Bold.ttx, but unfortunately (and unsurprisingly) some other font like say Chandas will use a different glyph sequence, so will need to be handled separately.

So to summarize, what needs to be done is to 
(1) map the other 697 glyphs (or whichever of them are of interest, probably a lot fewer) to meanings, 
in order to
(2) convert from glyph order to Unicode order.



Suhas Mahesh

unread,
Jul 26, 2021, 4:53:12 PM7/26/21
to sanskrit-programmers
Thanks for analysing this in such great detail, Shreevatsa. I do not have the skills necessary to execute your suggestions in an efficient fashion. But if anyone else wants to do it, and wants help with any grunt work, I'm happy to pitch in.

Suhas



--
Dr. Suhas Mahesh
Dept. of Electrical & Computer Engineering
University of Toronto

उज्ज्वल राजपूत

unread,
Jul 27, 2021, 3:02:56 AM7/27/21
to sanskrit-p...@googlegroups.com

font-स॒ञ्चि॒कायाः॑ संयुक्ताक्षरनिय॒माना॑मनु॒शील॑नेने॒दं कृ॒तम्। स॒म्यक् क्र॑मावबो॒धाये॒दानीं॑ प्रय॒त्नः का॒र्यः॑।

ए॒वमुप॑युज्यते

ttx NotoSansDevanagari-Regular.ttf

ततः python-shell-म॒ध्ये

import devnagri_pdf_text
f = devnagri_pdf_text.Font('NotoSansDevanagari-Regular.ttx')
print(f.id_unicode([57, 39, 463], prkriya = True))

इ॒दं ल॑भ्यते

vadeva
nnadeva
धात॑वः ovowelsignrephanusvaradeva > ovowelsigndeva + rephanusvaradeva
ovowelsigndeva
धात॑वः rephanusvaradeva > rephdeva + anusvaradeva
धात॑वः rephdeva > radeva + viramadeva
radeva
viramadeva
anusvaradeva
वणोर्ं

Anunad Singh

unread,
Jul 27, 2021, 3:31:18 AM7/27/21
to sanskrit-programmers

Even if we get वणोर्ं , it can easily be changed to वर्णों by find-replace using regular expression.


--
You received this message because you are subscribed to the Google Groups "sanskrit-programmers" group.
To unsubscribe from this group and stop receiving emails from it, send an email to sanskrit-program...@googlegroups.com.

Shreevatsa R

unread,
Jul 28, 2021, 12:01:50 AM7/28/21
to sanskrit-programmers
Just using the opportunity to learn a little more about the PDF format and about fonts, and sharing what I'm learning while it's still new. :-) Even if it doesn't seem directly useful so far...

An additional complication is that, looking more closely, this PDF seems to be using a different (older/newer?) version of Noto Sans Devanagari compared to what I have on my system, with some glyph positions being different. (Either that, or the software that produced the PDF changed some ids when embedding the font.) For example, taking the page-200 that I've been using as test case, the section corresponding to (from the first line) "पुरातन ऋषि एवं यशस्वी ब्राह्मण…" gives, using the glyph names from NotoSansDevanagari-Regular.ttf from my system:

(id from PDF) (name from ttx)
0x002E 46 padeva
0x0045 69 uvowelsigndeva
0x0034 52 radeva
0x0042 66 aavowelsigndeva
0x0028 40 tadeva
0x002C 44 nadeva
0x0003 3 space

0x000F 15 rvocalicdeva
0x0231 561 shanadeva 
0x003B 59 ssadeva
0x0003 3 space

0x0013 19 edeva
0x0039 57 vadeva
0x0006 6 anusvaradeva
0x0003 3 space

0x0033 51 yadeva
0x003A 58 shadeva
0x00D7 215 saprehalfdeva
0x0039 57 vadeva
0x0044 68 iivowelsigndeva
0x0003 3 space

0x0114 276 baradeva
0x0042 66 aavowelsigndeva
0x0212 530 davayadeva
0x0027 39 nnadeva

Note the second glyph in ऋषि which is actually one of the "i vowel sign" glyphs, and the third glyph in "ब्राह्मण" which is ह्म rather than द्व्य. 

So we may need to extract the font from the PDF itself, rather than using the font of the same name from elsewhere. Anyway, this is what one would have to do for general PDFs anyway.

I tried running "mutool extract" and opening the resulting ttf with a variety of tools (FontForge crashes, opentype.js and many tools based on it complain about 'glyphindexMap', …), but I think I've found something that works (rusttype / ab_glyph), to dump bitmaps of the glyphs. Will take a look at this again in a few days.


Bhasha IME

unread,
Jul 28, 2021, 4:27:51 AM7/28/21
to sanskrit-programmers

उज्ज्वल राजपूत

unread,
Jul 28, 2021, 7:19:45 AM7/28/21
to sanskrit-programmers
नम॑स्ते श्रीवत्समहोदय!

I am proceeding with these steps:
  • Using the version of the font that I have to automatically get a tentative map and reverse ligature rules
  • Manually writing a small "correction map" (this is the only font and book-specific manual work that is required, see here what I have done so far for the Mahabharata book in question)
  • Using regex to correct ि and र् marks as suggested by अनुनाद जी

All was going well (up to the text in black colour below) when I reached an ambiguous point. The id number 744 seems to be used for "(", as in the very beginning, but also for "-". To me at least it appears so. Might be a mistake with using only the data before "Tj". Can you please check this little issue so that I can make further progress?

(आस्तीकपर्व)त्रयोदशोऽध्यायःजरत्कारुका अपने पितरोंके अनुरोधसे विवाहकेलिये उद्‍यत होनाशौनक उवाचकिमर्थं राजशार्दूलः स राजा जनमेजयः  ।सर्पसत्रेण सर्पाणां गतोऽन्तं तद् वदस्व मे  । ।  १  । ।निखिलेन यथातत्त्वं सौते सर्वमशेषतः  ।आस्तीकश्च दि्‍वजश्रेष्ठः किमर्थं जपतां वरः  । ।  २  । ।मोक्षयामास भुजगान् प्रदीप्ताद् वसुरेतसः  ।कस्य पुत्रः स राजासीत् सर्पसत्रं य आहरत्  । ।  ३  । ।स च दि्‍वजातिप्रवरः कस्य पुत्रोऽभिधत्स्व मे  ।शौनकजीने पूछा—सूतजी! राजाओंमें श्रेष्ठ जनमेजयने किसलियेसर्पसत्रद्‍वारा सर्पोंका अन्त किया? यह प्रसंग मुझसे कहिये। सूतनन्दन! इसविषयकी सब बातोंका यथार्थरूपसे वर्णन कीजिये। जप(यज्ञ करनेवाले पुह्रुषोंमेंश्रेष्ठ विप्रवर आस्तीकने किसलिये सर्पोंको प्रज्वलित अग्निमें जलनेसे बचायाऔर वे राजा जनमेजयॏँ जिन्होंने सर्पसत्रका आयोजन किया थाॏँ किसके पुत्रथे? तथा दि्‍वजवंशशिरोमणि आस्तीक भी किसके पुत्र थे? यह मुझे बताइये  । ।  १ि३  । ।सौतिह्रुवाचमहदाख्यानमास्तीकं यथैतत् प्रोच्यते दि्‍वज  । ।  ४  । ।सर्वमेतदशेषेण शृणु मे वदतां वर  ।उग्रश्रवाजीने कहा—ब्रष्ठन्! आस्तीकका उपाख्यान बहुत बड़ा है।वॵंाओंमें श्रेष्ठ! यह प्रसंग जैसे कहा जाता हैॏँ वह सब पूरा(पूरा सुनो  । ।  ४  । ।शौनक उवाचश्रोतुमिच्छाम्यशेषेण कथामेतां मनोरमाम्  । ।  ५  । ।आस्तीकस्य पुराणषॅंर्ब्राद्‍द्‍यणस्य यशस्फस्वनः  ।

Shreevatsa R

unread,
Jul 28, 2021, 11:21:40 AM7/28/21
to sanskrit-programmers
Great! Thank you उज्ज्वल-जी, that's exactly what is needed!

I think the issue is that there are multiple fonts: 
NotoSansDevanagari
NotoSansDevanagari-Bold
Times-Roman
Times-Bold

But mainly the first two, and some ids are not the same between the two of them. So I'm afraid we'll need to do things a bit more properly: dump separate lists of "Tj"s by font. I'll also try to do that and upload the separate lists. (I filed an issue on qpdf just in case, got back an explanation that I'm yet to understand properly.)

Regards,

--
You received this message because you are subscribed to the Google Groups "sanskrit-programmers" group.
To unsubscribe from this group and stop receiving emails from it, send an email to sanskrit-program...@googlegroups.com.

Shreevatsa R

unread,
Jul 29, 2021, 7:52:59 AM7/29/21
to sanskrit-programmers
Here you go; the effort can begin :-)
I extracted the glyphs from the fonts in the PDF as bitmaps, and also the operands (glyph ids) to the text-showing operations (Tj) in the PDF (but into separate files by font this time), and also hacked up some crude html to help assign meanings to them. This is the result from the giant PDF file:
— in each case, each glyph from the font is shown (in descending order of frequency) along with (up to) 20 sample text runs containing it in context, which should help.
So the manual (AFAICT) task to be done is, for each glyph:
  • if it already has a Unicode mapping in the above HTML files, then verify that it is correct,
  • if it doesn't, assign one to it. This may be a single Unicode character, or sometimes a combination of a few (like Consonant+virama, or Consonant1+virama+Consonant2, etc), and there are also a few like the various short-i vowel signs and repha (and combinations thereof) which need some special annotation/handling to indicate that they should be placed *after* the following consonant(s) in Unicode order.
If someone does this, then it should be easy to convert the raw Tj files—they look like this and you can figure out the other three URLs; be warned that the two large ones are 40 MB and 65 MB—into text, or even put the text back into the PDF so that it can be copied from properly, and standard tools for PDF text extraction should work correctly (let's see).

(The programs that produced the above output are here: I had to write them in Rust because the Python/Javascript parsers I found for TTF/PDF weren't working properly for this use case… I'm new to Rust and probably not using it idiomatically, but it was ok with IDE help.)


Shreevatsa R

unread,
Aug 15, 2021, 9:20:52 PM8/15/21
to sanskrit-programmers, Suhas Mahesh
An update here, if people are still interested. :-)

I implemented the "fix the original PDF by surrounding text operators inside /ActualText" feature, and it sort of works now; anyone feeling sufficiently adventurous can try getting the converted text out of the original PDF.
The workflow is a bit hairy internally, but mainly it involves running a command, and the manual grunt work needed is to, using a couple of HTML files, write down the Unicode sequence for 329 glyphs (the ones marked "Not mapped in the PDF"). (This 329 = 170+159 and I wouldn't be surprised if most/all of the common glyph ids are actually the same.) Thanks to a very useful contribution by उज्ज्वल राजपूत (thanks for the interest!), a significant amount of this manual work can be reduced (one can just copy the sequence from the "helper fonts"). After this, a new PDF will be generated. There are some complications involving र् and  ि that I've implemented hackily but probably be made simpler. If you miss some (or all) glyphs it's ok; the next step will prompt you when it encounters those glyphs.

I did this for one page, and it needed giving the Unicode for 75 glyphs that occurred on the page (so I guess there are not 329 but only about 250 glyphs that still remain to be translated), and it took only a few minutes (less than an hour; less time than it took to write this email and certainly less than the several tens of hours I've spent on this project so far :D), so I imagine the rest should be possible too. Attached is the result as an example (also here if the attachment doesn't come through):
- "unabridged-page-201.pdf" is extracted from the original PDF,
- "unabridged-page-201.txt" is the result of pdftotext on that PDF, 
- "my-manual-page-200.txt" is the manual work (also written out to the two .toml files at the end),
- "unabridged.fixed-page-201.pdf" is Page 201 of the "fixed" PDF (Chrome seems to have some trouble selecting text in it—all the text region is scrunched up near the middle of the page—but it works fine in Adobe Acrobat Reader. I have some idea of what may be the reason, but not pursuing this right now as for text extraction this doesn't seem to be a problem),
- "unabridged.fixed-page-201-with-re-fix.txt" is the result of pdftotext on this fixed PDF, followed by a couple of regular-expression fixes (also contributed by उज्ज्वल राजपूत based on a suggestion by Anunad Singh above).

Please take a look at the two .txt files and see if the latter seems sufficiently correct. The first few lines:

Before:

शौनकजीने कहा—सूतन दन! पुरातन ऋ ष एवं यश वी ा ण
आ तीकक इस मनोरम कथाको म पूण पसे सुनना चाहता ँ ।। ५ ।।

सौ त वाच
इ तहास ममं व ाः पुराणं प रच ते ।। ६ ।।
कृ ण ै पायन ो ं नै मषार यवा सषु ।

After:

शौनकजीने  कहा—सूतनन्दन!  पुरातन  ऋषि  एवं  यशस्वी  ब्राह्मण
आस्तीककी इस मनोरम कथाको मैं पूर्णरूपसे सुनना चाहता हूँ  । ।  ५   । ।

सौतिरुवाच
इतिहासमिमं विप्राः पुराणं परिचक्षते  । ।  ६  । ।
कृष्णद्वै पायनप्रोक्तं नैमिषारण्यवासिषु  ।

I'm aware of one error: for glyph 01C2, which occurs 5 times on this page and actually represents ों (094B DEVANAGARI VOWEL SIGN O followed by ‎0902 DEVANAGARI SIGN ANUSVARA), I misread it when entering the meanings manually, and entered र्<CCprec>े (as for 01BB, visually similar), so in place of ऋषियोंके ब्राह्मणोंके पापोंका उन्होंने यायावरोंमें you'll see ऋषिर्येके ब्राह्मर्णेके पार्पेका उर्न्हेने यायावर्रेमें respectively — this is a good example of the kind of human error that's possible, and why we probably want some independent verification of the glyph meanings. Other than this error—which, auspiciously or inauspiciously, happens to be with "om" :-)—please see if there are other errors in the text.
unabridged-page-201.txt
unabridged-page-201.pdf
unabridged.fixed-page-201.pdf
unabridged.fixed-page-201-with-re-fix.txt

Suhas Mahesh

unread,
Aug 15, 2021, 11:12:13 PM8/15/21
to Shreevatsa R, sanskrit-programmers
Hi Shreevatsa,

Wow! This is great. Thanks for providing such a clear workflow. I read through the text and could only see two errors, one of which is the om issue that you've already noted. The other is that कृष्णद्वैपायन, in both instances, appears as कृष्णद्वै पायन.

Best,
Suhas

(आस्तीकपर्व)त्रयोदशोऽध्यायःजरत्कारुका अपने पितरोंके अनुरोधसे विवाहकेलिये उद्‍यत होनाशौनक उवाचकिमर्थं राजशार्दूलः स राजा जनमेजयः  ।सर्पसत्रेण सर्पाणां गतोऽन्तं तद् वदस्व मे  । ।  १  । ।निखिलेन यथातत्त्वं सौते सर्वमशेषतः  ।आस्तीकश्च दि्‍वजश्रेष्ठः किमर्थं जपतां वरः  । ।  २  । ।मोक्षयामास भुजगान् प्रदीप्ताद् वसुरेतसः  ।कस्य पुत्रः स राजासीत् सर्पसत्रं य आहरत्  । ।  ३  । ।स च दि्‍वजातिप्रवरः कस्य पुत्रोऽभिधत्स्व मे  ।शौनकजीने पूछा—सूतजी! राजाओंमें श्रेष्ठ जनमेजयने किसलियेसर्पसत्रद्‍वारा सर्पोंका अन्त किया? यह प्रसंग मुझसे कहिये। सूतनन्दन! इसविषयकी सब बातोंका यथार्थरूपसे वर्णन कीजिये। जप(यज्ञ करनेवाले पुह्रुषोंमेंश्रेष्ठ विप्रवर आस्तीकने किसलिये सर्पोंको प्रज्वलित अग्निमें जलनेसे बचायाऔर वे राजा जनमेजयॏँ जिन्होंने सर्पसत्रका आयोजन किया थाॏँ किसके पुत्रथे? तथा दि्‍वजवंशशिरोमणि आस्तीक भी किसके पुत्र थे? यह मुझे बताइये  । ।  १ि३  । ।सौतिह्रुवाचमहदाख्यानमास्तीकं यथैतत् प्रोच्यते दि्‍वज  । ।  ४  । ।सर्वमेतदशेषेण शृणु मे वदतां वर  ।उग्रश्रवाजीने कहा—ब्रष्ठन्! आस्तीकका उपाख्यान बहुत बड़ा है।वॵंाओंमें श्रेष्ठ! यह प्रसंग जैसे कहा जाता हैॏँ वह सब पूरा(पूरा सुनो  । ।  ४  । ।शौनक उवाचश्रोतुमिच्छाम्यशेषेण कथामेतां मनोरमाम्  । ।  ५  । ।आस्तीकस्य पुराणषॅंर्ब्राद्‍द्‍यणस्य यशस्फस्वनः  ।

विश्वासो वासुकिजः (Vishvas Vasuki)

unread,
Aug 15, 2021, 11:37:55 PM8/15/21
to sanskrit-programmers, Suhas Mahesh
On Mon, Aug 16, 2021 at 6:50 AM Shreevatsa R <shree...@gmail.com> wrote:
An update here, if people are still interested. :-)

Super! Very grateful for what your intelligent efforts. Importance order from my perspective: Getting the sanskrit right >> Getting hindi right , though the latter is desirable as well.

 
I implemented the "fix the original PDF by surrounding text operators inside /ActualText" feature, and it sort of works now; anyone feeling sufficiently adventurous can try getting the converted text out of the original PDF.
The workflow is a bit hairy internally, but mainly it involves running a command, and the manual grunt work needed is to, using a couple of HTML files, write down the Unicode sequence for 329 glyphs (the ones marked "Not mapped in the PDF"). (This 329 = 170+159 and I wouldn't be surprised if most/all of the common glyph ids are actually the same.) Thanks to a very useful contribution by उज्ज्वल राजपूत (thanks for the interest!), a significant amount of this manual work can be reduced (one can just copy the sequence from the "helper fonts"). After this, a new PDF will be generated. There are some complications involving र् and  ि that I've implemented hackily but probably be made simpler. If you miss some (or all) glyphs it's ok; the next step will prompt you when it encounters those glyphs.

I did this for one page, and it needed giving the Unicode for 75 glyphs that occurred on the page (so I guess there are not 329 but only about 250 glyphs that still remain to be translated),

Could you generate a toml with the missing  glyphs which can be directly edited looking at  a couple of HTML files without having to build your project etc..?
Workflow would be - look at missing Glyph ID in toml file, search (make it one) html for that Glyph ID, fill in unicode value with any additional comment in the toml file, send pull request.

Shreevatsa R

unread,
Aug 15, 2021, 11:39:57 PM8/15/21
to Suhas Mahesh, sanskrit-programmers
Revised workflow:
Actually, on further thought, I think I overengineered a bit with the toml files etc — instead, to crowdsource it better; here's something simpler that anyone with a web browser can contribute to, without having to download or install anything:
Go to the following sheet, add two columns for yourself, and fill in the highlighted rows (rows 80 to 304 I think): https://docs.google.com/spreadsheets/d/1SbLjlgpSa-H8z47dNgcwCo18izPswLjYOiNXOb5WPhg/edit#gid=1707967081 


उज्ज्वल राजपूत

unread,
Aug 16, 2021, 12:30:09 AM8/16/21
to sanskrit-programmers
Revised workflow:
Actually, on further thought, I think I overengineered a bit with the toml files etc — instead, to crowdsource it better; here's something simpler that anyone with a web browser can contribute to, without having to download or install anything:
Go to the following sheet, add two columns for yourself, and fill in the highlighted rows (rows 80 to 304 I think): https://docs.google.com/spreadsheets/d/1SbLjlgpSa-H8z47dNgcwCo18izPswLjYOiNXOb5WPhg/edit#gid=1707967081 

Great, thanks. Added the images of the glyphs to the sheet itself. So no need to search (or switch tabs on your browser). All required info is in the respective row of the ID.

उज्ज्वल राजपूत

unread,
Aug 16, 2021, 1:05:36 AM8/16/21
to sanskrit-programmers
Just finished doing my share of the grunt work as well: verified (and keyed in where required) all the glyphs for both the fonts. Now quickly give us the Mahabharata PDF with the fixed text :-)

Shreevatsa R

unread,
Aug 16, 2021, 10:38:38 PM8/16/21
to sanskrit-programmers
Thank you, that was very quick. Now all the glyphs are mapped, so there's no manual work remaining to be done. But the program seems to be buggy now as it's producing non-working PDFs; I'll take a look but it may be a few days.

On Sun, 15 Aug 2021 at 22:05, उज्ज्वल राजपूत <ujjwal....@gmail.com> wrote:
Just finished doing my share of the grunt work as well: verified (and keyed in where required) all the glyphs for both the fonts. Now quickly give us the Mahabharata PDF with the fixed text :-)

--
You received this message because you are subscribed to the Google Groups "sanskrit-programmers" group.
To unsubscribe from this group and stop receiving emails from it, send an email to sanskrit-program...@googlegroups.com.

Anunad Singh

unread,
Aug 17, 2021, 1:43:49 AM8/17/21
to sanskrit-programmers
Though I have not been able to follow technically what Shreevatsa ji, श्री  उज्ज्वल राजपूत and others are doing, I feel it is a tough job. We all are keenly waiting for its results and are now more than 95% confident of its success. It need not be repeated that the effort will be highly useful for all Indian languages.

-- अनुनाद

Shreevatsa R

unread,
Aug 17, 2021, 6:09:48 AM8/17/21
to sanskrit-programmers
Sorry it's not clear; I'll look into adding a better explanation of what's going on :-)

For now, while I still don't know exactly what the bug is, running "mutool clean" and "qpdf" on the PDF file before processing it seems to work for now. The PDF files seem to be visually cut off at the top, but the text seems to be copyable.

— see the "replaced-*.txt" files for the final text (and "ran.sh" for what was run).

The text still has some issues: There are still a couple thousand occurrences of "CCsucc" and a few dozen of "CCprec" in the txt files, so either the regexes or some of the i-vowel glyphs may need another look. And I don't know whether there are even more other issues with the text. Please take a look.



उज्ज्वल राजपूत

unread,
Aug 17, 2021, 6:59:18 AM8/17/21
to sanskrit-p...@googlegroups.com
The text still has some issues: There are still a couple thousand occurrences of "CCsucc" and a few dozen of "CCprec" in the txt files, so either the regexes or some of the i-vowel glyphs may need another look. And I don't know whether there are even more other issues with the text. Please take a look.


With this, I don't have any remaining <CC*>'s other than some within garbage present at the end of quite a few pages. And I corrected some errors in the manual map I wrote. Sincerely request others to rewrite the maps, too. Shouldn't take more than fifteen minutes. Or at least just check mine.

Here is a sample (pg. 10000):

(Of special concern is the misplacement of the Visarga, highlighted below, which could actually be correctly copied in the original PDF)

          भरतश्रेष्ठ! भगवान् पशुपतिने उन्हें अरुण और सूर्यके समान प्रकाशमान एक पताका
और अपने सम्पूर्ण भूतगणोंकी विशाल सेना भी प्रदान की  । ।  ४६  । ।
          उग्रां नानाप्रहरणां तपोवीर्यबलान्विताम्  ।
          अजेयां स्वगणैर्युक्तां नाम्ना सेनां धनंजयाम्  । ।  ४७  । ।
          रुद्रतुल्यबलैर्युक्तां योधानामयुतैस्त्रिभिः  ।
          न सा विजानाति रणात्  कदाचिद विनिवर्ति तुम्  
                                                                      ्                             । ।  ४८  । ।
          वह भयंकर सेना धनंजय नामसे विख्यात थी। उसमें सभी सैनिक नाना प्रकारके अस्त्र,
शस्त्र,  तपस्या,  बल  और  पराक्रमसे  सम्पन्न  थे।  रुद्रके समान  बलशाली  तीस  हजार
रुद्रगणोंसे युक्त वह सेना शत्रुओंके लिये अजेय थी। वह कभी भी युद्धसे पीछे हटना जानती
ही नहीं थी  । ।
          विष्णुर्ददौ वैजयन्तीं मालां बलविवर्धि नीम्  ।
          उमा ददौ विरजसी वाससी रविसप्रभे  । ।  ४९  । ।
          भगवान्  विष्णुने  कुमारको  बल  बढ़ानेवाली  वैजयन्ती  माला  दी  और  उमाने  सूर्यके
समान चमकीले दो निर्मल वस्त्र प्रदान किये  । ।  ४९  । ।
          गङ्गा कमण्डलुं दिव्यममृतोद्भवमुत्तमम्  ।
          ददौ प्रीत्या कुमाराय दण्डं चैव बृहस्पतिः  । ।  ५०  । ।
          गंगाने कुमारको प्रसन्नतापूर्वक एक दिव्य  और उत्तम कमण्डलु दिया, जो अमृत प्रकट
करनेवाला था। बृहस्पतिजीने दण्ड प्रदान किया  । ।  ५०  । ।
          गरुडो दयितं पुत्रं मयूरं चित्रबर्हि णम्  ।
          अरुणस्ताम्रचूडं च प्रददौ चरणायुधम्  । ।  ५१  । ।
          गरुडने  विचित्र  पंखोंसे  सुशोभित  अपना  प्रिय  पुत्र  मयूर  भेंट  किया।  अरुणने  लाल
शिखावाले अपने पुत्र ताम्रचूड (मुर्ग)-को समर्पि त किया, जिसका पैर ही आयुध था  । ।
          नागं तु वरुणो राजा बलवीर्यसमन्वितम्  ।
          कृष्णाजिनं ततो ब्रह्मा ब्रह्मण्याय ददौ प्रभुः  । ।  ५२  । ।
          समरेषु जयं चैव प्रददौ लोकभावनः  ।
          राजा  वरुणने  बल  और  वीर्यसे  सम्पन्न  एक  नाग  भेंट  किया  और  लोकस्रष्टा भगवान्
ब्रह्माने  ब्राह्मणहितैषी  कुमारको  काला  मृगचर्म  तथा  युद्धमें  विजयका  आशीर्वाद  प्रदान
किया  । ।  ५२   । ।
          सैनापत्यमनुप्राप्य स्कन्दो दे वगणस्य ह  । ।  ५३  । ।
          शुशुभे ज्वलितोऽर्चि ष्मान् द्वितीय इव पावकः  ।
          दे वताओंका  सेनापतित्व  पाकर  तेजस्वी  स्कन्द  अपने  तेजसे  प्रज्वलित  हो  दूसरे
अग्निदे वके समान सुशोभित होने लगे  । ।  ५३   । ।
          ततः पारिषदै श्चैव मातृभिश्च समन्वितः  । ।  ५४  । ।

उज्ज्वल राजपूत

unread,
Aug 17, 2021, 7:48:53 AM8/17/21
to sanskrit-programmers
Also, I've dumped text files (one per page) here:


So, for example the text in the last message can be seen at:


Will keep updating this whenever improvements are made in the code.

shreevatsa

unread,
Aug 17, 2021, 10:48:58 AM8/17/21
to sanskrit-programmers

That's great, thanks. 

I've taken down my Google Drive link as they were full of errors anyway and these HTML files subsume the useful output.

About the misplacement of the virama, that may be from pdftotext being too smart and deciding that (because it is so low) it belongs to another line... it may be possible to fix, not sure.

As Suhas mentioned, there are also some unwanted spaces here and there ("पारिषदै श्चैव" on the last line of https://031323.github.io/gp-mbh/mbh/10000.html). This is another thing to investigate in the PDF.

विश्वासो वासुकिजः (Vishvas Vasuki)

unread,
Aug 17, 2021, 12:40:58 PM8/17/21
to sanskrit-programmers
कृपया 1 markdown file per volume इत्यप्य् उत्पाद्य प्रकाशयतु। (यदि प्रत्यध्यायम् एकम् इति कर्तुम् बहु कठिनं स्यात्!)

 

--
You received this message because you are subscribed to the Google Groups "sanskrit-programmers" group.
To unsubscribe from this group and stop receiving emails from it, send an email to sanskrit-program...@googlegroups.com.

उज्ज्वल राजपूत

unread,
Aug 19, 2021, 3:37:05 AM8/19/21
to sanskrit-programmers
कृपया 1 markdown file per volume इत्यप्य् उत्पाद्य प्रकाशयतु। (यदि प्रत्यध्यायम् एकम् इति कर्तुम् बहु कठिनं स्यात्!)

प॒र्व॒शश्चा॑ध्याय॒शश्च॒ विभ॑क्तम्। यथा॒यम् अ॑ष्टाद॒शे स्व॑र्गा॒रोह॑णे॒ नाम॒ पर्व॑णि प्रथ॒मो॑ध्या॒यः


Markdown इत्या॑दिक॒म् अपि॑ करिष्यामः। प्र॒थ॒मं ताव॑द् अ॒क्षरा॑णां गौर॒वं प्रह्व॒त्वं च॒ वेत्तुं॑ श्रीवत्समहोद॒यः पृ॒ष्टः

विश्वासो वासुकिजः (Vishvas Vasuki)

unread,
Oct 1, 2021, 11:02:17 PM10/1/21
to sanskrit-programmers, Shreevatsa R श्रीवत्सो गणितज्ञः
आर्य श्रीवत्स - न विस्मृतं खल्व् इदं सूत्रम्? चातकायामहे वयम् इह केचित्।

--
You received this message because you are subscribed to the Google Groups "sanskrit-programmers" group.
To unsubscribe from this group and stop receiving emails from it, send an email to sanskrit-program...@googlegroups.com.

Shreevatsa R

unread,
Oct 4, 2021, 10:08:31 PM10/4/21
to विश्वासो वासुकिजः (Vishvas Vasuki), sanskrit-programmers
Sorry, busy time for me in general. I didn't quite get time to look into this matter this weekend (and unlikely I'll have time next weekend either), but I just pushed a change that may help, by surrounding the italic text with "[sl]". And did something similar for the font name as well. I haven't tested this end-to-end but hopefully उज्ज्वलji or someone can make use of it the way he had done earlier (running this on the PDF file, and post-processing the output of pdftotext I guess).

उज्ज्वल राजपूत

unread,
Oct 6, 2021, 9:02:54 AM10/6/21
to sanskrit-p...@googlegroups.com, विश्वासो वासुकिजः (Vishvas Vasuki)
but I just pushed a change that may help, by surrounding the italic text with "[sl]". And did something similar for the font name as well.

अनु॑गृहीताः स्मः श्रीवत्समहोदय! इ॒दं समी॑क्ष्यताम् इ॒दानी॒य्ँ यथा॑ दृ॒श्यते॑-।
 

अ॒धो॒लि॒खि॒त॒चि॒ह्नाना॑व्ँ विराम-उका॒रादी॑नाम् ए॒व व्य॑त्य॒य इ॒दानी॑म् परिशील॒नीय॒म् अव॑शिष्यते। 

Shreevatsa R

unread,
Oct 6, 2021, 10:54:54 AM10/6/21
to sanskrit-programmers, विश्वासो वासुकिजः (Vishvas Vasuki)
1. This is great! Nice to see that so much is recoverable. Please also share the scripts you used to go from the PDF to these HTML files; they may be useful.

2. About the misplaced virāma and u-mātrās, this is unfortunate; I guess pdftotext (assuming that's what you used) is trying to be "clever". I noticed that it has some options like:
-r <fp>              : resolution, in DPI (default is 72)
-layout              : maintain original physical layout
-raw                 : keep strings in content stream order
-nodiag              : discard diagonal text

could you see if any combination of them helps? (Maybe increasing resolution a lot, and also using -raw, may help...)

Also, I noticed that there is a "pdftohtml" (also from poppler/xpdf, like pdftotext), which has a
  -wbt <fp>             : word break threshold (default 10 percent)
option and tuning that looks like it could help, but pdftotext doesn't have that option.

 

--
You received this message because you are subscribed to the Google Groups "sanskrit-programmers" group.
To unsubscribe from this group and stop receiving emails from it, send an email to sanskrit-program...@googlegroups.com.

विश्वासो वासुकिजः (Vishvas Vasuki)

unread,
Oct 22, 2021, 1:42:30 AM10/22/21
to Shreevatsa R, sanskrit-programmers
On Wed, Oct 6, 2021 at 8:24 PM Shreevatsa R <shree...@gmail.com> wrote:
On Wed, 6 Oct 2021 at 06:02, उज्ज्वल राजपूत <ujjwal....@gmail.com> wrote:
but I just pushed a change that may help, by surrounding the italic text with "[sl]". And did something similar for the font name as well.

अनु॑गृहीताः स्मः श्रीवत्समहोदय! इ॒दं समी॑क्ष्यताम् इ॒दानी॒य्ँ यथा॑ दृ॒श्यते॑-।
 

अ॒धो॒लि॒खि॒त॒चि॒ह्नाना॑व्ँ विराम-उका॒रादी॑नाम् ए॒व व्य॑त्य॒य इ॒दानी॑म् परिशील॒नीय॒म् अव॑शिष्यते। 

1. This is great! Nice to see that so much is recoverable. Please also share the scripts you used to go from the PDF to these HTML files; they may be useful.

2. About the misplaced virāma and u-mātrās, this is unfortunate; I guess pdftotext (assuming that's what you used) is trying to be "clever". I noticed that it has some options like:
-r <fp>              : resolution, in DPI (default is 72)
-layout              : maintain original physical layout
-raw                 : keep strings in content stream order
-nodiag              : discard diagonal text

could you see if any combination of them helps? (Maybe increasing resolution a lot, and also using -raw, may help...)

Also, I noticed that there is a "pdftohtml" (also from poppler/xpdf, like pdftotext), which has a
  -wbt <fp>             : word break threshold (default 10 percent)
option and tuning that looks like it could help, but pdftotext doesn't have that option.


shrI venkaTesh writes - 

Text Extraction : Foxit Reader (free, Windows/Linux) is by far the best. It somewhat preserves the original layout (font size, indentation, color etc). It also preserves font info. If the PDF uses multiple fonts, this will be imperative to convert the text font-vice(wise). It puts up plain text as well as RTF on the clipboard.
While extracting Devanagari, other extractors mangle up the sequence of above-base and below-base chars e.g. u-kaara, uu-kaara, e-kaara, ai-kaara, anusvara etc. Foxit mostly works. This should be your first choice. 

आर्योज्ज्वल - पश्यतु - शिष्टो दोषो ऽनेन परिह्रियते वेति …

 
 

--
You received this message because you are subscribed to the Google Groups "sanskrit-programmers" group.
To unsubscribe from this group and stop receiving emails from it, send an email to sanskrit-program...@googlegroups.com.
To view this discussion on the web visit https://groups.google.com/d/msgid/sanskrit-programmers/CAJaaQk8hsDQvfnr%2B2SSkBXbCkFPn4DBX_T%3DFi679T-24ub8wWA%40mail.gmail.com.

उज्ज्वल राजपूत

unread,
Oct 24, 2021, 6:21:28 AM10/24/21
to sanskrit-programmers
  1. pdftotext with increased resolution (-r) recognizes spaces between words more accurately but also misplaces virāma and ukāra signs more enthusiastically.
  2. -raw seems to simply give a list of Tjs (?) as is, without any respect for words boundaries or line changes.
  3. pdftohtml and Foxit Reader are not providing any of the mapped characters, nor <CCprec>, <CCsucc> etc.
Please try with other text extractors with various parameters:
https://github.com/031323/gp-mbh#readme

Much of the process can also be automated by maximizing on the total length of substrings in the extracted text that are right of a Sanskrit dictionary.

शुक्रवार, 22 अक्तूबर 2021 को 11:12:30 am UTC+5:30 बजे Vishvas Vasuki ने लिखा:

उज्ज्वल राजपूत

unread,
Oct 25, 2021, 12:27:40 AM10/25/21
to sanskrit-p...@googlegroups.com
Here is the link to the Mahabharata PDF generated by the Shreevatsa ji's system (available till seven days from now):
https://we.tl/t-D4NsRNBB3L

Extract text from this using a tool of your choice, clone the repo https://github.com/031323/gp-mbhcd into the folder rust, and run:

cargo run --release -- <PATH TO EXTRACTED TEXT FILE>

The Mahabharata html files will be generated in the folder pages/mbh/.

विश्वासो वासुकिजः (Vishvas Vasuki)

unread,
Oct 25, 2021, 12:37:22 AM10/25/21
to sanskrit-programmers, venkaTeshaH BhashaIME वेङ्कटेशः पराङ्कुशसूनुरामानुज-सहकर्ता
आर्योज्ज्वल - "Foxit Reader are not providing any of the mapped characters, nor <CCprec>, <CCsucc> etc." इति यद् उक्तम्, तेन पुनः foxit reader प्रयोगो विफल इत्य् अवगन्तव्यम्? कश्चन लाभो दृष्टस् तत्र ?

--
You received this message because you are subscribed to the Google Groups "sanskrit-programmers" group.
To unsubscribe from this group and stop receiving emails from it, send an email to sanskrit-program...@googlegroups.com.

उज्ज्वल राजपूत

unread,
Oct 25, 2021, 1:15:17 AM10/25/21
to विश्वासो वासुकिजः (Vishvas Vasuki), sanskrit-p...@googlegroups.com, bhas...@gmail.com
न दृ॒ष्टो ला॒भ इत्ये॒व व॑क्त॒व्य॑म्। तेनापि॑ द॒त्ते पा॒ठे खल्वस्था॑नेष्ववका॒शान् प॑श्यामि।

सोम, 25 अक्तू॰ 2021 को 10:07 am बजे को विश्वासो वासुकिजः (Vishvas Vasuki) <vishvas...@gmail.com> ने लिखा:

Shreevatsa R

unread,
Oct 26, 2021, 10:58:46 AM10/26/21
to sanskrit-programmers, विश्वासो वासुकिजः (Vishvas Vasuki), Bhasha IME
I'll try looking into doing the text extraction in the pdf-glyph-mapping library itself, when I get some time. In general text extraction from PDF can be a hard problem because of layout (consider tables, two-column layout, etc) and the weird ways in which PDF can be produced, but for certain PDFs like this one, it may be feasible. We'll find out when we try.

--
You received this message because you are subscribed to the Google Groups "sanskrit-programmers" group.
To unsubscribe from this group and stop receiving emails from it, send an email to sanskrit-program...@googlegroups.com.

Bhasha IME

unread,
Oct 26, 2021, 2:10:18 PM10/26/21
to Shreevatsa R, sanskrit-programmers, विश्वासो वासुकिजः (Vishvas Vasuki)
Acrobat Reader produces decent results with some PDFs (latest on Windows is 21.007.20099.61763).
It produces UTF16 surrogate pairs for unmapped glyphs

विश्वासो वासुकिजः (Vishvas Vasuki)

unread,
Nov 12, 2021, 8:49:56 AM11/12/21
to Shreevatsa R, sanskrit-programmers, Bhasha IME
On Tue, 26 Oct 2021 at 20:26, Shreevatsa R <shree...@gmail.com> wrote:
I'll try looking into doing the text extraction in the pdf-glyph-mapping library itself, when I get some time. In general text extraction from PDF can be a hard problem because of layout (consider tables, two-column layout, etc) and the weird ways in which PDF can be produced, but for certain PDFs like this one, it may be feasible. We'll find out when we try.

साधु साधु -अविस्मृतम् इदन्त्व् अस्तु।

 

On Sun, 24 Oct 2021 at 22:15, उज्ज्वल राजपूत <ujjwal....@gmail.com> wrote:
न दृ॒ष्टो ला॒भ इत्ये॒व व॑क्त॒व्य॑म्। तेनापि॑ द॒त्ते पा॒ठे खल्वस्था॑नेष्ववका॒शान् प॑श्यामि।

सोम, 25 अक्तू॰ 2021 को 10:07 am बजे को विश्वासो वासुकिजः (Vishvas Vasuki) <vishvas...@gmail.com> ने लिखा:
आर्योज्ज्वल - "Foxit Reader are not providing any of the mapped characters, nor <CCprec>, <CCsucc> etc." इति यद् उक्तम्, तेन पुनः foxit reader प्रयोगो विफल इत्य् अवगन्तव्यम्? कश्चन लाभो दृष्टस् तत्र ?

--
You received this message because you are subscribed to the Google Groups "sanskrit-programmers" group.
To unsubscribe from this group and stop receiving emails from it, send an email to sanskrit-program...@googlegroups.com.
To view this discussion on the web visit https://groups.google.com/d/msgid/sanskrit-programmers/CAJaaQk9XrfM-ndogEWVBMDCGaxUC13pMAfDFrqpseYk8pdB_8g%40mail.gmail.com.

विश्वासो वासुकिजः (Vishvas Vasuki)

unread,
Apr 27, 2022, 10:23:10 AMApr 27
to sanskrit-programmers, Suhas M सुहासो महेशसूनुः कविः बहुभाषाज्ञः भूतशास्त्रज्ञः, उज्ज्वल ujjvalo rAjaputraH susaMskRtaH राजपूत
सज्जनेभ्यो नमस्काराः!! 

https://we.tl/t-sLE01lPJLx इति कस्माच्चित् सज्जनात् सञ्चिका लब्धाः। तदुपयोगेन संस्कृतं हिन्दीं च विविच्य, अनया रीत्या प्रत्यध्यायं भिन्नं markdown सञ्चिकाः कश्चित् प्रेषयितुम् उत्सहेत? तत्र प्रस्तुतिर् एवम् इष्यते - 


<details open><summary>मूलम्</summary>

*वैशम्पायन उवाच*

तत्रैव न्यवसन् राजन् निहत्य बकराक्षसम्।
अधीयानाः परं ब्रह्म ब्राह्मणस्य निवेशने॥ २ ॥
</details>

<details><summary>हिन्दी</summary>

वैशम्पायनजीने कहाराजन्! बकासुरका वध करनेके पश्चात् पाण्डवलोग ब्रह्मतत्त्वका प्रतिपादन करनेवाले उपनिषदोंका स्वाध्याय करते हुए वहीं ब्राह्मणके घरमें रहने लगे।
</details>

Calibre-तन्त्रांशेन एवं सभेदाङ्कनं परिवर्तनं न शक्यम्। अतः epub सञ्चिका extract कृत्वा किञ्चित् श्रान्तव्यं स्यात्। 

On Sat, 24 Jul 2021 at 19:20, विश्वासो वासुकिजः (Vishvas Vasuki) <vishvas...@gmail.com> wrote:
namaste!

myself and suhAs (cc-ed) will be very grateful if someone can extract text at https://ia601804.us.archive.org/5/items/unabridged-mahabharata-6-volumes-set-in-hindi-by-veda-vyasa-compressed/Unabridged%20Mahabharata%206%20Volumes%20Set%20in%20Hindi%20by%20Veda%20Vyasa.pdf without error and send us the plain text. (copy pasting does not work well, and ocr-ing might introduce errors - so is a last resort.)

In case the below helps -

image.png

--
--
Vishvas /विश्वासः

विश्वासो वासुकिजः (Vishvas Vasuki)

unread,
May 11, 2022, 2:25:40 AMMay 11
to sanskrit-programmers, Suhas M सुहासो महेशसूनुः कविः बहुभाषाज्ञः भूतशास्त्रज्ञः, उज्ज्वल ujjvalo rAjaputraH susaMskRtaH राजपूत
Available (~1k) voice recordings are now presented together with the original and the hindI translation (~350 adhyAyas accross 3 parvas) - see for example https://vishvasa.github.io/purANam/mahAbhAratam/goraxapura-pAThaH/01_Adiparva/09_hiDimbavadhaparva/151_hiDimbA-saMvAdaH . I look forward to using this to consume and annotate the mahAbhArata with delight.


Reply all
Reply to author
Forward
0 new messages