--
You received this message because you are subscribed to the Google Groups "sanskrit-programmers" group.
To unsubscribe from this group and stop receiving emails from it, send an email to sanskrit-program...@googlegroups.com.
To view this discussion on the web visit https://groups.google.com/d/msgid/sanskrit-programmers/1c9c6d93-6d79-4380-8d02-52649cb7a888n%40googlegroups.com.
To view this discussion on the web visit https://groups.google.com/d/msgid/sanskrit-programmers/CAKEM%3DPOo-TpMKPoP%2BPVgcuAVSiHcvTUdfFGFzLT_sPjh9vzAbA%40mail.gmail.com.
% !TEX TS-program = lualatex\documentclass[border=3mm]{standalone}\usepackage{fontspec}\setmainfont{Noto Sans Devanagari}[Renderer=Harfbuzz,Script=Devanagari]\begin{document}वर्णों\end{document}
To view this discussion on the web visit https://groups.google.com/d/msgid/sanskrit-programmers/CACa%2Bt%3DOmPHTuhx%3DkzHT2jNjdUHtJT4y7jopZa8L8%2BQL5kj_oyw%40mail.gmail.com.
To view this discussion on the web visit https://groups.google.com/d/msgid/sanskrit-programmers/CAKEM%3DPPxyZ5C1SOAka5LbUyj12JoGY-7xcpxk7Wp%2BH8qb1ESSA%40mail.gmail.com.
ttx NotoSansDevanagari-Regular.ttf
ततः python-shell
-म॒ध्ये
import devnagri_pdf_text f = devnagri_pdf_text.Font('NotoSansDevanagari-Regular.ttx') print(f.id_unicode([57, 39, 463], prkriya = True))
इ॒दं ल॑भ्यते
vadeva nnadeva धात॑वः ovowelsignrephanusvaradeva > ovowelsigndeva + rephanusvaradeva ovowelsigndeva धात॑वः rephanusvaradeva > rephdeva + anusvaradeva धात॑वः rephdeva > radeva + viramadeva radeva viramadeva anusvaradeva
वणोर्ं
Even if we get वणोर्ं , it can easily be changed to वर्णों by find-replace using regular expression.
--
You received this message because you are subscribed to the Google Groups "sanskrit-programmers" group.
To unsubscribe from this group and stop receiving emails from it, send an email to sanskrit-program...@googlegroups.com.
To view this discussion on the web visit https://groups.google.com/d/msgid/sanskrit-programmers/CAJaaQk9NiydfHNhWMR79J1L353E8qmK_egidi7%2BcGVCkcZD9NA%40mail.gmail.com.
(id from PDF) (name from ttx)
0x002E 46 padeva0x0045 69 uvowelsigndeva0x0034 52 radeva0x0042 66 aavowelsigndeva0x0028 40 tadeva0x002C 44 nadeva0x0003 3 space0x000F 15 rvocalicdeva0x0231 561 shanadeva0x003B 59 ssadeva0x0003 3 space0x0013 19 edeva0x0039 57 vadeva0x0006 6 anusvaradeva0x0003 3 space0x0033 51 yadeva0x003A 58 shadeva0x00D7 215 saprehalfdeva0x0039 57 vadeva0x0044 68 iivowelsigndeva0x0003 3 space0x0114 276 baradeva0x0042 66 aavowelsigndeva0x0212 530 davayadeva0x0027 39 nnadeva
To view this discussion on the web visit https://groups.google.com/d/msgid/sanskrit-programmers/CA%2B9po4vEXF2h%3D8JbLoH%3DXcTfhNRMBqXXzG-KdOvYMOPojVwkbg%40mail.gmail.com.
To view this discussion on the web visit https://groups.google.com/d/msgid/sanskrit-programmers/CAKEM%3DPO68ns1Th0%3DYz7_ttdN7qNVDF%2Bzuim1f7UVLJp1%3DUMFAA%40mail.gmail.com.
--
You received this message because you are subscribed to the Google Groups "sanskrit-programmers" group.
To unsubscribe from this group and stop receiving emails from it, send an email to sanskrit-program...@googlegroups.com.
To view this discussion on the web visit https://groups.google.com/d/msgid/sanskrit-programmers/91077171-c281-4dc3-ab1e-c83efc2c237cn%40googlegroups.com.
शौनकजीने कहा—सूतन दन! पुरातन ऋ ष एवं यश वी ा णआ तीकक इस मनोरम कथाको म पूण पसे सुनना चाहता ँ ।। ५ ।।
सौ त वाचइ तहास ममं व ाः पुराणं प रच ते ।। ६ ।।कृ ण ै पायन ो ं नै मषार यवा सषु ।
शौनकजीने कहा—सूतनन्दन! पुरातन ऋषि एवं यशस्वी ब्राह्मणआस्तीककी इस मनोरम कथाको मैं पूर्णरूपसे सुनना चाहता हूँ । । ५ । ।
सौतिरुवाचइतिहासमिमं विप्राः पुराणं परिचक्षते । । ६ । ।कृष्णद्वै पायनप्रोक्तं नैमिषारण्यवासिषु ।
(आस्तीकपर्व)त्रयोदशोऽध्यायःजरत्कारुका अपने पितरोंके अनुरोधसे विवाहकेलिये उद्यत होनाशौनक उवाचकिमर्थं राजशार्दूलः स राजा जनमेजयः ।सर्पसत्रेण सर्पाणां गतोऽन्तं तद् वदस्व मे । । १ । ।निखिलेन यथातत्त्वं सौते सर्वमशेषतः ।आस्तीकश्च दि्वजश्रेष्ठः किमर्थं जपतां वरः । । २ । ।मोक्षयामास भुजगान् प्रदीप्ताद् वसुरेतसः ।कस्य पुत्रः स राजासीत् सर्पसत्रं य आहरत् । । ३ । ।स च दि्वजातिप्रवरः कस्य पुत्रोऽभिधत्स्व मे ।शौनकजीने पूछा—सूतजी! राजाओंमें श्रेष्ठ जनमेजयने किसलियेसर्पसत्रद्वारा सर्पोंका अन्त किया? यह प्रसंग मुझसे कहिये। सूतनन्दन! इसविषयकी सब बातोंका यथार्थरूपसे वर्णन कीजिये। जप(यज्ञ करनेवाले पुह्रुषोंमेंश्रेष्ठ विप्रवर आस्तीकने किसलिये सर्पोंको प्रज्वलित अग्निमें जलनेसे बचायाऔर वे राजा जनमेजयॏँ जिन्होंने सर्पसत्रका आयोजन किया थाॏँ किसके पुत्रथे? तथा दि्वजवंशशिरोमणि आस्तीक भी किसके पुत्र थे? यह मुझे बताइये । । १ि३ । ।सौतिह्रुवाचमहदाख्यानमास्तीकं यथैतत् प्रोच्यते दि्वज । । ४ । ।सर्वमेतदशेषेण शृणु मे वदतां वर ।उग्रश्रवाजीने कहा—ब्रष्ठन्! आस्तीकका उपाख्यान बहुत बड़ा है।वॵंाओंमें श्रेष्ठ! यह प्रसंग जैसे कहा जाता हैॏँ वह सब पूरा(पूरा सुनो । । ४ । ।शौनक उवाचश्रोतुमिच्छाम्यशेषेण कथामेतां मनोरमाम् । । ५ । ।आस्तीकस्य पुराणषॅंर्ब्राद्द्यणस्य यशस्फस्वनः ।
An update here, if people are still interested. :-)
I implemented the "fix the original PDF by surrounding text operators inside /ActualText" feature, and it sort of works now; anyone feeling sufficiently adventurous can try getting the converted text out of the original PDF.The workflow is a bit hairy internally, but mainly it involves running a command, and the manual grunt work needed is to, using a couple of HTML files, write down the Unicode sequence for 329 glyphs (the ones marked "Not mapped in the PDF"). (This 329 = 170+159 and I wouldn't be surprised if most/all of the common glyph ids are actually the same.) Thanks to a very useful contribution by उज्ज्वल राजपूत (thanks for the interest!), a significant amount of this manual work can be reduced (one can just copy the sequence from the "helper fonts"). After this, a new PDF will be generated. There are some complications involving र् and ि that I've implemented hackily but probably be made simpler. If you miss some (or all) glyphs it's ok; the next step will prompt you when it encounters those glyphs.
I did this for one page, and it needed giving the Unicode for 75 glyphs that occurred on the page (so I guess there are not 329 but only about 250 glyphs that still remain to be translated),
Revised workflow:Actually, on further thought, I think I overengineered a bit with the toml files etc — instead, to crowdsource it better; here's something simpler that anyone with a web browser can contribute to, without having to download or install anything:Go to the following sheet, add two columns for yourself, and fill in the highlighted rows (rows 80 to 304 I think): https://docs.google.com/spreadsheets/d/1SbLjlgpSa-H8z47dNgcwCo18izPswLjYOiNXOb5WPhg/edit#gid=1707967081
Just finished doing my share of the grunt work as well: verified (and keyed in where required) all the glyphs for both the fonts. Now quickly give us the Mahabharata PDF with the fixed text :-)
--
You received this message because you are subscribed to the Google Groups "sanskrit-programmers" group.
To unsubscribe from this group and stop receiving emails from it, send an email to sanskrit-program...@googlegroups.com.
To view this discussion on the web visit https://groups.google.com/d/msgid/sanskrit-programmers/da06e3c4-627f-4af7-9f9d-5bcc053614a8n%40googlegroups.com.
To view this discussion on the web visit https://groups.google.com/d/msgid/sanskrit-programmers/CAKEM%3DPMx_5G2YQst-WmtNCxHHw6-xBhaUcrEiC0OTKDrhDjq8g%40mail.gmail.com.
To view this discussion on the web visit https://groups.google.com/d/msgid/sanskrit-programmers/CACa%2Bt%3DMzGbNu6Z2N2Ei7QNw_RR9iSZyBWF-bCuBePKRjd2kcBQ%40mail.gmail.com.
The text still has some issues: There are still a couple thousand occurrences of "CCsucc" and a few dozen of "CCprec" in the txt files, so either the regexes or some of the i-vowel glyphs may need another look. And I don't know whether there are even more other issues with the text. Please take a look.
--
You received this message because you are subscribed to the Google Groups "sanskrit-programmers" group.
To unsubscribe from this group and stop receiving emails from it, send an email to sanskrit-program...@googlegroups.com.
To view this discussion on the web visit https://groups.google.com/d/msgid/sanskrit-programmers/918807ba-b532-479b-90cb-3472023af8bbn%40googlegroups.com.
कृपया 1 markdown file per volume इत्यप्य् उत्पाद्य प्रकाशयतु। (यदि प्रत्यध्यायम् एकम् इति कर्तुम् बहु कठिनं स्यात्!)
--
You received this message because you are subscribed to the Google Groups "sanskrit-programmers" group.
To unsubscribe from this group and stop receiving emails from it, send an email to sanskrit-program...@googlegroups.com.
To view this discussion on the web visit https://groups.google.com/d/msgid/sanskrit-programmers/4a1f28ba-f89d-414b-aa94-18ddbca86630n%40googlegroups.com.
but I just pushed a change that may help, by surrounding the italic text with "[sl]". And did something similar for the font name as well.
--
You received this message because you are subscribed to the Google Groups "sanskrit-programmers" group.
To unsubscribe from this group and stop receiving emails from it, send an email to sanskrit-program...@googlegroups.com.
To view this discussion on the web visit https://groups.google.com/d/msgid/sanskrit-programmers/CAJaaQk8hsDQvfnr%2B2SSkBXbCkFPn4DBX_T%3DFi679T-24ub8wWA%40mail.gmail.com.
On Wed, 6 Oct 2021 at 06:02, उज्ज्वल राजपूत <ujjwal....@gmail.com> wrote:but I just pushed a change that may help, by surrounding the italic text with "[sl]". And did something similar for the font name as well.अनु॑गृहीताः स्मः श्रीवत्समहोदय! इ॒दं समी॑क्ष्यताम् इ॒दानी॒य्ँ यथा॑ दृ॒श्यते॑-।अ॒धो॒लि॒खि॒त॒चि॒ह्नाना॑व्ँ विराम-उका॒रादी॑नाम् ए॒व व्य॑त्य॒य इ॒दानी॑म् परिशील॒नीय॒म् अव॑शिष्यते।1. This is great! Nice to see that so much is recoverable. Please also share the scripts you used to go from the PDF to these HTML files; they may be useful.2. About the misplaced virāma and u-mātrās, this is unfortunate; I guess pdftotext (assuming that's what you used) is trying to be "clever". I noticed that it has some options like:-r <fp> : resolution, in DPI (default is 72)-layout : maintain original physical layout-raw : keep strings in content stream order-nodiag : discard diagonal textcould you see if any combination of them helps? (Maybe increasing resolution a lot, and also using -raw, may help...)Also, I noticed that there is a "pdftohtml" (also from poppler/xpdf, like pdftotext), which has a-wbt <fp> : word break threshold (default 10 percent)option and tuning that looks like it could help, but pdftotext doesn't have that option.
--
You received this message because you are subscribed to the Google Groups "sanskrit-programmers" group.
To unsubscribe from this group and stop receiving emails from it, send an email to sanskrit-program...@googlegroups.com.
To view this discussion on the web visit https://groups.google.com/d/msgid/sanskrit-programmers/CAJaaQk8hsDQvfnr%2B2SSkBXbCkFPn4DBX_T%3DFi679T-24ub8wWA%40mail.gmail.com.
--
You received this message because you are subscribed to the Google Groups "sanskrit-programmers" group.
To unsubscribe from this group and stop receiving emails from it, send an email to sanskrit-program...@googlegroups.com.
To view this discussion on the web visit https://groups.google.com/d/msgid/sanskrit-programmers/CAJaaQk-AUekcCdkgH1jd_TT3TvKWgMhouv7%3Dy-NmPBfvM4sPjw%40mail.gmail.com.
--
You received this message because you are subscribed to the Google Groups "sanskrit-programmers" group.
To unsubscribe from this group and stop receiving emails from it, send an email to sanskrit-program...@googlegroups.com.
To view this discussion on the web visit https://groups.google.com/d/msgid/sanskrit-programmers/CAJaaQk9XrfM-ndogEWVBMDCGaxUC13pMAfDFrqpseYk8pdB_8g%40mail.gmail.com.
I'll try looking into doing the text extraction in the pdf-glyph-mapping library itself, when I get some time. In general text extraction from PDF can be a hard problem because of layout (consider tables, two-column layout, etc) and the weird ways in which PDF can be produced, but for certain PDFs like this one, it may be feasible. We'll find out when we try.
On Sun, 24 Oct 2021 at 22:15, उज्ज्वल राजपूत <ujjwal....@gmail.com> wrote:--न दृ॒ष्टो ला॒भ इत्ये॒व व॑क्त॒व्य॑म्। तेनापि॑ द॒त्ते पा॒ठे खल्वस्था॑नेष्ववका॒शान् प॑श्यामि।सोम, 25 अक्तू॰ 2021 को 10:07 am बजे को विश्वासो वासुकिजः (Vishvas Vasuki) <vishvas...@gmail.com> ने लिखा:आर्योज्ज्वल - "Foxit Reader are not providing any of the mapped characters, nor <CCprec>, <CCsucc> etc." इति यद् उक्तम्, तेन पुनः foxit reader प्रयोगो विफल इत्य् अवगन्तव्यम्? कश्चन लाभो दृष्टस् तत्र ?
You received this message because you are subscribed to the Google Groups "sanskrit-programmers" group.
To unsubscribe from this group and stop receiving emails from it, send an email to sanskrit-program...@googlegroups.com.
To view this discussion on the web visit https://groups.google.com/d/msgid/sanskrit-programmers/CAJaaQk9XrfM-ndogEWVBMDCGaxUC13pMAfDFrqpseYk8pdB_8g%40mail.gmail.com.
<details open><summary>मूलम्</summary>
*वैशम्पायन उवाच*
तत्रैव न्यवसन् राजन् निहत्य बकराक्षसम्।
अधीयानाः परं ब्रह्म ब्राह्मणस्य निवेशने॥ २ ॥
</details>
<details><summary>हिन्दी</summary>
वैशम्पायनजीने कहा—राजन्! बकासुरका वध करनेके पश्चात् पाण्डवलोग ब्रह्मतत्त्वका प्रतिपादन करनेवाले उपनिषदोंका स्वाध्याय करते हुए वहीं ब्राह्मणके घरमें रहने लगे।
</details>
namaste!
myself and suhAs (cc-ed) will be very grateful if someone can extract text at https://ia601804.us.archive.org/5/items/unabridged-mahabharata-6-volumes-set-in-hindi-by-veda-vyasa-compressed/Unabridged%20Mahabharata%206%20Volumes%20Set%20in%20Hindi%20by%20Veda%20Vyasa.pdf without error and send us the plain text. (copy pasting does not work well, and ocr-ing might introduce errors - so is a last resort.)In case the below helps -
----
Vishvas /विश्वासः