How to use Gimagereader for Tesseract 4.0.0 alpha for Hindi

ShreeDevi Kumar

unread,

Dec 27, 2016, 8:19:10 AM12/27/16

to technic...@googlegroups.com

1. Install gimagereader from one of the following links:

https://smani.fedorapeople.org/tmp/gImageReader_3.2.0_qt5_x86_64_tesseract-25fed52.exe

https://smani.fedorapeople.org/tmp/gImageReader_3.2.0_qt5_i686_tesseract-25fed52.exe

2. Download 4.0.0alpha Hindi traineddata from

https://github.com/tesseract-ocr/tessdata/blob/master/hin.traineddata

3. Save hin.traineddata to Start→All Programs→gImageReader→Tesseract language definitions.

4. Start the Giamagereader program

5. Click on /Settings/Tools icon on top right corner. Redetect languages from it.

6. Click on the icon under Sources - Files on top left corner to Add images.

7. Choose the file to OCR. The image will be displayed in the main window in center.

8. Click down arrow near 'Recognize All' in center top to choose the language. Choose Hindi.

9. Highlight a section of the image to OCR and click on recognize all button. Or to OCR whole image click on recognize all button.

10. Left bottom will display status - recognizing page 1 of 1. Right bottom will show OCR progress in %.

11. OCRed text will be displayed in window on right.

See attached image for reference.

gimagereader.png

ShreeDevi Kumar

unread,

Dec 27, 2016, 8:22:08 AM12/27/16

to technic...@googlegroups.com

Cropped image - easier to view - for reference

ShreeDevi
____________________________________________________________
भजन - कीर्तन - आरती @ http://bhajans.ramparivar.com

gimagereader.png

Suyash

unread,

Dec 27, 2016, 8:50:23 AM12/27/16

to Scientific and Technical Hindi (वैज्ञानिक तथा तकनीकी हिन्दी)

shree जी,
आप ने बहुत अच्छे और विस्तृत तरह से समझाया इसलिये धन्यवाद।
अब सभी लाेगाें काे हिंदी के OCR की विधी का ज्ञान हाेगा ।

सादर ।

ShreeDevi Kumar

unread,

Dec 27, 2016, 9:55:24 AM12/27/16

to technic...@googlegroups.com

Actually, this can be be used for any of the Indian languages by downloading the appropriate language traineddata. It is easier to test using the Gimagereader gui interface.

Ray, the lead developer for tesseract-ocr at Google will be updating the traineddata files again in January. However, those who want to try and test can provide feedback with the current files.

Vineetji, it will be good if you could ask people to test for other Indian languages too.

Thanks!

ShreeDevi
____________________________________________________________
भजन - कीर्तन - आरती @ http://bhajans.ramparivar.com

--
आपको यह संदेश इसलिए प्राप्त हुआ क्योंकि आपने Google समूह "Scientific and Technical Hindi (वैज्ञानिक तथा तकनीकी हिन्दी)" समूह की सदस्यता ली है.
इस समूह की सदस्यता समाप्त करने और इससे ईमेल प्राप्त करना बंद करने के लिए, technical-hindi+unsubscribe@googlegroups.com को ईमेल भेजें.
अधिक विकल्‍पों के लिए, https://groups.google.com/d/optout पर जाएं.

Anunad Singh

unread,

Dec 28, 2016, 1:31:23 AM12/28/16

to Scientific and Technical Hindi (वैज्ञानिक तथा तकनीकी हिन्दी)

ओसीआर का वास्तविक लाभ छोटे-मोटे दस्तावेजों के यूनिकोडीकरण में नहीं है जितना बड़े-बड़े दस्तावेजों (जैसे पचासों-सैकड़ों वर्ष पहले छपी पुस्तकें, जो मशीन-पठनीय रूप में उपलब्ध नहीं हैं।) के यूनिकोडीकरण में।

मैं श्रीदेवी कुमार जी को साधुवाद देना चाहता हूँ कि उन्होने इस मुक्तस्रोत ओसीआर के नवीनतम संस्करण के बारे में हम सबको सूचित किया और उसके उपयोग की सम्पूर्ण विधि भी बताई।

इसी के साथ मेरी जिज्ञासा यह है कि मान लीजिये मेरे पास कोई पुस्तक स्कैन रूप में है। तो इसको यूनीकोडित रूप में बदलने के लिये क्या एक-एक पृष्ट करके यूनिकोड करना पड़ेगा, या स्कैन फाइल बताने (और कुछ छोटे-मोटे काम करने के बाद ) के बाद प्रयोक्ता को कुछ नहीं करना है और सैकड़ों पृष्टों का यूनिकोडीकरण हो जायेगा।

यदि ऐसा होता है तो बहुत सारी पुस्तकों का यूनिकोडीकरण करके उन्हें विकिस्रोत या विकिबुक्स आदि पर रखा जा सकता है। चूँकि यह पाठ मशीन-पठनीय हो चुका होगा, इसलिये निश्चित ही यह आधुनिक युग के लिये अधिक उपयोगी है (जैसे खोज करना, उसका अनुवाद करना, उसका कॉपी-पेस्ट करना आदि)।

--
आपको यह संदश इसलिए मिला है क्योंकि आपने Google समूह के "Scientific and Technical Hindi (वैज्ञानिक तथा तकनीकी हिन्दी)" समूह की सदस्यता ली है.

इस समूह की सदस्यता समाप्त करने और इससे ईमेल प्राप्त करना बंद करने के लिए, technical-hindi+unsubscribe@googlegroups.com को ईमेल भेजें.

अधिक विकल्पों के लिए, https://groups.google.com/d/optout में जाएं.

ShreeDevi Kumar

unread,

Dec 28, 2016, 2:00:53 AM12/28/16

to technic...@googlegroups.com

>>इसी के साथ मेरी जिज्ञासा यह है कि मान लीजिये मेरे पास कोई पुस्तक स्कैन रूप में है। तो इसको यूनीकोडित रूप में बदलने के लिये क्या एक-एक पृष्ट करके यूनिकोड करना पड़ेगा, या स्कैन फाइल बताने (और कुछ छोटे-मोटे काम करने के बाद ) के बाद प्रयोक्ता को कुछ नहीं करना है और सैकड़ों पृष्टों का यूनिकोडीकरण हो जायेगा।

कमांड मोड में तो बैच फ़ाइल बना कर पूरी किताब एक साथ ओसीआर की जा सकती है .

विएत ओसीआर में भी बैच मोड है पर वह अभी tesseract ४.० के लिए update नहीं हुआ है .

gimagereader में मैंने खोजा नहीं है, मैं बैच रूप में ही टेस्ट कर रही हूँ.

ShreeDevi
____________________________________________________________
भजन - कीर्तन - आरती @ http://bhajans.ramparivar.com

Suyash

unread,

Dec 28, 2016, 3:08:50 AM12/28/16

to Scientific and Technical Hindi (वैज्ञानिक तथा तकनीकी हिन्दी)

अनुनाद जी,
आप ने युनिकोड आधुनिकीकरण बारें मे जो सुझाव दिया है वह अभिनंदनीय है ।
मुझे भी यह लगता है कि आज भी हमारी बहुत सारी ग्रंथ-संपदा आधुनिक (Computerized) करना बाकी है ।
सैकडो पुराण ग्रंथ, शास्त्रों की किताबें तथा प्रसिद्ध लेखकों की पुस्तके लुप्त होने के मार्ग पर है । क्योंकी वे युनिकोड और इंटरनेट के दुनिया से दूर है ।
अगर हम यह कर पायें, तो आने वाली कई पिढीयों तक भारत की ज्ञान विरासत सुरक्षित हो जायेगी ।
विकीबुक्स जैसे माध्यमो से अपनी किताबें दुनिया भर के लिए आज उपलब्ध हो सकती है ।
कार्य बिल्कुल भी असंभव नहीं है । बस जरुरत है तो वह सच्ची निष्ठा और समर्पण की ।

धन्यवाद ।

Anunad Singh

unread,

Dec 28, 2016, 5:30:22 AM12/28/16

to Scientific and Technical Hindi (वैज्ञानिक तथा तकनीकी हिन्दी)

कमाण्ड मोड में बैच फाइल --> क्या बैच फाइल में प्रत्येक पृष्ट के लिये एक कमाण्ड होगा, या कुछ और?

ShreeDevi Kumar

unread,

Dec 28, 2016, 6:11:35 AM12/28/16

to technic...@googlegroups.com

Tesseract can handle multi-page tifs, so one solution is to convert a pdf to multipage tif and then run tesseract on that - in that it will be a single command line.

Or, if each pdf page is a separate image file, one can setup a FOR loop which goes through all files and processes them one after another. If needed all output files can be concatenated to create one big output file.

If there is any particular file, you want me to test with, please email it to me and I can figure out the method and give more detailed instructions/send you the OOCRed output from the testing.

ShreeDevi
____________________________________________________________
भजन - कीर्तन - आरती @ http://bhajans.ramparivar.com

चंदन कुमार मिश्र

unread,

Dec 28, 2016, 6:42:45 AM12/28/16

to technic...@googlegroups.com

बढिया काम रहा यह! कई ओसीआर इस्तेमाल किया, उनमें यह अच्छा है। हमने देखा
कि इंडसेंज वाले से जिस पृष्ठ का अच्छा काम नहीं हुआ, इससे दिखा! बहुत
बहुत धन्यवाद!

--
चंदन कुमार मिश्र

hindibhojpuri.blogspot.com
bhojpurihindi.blogspot.com

Anunad Singh

unread,

Dec 28, 2016, 9:25:51 AM12/28/16

to Scientific and Technical Hindi (वैज्ञानिक तथा तकनीकी हिन्दी)

इस पुस्तक को यहाँ से उतार सकते हैं-

http://mdudde.net/pdf/study_material_DDE/ba/BA1/World%20Military%20History.pdf

2016-12-28 19:38 GMT+05:30 Anunad Singh <anu...@gmail.com>:

'विश्व का सैन्य इतिहास' संलग्न है। कृपया इसका ओसीआर करें।

ShreeDevi Kumar

unread,

Dec 28, 2016, 10:34:50 AM12/28/16

to technic...@googlegroups.com

Anunadji,

This book is from 2007, must be under copyright. I will process a sample page range and let you know the process I followed.

ShreeDevi
____________________________________________________________
भजन - कीर्तन - आरती @ http://bhajans.ramparivar.com

ShreeDevi Kumar

unread,

Dec 28, 2016, 10:53:45 AM12/28/16

to technic...@googlegroups.com

Anunadji,

Commands followed under bash environment under windows 10 -

1. convert pdf to images using ghostscript

gs -q -dNOPAUSE -r300x300 -sDEVICE=tiffg4 -sOutputFile=WMH%03d.tif WMH.pdf -dFirstPage=10 -dLastPage=20

2. Use scantailor on windows

to automatically crop the images, deskew them

3. Run tesseract batch process for imagefiles

#!/bin/bash

#run in anunad/out dir

export TESSDATA_PREFIX=/mnt/c/Users/User/shree

img_files=${img_files}' '$(ls *.tif)

for img_file in ${img_files}; do

echo ${img_file}

time tesseract ${img_file} ${img_file%.*} --psm 6 --oem 1 -l hin+eng

done

----

You can follow similar procedure under windows also.

The sample OCRed text is attached. There are extra | at beginning of lines because of the outline of page in the images. By more cropping or by using appropriate selection in gimagereader these can be avoided.

ShreeDevi
____________________________________________________________
भजन - कीर्तन - आरती @ http://bhajans.ramparivar.com

WMH010.txt

WMH019.txt

WMH020.txt

WMH011.txt

WMH012.txt

WMH013.txt

WMH014.txt

WMH015.txt

WMH016.txt

WMH017.txt

WMH018.txt

ShreeDevi Kumar

unread,

Dec 29, 2016, 12:47:18 AM12/29/16

to technic...@googlegroups.com

Using tesseract in batch mode on windows

1. use ghostscript or another such program to split pdf to tifs

when using MS Windows console (command.com or cmd.exe), you will have to double the % character since the % is used by that shell to prefix variables for substitution, e.g.,

gswin32c -sOutputFile=ABC%%03d.xyz

2. The batch command will be similar to the following

for %%F in (*.tif) do tesseract %%~nF.tif %%~nF --psm 6 -l hin

3. tesseract command syntax can be seen by typing

tesseract --help

basically it is

tesseract <imagefilename> <output filename> --psm 6 -l hin

psm 6 - page segmentation mode of treat all text as one block

-l him - says to use Hindi traineddata

Anunad Singh

unread,

Dec 29, 2016, 4:18:49 AM12/29/16

to Scientific and Technical Hindi (वैज्ञानिक तथा तकनीकी हिन्दी)

बहुत-बहुत धन्यवाद,
मैं पहले स्वयं करने का प्रयत्न करता हूं। यदि कोई समस्या आयेगी तो फिर आपसे पूछूँगा।

Anunad Singh

unread,

Jan 1, 2017, 6:19:09 AM1/1/17

to Scientific and Technical Hindi (वैज्ञानिक तथा तकनीकी हिन्दी)

श्रीदेवी कुमार जी,

अभी तक मैने टेसरैक्ट इन्स्टाल नहीं किया है। अतः कृपया इन 5 पृष्टों का OCR करके भेजें। मेरे पास इस तरह के 520 फाइल हैं। मैं देखना चाह्ता हूँ कि इसका यूनिकोडीकरण कितना 'शुद्ध' होता है।

20.jpg

21.jpg

22.jpg

23.jpg

24.jpg

ShreeDevi Kumar

unread,

Jan 1, 2017, 7:42:18 AM1/1/17

to technic...@googlegroups.com

OCRed using tesseract 4.0alpha under bash on windows. I processed the files as is and also after converting them to 300 dpi and used -l hin+fra to process using Hindi plus French traineddata.

Files attached.

ShreeDevi
____________________________________________________________
भजन - कीर्तन - आरती @ http://bhajans.ramparivar.com

20.txt

20-300.txt

20.txt

24-300.txt

24.txt

23-300.txt

23.txt

22-300.txt

22.txt

21-300.txt

21.txt

Anunad Singh

unread,

Jan 1, 2017, 9:25:24 AM1/1/17

to Scientific and Technical Hindi (वैज्ञानिक तथा तकनीकी हिन्दी)

22.jpg, 23.jpg, 24.jpg देखिये। इनमें पहला कॉलम रोमन (फ्रेञ्च) में है। किन्तु OCR होने में गड़बडी हुई है और उसे नम्बर (1,2,3 etc) जैसा समझ लिया है। क्या इसके लिये कुछ उपाय हो सकता है?

ShreeDevi Kumar

unread,

Jan 1, 2017, 9:39:43 AM1/1/17

to technic...@googlegroups.com

If you use gimagereader gui, you can select sections and apply appropriate language for OCR, rather than relying on automatic detection, which is not perfect.

You can then copy the text from gimagereader (saved text does not seem to be in utf-8 encoding).

See attached sample.

ShreeDevi
____________________________________________________________
भजन - कीर्तन - आरती @ http://bhajans.ramparivar.com

24-gimagereader.txt

ShreeDevi Kumar

unread,

Jan 1, 2017, 10:46:05 AM1/1/17

to technic...@googlegroups.com

Re gimagereader,

Change the preferences (right top corner) for encoding to utf8 instead of system encoding, and hindi text will be saved correctly in the OCRed output file.

- excuse the brevity, sent from mobile

ShreeDevi Kumar

unread,

Jan 2, 2017, 4:49:36 AM1/2/17

to technic...@googlegroups.com

gImagereader can take a pdf file as input and process multiple pages at a time also. So there is no need to split it separate images.

ShreeDevi
____________________________________________________________
भजन - कीर्तन - आरती @ http://bhajans.ramparivar.com

Anunad Singh

unread,

Jan 2, 2017, 4:58:48 AM1/2/17

to Scientific and Technical Hindi (वैज्ञानिक तथा तकनीकी हिन्दी)

मेरे पास जो 520 JPG फाइले हैं उनको एक बार में ही OCR कैसे करें। सर्वोत्तम विधि क्या होगी?

ShreeDevi Kumar

unread,

Jan 2, 2017, 5:04:08 AM1/2/17

to technic...@googlegroups.com

Use Gimagereader. You can select multiple files under file open and OCR them. It will also allow you to tweak the OCR to get accurate recognition for French.

Plus, being a GUI, it is more user friendly.

ShreeDevi
____________________________________________________________
भजन - कीर्तन - आरती @ http://bhajans.ramparivar.com

ken

unread,

Jan 2, 2017, 7:55:53 PM1/2/17

to Scientific and Technical Hindi (वैज्ञानिक तथा तकनीकी हिन्दी)

Anunad Ji,

An easy way is to convert French words pronunciations to IPA to Devanagari is via Fox Replacement tool.

one two three four five six seven eight nine ten

un deux trois quatre cinq six sept huit neuf dix........French

ɛ̃ dø tʁwa katʁə sɛ̃k sis sɛt ɥit nœf dis........IPA

en dar trwaa kaatre senk sis set yuit narf dis..........Replace IPA to traditional to Dev

https://en.wikipedia.org/wiki/Help:IPA_for_French

http://easypronunciation.com/en/french-phonetic-transcription-converter

Suyash

unread,

Jan 3, 2017, 12:49:01 AM1/3/17

to Scientific and Technical Hindi (वैज्ञानिक तथा तकनीकी हिन्दी)

Tesseract OCR तज्ञों का Google Group

Tesseract OCR कि अधिक जानकारी हेतु नीचे दिये गये Google Group लिंक
पर Login करें ।

लिंक : https://groups.google.com/forum/#!forum/tesseract-ocr

"ज्ञान देने से बढता है।"

On Tuesday, December 27, 2016 at 6:49:10 PM UTC+5:30, shree wrote:

Reply all

Reply to author

Forward