Tesseract Dictionary (finally) works for Indic

74yrs old

unread,

Jan 17, 2010, 11:43:00 AM1/17/10

to indi...@googlegroups.com

Dear Debayan Banerjee,

I followed every steps mentioned in your blog on the above subject - to test for kannada.

extract of ubuntu's terminal is reproduced below for your information.
sriranga@ubuntu:~$ cd tesseractindic-0.2/
sriranga@ubuntu:~/tesseractindic-0.2$ wordlist2dawg test.txt dawg
Building DAWG from word list in file, 'test.txt'
Compacting the DAWG
Compacting node from 9990280 to 1000234 (2)
100 nodes reduced
Writing squished DAWG file, 'dawg'
118 nodes in DAWG
118 edges in DAWG
sriranga@ubuntu:~/tesseractindic-0.2$ sudo cp dawg /usr/local/share/tessdata/utf.
[sudo] password for sriranga:
sriranga@ubuntu:~/tesseractindic-0.2$ sudo cp dawg /usr/local/share/tessdata/utf.freq-dawg
sriranga@ubuntu:~/tesseractindic-0.2$ sudo cp dawg /usr/local/share/tessdata/utf.word-dawg
sriranga@ubuntu:~/tesseractindic-0.2$ tesseract sampletif.tif test1 -l utf
Tesseract Open Source OCR Engine
Image has 8 * 3 bits per pixel, and size (800,600)
Resolution=300
sriranga@ubuntu:~/tesseractindic-0.2$ cat test1.txt
ಕೆಫ}॥ಡೆಹಘಃಕೆಲಿಯರಿ

sriranga@ubuntu:~/tesseractindic-0.2$ echo 'ಕ ನ್ನ ಡ ವ ನ್ನು ಕ ಲಿ ಯಿ ರಿ'>list
sriranga@ubuntu:~/tesseractindic-0.2$ cat list
ಕ ನ್ನ ಡ ವ ನ್ನು ಕ ಲಿ ಯಿ ರಿ
sriranga@ubuntu:~/tesseractindic-0.2$ wordlist2dawg list dawg
Building DAWG from word list in file, 'list'
Compacting the DAWG
Compacting node from 9990280 to 1000124 (2)
Writing squished DAWG file, 'dawg'
63 nodes in DAWG
63 edges in DAWG
sriranga@ubuntu:~/tesseractindic-0.2$ sudo cp dawg /usr/local/share/tessdata/utf.freq-dawg
sriranga@ubuntu:~/tesseractindic-0.2$ sudo cp dawg /usr/local/share/tessdata/utf.word-dawg
sriranga@ubuntu:~/tesseractindic-0.2$ tesseract sampletif.tif -l utf>temp
Could not open file, utf
sriranga@ubuntu:~/tesseractindic-0.2$ tesseract sampletif.tif -lutf>temp
Tesseract Open Source OCR Engine
Image has 8 * 3 bits per pixel, and size (800,600)
Resolution=300
sriranga@ubuntu:~/tesseractindic-0.2$ cat sampletif.txt
cat: sampletif.txt: No such file or directory
sriranga@ubuntu:~/tesseractindic-0.2$ cat temp.txt
cat: temp.txt: No such file or directory
sriranga@ubuntu:~/tesseractindic-0.2$ tesseract sampletif.tif sample.txt -l utf
Tesseract Open Source OCR Engine
Image has 8 * 3 bits per pixel, and size (800,600)
Resolution=300
sriranga@ubuntu:~/tesseractindic-0.2$ cat sample.txt
cat: sample.txt: No such file or directory

From the above it could seen that out of ಕೆಫ}॥ಡೆಹಘಃ ಕೆಲಿಯರಿ ಕೆಫ}॥ಡೆಹಘಃ =wrong/ ಕೆಲಿಯರಿ= Ok except ಕೆ should be ಕ.
Based on above, It is felt that your logic about Dictionary will work( 50%) for Indic, if relevant codes of tesseract are improved by conducting similar experiments on different indic languages. Anyhow I appreciate your wonderful logic/idea.
Awaiting your post on further research= " I intend to analyse the output and pinpoint the problem in the next post. In this post, lets concentrate on the results."

With Regards,
-sriranga(77yrsold)

sample.txt.txt

sampletif.tif

test.txt

test1.txt

list

-lutf.txt

sriranga(77yrsold) location: Bangalore

unread,

Jan 29, 2010, 8:51:03 AM1/29/10

to indic-ocr

No solution is forthcoming?
-sriranga(77yrsold)

On Jan 17, 9:43 pm, 74yrs old <withblessi...@gmail.com> wrote:
> Dear Debayan Banerjee,
>
> I followed every steps mentioned in your blog on the above subject - to
> test for kannada.
>

> *extract of ubuntu's terminal is reproduced below for your information.*

> sriranga@ubuntu:~$ cd tesseractindic-0.2/
> sriranga@ubuntu:~/tesseractindic-0.2$ wordlist2dawg test.txt dawg
> Building DAWG from word list in file, 'test.txt'
> Compacting the DAWG
> Compacting node from 9990280 to 1000234 (2)
> 100 nodes reduced
> Writing squished DAWG file, 'dawg'
> 118 nodes in DAWG
> 118 edges in DAWG
> sriranga@ubuntu:~/tesseractindic-0.2$ sudo cp dawg
> /usr/local/share/tessdata/utf.
> [sudo] password for sriranga:
> sriranga@ubuntu:~/tesseractindic-0.2$ sudo cp dawg
> /usr/local/share/tessdata/utf.freq-dawg
> sriranga@ubuntu:~/tesseractindic-0.2$ sudo cp dawg
> /usr/local/share/tessdata/utf.word-dawg
> sriranga@ubuntu:~/tesseractindic-0.2$ tesseract sampletif.tif test1 -l utf
> Tesseract Open Source OCR Engine
> Image has 8 * 3 bits per pixel, and size (800,600)
> Resolution=300
> sriranga@ubuntu:~/tesseractindic-0.2$ cat test1.txt

> *ಕೆಫ}॥ಡೆಹಘಃಕೆಲಿಯರಿ*
>
> sriranga@ubuntu:~/tesseractindic-0.2$ echo '*ಕ ನ್ನ ಡ ವ ನ್ನು* *ಕ ಲಿ ಯಿ *

> ರಿ'>list
> sriranga@ubuntu:~/tesseractindic-0.2$ cat list

> *ಕ ನ್ನ ಡ ವ ನ್ನು* *ಕ ಲಿ ಯಿ ರಿ*

> From the above it could seen that out of *ಕೆಫ}॥ಡೆಹಘಃ ಕೆಲಿಯರಿ* *ಕೆಫ}॥ಡೆಹಘಃ=wrong
> */ *ಕೆಲಿಯರಿ= Ok except **ಕೆ should be **ಕ.
> Based on above,* It is felt that your logic about Dictionary will work( 50%)

> for Indic, if relevant codes of tesseract are improved by conducting
> similar experiments on different indic languages. Anyhow I appreciate your
> wonderful logic/idea.

> Awaiting your post on further research= *"** I intend to analyse the output

> and pinpoint the problem in the next post. In this post, lets concentrate on

> the results."*
>
> With Regards,
> -sriranga(77yrsold)
> *
> *
>
> sample.txt.txt
> < 1KViewDownload
>
> sampletif.tif
> 1900KViewDownload
>
> test.txt
> < 1KViewDownload
>
> test1.txt
> < 1KViewDownload
>
> list
> < 1KViewDownload
>
> -lutf.txt
> < 1KViewDownload

74yrs old

unread,

Jan 31, 2010, 10:45:29 AM1/31/10

to indi...@googlegroups.com

Indu,

Attached test sample tif to verify correctness of the output at your end. I get output as follows:

ಬ್ಲಾರ್ಗೆಳೆಯರಿಗೆಬ್ಲಾಹೊಸವಷ೪ದಶಎಭಾಶಯಗಳು(spelling mistakes). ಬೆಳೆಸಾಲ ತೀರಿಸಲಾರದ ರೈತ ತನ್ನ ಮಾನಕ್ಕೆ ಅಂಜಿ(no spelling mistake). please generate freq -dawg and word-dawg separately based on the jttest.txt(now attached) After updating your tessdata folder and testing at your end, please forward your output to me for my perusal. At my side I generated freq-dawg and word-dawg separately and copied to my tessdata folder and run tesseract as usual. Please note in earlier(old) freq-dawg and word-dawg be renamed as Mal.freq-dawg -1 and mal.word.dawg-1 to enable you restore to old files after testing, by deleting" -1" . I hope you have catched my point.

With best of Luck,
-sriranga(77yrsold)

On Fri, Jan 29, 2010 at 7:42 PM, 74yrs old <withbl...@gmail.com> wrote:

Dear Nishad,
Sorry for disturbing you. fwd for your information about the research on dictionary made by Deepayan - it may be useful in your professional work on OCR.
With regards,
-sriranga(77yrsold)

jttest.txt

74yrs old

unread,

Jan 31, 2010, 10:47:12 AM1/31/10

to indi...@googlegroups.com

Indu,
forgot to attach the imagefile(png) for testing purpose at your end.
-sriranga