tool to convert horizontal line tinto vertical line in the text file available?

40 views
Skip to first unread message

77yrsold

unread,
Dec 8, 2009, 3:05:55 AM12/8/09
to indic-ocr
In the textfile how to automate the horizontal lines of characters
into vertical characters? Is there any tool
available for this purpose. for example.

অংশগ্রহণকারী = horizontal line
vertical line
অং

গ্র
হণ
কা
রী

Santhosh Thottingal

unread,
Dec 8, 2009, 3:24:25 AM12/8/09
to indic-ocr
If you want to split the text into syllables have a look at this code
http://smc.org.in/silpa.git/?p=silpa.git;a=blob;f=modules/syllabalizer/syllabalizer.py

Thanks
Santhosh

Debayan Banerjee

unread,
Dec 8, 2009, 3:30:19 AM12/8/09
to indi...@googlegroups.com
2009/12/8 77yrsold <withbl...@gmail.com>:

> In the textfile how to automate the horizontal lines of characters
> into vertical characters? Is there any tool
> available for this purpose. for example.
>
> অংশগ্রহণকারী  = horizontal line


The following python script does it for you:

#!/usr/bin/python
#-*- coding:utf8 -*-

import sys, codecs

word = unicode(sys.argv[1],'utf-8')

for num in range(0,len(word)):
print word[num].encode('utf-8')


Here is how to execute it on the command line:

./split.py অংশগ্রহণকারী

--
Regards,
Debayan Banerjee

Debayan Banerjee

unread,
Dec 8, 2009, 3:30:57 AM12/8/09
to indi...@googlegroups.com
2009/12/8 Debayan Banerjee <deba...@gmail.com>:

> 2009/12/8 77yrsold <withbl...@gmail.com>:
>> In the textfile how to automate the horizontal lines of characters
>> into vertical characters? Is there any tool
>> available for this purpose. for example.
>>
>> অংশগ্রহণকারী  = horizontal line
>

I am assuming you are not interested in syllable wise breakup.

--
Regards,
Debayan Banerjee

74yrs old

unread,
Dec 8, 2009, 4:11:10 AM12/8/09
to indi...@googlegroups.com
Yes I am not interested in syllable wise breakup. for example  ಕಾ (Kannada)   కా (telugu)   কা (Bengali)   কা (hindi) கா (Tamil)    കാ=  kA
Regards,
-sriranga(77yrs old)

74yrs old

unread,
Dec 8, 2009, 5:19:57 AM12/8/09
to indi...@googlegroups.com
did not work in my case. sample Kannada =  ಕ ನ್ನ ಡ ದ ಲ್ಲಿ ಮಾ ತಾ ಡಿ
-sriranga

On Tue, Dec 8, 2009 at 2:00 PM, Debayan Banerjee <deba...@gmail.com> wrote:

Debayan Banerjee

unread,
Dec 8, 2009, 5:23:34 AM12/8/09
to indi...@googlegroups.com
2009/12/8 74yrs old <withbl...@gmail.com>:

> did not work in my case. sample Kannada =  ಕ ನ್ನ ಡ ದ ಲ್ಲಿ ಮಾ ತಾ ಡಿ

I thought your input will be a word, and not individual letters.
What are you exactly trying to do?

--
Regards,
Debayan Banerjee

74yrs old

unread,
Dec 8, 2009, 5:28:48 AM12/8/09
to indi...@googlegroups.com
Debayan,
I am giving example of word below:
word = ಕನ್ನಡದಲ್ಲಿ ಮಾತಾಡಿ
Split word like below:


ನ್ನ


ಲ್ಲಿ
ಮಾ
ತಾ
ಡಿ
Trust you have catched my concept.
-sriranga

Debayan Banerjee

unread,
Dec 8, 2009, 5:31:06 AM12/8/09
to indi...@googlegroups.com
2009/12/8 74yrs old <withbl...@gmail.com>:

> Debayan,
> I am giving example of word below:
> word = ಕನ್ನಡದಲ್ಲಿ ಮಾತಾಡಿ

Ah! The script works for even this word but not exactly. You need
consonants+overlapping vowels separated out. I have to think about it
a little bit.


--
Regards,
Debayan Banerjee

74yrs old

unread,
Dec 8, 2009, 5:56:18 AM12/8/09
to indi...@googlegroups.com
Dear Banerjee,
I forgot to inform you that the in py program  Input file is required i.e. open (text.txt /text.doc)to be added.
I don't know how to add in the script since I am not familiar with python. Just now I was able to run py in CMD.exe MSwindows using string as " testing" (english) works well. But when I tried to paste kannada or anyindic  script it appears as ????? in cmd.exe and also in output.txt.
-with regards,
-sriranga(77yrsold)

74yrs old

unread,
Dec 8, 2009, 8:03:46 AM12/8/09
to indi...@googlegroups.com
Dear Banerjee,
just now I tested with python3 and also python 2.6 - able to run your program in cmd.exe of winxp. as well as Ubuntu 9.04. This is for your information that python program will work in windows platform apart from linux platform. How to call text file in split.py.
With regards,
-sriranga(77yrs old)

Pranava Swaroop Madhyastha

unread,
Dec 8, 2009, 9:35:26 AM12/8/09
to indi...@googlegroups.com
Dear Sriranga,

>> I forgot to inform you that the in py program  Input file is required i.e.
>> open (text.txt /text.doc)to be added.
>> I don't know how to add in the script since I am not familiar with python.
>> Just now I was able to run py in CMD.exe MSwindows using string as "
>> testing" (english) works well. But when I tried to paste kannada or
>> anyindic  script it appears as ????? in cmd.exe and also in output.txt.

I think this problem is more with the support of Unicode on WinXP
commandline than the script.
Though I am not an expert in WinXP as such, but I think you can use
"\U" as an option for unicode.

I am sure someone here will be able to help you with the Unicode input.

--
Sincere regards,
Pranava Swaroop

74yrs old

unread,
Jan 11, 2010, 11:39:04 AM1/11/10
to indi...@googlegroups.com
No help is forthcoming till today.
how to call or open the text file in split.py. script is requested since I am newbie to python.
With regards,
-sriranga(77yrsold)

Debayan Banerjee

unread,
Jan 11, 2010, 1:54:52 PM1/11/10
to indi...@googlegroups.com
On 11/01/2010, 74yrs old <withbl...@gmail.com> wrote:
> No help is forthcoming till today.
> how to call or open the text file in split.py. script is requested since I
> am newbie to python.

#!/usr/bin/python
#-*- coding:utf8 -*-

import sys, codecs

f = open(sys.argv[1])
lines = f.readlines()

for words in lines:
for letters in words:
print letters.encode('utf-8')


Save the above script with split.py as file name.
Then do chmod +x split.py .
Then run like this: ./split.py data.txt where data.txt is your text file.

--
Regards,
Debayan Banerjee

74yrs old

unread,
Jan 12, 2010, 4:52:34 AM1/12/10
to indi...@googlegroups.com
Debayan Sarkar,
Thanks for the revised py program. when run, the following observations were made for your information and
instructions.

If run in IDLE - error message displayed -reproduced below:
>>> ================================ RESTART ================================
>>>

Traceback (most recent call last):
  File "C:\Python26\dep_split.py", line 6, in <module>

    f = open(sys.argv[1])
IndexError: list index out of range
>>>
========================================================================
If run in CMD.exe - error message displayed - reproduced below:

C:\Python26>python dep_split.py splittest.txt > out.txt
Traceback (most recent call last):
  File "dep_split.py", line 11, in <module>
    print letters.encode('utf-8')
UnicodeDecodeError: 'ascii' codec can't decode byte 0xef in position 0: ordinal
not in range(128)
C:\Python26>
=======================================================================
Even run in Linux Fedora-11 : same message displayed.

Awaiting further guidance.
With regards,
-sriranga(77yrsold)

74yrs old

unread,
Jan 12, 2010, 5:33:18 AM1/12/10
to indi...@googlegroups.com
Downloaded the revised version and renamed as dep_split.py.
Tested.
when run in Ubuntu 9.04 error message displayed - reproduced below:
sriranga@ubuntu:~/Desktop$ chmod +x dep.split.py
sriranga@ubuntu:~/Desktop$ ./dep.split.py test.txt

Traceback (most recent call last):
  File "./dep.split.py", line 11, in <module>

    print letters.encode('utf-8')
UnicodeDecodeError: 'ascii' codec can't decode byte 0xef in position 0: ordinal not in range(128)
sriranga@ubuntu:~/Desktop$

Whereas when run earlier version ./split.py  it works without any error message.

74yrs old

unread,
Jan 12, 2010, 11:25:26 AM1/12/10
to indi...@googlegroups.com
Dear Pranava Swaroop,
Yes what you said is correct reg. problem is more with support of unicode commandline of WinXP
whatever typed other than English in cmd.exe, it displayed as ???? instead of showing script of the relevant
lang. Even if  you directed to output.txt using > in commandline, in the output notepad.txt will also contains
???? eventhough notepad is saved as utf-8 format -but will immediately reversed to ANSI format !!.   No use of "\U".as suggested by you.
Atleast in Linux it can be saved as utf-8 format.
 Till today no one is forthcoming to solve the problem.
Wishing you Best of Luck,
-sriranga(77yrsold)

On Tue, Dec 8, 2009 at 8:05 PM, Pranava Swaroop Madhyastha <pranava.m...@gmail.com> wrote:

74yrs old

unread,
Jan 14, 2010, 12:48:27 AM1/14/10
to indi...@googlegroups.com
Debayan,
Any progress? I  may kindly be updated.
With regards,
-sriranga(77yrsold0

Nishad TR

unread,
Jan 17, 2010, 9:11:22 PM1/17/10
to indi...@googlegroups.com
-----BEGIN PGP SIGNED MESSAGE-----
Hash: SHA1

Dear Sriranga,

Here is a tool, which is a simple modified version of one of our
tesseract post-processors.

http://lime-ocr.googlecode.com/files/h2v-splitter.zip

Just uploaded to see if this is any useful for your work. If you need
any modifications for your use, please let me know.

Source and a simple help text is included in the zip.

Regards
Nishad

On 1/14/2010 11:18 AM, 74yrs old wrote:
> Debayan,
> Any progress? I may kindly be updated.
> With regards,
> -sriranga(77yrsold0
>
> On Tue, Dec 8, 2009 at 4:01 PM, Debayan Banerjee <deba...@gmail.com
> <mailto:deba...@gmail.com>> wrote:
>
>
> 2009/12/8 74yrs old <withbl...@gmail.com

> <mailto:withbl...@gmail.com>>:


> > Debayan,
> > I am giving example of word below:
> > word = ಕನ್ನಡದಲ್ಲಿ ಮಾತಾಡಿ
>
> Ah! The script works for even this word but not exactly. You need
> consonants+overlapping vowels separated out. I have to think about it
> a little bit.
>
>
> --
> Regards,
> Debayan Banerjee
>
>

-----BEGIN PGP SIGNATURE-----
Version: GnuPG v1.4.10 (MingW32)
Comment: Using GnuPG with Mozilla - http://enigmail.mozdev.org/

iQEcBAEBAgAGBQJLU8NKAAoJEDng0hpUv7qbyXsIAKpUijkeS3zY4Rl6sEAX3oXc
k2+HdHOZSfbO7Yk0saWTMJwBYlIfugPowmXB7zIJskpzZfDpYGGnexkiEI8I/P+m
o9RSIJ7piZc3XnJI9XTJ0iDEfHjeMHapZxr7Os0Mwf6WJPOlJkm7n6ECd5OCJW4E
DEhbZXzAP4Z335iEKRSEbp68ejY7mVoe/2cG6kfZR0fJIY3bLh39sMyvb2v4xasr
hL60p9V3GifmplgJ7c9LTDa2iEZWto7k97wkv5uVX+9h8ya9GaGj/sFsLvkS0I6c
UB4G4vPn77udUJrMDYvNtprijZg8qApn2Z6kjzqmiHAsRC6XK79c7GAF0kuR4H8=
=1QN6
-----END PGP SIGNATURE-----

74yrs old

unread,
Jan 18, 2010, 3:41:41 AM1/18/10
to indi...@googlegroups.com
Dear Nishad,
Congratulations! for successfully developed for windows platform.
Thank you very much for your wonderful tool. Tested and works fine. Generated in your tool and  tested in trainer GuI but failed with error message " Error: Illegal feature parameter spec!". Extract of terminal is reproduced below:
"in train
tesseract Kannada.images/bigimage.tif junk nobatch box.train
Error: Illegal feature parameter spec!
Fatal error: No error trap defined!
Signal_termination_handler called with signal 1000
Reading Kannada.images/bigimage.tr ...
Error: Illegal feature parameter spec!
Fatal error: No error trap defined!
Signal_termination_handler called with signal 1000
Reading Kannada.images/bigimage.tr ...
unicharset_extractor Kannada.images/bigimage.box
Extracting unicharset from Kannada.images/bigimage.box
Wrote unicharset file ./unicharset.
unicharset renamed and moved
sriranga@ubuntu:~/Desktop/TesseractIndic-Trainer-GUI-0.1.3$ "

However kannada.unicharset was generated while rest of datafiles failed to generated.
I shall test in winXP and feedback to you.

2) Small request whether you are in position to develop tool for WinXP to generate dictionary(utf-8) type  for  use in wordlist2dawg program  of tesseract-ocr . For example:

example=Fatal error: No error trap defined!
Fatal
error:
No
error
trap
defined!
With Choicest Best of Luck,
-sriranga(77yrsold)

Nishad TR

unread,
Jan 19, 2010, 8:06:30 PM1/19/10
to indi...@googlegroups.com
-----BEGIN PGP SIGNED MESSAGE-----
Hash: SHA1

Dear Sriranga,

I couldn't respond on time, since I'm traveling. Here's an updated
version.

http://lime-ocr.googlecode.com/files/h2v-split-2.zip

Please go through provided readme, command line parameters were
changed and using a single tool you can split into words or
characters. cleaner got more options like removing punctuations.

I'm sorry to say, but I couldn't pay enough time to test it properly.
I'll update and fix it (if there is any error) once I'm back.

Related tesseract trainer GUI and other tesseract issues, please seek
help from ninjas from indic-ocr community. I myself, never tried this GUI.

If your development is in a hopeful direction, then i humbly request
you to use tesseract-ocr version 3. It gives better results for
Malayalam (According to one of our clients, for typewritten documents
its near 100%, and in v2 it's hardly 85%). Structurally all Dravidian
languages are same, so I think for Kannada you can move along with
tessercat-ocr 3. (I'm not so sure about Hindi and other Devanagari
Scripts)

Please feel free to get back, with any thing in need.

Regards
Nishad


-----BEGIN PGP SIGNATURE-----
Version: GnuPG v1.4.10 (MingW32)
Comment: Using GnuPG with Mozilla - http://enigmail.mozdev.org/

iQEcBAEBAgAGBQJLVlcVAAoJEDng0hpUv7qbXBsIALTn7Ch1GPfsXDm4WDTWjan2
QoX2SuuLpSGTIES2y9jcNezsYV27jOuEiQ/5wydt7XQwbsXZENymunqh1tT77LpP
OSs4MZ3vMyOGXSCLT+MpTwb79UcPQV4a9RZR8+jsMctvsnAXD1xmVM799lOHLfl3
W1SwTNZ5kI23gf8IqXg6oOyYLLzMJBwxWR/hXqSnWMM7nn9/2UUBg/gZE5IsCkX7
h4ma6ND6AkhAIaE5mAeItiAEfuRdn4jCWz1dJjnzxhEHQHjopkhPRHDOROPodkTP
zYWI9fdOMmrMNjEM/D1AILK5HpWwCqaCD8E3/s4FW3h8t/bSy7zRL5jDKTXOcbA=
=ZB5B
-----END PGP SIGNATURE-----

74yrs old

unread,
Jan 19, 2010, 10:18:45 PM1/19/10
to indi...@googlegroups.com
Dear Nishad,
Thanks for the updated information. In fact I tested your tool as well as Lime-ocr
(appears used tesseract 3.0) yesterday (whole day) but failed. I could not understand why it gives trouble.
Anyhow I will download latest your version and test in winXP using lime-ocr. In the older version 2.03 it had gives 80% correct
output ( 2-3 yrs back) I forgot how I had done !
Reg:"If your development .....tesseract-ocr3"(your last para) = I agree with your point. When compared to
Hindi/Devanagari, - Kannada, Tamil, telugu and Malayalam are not complex scripts. These langs should
succeeded. I had tested in tesseract2.03 using sample of kannada, tamil, telugu and malayalam in one file and found
output are almost identical - which I had posted in tesseract-ocr forum longback(2-3 yrs back) also.

Will you help me to test the sample attached herewith using your tool as well as lime-ocr(r319) and feedback to me
so that I can follow the same steps you have taken. I shall feedback to you after perusal of output of yours.

I am ready perform beta-testing in Kannada, tamil, telugu and malayalam (eventhough I don't know - but with help of baraha tools).
With Choicest Blessings of Supreme Lord,
-sriranga(77yrsold)


I could not understand "ninjas" from indic-ocr community"- it is presumed that ninjas is  members of forum?
However I used trainerGUI - still I have not succeeded  to have output accuracy, even after using your tool.
abarahanews5.txt
A-kan-utf8.txt

Nishad TR

unread,
Jan 19, 2010, 11:37:20 PM1/19/10
to indi...@googlegroups.com
-----BEGIN PGP SIGNED MESSAGE-----
Hash: SHA1

Dear Sriranga,

I don't have any language files for Kannada generated for tesseract-3.
I'm sorry about lime-ocr. I guess you used python script called
lime-ocr.py from trunk. Please don't worry about the errors. That's
only the source not the distribution.I kept the source in trunk to
ensure it's strictly under open domain.

I already mailed my guys in office to create a package. They will be
sending it soon. Some of our clients are using lime-ocr in production
environment and it gives significant performance. I never get enough
time to create proper documentation for end users or creating a proper
setup package, that is the main reason for it's absence in a single
package. Lime OCR is nothing but a GUI. In all of our projects we keep
tesseract untouched and utilize some pre-processing and
post-processing tools to produce expected output.

Sorry about 'ninja' it's just a jargon. I mean most experienced and
adventurous people in indic-ocr group. Here is a hint about the word.
http://en.wikipedia.org/wiki/Ninja

Text Split is not exactly the tool you require. You need some
trainable pattern based tools for character splitting. But for word
split you can use it. For example it can't split a letter along with
it's vowel combination. I guess if we create some pattern (of course
with pain) we can use that for many Indian languages. Transliteration
can be easily implemented for these training files and job comes handy.

I'll try to experiment with Kannada, once I'm back.

Please keep posted
Regards
Nishad

-----BEGIN PGP SIGNATURE-----
Version: GnuPG v1.4.10 (MingW32)
Comment: Using GnuPG with Mozilla - http://enigmail.mozdev.org/

iQEcBAEBAgAGBQJLVoiAAAoJEDng0hpUv7qb2HUH/ixbKfEBbbgpvrxQuy3yDHNZ
4NsOf/fmVf7MzG8omAuzXRqkhXWSSCR1Y6AB/oYJICw/sTTjI5j6EmEkvRZ+38wg
EU4XrNlKLivtCFOXSeFASQpmV/+cuL6YudXOGGq3ZQJIfcL3hnlhexNYuS8j3WAt
zje1kFRPwRvhJikV38Eu/xHVTDBrCHzGxa/Sr3+2qhz1gvmVo0gI2hqD9KqWLc14
fhtdZwmdHyS2X6zFqvp/oMtM1LnMpTNVxqQLcstNVhix8gEifYCiGdKbU9q5MDv9
nsfjwszbSOTzp5NB1jxudZp0MEsdNxu2BQEHnWuEtWsoWPx8nAsjBGwnin5nSh8=
=Nu3j
-----END PGP SIGNATURE-----

Nishad TR

unread,
Jan 20, 2010, 12:55:04 AM1/20/10
to indi...@googlegroups.com
-----BEGIN PGP SIGNED MESSAGE-----
Hash: SHA1


Dear Sriranga,

Here comes your solution

For split_word_20012010102036.txt
h2v-split.exe abarahanews5.txt word 1
(Just word split with new-lines and tabs removed)

For split_word_20012010102221.txt
h2v-split.exe abarahanews5.txt word 2
(Word split with new-lines, tabs and punctuations removed)

For split_word_20012010102336.txt
h2v-split.exe A-kan-utf8.txt word 1
(Only word split with new-lines and tabs removed)

I'm attaching a Kannada Word List, It seems to be useful for you. It has
around 60000 Words.

All the best and please keep me posted.

I'll be busy up to 10 pm. But please don't hesitate to shoot.

Wishes & Regards
Nishad


-----BEGIN PGP SIGNATURE-----
Version: GnuPG v1.4.10 (MingW32)
Comment: Using GnuPG with Mozilla - http://enigmail.mozdev.org/

iQEcBAEBAgAGBQJLVpq4AAoJEDng0hpUv7qbwQQIAKsrEG5AZtoYug8eLYx7vUco
NRjpg/KB2Bo7UpqkBO2eqCleFfdEkJZ0xJG8jIFyLY5GAaOh6C6jCl4ul0NKWrh3
SG53NXs4r+LJBv22nkvSOUEnpF4Sg02JavAVFJkiLN9pcEfbLkp/93EjbPOoRY1M
2YbBtJU8VN2UlTXKb2A/ItZlPAW/X6CeBeX/yDyEw9FlPtRNphHw2d5sRN6H0VAn
uHtfl9VhSHZx0CXN0j3h+ce1qHolrCkgO5r+oTxtukgyR4fKCzwNZryAKj/OpShG
V87J0WnSiAB45F9lFKTVHY1GCJMd7hBtBRGGV7nH0tCni8fdSq0JQ72FoJ1G6fo=
=qz1B
-----END PGP SIGNATURE-----

split_word_20012010102036.txt
split_word_20012010102221.txt
split_word_20012010102336.txt
Kannada_word_List.zip

74yrs old

unread,
Jan 20, 2010, 2:15:58 AM1/20/10
to indi...@googlegroups.com
Dear Nishad,
Thanks for the revised tools which serves two purpose (1) generate dictionary type (2) vertical line for Deepyan trainer.GUI.
Really I am thankful to you for the Kannada word list for generating word2listdawg. If more word list are available welcome. I searched for the words list  in google but failed. As such I am thankful to you for the same.
 In fact I had downloaded "tesseract-bin-win-r319.7z" into folder "lime-ocr-r319". I have already downloaded original tesseract-r319 from Tesseract website and had compiled in vc++2008 - wherein combine.exe generated also works fine. It appears in tesseract-bin-win-r319 does not have combine.exe.

I have   downloaded limeocr.py -  experimented in winXP wherein I have installed python26 but failed to work.
It appears some modification has to be made in the py file.

Tesseract-Trainer-1.5.0.1.exe - unfortunately it does not open editor box even though there is image/box file are available - as such it failed in winxP.


With Choicest Blessings of Supreme Lord,
-sriranga(77yrsold)

Nishad TR

unread,
Jan 20, 2010, 2:26:35 AM1/20/10
to indi...@googlegroups.com
-----BEGIN PGP SIGNED MESSAGE-----
Hash: SHA1


Dear Sriranga,

Lime OCR is a single installer, it doesn't require python. I'll upload
any available version (which I'm expecting by lunch break) soon. Please
don't spoil your time on that python source. Give me time up to lunch
break.

Regard
Nishad


-----BEGIN PGP SIGNATURE-----
Version: GnuPG v1.4.10 (MingW32)
Comment: Using GnuPG with Mozilla - http://enigmail.mozdev.org/

iQEcBAEBAgAGBQJLVrArAAoJEDng0hpUv7qbvlUH/A+eItJvRxMIGTvskYPhav4R
X/YusBywG6e2drpg85r1SOp6FOng3lz5uZrUnQxPC32abInzt0wHEmqK3r/KeXqG
kElnyLNY9+46ze7NobjcIWueV8ooqAmTuSjARd0+o1AmEhmo8S1Rm17AAA25+ZQs
ZVOMfu20KU7RAMPnfO8zjNAdNwp4WilAv+AybuezFBLzlsQk1gk/ntYH1zt4MwSi
KgdHQeGg9pMw6CVsLW3OfM6UjWKSzgy9bVECkIbZfX2N630F9KvQRMNLAwuUMdE8
wgGmUdwM6aG8pl59iYE4irCo7vsqDzeVml0ujuP03wVSEtM+M1d4aULZUHQyh9s=
=Treb
-----END PGP SIGNATURE-----

74yrs old

unread,
Jan 20, 2010, 6:28:55 AM1/20/10
to indi...@googlegroups.com
Dear Nishad,
Tested tesseract-r319 and also in lime-ocr-r319(i.e.tesseract-bin-win-r319.7z) windows encounter message displayed during training .tr 
Copied generated combine.exe  into your lime-ocr wherein  successfully generated data files. when run in both tesseract r319 and lime-ocr319 failed with windows encounter exe message.it is felt there is bug in the source itself  of tesseract-r319. On receipt of your confirmation, I wanted to file in issue.

 I also tested the kannada wordlist in tesseract 2.04 by generating word-dawg(freq/word) I did not find any improvement in the output.  in other words no effect on the generated output.
attached files tested files for your research purpose.since this sample text has 97 mistakes.

I also attached concept  of spellchecker for your information. I hope you will be able to develop either pre/post processer or small program for tesseract-ocr.

With Best of Luck,
-sriranga(77yrs old)
Output-mistakes.txt
output.doc
2kan-utf8-txt.JPG
test.txt
spellchecker.JPG

74yrs old

unread,
Jan 20, 2010, 8:54:16 AM1/20/10
to indi...@googlegroups.com
Dear Nishad,
Whether you are well versed with python?
-sriranga

2010/1/20 74yrs old <withbl...@gmail.com>

Nishad TR

unread,
Jan 20, 2010, 9:28:39 AM1/20/10
to indi...@googlegroups.com
-----BEGIN PGP SIGNED MESSAGE-----
Hash: SHA1


Dear Sriranga,

First of all, Lime OCR is only a GUI for tesseract.
tesseract-bin-win-r319.7z is tesseract 3 binaries for windows. I
uploaded our production build, just in case for somebody in need. It's
build from pure tesseract code without any modification.

It seem that there is a confusion regarding Lime OCR and tesseract. We
are extensively using tesserct engine for some of our projects related
digital archiving. For some of our client we optimized Tesseract-GUI,
and we forced to switch to Windows since some scanner drivers are
missing in Linux. Finally when it was perfect in function, we decided to
give it back to the community.

This name Lime-OCR was there from 2008 beginning, that time we were
using a QT application and now its partially replaced with this open
source python script.

I couldn't find enough time to cleanup Lime-OCR for general purpose,
because there were many complicated scripts which is only useful for
certain purposes like, post processes and pre image enhancers for some
typewrites n stuffs like that. More than that we need to provide a
useful help resource, in-order to make it digestible for simple users.

We even didn't finish it's testing in different windows versions.
Whatever, your attempt to use the script is just an inspiration. So here
we deliver out Lime-OCR first time for general public.

There are many known issues like some GTK bugs, certain logical errors.
There are no help files, Setup is not properly fixed to afford updates.
After all, we don't know whether it'll work or not.

If you find any issues please post it in issue section of the project
site, than in mail. Once I'm back in office, I'll assign somebody to
maintain the project.

Please remember, it's not yet a complete software. Just posting as a
pre-alpha.

http://lime-ocr.googlecode.com/files/LimeOCR-PreAlpha-241-Installer.exe

About trainer GUI, I don't think it is working as it supposed to be. Use
http://code.google.com/p/jtesseract/ and create a small training data
set with a maximum of 10 words. OCR that same image. Repeat this step
with Trainer GUI generated training data. and compare the results.

combine.exe got some bugs, Try to train tesseract 3 with simple sample
data and see if there is any improvement.

Regards
Nishad TR

NB: Lime-OCR only supports tesseract 3. but if you wish you can replace
tesseract.exe and tessdata with version 2 and replace langlist.exe with
the one provided in v2 folder of installed location.

I'll be online up to 9PM so, please get back to me if in case.


-----BEGIN PGP SIGNATURE-----
Version: GnuPG v1.4.10 (MingW32)
Comment: Using GnuPG with Mozilla - http://enigmail.mozdev.org/

iQEcBAEBAgAGBQJLVxMXAAoJEDng0hpUv7qbOQ0H+wbCrRnzTdi89bEpyYi7Nc/T
3Vb3o6N5cOjkn5C/Fy73/PkoHPpVGR5dSpsPqxyO/RD0sWMeyZ2hmQJ4ERkjTfLV
nOkcQp9nqid4mK0NpvCT/rJAXpXR6n2puqAQGtE4YjpOF8wbLPfMMjso+yZgGEft
RdiXwb4F1tresluAuLVNLrJd74qDkqONxWor6ancoklPaiaLqfTVG9V7xLn99G9z
wjldwCalRPzFT+Ep8HsY2ICZzLtIbgcG2+ORBJiH6hKB/RZ03Pkkv2HB1BnDedes
nO8hIkaQn9fSX4uQEbZxpoOCuwpjTEUhnf0Xjqx/cVWXaXw3/taTm4h3WO551sU=
=HQlR
-----END PGP SIGNATURE-----

74yrs old

unread,
Jan 20, 2010, 11:09:08 AM1/20/10
to indi...@googlegroups.com
Dear Nishad,
downloaded the lime ocr and installed. After installed it is observed there is no lang. selection dialog. Then I added tessdata to your folder and tried to click on limeocr shortcut but failed except flashed the Lime OCR screen. Now I have uninstalled. again I will try. please quote project website where issues have to be filed?
With reference to NB: I could not follow where v2 folder of installed location?.
-sriranga

Nishad TR

unread,
Jan 20, 2010, 11:13:13 AM1/20/10
to indi...@googlegroups.com
-----BEGIN PGP SIGNED MESSAGE-----
Hash: SHA1


Dear Sriranga,

Please send me screen shot of the GUI, without language selection
option. Installed directory is where you installed Lime-OCR

It may be C:\Program Files\Lime-OCR

you can find a replacement program in v2 folder. (But I never tried that)

Regards
Nishad


-----BEGIN PGP SIGNATURE-----
Version: GnuPG v1.4.10 (MingW32)
Comment: Using GnuPG with Mozilla - http://enigmail.mozdev.org/

iQEcBAEBAgAGBQJLVyuZAAoJEDng0hpUv7qbM2MH/RbnvTBxPRpIAPBVbgAtfWj1
5XqECsg3aXEl8OkcsycIlQP4QdYAmzWfMjiI+M0FXA/k6GVqdKOASq/1ThC7o1e6
7GtmLQfvciFz/bQn/dnobIJ0ZyvJwN84SeU4EwmJT50+VQZRKIZdiDpCns941Frz
vC7Rm9m/Psb+be3ZMa9mnsefbixA4wyKnHPs1GkZ2lEaPm3fjRDfts6QFHU6Ozf/
ntwaXJuK94UG6yrvXlW2rA8A2UcyS50oFWxJKvl/6ed9VdgmVWTJIrkmdzIztpME
DU1eRugvM35Woxp5R1tYD8KK+1ZpB4+y8ZGiTuGHGvwTI77wtdh0sOBPoUHIcac=
=9m7G
-----END PGP SIGNATURE-----

74yrs old

unread,
Jan 20, 2010, 11:26:45 AM1/20/10
to indi...@googlegroups.com
Dear Nishad,
since GUI did not open even clicked (after re-installed) shortcut or exe file does not open GUI as such question of screenshot does not arise. since I have added tessdata to the tessdata folder, I hope it may open and operate after solving the problem non opening GUI first.  UnderC:\ programfiles/lime OCR, I observed there is tessdata folder - which has already added datafiles and langlist has been updated. I also noticed there is tesseract.exe. So all doubts are cleared. Now question is how to make run limeocr?
sriranga

Nishad TR

unread,
Jan 20, 2010, 11:31:52 AM1/20/10
to indi...@googlegroups.com
-----BEGIN PGP SIGNED MESSAGE-----
Hash: SHA1

Dear Sriranga,

First uninstall Lime-OCR, then delete Your install directory. ie
"C:\Program Files\LIME OCR".

After that simply take CMD (It will in a prompt of your profile folder).
then enter this command

del .lime-ocrrc


Install fresh Lime OCR, then try to start it.

If it still fails, please get back to me.

Regards
Nishad
-----BEGIN PGP SIGNATURE-----
Version: GnuPG v1.4.10 (MingW32)
Comment: Using GnuPG with Mozilla - http://enigmail.mozdev.org/

iQEcBAEBAgAGBQJLVy/4AAoJEDng0hpUv7qbeGcH/A1JISfdgeZcOtfbmlK6ruw8
VIWh6l4GGoILVRwI812IPimmMyqXsxMl26OfKq88hK5nBRdDNPaP0jmHWxi72nuT
jMDVD7qGGtxWiAGKDGgI1E4uPjugs6gIyOOnyTbZJuUDkiIxCf9wGaIYXn0NcKSb
3anuJgA4hJM8/5OZdwbu40pUzR66NuV03OES4Bx66tBtaEIdmFajgFRcYAD3m/67
5Xl/QuKIaKTqTnY/rMGZS3nMYbe6dITCTxi09ogiDMdQakdwZEqW+QybVY9XwLNT
Qm7vd7PE/7rLu9kD5jOtGScwmt18bJYPmFc9gsWUvBQby3lxfWhImw4ig6XU3QQ=
=438q
-----END PGP SIGNATURE-----

74yrs old

unread,
Jan 20, 2010, 11:40:51 AM1/20/10
to indi...@googlegroups.com
dear Nishad,
Before receipt of your message, I have already deleted shortcut, Limeocr under C:\program files and tried to re-install. but same fate still continue one or two seconds splashed and disappear no GuI opened.
if clicked on the shortcut. even tried to run in cmd.exe failed.
-sriranga

Indu s

unread,
Jan 21, 2010, 2:13:45 AM1/21/10
to indi...@googlegroups.com
is there any tool for combining the data files generated for tesseract 2.04 v to use in  tesseract 3.0( Linux ).
--
Thanks & Regards

Indu

Nishad TR

unread,
Jan 21, 2010, 2:24:11 AM1/21/10
to indi...@googlegroups.com
-----BEGIN PGP SIGNED MESSAGE-----
Hash: SHA1

Dear Indu,

There is a combine executable. Probably with lot of bugs. and
.traineddata generated was not useful.

But you can try, I got only windows executable.

You can download the source and generate a combine binary in linux

Regards
Nishad

-----BEGIN PGP SIGNATURE-----


Version: GnuPG v1.4.10 (MingW32)
Comment: Using GnuPG with Mozilla - http://enigmail.mozdev.org/

iQEcBAEBAgAGBQJLWAEbAAoJEDng0hpUv7qbfzUIAJj5duZpaVYoLaXC7QnQXIxP
bCld9b+gh6P4uiqVk4D+9NPQ+/x5sItuddHCX0OvflfSTZgTl26PpJgHt7CnhdSp
JLc3q74uXIqWH+yiXxEvUEIDWmwhnuP7DeGuSC450MHR6LXBMmgVAAStFJ9Oee7o
3DNxwP7uw/N1F7EOUhRj2m/Wh5VfHY6vdbNW2b34waaHVP1Gwm39Ew4Etb3+wSH5
ur9rkNsbhpXf/0Nq3QQ9o4cPC6d2fQ0kOqc3elWqTnpUeRjS7j3ilHxH+9C9WCKK
FBzQ4tBGovDYcr3yrB+bhYLPoFkawuyQVkqjOBFMuvbLbNcUTkUFx2i030QhdqM=
=0rt1
-----END PGP SIGNATURE-----

74yrs old

unread,
Jan 21, 2010, 4:26:05 AM1/21/10
to indi...@googlegroups.com
Dear Indu,

please see visit tesserac-orcr forum on the  following subject** wherein I have posted all experimented with datafiles of different versions of tesseract-ocr. All are performed in WinXP only and not tried in Ubuntu or Fedora so far due to newbie to linux. There is no tool for combining datafiles generated for tesseract 2.04.
In winXP what I have done is copy all kan.datafiles of tesseract2.04 into tesseract3.0-tessfolder and run combine.exe in CMD.exe of M$
as "combine tessdata\kan." automatically generated only one file viz. kan.traineddata.  Here please note it is important that suffix "\kan. must contain full stop(.). I don't know whether in ubuntu is able to generate combine.exe and if so how?
With Best of Luck,
-sriranga(77yrsold)

**A vcproj file for building the traineddata files for 3.0

Indu s

unread,
Jan 21, 2010, 5:54:49 AM1/21/10
to indi...@googlegroups.com

Thank you Nishad nd Sriranga sir for the suggestions.

74yrs old

unread,
Jan 21, 2010, 6:02:42 AM1/21/10
to indi...@googlegroups.com
Indu,
If you suceeded to generate combine in linux please forward extract of terminal for my reference.
-With Best of Luck,
- sriranga(77yrsold)

Indu s

unread,
Jan 25, 2010, 12:09:48 AM1/25/10
to indi...@googlegroups.com
i tried combining the data files using the combine_tessdata in the training folder of tesseract newer version.the combined data file gets generated but im getting segmentation fault on running the tesseract with that traineddata



indus@RD20:~/Desktop/svncheckout/tesseract-ocr-read-only/training$ combine_tessdata /home/indus/Desktop/OCR/MOCR/tessdata/mal.
TessdataManager combined tesseract data files.
Offset for type 0 is -1
Offset for type 1 is 84
Offset for type 2 is -1
Offset for type 3 is 6698
Offset for type 4 is 2383282
Offset for type 5 is 2388564
Offset for type 6 is -1
Offset for type 7 is 2442214
Offset for type 8 is -1
Offset for type 9 is 2442706


./tesseract Mat9.tif outMat.txt -l mal
Segmentation fault

Were you able to run tesseract sucessfully with combined data files for kannada?

74yrs old

unread,
Jan 25, 2010, 12:38:47 AM1/25/10
to indi...@googlegroups.com
Dear Indu,
please clarify whether you used datafiles of previous version OR used datafiles generated in new version 3?
With choicest Blessings,
-sriranga(77yrsold)

Indu s

unread,
Jan 25, 2010, 12:40:37 AM1/25/10
to indi...@googlegroups.com
I used the newly generated mal.traineddata file.

74yrs old

unread,
Jan 25, 2010, 12:51:51 AM1/25/10
to indi...@googlegroups.com
On part  result for tesseract 3.0 was same with result of tesseract previous version.Still I have to check once again for which I am generating datafiles now.

Indu s

unread,
Jan 25, 2010, 3:46:34 AM1/25/10
to indi...@googlegroups.com
I used the data files of previous version of tesseract for getting the combined data file.It seems i have to generate new data files with tesseract 3.0 and then apply the combined_tessdata.

74yrs old

unread,
Jan 25, 2010, 3:58:19 AM1/25/10
to indi...@googlegroups.com
Yes. You are correct. It appears you are testing with same Malayalam.tif. I hope misspelling in the  output will be reduced verion 3.0  or same as in previous version. please feedback to me.

Indu s

unread,
Jan 25, 2010, 6:49:52 AM1/25/10
to indi...@googlegroups.com
sorry i didnt understand...misspelling in o/p?im trying with same box files and imagefile generated for  tesseract prev version.

74yrs old

unread,
Jan 25, 2010, 8:04:43 AM1/25/10
to indi...@googlegroups.com
Then output text in the tesseract 3.0 and output text of tesseract 2.04, 2.03 are same even using the same datafiles generated in the previous versions viz 2.04 or 2.03? If so,  then chances are mispelling in the output texts are identical?

74yrs old

unread,
Jan 29, 2010, 10:41:03 AM1/29/10
to indi...@googlegroups.com
Indu,
will you please feedback to me on question since I am curious to know?
With best wishes,
-sriranga(77yrsold)

74yrs old

unread,
Jan 29, 2010, 11:44:02 AM1/29/10
to indi...@googlegroups.com
Dear Nishad,
Thanks for the latest version. It works excellently according to my expectation. Tested. I found that each character in the single vertical line does not split but retained the original ( i.e. combination of consonants plus dependent vowels)- which I wanted. It is also observed that each word of the horizontal line placed in one line in the vertical line- similar to dictionary. This is very much and excellent tool to generate the text files into Dictionary type file for tesseract-ocr - which I was searching for long time .
In nutshell, your wonderful h2v-split-2. exe serves the the purpose required by me and also helpful to the users of IndicOCR software.

I am very much thankful and grateful to you for the wonderful split tool - which reduced the strain and stress of the users.
With Choicest Blessings of Supreme Lord,
-sriranga(77yrsold)
 

On Wed, Jan 20, 2010 at 6:36 AM, Nishad TR <nishad.tr@gmail.com> wrote:
-----BEGIN PGP SIGNED MESSAGE-----
Hash: SHA1


Dear Sriranga,

I couldn't respond on time, since I'm traveling. Here's an updated
version.

http://lime-ocr.googlecode.com/files/h2v-split-2.zip

Please go through provided readme, command line parameters were
changed and using a single tool you can split into words or
characters. cleaner got more options like removing punctuations.

I'm sorry to say, but I couldn't pay enough time to test it properly.
I'll update and fix it (if there is any error) once I'm back.

Related tesseract trainer GUI and other tesseract issues, please seek
help from ninjas from indic-ocr community. I myself, never tried this GUI.

If your development is in a hopeful direction, then i humbly request
you to use tesseract-ocr version 3. It gives better results for
Malayalam (According to one of our clients, for typewritten documents
its near 100%, and in v2 it's hardly 85%). Structurally all Dravidian
languages are same, so I think for Kannada you can move along with
tessercat-ocr 3. (I'm not so sure about Hindi and other Devanagari
Scripts)

Please feel free to get back, with any thing in need.

Regards
Nishad



-----BEGIN PGP SIGNATURE-----
Version: GnuPG v1.4.10 (MingW32)
Comment: Using GnuPG with Mozilla - http://enigmail.mozdev.org/

iQEcBAEBAgAGBQJLVlcVAAoJEDng0hpUv7qbXBsIALTn7Ch1GPfsXDm4WDTWjan2
QoX2SuuLpSGTIES2y9jcNezsYV27jOuEiQ/5wydt7XQwbsXZENymunqh1tT77LpP
OSs4MZ3vMyOGXSCLT+MpTwb79UcPQV4a9RZR8+jsMctvsnAXD1xmVM799lOHLfl3
W1SwTNZ5kI23gf8IqXg6oOyYLLzMJBwxWR/hXqSnWMM7nn9/2UUBg/gZE5IsCkX7
h4ma6ND6AkhAIaE5mAeItiAEfuRdn4jCWz1dJjnzxhEHQHjopkhPRHDOROPodkTP
zYWI9fdOMmrMNjEM/D1AILK5HpWwCqaCD8E3/s4FW3h8t/bSy7zRL5jDKTXOcbA=
=ZB5B
-----END PGP SIGNATURE-----


Reply all
Reply to author
Forward
0 new messages