Re: problem of generating for Kannada data files.

12 views
Skip to first unread message

74yrs old

unread,
Dec 13, 2009, 8:53:19 AM12/13/09
to indi...@googlegroups.com
posted again due to wrong address.

2009/12/13 74yrs old <withbl...@gmail.com>
Again downloaded version 1.1.3  tried again.

Attached screenshots which are self explanatory. Also attached Kan.alphabets, post vowels and rest for experiment at your side.

I could not understand where I made wrong.  Error message of folder crop up.   I shall check with tamil alphabet and feedback to you soon.

Awaiting valuable guidance.
With warm regards,
-sriranga.



consonants_conjuncts.txt
Screenshot-1.png
Screenshot-1a.png
Screenshot-3.png
extract of terminal
post_vowels.txt
rest.txt
Kannada.unicharset

Debayan Banerjee

unread,
Dec 17, 2009, 5:00:56 AM12/17/09
to indi...@googlegroups.com


2009/12/13 74yrs old <withbl...@gmail.com>

posted again due to wrong address.

2009/12/13 74yrs old <withbl...@gmail.com>
Again downloaded version 1.1.3  tried again.

Attached screenshots which are self explanatory. Also attached Kan.alphabets, post vowels and rest for experiment at your side.

I could not understand where I made wrong.  Error message of folder crop up.   I shall check with tamil alphabet and feedback to you soon.

You added .txt to the file names of consonants_conjuncts and rest. Dont do that. Simply name them as consonants_conjuncts and rest .

--
Regards,
Debayan Banerjee


74yrs old

unread,
Dec 17, 2009, 7:23:09 AM12/17/09
to indi...@googlegroups.com
As per your  valuable guidance, I deleted " ..txt" and tried again but same fate. even I tried with Tamil without .txt prefixed  also met the same fate. On hearing from you I shall forward tamil;alphabet folder to you.
I request you kindly to conduct test kan.alphabets folder already forwarded to you - you will understand and locate where mistake has been made by me.. On receipt of your feedback, I shall test
kan as well as tamil again and feedback to you. Atleast this time is succeeded in generating datafiles without any problem, I shall be more happy.
With Regards,
-sriranga(77yrsold)

74yrs old

unread,
Dec 18, 2009, 12:09:26 AM12/18/09
to indi...@googlegroups.com
Dear Banerjee,
I was thinking why your tool failed to generated datafiles of Kannada or Tamil? As a test, just now I run train the hin.alphabets  to my surprise, it is noticed generating hin.datafiles failed - similar to Kannada/tamil. Same Pop message "wrong folder selection" appeared for all three lang. viz. hindi, kannada, tamil.
This is brought to your kind notice for needful.
-Regards,
-sriranga(77yrsold)
Screenshot-1.png

74yrs old

unread,
Dec 18, 2009, 7:43:33 AM12/18/09
to indi...@googlegroups.com
Dear Sarkar,

Sorry, I made mistake in selecting the output and <lan>alpahets file. Just now I noticed the screenshot of indu in the issue and observed that output is first and then 2nd file alphabets file.
So again experimented. successfully generated image file and box file as well as datafiles of hindi as well as Kannada. Whereas tamil only box generated and image file is not generated as well as tamil tessdataf ile. As per tesseract.log there are many apply box failure!.

In nutshell you tool is in order and no problem. Only thing is one has to select output and .alphabets carefully.
With regards,
-sriranga(77yrsold)

Indu s

unread,
Dec 21, 2009, 5:59:13 AM12/21/09
to indi...@googlegroups.com
Sir ,

i tried with the data files in kannada provided by you in your earlier post. But i am getting this error..

  File "trainer_gui.py", line 143, in train
    tesseract_trainer.generate.draw(font_string,15,self.language,tesseract_trainer.file.read_file(self.DirectoryIn),self.DirectoryOut)
  File "/home/ilcg/TesseractIndic-Trainer-GUI-0.1.3/tesseract_trainer/generate.py", line 86, in draw
    deltax=bbox[2]-bbox[0]
TypeError: 'NoneType' object is unsubscriptable
--
Thanks & Regards

Indu

74yrs old

unread,
Dec 21, 2009, 6:30:34 AM12/21/09
to indi...@googlegroups.com
Indu,
Yes what you said correct.  Then I checked the rest file wherein I noticed there are spaces between two characters(vertical line), as a trial and test  I deleted/removed  the space between two  characters existed by using backspace key then only sucessfully generated.     for example      test
                                                                                                                                                                           white  space
                                                                                                                                                                             test2

changed to  as follow   test
                                    test2
Thus space between test and bottom test2 removed by using backspace key - upward
With Best Wishes,
-sriranga(77yrs old)

74yrs old

unread,
Dec 21, 2009, 8:01:51 AM12/21/09
to indi...@googlegroups.com
Indu,
I have another similar problem. In this connection, attached file "rest".  In the "rest", I copied and pasted  all the contents from rest1, rest2, rest3, rest4 files , since tool failed to generate, presuming that the tool will not accept more than one default  file viz. rest in the "Kan.alphabet" folder
I seek valuable guidance.
With regards,
-sriranga(77yrsold)
Screenshot-3.png
rest
rest-1
rest2
rest3
rest4

74yrs old

unread,
Dec 22, 2009, 11:10:46 AM12/22/09
to indi...@googlegroups.com
Indu-mss,
In continuation of my previous email, further clarification is solicited as follow:

Whether tool restricted to one file only e.g "rest" as well as length of vertical line in the rest? What is the solution if exceeds the limitation of vertical lines in "rest".
. I tried to use as rest/ rest1/rest2/rest3/rest4 in the kan.alphabet folder.

Traceback (most recent call last):
  File "./trainer_gui.py", line 136, in train
    tesseract_trainer.generate.draw(font_string,15,self.language,tesseract_trainer.file.read_file(self.DirectoryIn),self.DirectoryOut)
  File "/home/sangeetha/Desktop/TesseractIndic-Trainer-GUI-0.1.3/tesseract_trainer/generate.py", line 83, in draw

    deltax=bbox[2]-bbox[0]
TypeError: 'NoneType' object is unsubscriptable

I tested the rest again by deleting/reduced to half the contents of the  "rest" but still same error generated as above - which I could not understand the error 'NoneType" ?
Immediate clarirfication is requested.
with Best Wishes,
-sriranga(77yrsold)

Debayan Banerjee

unread,
Dec 23, 2009, 12:32:32 AM12/23/09
to indi...@googlegroups.com


2009/12/22 74yrs old <withbl...@gmail.com>

Indu-mss,
In continuation of my previous email, further clarification is solicited as follow:

Whether tool restricted to one file only e.g "rest" as well as length of vertical line in the rest? What is the solution if exceeds the limitation of vertical lines in "rest".

The tool only supports one 'rest' file. You are not allowed to work with rest1 rest2 etc.
Why do you need more than one 'rest file'? What do you mean by "exceeds the limitation of vertical lines "?



--
Regards,
Debayan Banerjee


Debayan Banerjee

unread,
Dec 23, 2009, 12:36:32 AM12/23/09
to indi...@googlegroups.com


2009/12/23 Debayan Banerjee <deba...@gmail.com>

You get a 'NoneType' error if you leave empty lines somewhere in the list in 'rest'. Make sure there are no empty lines. Actually this logic should be built into the application itself. i will do it in a subsequent release.
Here is a way to remove empty lines from a file quickly: http://soft.zoneo.net/Linux/remove_empty_lines.php .
 



--
Regards,
Debayan Banerjee





--
Regards,
Debayan Banerjee


74yrs old

unread,
Dec 23, 2009, 9:30:50 AM12/23/09
to indi...@googlegroups.com
Banerjee,
Thanks for the  valuable guidance. I copied sed program as follow:

sed '/^$/d' rest > tt
mv tt rest


then  I run in the terminal  "  ./trainer.py " but generated same error as earlier reported reproduced below. It appears sed program has no effect on the rest. file.
Traceback (most recent call last):
  File "./trainer_gui.py", line 136, in train
    tesseract_trainer.generate.draw(font_string,15,self.language,tesseract_trainer.file.read_file(self.DirectoryIn),self.DirectoryOut)
  File "/home/sangeetha/Desktop/TesseractIndic-Trainer-GUI-0.1.3/tesseract_trainer/generate.py", line 83, in draw
    deltax=bbox[2]-bbox[0]
TypeError: 'NoneType' object is unsubscriptable

. only bigimage.box generated but not generated image file. For testing purpose by you, attached "rest" file.
With Regards,
-sriranga(77yrsold)
rest

74yrs old

unread,
Dec 23, 2009, 9:43:28 AM12/23/09
to indi...@googlegroups.com
Dear Debayan,
My replies are noted below under your query.


On Wed, Dec 23, 2009 at 11:02 AM, Debayan Banerjee <deba...@gmail.com> wrote:


2009/12/22 74yrs old <withbl...@gmail.com>

Indu-mss,
In continuation of my previous email, further clarification is solicited as follow:

Whether tool restricted to one file only e.g "rest" as well as length of vertical line in the rest? What is the solution if exceeds the limitation of vertical lines in "rest".

  Reply: Since the tool did  not work more than one file "rest" - with generating error, I split the contents of  rest into four files as rest1, rest2, rest3. rest4 presuming that the tool
      will not support bigger length of  files and there must be some limitation of length of text files. Now  as you clarified that the tool supports only one file. Accordingly all the
      contents of  rest1, rest2, rest3 and rest4 are copied to "rest". and tried to run  but failed - for details please see my second email
 
The tool only supports one 'rest' file. You are not allowed to work with rest1 rest2 etc.
Why do you need more than one 'rest file'? What do you mean by "exceeds the limitation of vertical lines "?

reply:  I mean "exceeds the limitation of vertical lines" is length of text from top to bottom in one file.

--
Regards,
Debayan Banerjee



Indu s

unread,
Dec 23, 2009, 11:39:33 AM12/23/09
to indi...@googlegroups.com
Sorry I couldnt test the tool as i came  home for holidays.


On Tue, Dec 22, 2009 at 9:40 PM, 74yrs old <withbl...@gmail.com> wrote:
Indu-mss,
In continuation of my previous email, further clarification is solicited as follow:

Whether tool restricted to one file only e.g "rest" as well as length of vertical line in the rest? What is the solution if exceeds the limitation of vertical lines in "rest".
. I tried to use as rest/ rest1/rest2/rest3/rest4 in the kan.alphabet folder.

Traceback (most recent call last):
  File "./trainer_gui.py", line 136, in train

    tesseract_trainer.generate.draw(font_string,15,self.language,tesseract_trainer.file.read_file(self.DirectoryIn),self.DirectoryOut)
  File "/home/sangeetha/Desktop/TesseractIndic-Trainer-GUI-0.1.3/tesseract_trainer/generate.py", line 83, in draw

    deltax=bbox[2]-bbox[0]
TypeError: 'NoneType' object is unsubscriptable

I tested the rest again by deleting/reduced to half the contents of the  "rest" but still same error generated as above - which I could not understand the error 'NoneType" ?
Immediate clarirfication is requested.
with Best Wishes,
-sriranga(77yrsold)


Thanks & Regards

Indu

sriranga(77yrsold) location: Bangalore

unread,
Dec 23, 2009, 9:35:24 PM12/23/09
to indic-ocr
Banerjee,
I tested Sed program in WinXp also. No effect on the kannada scripts.
This is for your information.
-sriranga

On Dec 23, 7:30 pm, 74yrs old <withblessi...@gmail.com> wrote:
> Banerjee,
> Thanks for the  valuable guidance. I copied sed program as follow:

> *


> sed '/^$/d' rest > tt
> mv tt rest

> *


> then  I run in the terminal  "  ./trainer.py " but generated same error as
> earlier reported reproduced below. It appears sed program has no effect on
> the rest. file.

> *Traceback (most recent call last):


>   File "./trainer_gui.py", line 136, in train
>
> tesseract_trainer.generate.draw(font_string,15,self.language,tesseract_trainer.file.read_file(self.DirectoryIn),self.DirectoryOut)
>   File
> "/home/sangeetha/Desktop/TesseractIndic-Trainer-GUI-0.1.3/tesseract_trainer/generate.py",
> line 83, in draw
>     deltax=bbox[2]-bbox[0]
> TypeError: 'NoneType' object is unsubscriptable

> *
> . only bigimage.box generated but *not* generated image file. For testing


> purpose by you, attached "rest" file.
> With Regards,
> -sriranga(77yrsold)
>

> On Wed, Dec 23, 2009 at 11:06 AM, Debayan Banerjee <debaya...@gmail.com>wrote:
>
>
>
> > 2009/12/23 Debayan Banerjee <debaya...@gmail.com>
>
> >> 2009/12/22 74yrs old <withblessi...@gmail.com>


>
> >> Indu-mss,
> >>> In continuation of my previous email, further clarification is solicited
> >>> as follow:
>
> >>> Whether tool restricted to one file only e.g "rest" as well as length of
> >>> vertical line in the rest? What is the solution if exceeds the limitation of
> >>> vertical lines in "rest".
>
> >> The tool only supports one 'rest' file. You are not allowed to work with
> >> rest1 rest2 etc.
> >> Why do you need more than one 'rest file'? What do you mean by "exceeds
> >> the limitation of vertical lines "?
>
> > You get a 'NoneType' error if you leave empty lines somewhere in the list
> > in 'rest'. Make sure there are no empty lines. Actually this logic should be
> > built into the application itself. i will do it in a subsequent release.
> > Here is a way to remove empty lines from a file quickly:
> >http://soft.zoneo.net/Linux/remove_empty_lines.php.
>
> >> --
> >> Regards,
> >> Debayan Banerjee
>
> > --
> > Regards,
> > Debayan Banerjee
>
>
>

>  rest
> 220KViewDownload

Debayan Banerjee

unread,
Dec 24, 2009, 1:36:12 AM12/24/09
to indi...@googlegroups.com


2009/12/24 sriranga(77yrsold) location: Bangalore <withbl...@gmail.com>

Banerjee,
I tested Sed program in WinXp also. No effect on the kannada scripts.
This is for your information.
-sriranga

sed works on GNU/Linux systems.


--
Regards,
Debayan Banerjee


74yrs old

unread,
Dec 24, 2009, 5:32:43 AM12/24/09
to indi...@googlegroups.com
Banerjee,
 I agree with your point that sed will work in Linux. But I am not able to run tool on rest attached to my previous email. please note that sed program applied to rest   - but still error "non-type" displayed before finishing. Kindly perform test on rest attached in the previous email.  i like to know where I made a mistake?
With regards,
-sriranga(77yrsold)

Debayan Banerjee

unread,
Dec 24, 2009, 9:55:20 AM12/24/09
to indi...@googlegroups.com


2009/12/24 74yrs old <withbl...@gmail.com>

Banerjee,
 I agree with your point that sed will work in Linux. But I am not able to run tool on rest attached to my previous email. please note that sed program applied to rest   - but still error "non-type" displayed before finishing. Kindly perform test on rest attached in the previous email.  i like to know where I made a mistake?
With regards,

Just attach the rest file, and only the rest file, and send it to the list.
 



--
Regards,
Debayan Banerjee


74yrs old

unread,
Dec 24, 2009, 10:54:04 AM12/24/09
to indi...@googlegroups.com
Dear Debayan Banerjee,
Thanks for taking interest. As desired by you I have attached rest file.
Awaiting your valuable guidance/comments.
With regards,
-sriranga(77yrsold)
rest

Debayan Banerjee

unread,
Dec 24, 2009, 11:24:56 AM12/24/09
to indi...@googlegroups.com


2009/12/24 74yrs old <withbl...@gmail.com>

Dear Debayan Banerjee,
Thanks for taking interest. As desired by you I have attached rest file.
Awaiting your valuable guidance/comments.
With regards,
-sriranga(77yrsold)

There were a few blank lines at the end of the file. I have removed them and re-attached.
I have no indic rendering on this machine running Windows XP (in a cyber cafe) and hence could not look at the contents of the file in detail.
On a slightly different note,  the thousand of character classes in Indic script is not suitably handled by the nearest neighbour character classifier in Tesseract. Nearest neighbour classifiers are the simplest form of classifiers and are good for a small character set, like English.
Recently some breakthroughs have been made in this field and IIIT Hyderabad in particular has proved in publications that Support Vector Machines <http://en.wikipedia.org/wiki/Support_vector_machine> can be used reliably well in 1000+ class classfiers. Jinesh, who is also on this list can explain this better.
For simpler Indic script like Bengali Tesseract can still work if we employ some techniques like tiliting characters by 90 degrees. That would still help in reducing character classes.




--
Regards,
Debayan Banerjee


rest

Debayan Banerjee

unread,
Dec 24, 2009, 11:47:05 AM12/24/09
to indi...@googlegroups.com


2009/12/24 Debayan Banerjee <deba...@gmail.com>

Go ahead and read this paper http://cvit.iiit.ac.in/papers/tejo07Support.pdf .



--
Regards,
Debayan Banerjee


74yrs old

unread,
Dec 24, 2009, 11:49:54 AM12/24/09
to indi...@googlegroups.com

74yrs old

unread,
Dec 24, 2009, 11:52:39 AM12/24/09
to indi...@googlegroups.com
Dear Banerjee,
 performed beta test after downloading the "rest". It is observed no change in the position as earlier reported. attached screenshot which are self explanatory. no image generated except box file.
With regards,
-sriranga(77yrsold)
Screenshot.png

Debayan Banerjee

unread,
Dec 24, 2009, 12:02:30 PM12/24/09
to indi...@googlegroups.com

Try with this attached file. I missed the last blank line.


--
Regards,
Debayan Banerjee


rest

74yrs old

unread,
Dec 24, 2009, 12:13:03 PM12/24/09
to indi...@googlegroups.com
Dear Banerjee,
I forgot to inform you that  beta testing are perfomed only in the Fedora-11 and sometimes in ubuntu also if the beta testing is succeeded and not in WinXP - which does not support the tool and will not work.
With warmest regards,
-sriranga

74yrs old

unread,
Dec 24, 2009, 12:37:20 PM12/24/09
to indi...@googlegroups.com
Dear Banerjee,
downloaded the revised rest and performed beta-test - but same position and no improvement. no change in error message. screenshot attached.
Screenshot-5.png
Screenshot-4.png

74yrs old

unread,
Dec 24, 2009, 8:04:14 PM12/24/09
to indi...@googlegroups.com
Dear Banerjee,
Good Morning. I am interested to know result of your analysis of the issue. where  the mistake is happens I like to know for improvement of my knowledge.
Awaiting further further valuable guidance to generate datafiles.
With regards,
-sriranga

74yrs old

unread,
Dec 25, 2009, 4:46:49 AM12/25/09
to indi...@googlegroups.com
Dear Debayan Banerjee,
Due to various valuable guidance rendered by you from time to time, I succeeded to generate image file as well as tessdata for kannada.(all maximum combinations )
I select ,used, the program  " grep .. <input >  <output> "from the site quoted by you viz. http://soft.zoneo.net/Linux/remove_empty_lines.php .

 First I tested in WinxP  I was able to run successfully by testing sample  (even two spaces) works well. Then all the rest1, rest2, rest3 and rest4 were checked with help of  grep program and copied all the contents into "rest" file and again  checked with grep program.
Everything was OK. Then I copied the verified "rest" into Fedora-11 and just now tested.

One advantage of winXP is one can clearly read local lang and easily locate spaces between two characters. whereas in linux difficult to read the local lang properly. Anyhow major problems are solved.

I attached rest for your personal evaluation purpose for including grep program in the tool.

Now question is how to copy the tessdata now generated into the tesseract tessdata folder, since I am newbie to linux -Fedora-11.
With Warmest regards,
-sriranga(77yrs Old)
rest
bigimage.box
bigimage.png

74yrs old

unread,
Dec 25, 2009, 6:04:09 AM12/25/09
to indi...@googlegroups.com
Deebayan Banerjee,

I could not attach the typescript which is more than 5000kb which bounced back to me.
It is observed that tesseract.log generated in Fedora-11 but not in Ubuntu9.04.  Max.classes error are identical.
With regards,
-sriranga

---------- Forwarded message ----------
From: 74yrs old <withbl...@gmail.com>
Date: Fri, Dec 25, 2009 at 4:09 PM
Subject: Re: problem of generating for Kannada data files.
To: indi...@googlegroups.com


In continuation of my previous email, I like to inform you that I tried in Ubuntu9.04 wherein no tesseract.log generated. error message displayed reg.exceeding -greater than max.classes. attached screenshot as well as
typescript. I run two times trained. After ist training, i moved tess datafiles to trash and again trained. tif file is not completed. rest is same used for fedora-11. Due to apply boxes failure and exceeding greater than max.classes, whether tesseract will run?
With regards,
-sriranga(77yrsold)
Screenshot-2.png
tesseract.log

74yrs old

unread,
Dec 25, 2009, 9:17:45 AM12/25/09
to indi...@googlegroups.com
Banerjee,
how to fix Size of unicharset of boxes is greater than MAX_NUM_CLASSES (8192) in the relevant source code of tesseract.
-sriranga

74yrs old

unread,
Dec 25, 2009, 9:19:34 AM12/25/09
to indi...@googlegroups.com
please read as how to increase the size of  max_num_classes in the relevant source code of tesseract.

Debayan Banerjee

unread,
Dec 29, 2009, 4:52:38 AM12/29/09
to indi...@googlegroups.com


2009/12/25 74yrs old <withbl...@gmail.com>

Banerjee,
how to fix Size of unicharset of boxes is greater than MAX_NUM_CLASSES (8192) in the relevant source code of tesseract.

I am aware of this particular technique. I even used it while training Hindi. The unfortunate fact is that increasing max\-num\-classes is useless since Tesseract's classifier can not handle those many classes in any case. The number 8192 itself is far higher than what it can handle.
You may increase the number to any number and recompile, but with little or no effect.
The solution is to either reduce the nuber of character classes drastically by some methods, or to use a modern classifier back end.

--
Regards,
Debayan Banerjee


74yrs old

unread,
Dec 29, 2009, 5:19:30 AM12/29/09
to indi...@googlegroups.com
Dear Banerjee,
What you said is perfectly correct and agreed with your point. I had tried to increase the number 8192 to 17000 recompiled in WinXP  but failed for 2.03 even for version 3.0.

Instead of training  complete set of alphabets(i.e.consonants plus vowel etc),

(1) it is better to pick up characters  like consonants_conjuncts from the real text files. But I find it difficult to pick up such characters of consonants_conjuncts manually - for which automate tool is required. OR
(2) tool or py/sed program has to be designed such a way all the textlines in the text file convert to vertical line for trainerGUI purpose. OR
(3) TrainerGUI should have option button to accept text file containing either horizontal line  or vertical line
for example:  horizontal line = testing for accuracy
Vertical line
 t
e
s
t
i
n
g

f
o
r
 
a
c
c
u
r
a
c
y          
        

With Regards,
-sriranga(77yrsold)

74yrs old

unread,
Dec 30, 2009, 4:26:43 AM12/30/09
to indi...@googlegroups.com

Dear Banerjee,
I am interested to know your comments/reaction on my humble suggestions
With regards,
-sriranga

sriranga(77yrsold) location: Bangalore

unread,
Jan 30, 2010, 2:20:16 AM1/30/10
to indic-ocr, Nishad TR
Dear Banerjee,
downloaded the 'rest" which was rectified by you for the missinglast
blank line and tested in your
trainer-gui 1.1.3 also in winXP but failed to generated relevant
files vide screenshot attached -which is self
explanatory. Awaiting your further guidance.
With regards,
-sriranga(77yrsold)

On Dec 24 2009, 10:02 pm, Debayan Banerjee <debaya...@gmail.com>
wrote:

>  rest
> 220KViewDownload

74yrs old

unread,
Jan 30, 2010, 2:30:52 AM1/30/10
to indic-ocr
Screenshot could not attached now attached. also copy of rest attached for ready reference and research at your end.
With regards,
-sriranga(77yrsold)
rest
Screenshot.png
Reply all
Reply to author
Forward
0 new messages