--
You received this message because you are subscribed to the Google Groups "tesseract-ocr" group.
To post to this group, send email to tesser...@googlegroups.com.
To unsubscribe from this group, send email to tesseract-oc...@googlegroups.com.
For more options, visit this group at http://groups.google.com/group/tesseract-ocr?hl=en.
---------- Forwarded message ----------
From: Sriranga(78yrsold) <withbl...@gmail.com>
Date: Tue, Jan 18, 2011 at 5:43 PM
Subject: Re: Tesseract TrainingTo: KHEM Sochenda <khemso...@gmail.com>
Sochenda,
I tried to generate data file based on tif and box Please see unicharset attached wherein your id did not appear except default id. when I checked unicode with your id does not tally. I suggest if you want to give Id try to give
unicode number which will appear in unicharset. I tested in my kannada it works vide sample attached herewith.
with regards,
-sriranga(78yrsold)
Thank you very much for you help. The reason why my output file is empty because I put my person ID to the glyphs, isn't it?
Please see the image attached, shall the image in the red box assigned to a Unicode character or seperated as in the image? This glyph is composed of two other glyphs-- one can be represented by a Unicode character, and the other is a part of a vowel.
Sochenda,
pleas see inline reply below.
On Wed, Jan 19, 2011 at 12:58 PM, KHEM Sochenda <khemso...@gmail.com> wrote:
Dear Dmitry and Sriranga,
Thank you very much for you help. The reason why my output file is empty because I put my person ID to the glyphs, isn't it?
Dear Dmitry,
Please see the image attached, shall the image in the red box assigned to a Unicode character or seperated as in the image? This glyph is composed of two other glyphs-- one can be represented by a Unicode character, and the other is a part of a vowel.
Dear Sriranga,
Are the several first lines in your unicharset files represent a characters, or just any unicode character represent no any character. These lines viz.0ccb 8, 0cd5 8, 20c88 , 30ce0 are unicode number instead of characters of Kannada to show you. Usually I am using characters(Script) instead of unicode number for training purpose. I am using tesseract 3.01 alpha(r-529)
Khmer font is also attached. Thanks but unable to type. However it appeared in CharacterMap.On receipt of your alphabets list I shall generated datafiles and forwarded to you.
Chenda,
By guess method I have edited the box file using another tool olwer.exe (which is for english only)attached herewith. Advantage of attached owler.exe is you can type character/ hexdecimal code by pressing tab. consonant and independent vowel may have single box but for consonant/independent vowel +dependent vowel must have single box. (the said owler box is not suitable for kannada and as such I am not using)
If the output using same tif file(used for training) should naturally correctly displayed. If used tif other than tif used for training purpose will have naturally have misspelling which can be corrected by post processor software. the same problem occurred for kannada also. I hope you will succeed in generating trained data file successfully since there is no more complex than Kannada script.
After receipt of corrected the box file, I shall generated trained data file.
With Best Wishes,
-sriranga(78yrs)
On Fri, Jan 21, 2011 at 7:49 AM, KHEM Sochenda <khemso...@gmail.com> wrote:
Dear Dmitry and Sriranga,
Here are my result of training. I tried recognize with the same used the trained image as a test, the result is perfect. When I tried with the test image as attached, there seem problem recognizing the characters.
Please tell me what your thoughts about this.
Best Regards,
SochendaOn Thu, Jan 20, 2011 at 11:47 PM, KHEM Sochenda <khemso...@gmail.com> wrote:
Dear Sriranga,
Here is my train box. It is really tedious editing box file. I just found some glyphs I haven't put the code for them yet, but it difficult to find them in the editing box you gave neigther with pytesseracttrainer.py as it is too slow..
Best Regards,
SochendaOn Thu, Jan 20, 2011 at 4:49 PM, Sriranga(78yrsold) <withbl...@gmail.com> wrote:
box file for editing
On Thu, Jan 20, 2011 at 2:46 PM, KHEM Sochenda <khemso...@gmail.com> wrote:
Dear Dmitry and Sriranga,
But, Sriranga, I guess your computer cannot render KH language well. I will send you an image instead ok?
Best Regards,
SochendaOn Thu, Jan 20, 2011 at 4:08 PM, Sriranga(78yrsold) <withbl...@gmail.com> wrote:
Attached zip file containing exe file of owler. Before unzip please delete word {"OM" }first and then unzip
with help owler, you edit box file according to your requirement After duly edited box file please forward to me
for further generating traineddata file or if you are able to generate traineddata file you can do yourself - no problem. .With best of Luck,Dear dmitry,
-sriranga(78yrs)
Sorry, I could not post in the forum due to attahed files.Hence I am endorsing copy to you.
On Thu, Jan 20, 2011 at 2:22 PM, Sriranga(78yrsold) <withbl...@gmail.com> wrote:Sochenda
please find attached box with its khtext.png file for editing in the box file I am sending separately to you -khtext.tif and owler tool for your editing purpose. since I don't know khemer lang nor unable to type in the keyboard. After editing the box file and return to me for further processing.
With best of Luck,
-sriranga(78yrs)2011/1/20 KHEM Sochenda <khemso...@gmail.com>
Warm regards,
Dmitry Silaev
Dimitry,
It appears that Khem has not endorsed copy to you as such I am forwarding for valuable guidance/comments - which may help me in my Kannada project..
with regards,
-sriranga(78yrs)
---------- Forwarded message ----------
From: KHEM Sochenda <khemso...@gmail.com>
Date: Wed, Feb 16, 2011 at 7:45 AM
Subject: Re: Tesseract Training
To: "Sriranga(78yrsold)" <withbl...@gmail.com>
Dear Sriranga,
The below are the steps that I did the trainings:
--------------------------------------------------------------------------------------------------------
- I created 3 pages of training images as you can see in the attachments( khm.limons1.1 is page, khm.limons1.2 is page 2, and the khm.limons1.3 is the page 3)
- I create box files of every page (khm.limons1.1.box and so on) with the command line:
"tesseract khm.limons1.1.tif khm.limons1.1 batch.nochop makebox" for page 1 and "tesseract khm.limons1.2.tif khm.limons1.2 batch.nochop makebox" for page two and the same for the page 3.
- Then I edit the box files, I got the final result in the attachments.
- I merged the images together into a single file (khm.limons1.0.tif)
- I merged to three box files into a single box file with page number assigned (khm.limons1.0.box)
I ran the command to train the sinble file "tesseract khm.limons1.1.tif khm.limons1.0.tif khm.limons1.0 nobatch box.train".. Result look okay at this step. (My purpose to merge this into one file is I want a single font to be in just one .tr file)
- I then run the command "unicharset_extractor khm.limons1.0.box " to extract every single glyp from the box files. The result look okay.
Then I tried running this to extract the feature "mftraining –U unicharset –O khm.unicharset khm.limons1.0.tr" and "cntraining khm.limons1.0.tr" I failed in this step.
Since I have no clue getting the above idea works, I obmitted the step 4 and 5 and skipped to point 6, 7, and 8 using the separated box files, I got the traineddata as in the attached file. With three .tr files separately is not what I want to do.
Currently I used the obtained trained data for my temporary OCR system. What I wished to do is to add other fonts, but the number of .tr files are limited to 32 only... This is what I concerned.
Best Regards,
Sochenda
> It is
> presumed that commandline for (WinXP) should be as follows:
> eg= " c:\tess\copy 001.tr + 002.tr + 003.tr + oo4.tr > 1234.tr or
> Multiimage.tr" which may kindly be confirmed. OR correct commandline for
> cancatenate using command "copy" to be used may kindly be intimated.
This command won't do what you want. First, you don't need to indicate
a path before "copy" as it is a built-in command of the MS-DOS command
processor, while prepended with a path it is treated as a name of an
executable within the "c:\tess\" directory and it doesn't exist.
Second, you don't need the ">" as it will direct all informational
output of the "copy" command (not files' contents) to "1234.tr". A
destination file should be specified at the end of the command after a
space. Therefore your command line should be
copy 001.tr + 002.tr + 003.tr + oo4.tr 1234.tr
Warm regards,
Dmitry Silaev
On Thu, Feb 17, 2011 at 9:44 AM, Sriranga(78yrsold)
<withbl...@gmail.com> wrote:
> Dmitry,
> Thanks for the valuable guidance However I could not understand how to
> cancatenate (simply "copy" all the resulted .tr files together? It is
> presumed that commandline for (WinXP) should be as follows:
> eg= " c:\tess\copy 001.tr + 002.tr + 003.tr + oo4.tr > 1234.tr or
> Multiimage.tr" which may kindly be confirmed. OR correct commandline for
> cancatenate using command "copy" to be used may kindly be intimated.
> With Warmest Regards,
> -sriranga(78yrs)
> > eg= " c:\tess\copy 001.tr <http://001.tr> + 002.tr
> <http://002.tr> + 003.tr <http://003.tr> + oo4.tr
> <http://oo4.tr> > 1234.tr <http://1234.tr> or
> > Multiimage.tr" which may kindly be confirmed. OR correct
> commandline for
> > cancatenate using command "copy" to be used may kindly be
> intimated.
>
> This command won't do what you want. First, you don't need to
> indicate
> a path before "copy" as it is a built-in command of the MS-DOS
> command
> processor, while prepended with a path it is treated as a name
> of an
> executable within the "c:\tess\" directory and it doesn't exist.
> Second, you don't need the ">" as it will direct all informational
> output of the "copy" command (not files' contents) to "1234.tr
> <http://1234.tr>". A
> destination file should be specified at the end of the command
> after a
> space. Therefore your command line should be
>
> copy 001.tr <http://001.tr> + 002.tr <http://002.tr> + 003.tr
> <http://003.tr> + oo4.tr <http://oo4.tr> 1234.tr <http://1234.tr>
>
> Warm regards,
> Dmitry Silaev
>
>
>
>
>
> On Thu, Feb 17, 2011 at 9:44 AM, Sriranga(78yrsold)
> <withbl...@gmail.com <mailto:withbl...@gmail.com>> wrote:
> > Dmitry,
> > Thanks for the valuable guidance However I could not
> understand how to
> > cancatenate (simply "copy" all the resulted .tr files
> together? It is
> > presumed that commandline for (WinXP) should be as follows:
> > eg= " c:\tess\copy 001.tr <http://001.tr> + 002.tr
> <http://002.tr> + 003.tr <http://003.tr> + oo4.tr
> <http://oo4.tr> > 1234.tr <http://1234.tr> or
> > Multiimage.tr" which may kindly be confirmed. OR correct
> commandline for
> > cancatenate using command "copy" to be used may kindly be
> intimated.
> > With Warmest Regards,
> > -sriranga(78yrs)
> >
> > On Wed, Feb 16, 2011 at 11:58 AM, Dmitry Silaev
> <daemo...@gmail.com <mailto:daemo...@gmail.com>>
> > wrote:
> >>
> >> Guys,
> >>
> >> If you have more than one box/tiff pair, you can train
> (i.e. generate a
> >> .tr file) for each of these pairs separately.
> >>
> >> Then you can concatenate (simply "cat" or "copy") all
> resulted .tr files
> >> together and then run all training tools on the single
> final .tr file. This
> >> relieves you from the 32 file limit.
> >>
> >> For your convenience you can craft a batch file or shell
> script which
> >> would train, concatenate, cluster, etc. in one run. You
> should analyze all
> >> errors carefully though.
> >>
> >> Warm regards,
> >> Dmitry Silaev
> >>
> >>
> >>
> >>
> >> On Wed, Feb 16, 2011 at 5:56 AM, Sriranga(78yrsold)
> >> <withbl...@gmail.com <mailto:withbl...@gmail.com>>
> "mftraining �U
> >>> unicharset �O khm.unicharset khm.limons1.0.tr
> <http://khm.limons1.0.tr>" and "cntraining
> >>> khm.limons1.0.tr <http://khm.limons1.0.tr>" I failed in
> this step.
> >>>
> >>>
> >>>
> --------------------------------------------------------------------------------------------------------
> >>> Since I have no clue getting the above idea works, I
> obmitted the step 4
> >>> and 5 and skipped to point 6, 7, and 8 using the separated
> box files, I got
> >>> the traineddata as in the attached file. With three .tr
> files separately is
> >>> not what I want to do.
> >>>
> >>> Currently I used the obtained trained data for my
> temporary OCR system.
> >>> What I wished to do is to add other fonts, but the number
> of .tr files are
> >>> limited to 32 only... This is what I concerned.
> >>>
> >>> Best Regards,
> >>>
> >>> Sochenda
> >>>
> >>>
> >>>
> >>>
> >>>
> >>
> >
> >
>
>
> --
> You received this message because you are subscribed to the Google
> Groups "tesseract-ocr" group.
> To post to this group, send email to
> tesser...@googlegroups.com
> <mailto:tesser...@googlegroups.com>.
> To unsubscribe from this group, send email to
> tesseract-oc...@googlegroups.com
> <mailto:tesseract-ocr%2Bunsu...@googlegroups.com>.