Tesseract Training

684 views
Skip to first unread message

KHEM Sochenda

unread,
Jan 14, 2011, 2:25:10 AM1/14/11
to tesser...@googlegroups.com
Dear Tesseract Team,

In training new language step, we have to assign a unicode value to each box.
I would like to know if a shape that is composed of several unicode characters?
Is there anyway to assign only an id for each box in tesseract?

Thank you very much in advance for your response.

Best Regards,
Chenda

Dmitry Silaev

unread,
Jan 14, 2011, 3:05:54 PM1/14/11
to tesser...@googlegroups.com
Chenda,

In fact Tesseract doesn't care if you do training for a real language's letter and which language this letter belongs to. Simplistically saying Tess only saves the mapping of feature sets obtained from training to Unicode ids. This implies that during training you can assign virtually any character code to virtually any glyph (to be exact, to a connected component or to a set of connected components).

If your language script is comprised by a reasonable number of joint character combinations then while training you can assign every such combination a predefined Unicode id (some restrictions apply). Later, when running recognition, you should do some post-processing to decode your predefined ids into real language's character sequences.

For good results all this requires you to develop a training file pre-processor (mapping: language char combinations -> provisional ids) and a recognition result post-processor (mapping: provisional ids -> language char sequences). I'm not sure but this also may require correcting character property bit masks in the unicharset file (I don't know exactly how this information is used by Tess as I don't need it in my project).

Warm regards,
Dmitry Silaev




--
You received this message because you are subscribed to the Google Groups "tesseract-ocr" group.
To post to this group, send email to tesser...@googlegroups.com.
To unsubscribe from this group, send email to tesseract-oc...@googlegroups.com.
For more options, visit this group at http://groups.google.com/group/tesseract-ocr?hl=en.

KHEM Sochenda

unread,
Jan 16, 2011, 2:42:52 AM1/16/11
to tesser...@googlegroups.com

Dear Dmitry,

Thank you very much for a comprehensive explanation.
Let say, to go straight, does it sound ok by assigning a code like 'k001' or 'k002' to the glype obtain from tesseract segmentation?

For post processing, touching the code tesseract, could you please point me out which I files I should modify to work on. Advice me if the last version of tesseract will do fine.

Thank you very much in advance for your time and response back.

Best Regards,

Sochenda

Dmitry Silaev

unread,
Jan 16, 2011, 3:48:01 AM1/16/11
to tesser...@googlegroups.com
Dear Sochenda,

I'm not sure what's the ultimate goal of your code assignment but a formal answer to your question is "Yes". You can assign "k001" or "k002" to a bounding box in a .box file. Moreover, you can assign any UTF-8 encoded character sequence. In Tess version 3.0x (current) the only restriction is a 24 byte limit for the entire char sequence length. This also allows you to use not only an abstract code like "k001" but a meaningful character sequence from your real language (e.g. a well-known "fi" ligature in some Latin fonts) which then relieves you from using the pre- and post-processing.

If you still prefer using abstract codes then pre-/post-processing can be done without tinkering with Tess's code. Since training as well as recognition result in generation of output files, you can develop a couple of file processing command-line utilities which then can be used along with calls to the Tesseract executable within shell scripts (or .bat files in Windows).

For further details you definitely should study thoroughly the "TrainingTesseract3" and "ReadMe" (section "Installation Notes - Tesseract 3.00") documents (http://code.google.com/p/tesseract-ocr/wiki/TrainingTesseract3 and http://code.google.com/p/tesseract-ocr/wiki/ReadMe). These are not quite easy searchable documents but they contain all the info you might need.

Warm regards,
Dmitry Silaev

KHEM Sochenda

unread,
Jan 16, 2011, 11:01:47 PM1/16/11
to tesser...@googlegroups.com
Dear Dmitry,

Thank you again for a very quick response.

I am going to train tesseract for Khmer Language in which there are many ligatures are in the same cases as "fi" in some latin fonts.
The attachment show you the example of the one line khmer sentence, please count the box from left to right. You can see that some glyphs are above to others. The first glyph is formed of two unicode characters, somehow the third glyph and the fifth glyph form a Unicode characters. This is the reason why I wish to give each glype its own ID and then I do a post processing afterward.

Regarding the two glyphs which are overlapped each other like the case of 7th glyph and the 8th glyph, how tesseract will segment these glyphs?  How to give the position of the boxes?


Thank you very much in advance for your response.


Best Regards,

Sochenda
example of Khmer sentence.TIF

Sriranga(78yrsold)

unread,
Jan 17, 2011, 12:16:27 AM1/17/11
to tesser...@googlegroups.com
Which tool you have used to create boxes. Please also upload box file generated by you.

KHEM Sochenda

unread,
Jan 17, 2011, 1:25:29 AM1/17/11
to tesser...@googlegroups.com
In the image, I've done manually.

Sriranga(78yrsold)

unread,
Jan 17, 2011, 1:41:20 AM1/17/11
to tesser...@googlegroups.com
as per wiki instructions.- commandline has to be used to generate box file as follow - as per wiki instructions.
tesseract <lang.fontname.number.tif >   <lang.fontname.number> batch.nochop makebox

KHEM Sochenda

unread,
Jan 17, 2011, 1:43:31 AM1/17/11
to tesser...@googlegroups.com
I know how to do it in tesseract, but the image just to show you how the glyphs should be boxed.

I can send you the box file generate by tesseract anyway.

Regards,

Sochenda

Sriranga(78yrsold)

unread,
Jan 17, 2011, 1:54:56 AM1/17/11
to tesser...@googlegroups.com
Is there are dependent vowel in your Khemer lang. If you have unicode chart  better to upload

Sriranga(78yrsold)

unread,
Jan 17, 2011, 2:06:47 AM1/17/11
to tesser...@googlegroups.com
Viewed Khemer unicode chart (pdf) there are dependent vowels are there. It is better to use bbtool to generate box file. please see wiki section for tools.

KHEM Sochenda

unread,
Jan 17, 2011, 2:10:56 AM1/17/11
to tesser...@googlegroups.com
Dear Sriranga,

Here are the corresponding box file and image file.

Regards,

Sochenda

On Mon, Jan 17, 2011 at 1:41 PM, Sriranga(78yrsold) <withbl...@gmail.com> wrote:
khm.limon.1.tif.box
khm.limon.1.TIF

KHEM Sochenda

unread,
Jan 17, 2011, 2:11:40 AM1/17/11
to tesser...@googlegroups.com
this link will lead you to Khmer Unicode chart

KHEM Sochenda

unread,
Jan 17, 2011, 2:12:08 AM1/17/11
to tesser...@googlegroups.com
this link will lead you to Khmer Unicode page http://unicode.org/charts/PDF/U1780.pdf

On Mon, Jan 17, 2011 at 2:06 PM, Sriranga(78yrsold) <withbl...@gmail.com> wrote:

Sriranga(78yrsold)

unread,
Jan 17, 2011, 2:16:24 AM1/17/11
to tesser...@googlegroups.com
From Pdf  it is observed thare are number of dependent vowels existed. The case is similar to Indic lang.
Let me know which OS you are using?

KHEM Sochenda

unread,
Jan 17, 2011, 2:23:26 AM1/17/11
to tesser...@googlegroups.com
I am using windows XP; occasionally CentOS

Dmitry Silaev

unread,
Jan 17, 2011, 7:21:13 AM1/17/11
to tesser...@googlegroups.com
Dear Sochenda,

I've checked the Unicode table range you've sent and now I see what the problem is. I'd agree that in such "algorithmic" writing system (contrasted with simpler "positional" systems like say Roman or Cyrillic) the stages of pre-/post-processing are inevitable.

I'd suggest making special hand-crafted or generated training images. In these images you would properly space out all the joint character combinations as well as character components that can make up Khmer characters. Then you would edit the resulting box files to assign codes according to your coding system. The noted process should be repeated as many times as required to achieve the sample count of 15-20 for every glyph.

At the recognition stage, if trained properly, overlapping bounding boxes is not a problem for Tess. My experience shows that it is very inventive in character segmentation even in cases of BB overlap. So I hope you should have no severe difficulties with partially overhanging or underlying glyphs.

Your post-processor should be able to "decode" recognition output using an algorithmic approach to form good Unicode characters. You can also use some Khmer bigram or trigram statistics to do error correction. Probably you'd want to play around with Tess's dictionary facility but I doubt it would be helpful in your case.

Dmitry

KHEM Sochenda

unread,
Jan 18, 2011, 4:02:13 AM1/18/11
to tesser...@googlegroups.com
Dear Dmitry,

Thank you again for your suggestions.

I have trained some set of the characters by assuming that glyphs that stay in the same axis are one glyph;
Unfortunately, when I try recognizing an image, the output file is empty. In the training process I did not create dictionary file nor the ambig file.

What I was expecting the result from the recognizing step is a file with a list of the Code ID that I assigned to the glyphs while training. Then I will take the post processing by taking the output file from tess recognition as the input. I am very much appreciated if you could point me out to the causes.

The attached image showed that tesseract boxing includes only two separated glyphs but not the third one that over each other. I would like to know if there is a way to include the third (below the box in the image)or to adjust the height of line in the image given.

Best Regards,

Sochenda




tesseracttraining--issues.TIF
khm.limon.2.box
khm.limon.2.box.bak
khm.limon.2.TIF

Sriranga(78yrsold)

unread,
Jan 18, 2011, 7:45:24 AM1/18/11
to tesser...@googlegroups.com


On Tue, Jan 18, 2011 at 6:12 PM, Sriranga(78yrsold) <withbl...@gmail.com> wrote:


---------- Forwarded message ----------
From: Sriranga(78yrsold) <withbl...@gmail.com>
Date: Tue, Jan 18, 2011 at 5:43 PM
Subject: Re: Tesseract Training
To: KHEM Sochenda <khemso...@gmail.com>


Sochenda,
I tried to generate data file based on tif and box  Please see unicharset attached  wherein your id did not appear except default id. when I checked unicode with your id does not tally. I suggest  if you want to give Id  try to give
unicode number which will appear in unicharset. I tested in my kannada it works vide sample attached herewith.
with regards,
-sriranga(78yrsold)


Dmitry Silaev

unread,
Jan 18, 2011, 8:27:50 AM1/18/11
to tesser...@googlegroups.com
Dear Sochenda,

In addition to what Sriranga said I'd remind that you should do a lot of manual work:

In pyTesseractTrainer check that no bounding boxes intersect glyphs; if some does - correct its BB coordinates manually.

In cases of BB overlap you should space out participating glyphs in the training image (see the attached picture for examples).

You should use manual spacing if participating glyphs are dependent characters (in your language - vowels) and the number of possible combinations is practically uncountable. Then you would assign every glyph its own code. Tess would consider these glyphs as separate characters and you should post-process the resulting code sequence to obtain a well-formed dependent Unicode pair (or triplet).

If there can be only few such combinations - you can merge these BBs into one to encompass all the required glyphs and assign a single code to the entire glyph combination. Then during the post-processing you'll need to replace this single code with a predefined dependent Unicode pair.

Hope I've managed to express myself clearly.

Warm regards,
Dmitry Silaev

figure01.GIF

KHEM Sochenda

unread,
Jan 19, 2011, 2:28:49 AM1/19/11
to tesser...@googlegroups.com
Dear Dmitry and Sriranga,

Thank you very much for you help. The reason why my output file is empty because I put my person ID to the glyphs, isn't it?

Dear Dmitry,
Please see the image attached, shall the image in the red box assigned to a Unicode character or seperated as in the image? This glyph is composed of two other glyphs-- one can be represented by a Unicode character, and the other is a part of a vowel.

Dear Sriranga,

Are the several first lines in your unicharset files represent a characters, or just any unicode character represent no any character.

Khmer font is also attached.

Best Regards,
Sochenda


figure01.GIF
unicharset(2)
KHMERKEP.ttf

Sriranga(78yrsold)

unread,
Jan 19, 2011, 4:24:20 AM1/19/11
to tesser...@googlegroups.com
Sochenda,
Attached khamer alphabets txt prepared based on charactermap as well as unicode chart - since I am unable to type in your lang eventhough i have installed font supplied by you..
please prepare text (saved as utf8) as per sample txt file attached. I shall try to generated trained data.
KHMER ALPHABETS.txt

Sriranga(78yrsold)

unread,
Jan 19, 2011, 4:25:53 AM1/19/11
to tesser...@googlegroups.com
please ensure typed alphabets as a text and  not image file.

2011/1/19 Sriranga(78yrsold) <withbl...@gmail.com>

Dmitry Silaev

unread,
Jan 19, 2011, 4:28:39 AM1/19/11
to tesser...@googlegroups.com
Dear Sochenda,

Thank you very much for you help. The reason why my output file is empty because I put my person ID to the glyphs, isn't it?

The reason (might be not the only one) is that your unicharset file should contain all characters which can appear in your recognition output. Your output is comprised by characters like "k", "0", "1", ..., "9" as you plan to encode glyphs by ids like "k001", "k099", etc.
 
Please see the image attached, shall the image in the red box assigned to a Unicode character or seperated as in the image? This glyph is composed of two other glyphs-- one can be represented by a Unicode character, and the other is a part of a vowel.
I'm not sure how Tess will behave in case of "stacked" glyphs: will it be able to segment them as separate characters? However in case it will, there can be a problem having stable output order. I'll explain. If say the top glyph is encoded as "k001" and the bottom one - as "k002", you might get an output "k001k002" as well as "k002k001" because due to binarization artefacts the two glyphs may have slightly different vertical alignment and Tess is only able to spit out "linear" sequences of characters. You may conduct a series of experiments to find out how good Tess is for stacked segmentation. Later you can address the problem of unstable recognition order by means of post-processing.

But what Tess can do for sure is take any group of stacked glyphs as a single prototype and successfully recognize it later. Examples are: letters with diacritics, semicolon, Greek capital letter Xi, equal sign, etc. When choosing this approach you should assign an individual code to every combination of stacked glyphs. I've reviewed Khmer Unicode range and I think the number of such combinations is not very big, at least not overwhelming.

And I must say one more important thing. Implementing all this pre- and post-processing is quite a job and would put you in a sort of awkward situation. Refer to the last paragraph in the TesseractProjects document (http://code.google.com/p/tesseract-ocr/wiki/TesseractProjects). Google and the project owner are a bit obscure about new language support roadmap so it might turn that you have overlapped with the Tess team. On the other hand your work can be a valuable contribution to the project. It would be good to have asked Ray Smith or Jimmy O'Regan about this but unfortunately they are probably too busy to appear in this group ((

Warm regards,
Dmitry Silaev

Sriranga(78yrsold)

unread,
Jan 19, 2011, 5:55:47 AM1/19/11
to tesser...@googlegroups.com, KHEM Sochenda, Dmitry Silaev
Sochenda,
output of lines viz.0ccb 8, 0cd5 8,  20c88 are appeared in vowel1.txt. So we have to convert unicode numbers to Kannada Character(script) with help of post-processor)
-Regards,
-sriranga(78yrs)

On Wed, Jan 19, 2011 at 4:04 PM, Sriranga(78yrsold) <withbl...@gmail.com> wrote:
Sochenda,
pleas see inline reply below.

On Wed, Jan 19, 2011 at 12:58 PM, KHEM Sochenda <khemso...@gmail.com> wrote:
Dear Dmitry and Sriranga,

Thank you very much for you help. The reason why my output file is empty because I put my person ID to the glyphs, isn't it?

Dear Dmitry,
Please see the image attached, shall the image in the red box assigned to a Unicode character or seperated as in the image? This glyph is composed of two other glyphs-- one can be represented by a Unicode character, and the other is a part of a vowel.

Dear Sriranga,

Are the several first lines in your unicharset files represent a characters, or just any unicode character represent no any character. These lines viz.0ccb 8, 0cd5 8,  20c88 , 30ce0 are unicode number instead of  characters of Kannada to show you. Usually I am using characters(Script) instead of unicode number for training purpose.  I am using tesseract 3.01 alpha(r-529)

Khmer font is also attached. Thanks but unable to type. However it appeared in CharacterMap.
  On receipt of your alphabets list I shall generated datafiles and forwarded to you. 
vowel1.txt
kh.unicharset

KHEM Sochenda

unread,
Jan 20, 2011, 2:33:41 AM1/20/11
to Sriranga(78yrsold), tesser...@googlegroups.com, Dmitry Silaev

Dear Dmitry and Sriranga,

I am so confused now. :(

Maybe I should apply for internship with tesseract, but I am so engaged with my project here.

Please find the attachment as KHtext in unicode for training sample.

Best Regards,

Sochenda

2011/1/19 Sriranga(78yrsold) <withbl...@gmail.com>
khtext.txt

Dmitry Silaev

unread,
Jan 20, 2011, 3:42:53 AM1/20/11
to tesser...@googlegroups.com, Sriranga(78yrsold)
Dear Sochenda,

Please provide us with every file you use to make up your traineddata.
Also we need all command lines with what you run Tess and Tess tools.
Please be sure to be as detailed as possible.

Internship is a good opportunity for everyone here; I'd probably try to apply also but I'm not much of a recent graduate already ((

Warm regards,
Dmitry Silaev

Sriranga(78yrsold)

unread,
Jan 22, 2011, 2:48:42 AM1/22/11
to tesser...@googlegroups.com
---------- Forwarded message ----------
From: Sriranga(78yrsold) <withbl...@gmail.com>
Date: Fri, Jan 21, 2011 at 12:33 PM
Subject: Re: Tesseract Training
To: KHEM Sochenda <khemso...@gmail.com>


Chenda,
It is better to type the character (your lang script) than code in the box file. Because your characters will find  in the unicharset file. I don't know whether your keyboard is able to type your lang and if so, it is better to type.


On Fri, Jan 21, 2011 at 11:41 AM, Sriranga(78yrsold) <withbl...@gmail.com> wrote:
Chenda,
 By guess method I have edited the box file using another tool olwer.exe (which is for english only)attached herewith. Advantage of attached owler.exe is you can type character/ hexdecimal code by pressing tab. consonant and independent vowel may have single box but for consonant/independent vowel +dependent vowel must have single box. (the said owler box is not suitable for kannada and as such I am not using)
If the output using same tif file(used for training) should naturally correctly displayed. If used tif other than tif used for training purpose  will have naturally have misspelling which can be corrected by post processor software. the same problem occurred for kannada also.  I hope you will succeed in generating trained data file successfully since there is no more complex than Kannada script.
After receipt of  corrected the box file, I shall generated trained data file.

With Best Wishes,
-sriranga(78yrs)



On Fri, Jan 21, 2011 at 7:49 AM, KHEM Sochenda <khemso...@gmail.com> wrote:
Dear Dmitry and Sriranga,

Here are my result of training. I tried recognize with the same used the trained image as a test, the result is perfect. When I tried with the test image as attached, there seem problem recognizing the characters.

Please tell me what your thoughts about this.

Best Regards,

Sochenda


On Thu, Jan 20, 2011 at 11:47 PM, KHEM Sochenda <khemso...@gmail.com> wrote:

Dear Sriranga,

Here is my train box. It is really tedious editing box file. I just found some glyphs I haven't put the code for them yet, but it difficult to find them in the editing box you gave neigther with pytesseracttrainer.py as it is too slow..

Best Regards,

Sochenda

On Thu, Jan 20, 2011 at 4:49 PM, Sriranga(78yrsold) <withbl...@gmail.com> wrote:
box file for editing



On Thu, Jan 20, 2011 at 2:46 PM, KHEM Sochenda <khemso...@gmail.com> wrote:
Dear Dmitry and Sriranga,

But, Sriranga, I guess your computer cannot render KH language well. I will send you an image instead ok?

Best Regards,
Sochenda


On Thu, Jan 20, 2011 at 4:08 PM, Sriranga(78yrsold) <withbl...@gmail.com> wrote:
Attached zip file containing exe file of owler. Before unzip please delete word {"OM" }first and then unzip
with help owler, you edit box file according to your requirement  After duly edited box file  please forward to me
for further generating traineddata file or if you  are able to generate traineddata file  you can do yourself - no problem. .
With best of Luck,
-sriranga(78yrs)
Dear dmitry,
Sorry, I could not post in the forum due to attahed files.Hence I am endorsing copy to you.

On Thu, Jan 20, 2011 at 2:22 PM, Sriranga(78yrsold) <withbl...@gmail.com> wrote:
Sochenda
please find attached box with its khtext.png file for editing in the box file  I am sending separately to you -khtext.tif and owler tool for your editing purpose. since I don't know khemer lang nor unable to type in the keyboard. After editing the box file and return to me for further processing.

With best of Luck,
-sriranga(78yrs)

2011/1/20 KHEM Sochenda <khemso...@gmail.com>

KHEM Sochenda

unread,
Jan 23, 2011, 9:22:53 PM1/23/11
to tesser...@googlegroups.com
thanks Sriranga,

Here is my box file after editing. I am going to test the recognition and improve the classification according to the error.

Best Regards,
Sochenda
limon-train.box

Sriranga(78yrsold)

unread,
Jan 23, 2011, 9:39:24 PM1/23/11
to tesser...@googlegroups.com
Sochenda,
I am really happy atleast it works for you now. I could not understand  your point "improve the classification according to the error"  Will you please explain little bit. Anyway please feedback  with percentage of accuracy in the output text. We must thanks to Dmitry for his valuable guidance.
Wish you Good Luck,
-sriranga(78yrs)

KHEM Sochenda

unread,
Jan 23, 2011, 10:03:51 PM1/23/11
to tesser...@googlegroups.com
Dear Sriranga,

I mean I will test and check if it work well with what I classify now, or may I have to adapt something more.

I have one more question, I put some entry in the unicharambigs file; however It seems the tess doesn't care what I have put in the file and the output is just the same as no entry in the unicharambigs. Please see the attachment as the test file and unicharambigs.

Of course, I thank you and Dmitry so much for his fruitful comments on this issues.

Best Regards,

Sochenda
khm.unicharambigs
kexeKe1.tif

Sriranga(78yrsold)

unread,
Jan 24, 2011, 12:42:16 AM1/24/11
to tesser...@googlegroups.com
Dear  Sochenda,
thanks for the updated me.
1) I am curious to know whether you are able to edit in owler tool or manually edited the box file ?
2) unicharambigs file  = I am not able to create unicharambigs for Kannada even following latest intructions in wiki
 In this connection, I have posted problem faced by me  under issue No:433 - which is still pending for solution. copy of om.unicharambigs is attached for your information. I could not understand where I made a mistake?
េ ក  tested  it does not merge with consonant  so it appears េ  is independent vowel and not dependent vowel. As such your file is appears to be in order and I feel it  should work - however please change v1 to v12  as per wiki instructions. and try again. 
With Best of Luck,
-sriranga(78yrsold)
om.unicharambigs

KHEM Sochenda

unread,
Jan 24, 2011, 12:57:15 AM1/24/11
to tesser...@googlegroups.com
Dear Sriranga,

Yes, I use owler tool to edit the box file.
េ is dependent vowel. In the unicharambigs, I just try to change the position order when meet េ and ក .

Best Regards,
Sochenda

Sriranga(78yrsold)

unread,
Jan 24, 2011, 1:52:26 AM1/24/11
to tesser...@googlegroups.com
Sochenda,
I have re-edited the box file attached.- some of box was not formed correctly. Please see tesseract.log attached.
please re-check the box file and correction have to be made wherever necessary since I could not type. let me have your feedback with tesseract.log if any.
With Best Wishes,
-sriranga
limon-train.box
tesseract.log

Sriranga(78yrsold)

unread,
Jan 24, 2011, 2:13:03 AM1/24/11
to tesser...@googlegroups.com
Dear  Sochenda,
thanks for the updated me.
1) I am curious to know whether you are able to edit in owler tool or manually edited the box file ?
2) unicharambigs file  = I am not able to create unicharambigs for Kannada even following latest intructions in wiki
 In this connection, I have posted problem faced by me  under issue No:433 - which is still pending for solution. copy of om.unicharambigs is attached for your information. I could not understand where I made a mistake?


KHEM Sochenda

unread,
Jan 24, 2011, 3:14:20 AM1/24/11
to tesser...@googlegroups.com
Dear Sriranga,

I wonder why you need to re-edit the box file. Here is the box file updated accordingly to your change. and the log file of the training image.
You can try to see if it works for you.

I have changed the version of the unicharambigs accordingly to your comment, but the output file of the recognition step is still the same. (no position change applied.)

If you have any further comments please let me know.

Best Regards,

Sochenda
limon-train.box
tesseract.log

Sriranga(78yrsold)

unread,
Jan 24, 2011, 3:37:05 AM1/24/11
to tesser...@googlegroups.com
attached screenshot wherein I  marked round in red color. single box for  vowel plus consonant should be drawn.
as such i edited wherever I noticed. After re-edited the box , tesseract.log does not show any errors. I shall try to  generated trained data now.  regarding re-edit - please compare and carefully observe the  original box of yours and revised box  forwarded to you  - you will notice lot difference between two box files. Please ensure one of box files be renamed to avoid replace.
khem.JPG

KHEM Sochenda

unread,
Jan 24, 2011, 3:38:55 AM1/24/11
to tesser...@googlegroups.com
Dear Sriranga,

Sorry attached the wrong log file..

Here attached, is the right one. I think the problem is the order of the glyphs appear inconsistently one to another. I meant the order of the glyphs is messed up.

Best Regards,

Sochenda
limon-train.box
tesseract.log

KHEM Sochenda

unread,
Jan 24, 2011, 3:45:03 AM1/24/11
to tesser...@googlegroups.com
Send me the one that does not show error.

Dmitry Silaev

unread,
Jan 24, 2011, 3:52:53 AM1/24/11
to tesser...@googlegroups.com
Dear Sochenda,

Glad you have succeeded in training for Khmer and thanks for your kind words.

Could you please share with us the images and .box files you used for training? Also some sample input images and respective recognition results would be of much use.

Sriranga, I see your *training* process is doing pretty well. Most of your problems are in the dictionary facility. However I do not feel proficient in this field. I mean I know how it works (to be exact how it *should* work), I understand the theoretical basis besides it, but I avoided using it as much as could. When I was getting ready to start using Tess in my project, I read much of the tesseract-XXX groups and I understood that dictionary facility is far from being perfect, at list it's not ready to use yet. Fortunately my project involves much image processing and the specifics of my task imply block/line/letter segmentation so I managed to keep off most of dubious Tess's parts and used it solely as a raw classifier. And I think, classification is what Tess does quite well.

Unfortunately I think you will have much struggling with various inconsistencies and cryptic errors, but anyway I think it's worth it. You should report your every error to the team and wait until it's fixed, at the same time trying to found your way around it. Or you can leave the dictionary facility and rely completely on some home brewed post-processing. If you choose this, your problem turns into a small R&D project so you need to find appropriate people to do this job.

Warm regards,
Dmitry Silaev

Sriranga(78yrsold)

unread,
Jan 24, 2011, 4:26:13 AM1/24/11
to tesser...@googlegroups.com, Dmitry Silaev, KHEM Sochenda
Dear Dmitry,
thanks for the valuable guidance and encouragement. In fact I am not programmer nor developer.. Since the said  Khem lang has independent vowel as dependent vowel which are similar to Kannada lang,
I took interest to know how tesseract will work for khemer lang and also to gain experience. Anyway I have succeeded to generate the lim.traineddata without any problem.  I am interested to know the percentage of accuracy in the output text viz testlim.txt - since I don't know khem lang. Only Sochenda  has to tell.  I dont know how to create post-processor program-  which is better than charambigs.
With warmest Regards,
-sriranga(78yrs old)



Warm regards,
Dmitry Silaev

testlim.txt
New000-r527.7z

KHEM Sochenda

unread,
Jan 24, 2011, 5:06:54 AM1/24/11
to Sriranga(78yrsold), tesser...@googlegroups.com, Dmitry Silaev
Dear Dmitry and Sriranga,

It is nice discussing with you both. Now I am finding the way out with unicharambigs to see if it can help some. I feel that unicharambigs didn't help anything as I have edit some and the output of the recognition file is unchanged. I wonder why it is so.

Regarding the recognition rate, I wonder if you guys know any tools to get the statistics of the recognition rates? What is the formula?

Sriranga, you can put khm. as a prefix for Khmer Language. I also success in getting the output of the recognition step even with the warning from while training. This issue is because of the disorder of the glyphs. I try to fix the order of the glyphs in box file manually now. It looks okay with the output file, just rendering matters only.

I am glad that your language and my language has similar structure. This is helpful!!

Will get back to you when I finish updating the training file with complete set of glyphs.

Thank you and Best Regards,

Sochenda

2011/1/24 Sriranga(78yrsold) <withbl...@gmail.com>

Dmitry Silaev

unread,
Feb 16, 2011, 1:28:16 AM2/16/11
to Sriranga(78yrsold), tesser...@googlegroups.com, KHEM Sochenda
Guys,

If you have more than one box/tiff pair, you can train (i.e. generate a .tr file) for each of these pairs separately.

Then you can concatenate (simply "cat" or "copy") all resulted .tr files together and then run all training tools on the single final .tr file. This relieves you from the 32 file limit.

For your convenience you can craft a batch file or shell script which would train, concatenate, cluster, etc. in one run. You should analyze all errors carefully though.

Warm regards,
Dmitry Silaev




On Wed, Feb 16, 2011 at 5:56 AM, Sriranga(78yrsold) <withbl...@gmail.com> wrote:
Dimitry,
It appears that Khem has not endorsed copy to you as such I am forwarding for valuable guidance/comments - which may help me in my Kannada project..
with regards,
-sriranga(78yrs)

---------- Forwarded message ----------
From: KHEM Sochenda <khemso...@gmail.com>
Date: Wed, Feb 16, 2011 at 7:45 AM
Subject: Re: Tesseract Training
To: "Sriranga(78yrsold)" <withbl...@gmail.com>


Dear Sriranga,

The below are the steps that I did the trainings:
  1. I created 3 pages of training images as you can see in the attachments( khm.limons1.1 is page, khm.limons1.2 is page 2, and the khm.limons1.3 is the page 3)
  2. I create box files of every page (khm.limons1.1.box and so on) with the command line:

    "tesseract khm.limons1.1.tif khm.limons1.1 batch.nochop  makebox" for page 1 and "tesseract khm.limons1.2.tif khm.limons1.2 batch.nochop  makebox" for page two and the same for the page 3.

  3. Then I edit the box files, I got the final result in the attachments.
  4. I merged the images together into a single file (khm.limons1.0.tif)
  5. I merged to three box files into a single box file with page number assigned (khm.limons1.0.box)
  6. I ran the command to train the sinble file "tesseract khm.limons1.1.tif khm.limons1.0.tif khm.limons1.0 nobatch box.train".. Result look okay at this step. (My purpose to merge this into one file is I want a single font to be in just one .tr file)

  7. I then run the command "unicharset_extractor khm.limons1.0.box " to extract every single glyp from the box files. The result look okay.
  8. Then I tried running this to extract the feature "mftraining –U unicharset –O khm.unicharset khm.limons1.0.tr" and "cntraining khm.limons1.0.tr" I failed in this step.

--------------------------------------------------------------------------------------------------------
Since I have no clue getting the above idea works, I obmitted the step 4 and 5 and skipped to point 6, 7, and 8 using the separated box files, I got the traineddata as in the attached file. With three .tr files separately is not what I want to do.

Currently I used the obtained trained data for my temporary OCR system. What I wished to do is to add other fonts, but the number of .tr files are limited to 32 only... This is what I concerned.

Best Regards,

Sochenda






Dmitry Silaev

unread,
Feb 17, 2011, 3:26:50 AM2/17/11
to Sriranga(78yrsold), tesser...@googlegroups.com
Sriranga,

> It is
> presumed that commandline for (WinXP) should be as follows:
> eg= " c:\tess\copy 001.tr + 002.tr + 003.tr + oo4.tr > 1234.tr or
> Multiimage.tr" which may kindly be confirmed. OR correct commandline for
> cancatenate using command "copy" to be used may kindly be intimated.

This command won't do what you want. First, you don't need to indicate
a path before "copy" as it is a built-in command of the MS-DOS command
processor, while prepended with a path it is treated as a name of an
executable within the "c:\tess\" directory and it doesn't exist.
Second, you don't need the ">" as it will direct all informational
output of the "copy" command (not files' contents) to "1234.tr". A
destination file should be specified at the end of the command after a
space. Therefore your command line should be

copy 001.tr + 002.tr + 003.tr + oo4.tr 1234.tr

Warm regards,
Dmitry Silaev

On Thu, Feb 17, 2011 at 9:44 AM, Sriranga(78yrsold)
<withbl...@gmail.com> wrote:
> Dmitry,
> Thanks for the valuable guidance  However I could not understand how to
> cancatenate (simply "copy" all the resulted .tr files together? It is
> presumed that commandline for (WinXP) should be as follows:
> eg= "  c:\tess\copy  001.tr + 002.tr + 003.tr + oo4.tr > 1234.tr or
> Multiimage.tr"  which may kindly be confirmed.  OR correct commandline for
> cancatenate using command "copy" to be used may kindly be intimated.
> With Warmest Regards,
> -sriranga(78yrs)

Sriranga(78yrsold)

unread,
Feb 17, 2011, 3:42:32 AM2/17/11
to Dmitry Silaev, tesser...@googlegroups.com
Dmitry,
I am extremely thankful for your valuable guidance. It works for me.I have to lean many things
under you.
With warmest Regards,
-sriranga(78yrs)

Ray Smith

unread,
Feb 19, 2011, 10:12:21 PM2/19/11
to tesser...@googlegroups.com, Sriranga(78yrsold), Dmitry Silaev
Sorry to be late on this very long thread, but you guys are making lives difficult for yourselves by getting hold of the wrong end of the stick. There is no need to give tesseract a convoluted re-encoding of the recognizable units that you want it to recognize, and and translate it on output.
Maybe I misunderstand what you were trying to do to start with, but you can give tesseract any utf-8 string for each recognizable unit that you train it with, including multiple unicodes if you want. If your original shapes/recognizable units/aksharas/syllables (call them what you like) represent multiple unicodes, then give tesseract all the utf8 for those, and it will be happy. (It currently supports up to 24 bytes of utf-8 for each shape.) It will make life easier when you want to give it a dictionary to use with the shapes, as it assumes that the words you give it can be made up of sequences of the codes for the basic shapes.

Eugene Reimer

unread,
Feb 19, 2011, 11:11:19 PM2/19/11
to tesser...@googlegroups.com
Would a "basic shape" be the same as a "shape", or as a "utf8"? Hmm,
perhaps it is a "call them what you like"?

> > eg= " c:\tess\copy 001.tr <http://001.tr> + 002.tr
> <http://002.tr> + 003.tr <http://003.tr> + oo4.tr
> <http://oo4.tr> > 1234.tr <http://1234.tr> or


> > Multiimage.tr" which may kindly be confirmed. OR correct
> commandline for
> > cancatenate using command "copy" to be used may kindly be
> intimated.
>
> This command won't do what you want. First, you don't need to
> indicate
> a path before "copy" as it is a built-in command of the MS-DOS
> command
> processor, while prepended with a path it is treated as a name
> of an
> executable within the "c:\tess\" directory and it doesn't exist.
> Second, you don't need the ">" as it will direct all informational
> output of the "copy" command (not files' contents) to "1234.tr

> <http://1234.tr>". A


> destination file should be specified at the end of the command
> after a
> space. Therefore your command line should be
>

> copy 001.tr <http://001.tr> + 002.tr <http://002.tr> + 003.tr
> <http://003.tr> + oo4.tr <http://oo4.tr> 1234.tr <http://1234.tr>


>
> Warm regards,
> Dmitry Silaev
>
>
>
>
>
> On Thu, Feb 17, 2011 at 9:44 AM, Sriranga(78yrsold)
> <withbl...@gmail.com <mailto:withbl...@gmail.com>> wrote:
> > Dmitry,
> > Thanks for the valuable guidance However I could not
> understand how to
> > cancatenate (simply "copy" all the resulted .tr files
> together? It is
> > presumed that commandline for (WinXP) should be as follows:

> > eg= " c:\tess\copy 001.tr <http://001.tr> + 002.tr
> <http://002.tr> + 003.tr <http://003.tr> + oo4.tr
> <http://oo4.tr> > 1234.tr <http://1234.tr> or


> > Multiimage.tr" which may kindly be confirmed. OR correct
> commandline for
> > cancatenate using command "copy" to be used may kindly be
> intimated.
> > With Warmest Regards,
> > -sriranga(78yrs)
> >
> > On Wed, Feb 16, 2011 at 11:58 AM, Dmitry Silaev

> <daemo...@gmail.com <mailto:daemo...@gmail.com>>


> > wrote:
> >>
> >> Guys,
> >>
> >> If you have more than one box/tiff pair, you can train
> (i.e. generate a
> >> .tr file) for each of these pairs separately.
> >>
> >> Then you can concatenate (simply "cat" or "copy") all
> resulted .tr files
> >> together and then run all training tools on the single
> final .tr file. This
> >> relieves you from the 32 file limit.
> >>
> >> For your convenience you can craft a batch file or shell
> script which
> >> would train, concatenate, cluster, etc. in one run. You
> should analyze all
> >> errors carefully though.
> >>
> >> Warm regards,
> >> Dmitry Silaev
> >>
> >>
> >>
> >>
> >> On Wed, Feb 16, 2011 at 5:56 AM, Sriranga(78yrsold)

> >> <withbl...@gmail.com <mailto:withbl...@gmail.com>>

> "mftraining �U
> >>> unicharset �O khm.unicharset khm.limons1.0.tr
> <http://khm.limons1.0.tr>" and "cntraining
> >>> khm.limons1.0.tr <http://khm.limons1.0.tr>" I failed in


> this step.
> >>>
> >>>
> >>>
> --------------------------------------------------------------------------------------------------------
> >>> Since I have no clue getting the above idea works, I
> obmitted the step 4
> >>> and 5 and skipped to point 6, 7, and 8 using the separated
> box files, I got
> >>> the traineddata as in the attached file. With three .tr
> files separately is
> >>> not what I want to do.
> >>>
> >>> Currently I used the obtained trained data for my
> temporary OCR system.
> >>> What I wished to do is to add other fonts, but the number
> of .tr files are
> >>> limited to 32 only... This is what I concerned.
> >>>
> >>> Best Regards,
> >>>
> >>> Sochenda
> >>>
> >>>
> >>>
> >>>
> >>>
> >>
> >
> >
>
>
> --
> You received this message because you are subscribed to the Google
> Groups "tesseract-ocr" group.
> To post to this group, send email to
> tesser...@googlegroups.com

> <mailto:tesser...@googlegroups.com>.


> To unsubscribe from this group, send email to
> tesseract-oc...@googlegroups.com

> <mailto:tesseract-ocr%2Bunsu...@googlegroups.com>.

Longyi

unread,
Feb 17, 2011, 3:33:46 PM2/17/11
to tesser...@googlegroups.com
hey all,
I am trying to use tesseract to extract small word segmentation images from scanned images. I have looked into the codes for several days, but still unable to find a way to do that.
Any one have suggestion on that? Thank you so much
Longyi Li
Reply all
Reply to author
Forward
0 new messages