Tesseract Guide for newbies (first draft)

317 views

Skip to first unread message

Kristóf Horváth

unread,

Feb 7, 2019, 3:36:25 AM2/7/19

to tesseract-ocr

Hi, i set out to make a newbie friendly guide and i already have some stuff that might already help people, but its not complete yet. I would like people to read it and where they can help out with comments. I left places empty or left notes of my own pls feel free to figure out what should be there. I really hope i didnt make big mistakes, but in case i did write something stupid pls share it in form of a constructive criticism.
The following things are very unclear for me (in terms of what they exactly represent):

radical-stroke.txt
learning_rate
noextract_font_properties
2 percent improvement
time=
best error was 100 @0
iteration 31/100/100
rms=
delta=
char train=
word train=
skip ratio=
best char error=

And finially here is the link. (Google docs should be in english, Im writing a wiki so formating is based on wiki syntax, with the link you should be able to make comments)
In case you are really enthusiastic about it you can contact me for write rights.

Best Regards

Kristof Horvath

Lorenzo Bolzani

unread,

Feb 7, 2019, 7:26:49 AM2/7/19

to tesser...@googlegroups.com

Hi Kristof,

good work, I thought about it a few times. I gave a quick look, just a couple of quick notes, I'll try to read it better when I get time.

This thread about the font size is where I got the 30/40px indication:

https://groups.google.com/forum/?utm_medium=email&utm_source=footer#!msg/tesseract-ocr/Wdh_JJwnw94/xk2ErJnFBQAJ

For my trainings (fine tuning) I used 48px (with 2px of white border, so text was about 44), maybe the size does not matter much if you do fine tuning, but I never did a precise comparison. Maybe 48 is even better. The white border probably was not important.

One thing to keep in mind is that IMO there is not THE correct way to train because different fonts or different types of images (contrast, noise, etc.) may work best with different parameters. So you need to experiment a little with these if you want optimal results.

This leads to the most important part: Am I done training? without this you are just wasting time.

What I describe in this post is not completely correct due to the way ocrd works (I should discuss this on github so see if it should be fixed or not).

https://groups.google.com/forum/#!msg/tesseract-ocr/COJ4IjcrL6s/C1OeE9bWBgAJ

The basic idea of any machine learning training is this: split the data in two parts, use one for training and use the other to check the result. The idea is that if you train too much only on a few things you get exceptionally good on these but you overspecialize and get worse at all the rest (this is called overfitting). So you get 99.999% accuracy on the training and 74% on the eval set and real world data that is what really matters (real world is usually a little worse than eval).

The problem I found is that ocrd recreates the files list.train and list.eval every time you run it (it was not designed for incremental training I think). So, if you follow my instructions, you'll mix the train and eval files and this is bad.

So I modified the ocrd Makefile to create these two files explicitly at the beginning of the training (and only once).

This is the edit (about line 80):

# Create lists of lstmf filenames for training and eval
#lists: $(ALL_LSTMF) data/list.train data/list.eval
lists: $(ALL_LSTMF)

train-lists: data/list.train data/list.eval

Now you need to call "make train-lists" only once when you start a new training session with new data (not after each "iteration step").

Ocrd by default does a 90/10 split (RATIO_TRAIN := 0.90), if you have some data (1000/10000 samples) do a 80/20. If you have a ton (100k+ samples) of data 90/10 or evel 95/5 may be fine.

About PSM. I did my training with PSM 6 but for one model (the most complex one, out of 8) I found that using PSM 13 when doing the recognition gives better results for punctation and other special characters.

Again, I do not know how much difference the PSM param makes during training. From what I understand PSM 6 does some custom cleanup/preprocessing to the images, PSM 13 leaves them untouched (completely?).

About the parameters you listed in your post: I know the meaning of a few of them but I think that in general they are quite useless (or you need to understand more to mess with them). What I mostly refer to is the output from lstmeval. char train and word train are the error on the recognition these are probably the only one to look at as a reference (but these refer to the training data, not the eval data). best char error is the best so far, the training is noisy and goes up and down. delta is probably the variation from the previous output and rms is root mean square of something. In other words you do not really understand all of them to do the training.

One iteration means one image, so max_iterations should be at least equal to your images. If you have a ton of images you may see that you do not need to process all of them to reach the "saturation" point when extra training is useless, but normally you want to process all of them even a few times (until the eval score stabilize or get worse for a few iteration).

One note: if you repeat the whole training multiple time (for example trying different image sizes) you need to keep aside the list.train/eval files otherwise you compare with a different set of eval images (and with a little data set this can make a big difference).

Another note: while you fine tune (specialize) on a new "font(s)" you get a little worse on all the others. If you care about other fonts too you should check on them with lstmeval too.

Bye

Lorenzo

--
You received this message because you are subscribed to the Google Groups "tesseract-ocr" group.
To unsubscribe from this group and stop receiving emails from it, send an email to tesseract-oc...@googlegroups.com.
To post to this group, send email to tesser...@googlegroups.com.
Visit this group at https://groups.google.com/group/tesseract-ocr.
To view this discussion on the web visit https://groups.google.com/d/msgid/tesseract-ocr/65c42f6a-3463-4290-905c-9dcc2d9caada%40googlegroups.com.
For more options, visit https://groups.google.com/d/optout.

Kristóf Horváth

unread,

Feb 7, 2019, 9:13:57 AM2/7/19

to tesseract-ocr

Dear Lorenzo,

thank you for your input it is very much appreciated. I will go through your suggestions, because I have questions or clarifications.

This thread about the font size is where I got the 30/40px indication:

https://groups.google.com/forum/?utm_medium=email&utm_source=footer#!msg/tesseract-ocr/Wdh_JJwnw94/xk2ErJnFBQAJ

For my trainings (fine tuning) I used 48px (with 2px of white border, so text was about 44), maybe the size does not matter much if you do fine tuning, but I never did a precise comparison. Maybe 48 is even better. The white border probably was not important.

One thing to keep in mind is that IMO there is not THE correct way to train because different fonts or different types of images (contrast, noise, etc.) may work best with different parameters. So you need to experiment a little with these if you want optimal results.

This leads to the most important part: Am I done training? without this you are just wasting time.

I dont exactly get what you wanted to point out , but the link for the source of the picture specification helps and i will try to digest it too.

What I describe in this post is not completely correct due to the way ocrd works (I should discuss this on github so see if it should be fixed or not).

https://groups.google.com/forum/#!msg/tesseract-ocr/COJ4IjcrL6s/C1OeE9bWBgAJ

The basic idea of any machine learning training is this: split the data in two parts, use one for training and use the other to check the result. The idea is that if you train too much only on a few things you get exceptionally good on these but you overspecialize and get worse at all the rest (this is called overfitting). So you get 99.999% accuracy on the training and 74% on the eval set and real world data that is what really matters (real world is usually a little worse than eval).

The problem I found is that ocrd recreates the files list.train and list.eval every time you run it (it was not designed for incremental training I think). So, if you follow my instructions, you'll mix the train and eval files and this is bad.

So I modified the ocrd Makefile to create these two files explicitly at the beginning of the training (and only once).

This is the edit (about line 80):

# Create lists of lstmf filenames for training and eval
#lists: $(ALL_LSTMF) data/list.train data/list.eval
lists: $(ALL_LSTMF)

train-lists: data/list.train data/list.eval

Now you need to call "make train-lists" only once when you start a new training session with new data (not after each "iteration step").

Thanks for writing train/eval down i had the concept its just i couldn't put it in proper words.

Thank you for fixing the makefile. I will include this in my documentation for sure.

Ocrd by default does a 90/10 split (RATIO_TRAIN := 0.90), if you have some data (1000/10000 samples) do a 80/20. If you have a ton (100k+ samples) of data 90/10 or evel 95/5 may be fine.

This is super useful info.

About PSM. I did my training with PSM 6 but for one model (the most complex one, out of 8) I found that using PSM 13 when doing the recognition gives better results for punctation and other special characters.
Again, I do not know how much difference the PSM param makes during training. From what I understand PSM 6 does some custom cleanup/preprocessing to the images, PSM 13 leaves them untouched (completely?).

I read the same thing, that 13 (PSM.RAW_LINE) is the most efficient one for training and I am somewhat sure (It wasn't me who researched segmentation modes, but he says it just takes the "rawest" form of the line) that 13 leaves them untouched.

About the parameters you listed in your post: I know the meaning of a few of them but I think that in general they are quite useless (or you need to understand more to mess with them). What I mostly refer to is the output from lstmeval. char train and word train are the error on the recognition these are probably the only one to look at as a reference (but these refer to the training data, not the eval data). best char error is the best so far, the training is noisy and goes up and down. delta is probably the variation from the previous output and rms is root mean square of something. In other words you do not really understand all of them to do the training.

Yes they are mostly useless, but im writing a documentation and if i say include this flag or that variable then i would like to include a definition for that flag or parameter. I am mostly interested in 3 questions considering variables and flags i pointed out.

How does this file look like?
What does it do?
How can i create it?

My problem with lstmeval is mostly small confusion i just wanna clarify. For example: char train and word train, if they are high means that there are a lot of errors, right? (same goes for best char error)

Oh and those outputs, you said i dont need for training (like rms). I still would like to know what are those even if i only get like one confusing sentence, because there should be a definition for it.

One iteration means one image, so max_iterations should be at least equal to your images. If you have a ton of images you may see that you do not need to process all of them to reach the "saturation" point when extra training is useless, but normally you want to process all of them even a few times (until the eval score stabilize or get worse for a few iteration).

Thank you for writing this down because i made the same conclusion and its just nice to hear it from you. But my question was actually referring to lstmeval output.
It puts out iteration number like this iteration 31/100/100. So can you tell me what the 3 numbers represent?

One note: if you repeat the whole training multiple time (for example trying different image sizes) you need to keep aside the list.train/eval files otherwise you compare with a different set of eval images (and with a little data set this can make a big difference).

Good note. This warning definitely belongs to the newbie guide.

Another note: while you fine tune (specialize) on a new "font(s)" you get a little worse on all the others. If you care about other fonts too you should check on them with lstmeval too.

Very good note. I am planning to make the Overview about training longer by adding a section that just talks about the mechanics of training. (Things like what the ratio for train/eval should be, how many iterations)
I know that there are no exact answer like this is the best for this. I know, but as i was doing research i found many advice that was very much true for specifications and i will try to collect few of these just to give a nice example of how should you think about your training.

----

So my further plans are simple:

rework most things in wiki (this is a general goal)
Add more flavour text to certain places (this will require testing the guide out on actual people, I have monkeys for testing my guide, but wouldn't mind if somebody on the forum would try it and give feedback like you did Lorenzo)
Collect general errors, common mistakes

Once again thank you for your input and i am eagerly waiting for your reply Lorenzo.

Shree Devi Kumar

unread,

Feb 7, 2019, 10:43:11 AM2/7/19

to tesser...@googlegroups.com

You may want to see the following guide (found using Google search)

https://www.endpoint.com/blog/2018/07/09/training-tesseract-models-from-scratch

To view this discussion on the web visit https://groups.google.com/d/msgid/tesseract-ocr/516b15b5-e6e1-4fb9-9e8a-9dd4c71b1cc5%40googlegroups.com.

Kristóf Horváth

unread,

Feb 7, 2019, 11:08:06 AM2/7/19

to tesseract-ocr

Thx shree. I will check it out tomorrow, but pls can you give a personal feedback?
Also i left from stratch because it requires serious amount of sample data and a newbie wont have that but definetly will dig myself into this guide.

Shree Devi Kumar

unread,

Feb 7, 2019, 11:31:24 AM2/7/19

to tesser...@googlegroups.com

>> iteration 31/100/100

see https://github.com/tesseract-ocr/tesseract/blob/3a7f5e4de459f4c64f36e08b18ce1b66b1fbc876/src/lstm/lstmtrainer.cpp#L410

/ Appends <intro_str> iteration learning_iteration()/training_iteration()/

// sample_iteration() to the log_msg.

void LSTMTrainer::LogIterations(const char* intro_str, STRING* log_msg) const {

*log_msg += intro_str;

log_msg->add_str_int(" iteration ", learning_iteration());

log_msg->add_str_int("/", training_iteration());

log_msg->add_str_int("/", sample_iteration());

}

>> radical-stroke.txt

See https://github.com/tesseract-ocr/tesseract/blob/5fdaa479da2c52526dac1281871db5c4bdaff359/src/training/lang_model_helpers.h#L49

// If pass_through is true, then the recoder will be a no-op, passing the

// unicharset codes through unchanged. Otherwise, the recoder will "compress"

// the unicharset by encoding Hangul in Jamos, decomposing multi-unicode

// symbols into sequences of unicodes, and encoding Han using the data in the

// radical_table_data, which must be the content of the file:

// langdata/radical-stroke.txt.

Even though it is only used for Han languages training, tesseract gives error if file is not found for other languages too.

To view this discussion on the web visit https://groups.google.com/d/msgid/tesseract-ocr/751567a3-b21b-4d98-a759-ce6932ed068e%40googlegroups.com.

For more options, visit https://groups.google.com/d/optout.

____________________________________________________________
भजन - कीर्तन - आरती @ http://bhajans.ramparivar.com

Kristóf Horváth

unread,

Feb 8, 2019, 2:21:31 AM2/8/19

to tesseract-ocr

Thank you Shree, that helps.

Reply all

Reply to author

Forward

0 new messages