Hi Kristof,
good work, I thought about it a few times. I gave a quick look, just a couple of quick notes, I'll try to read it better when I get time.
This thread about the font size is where I got the 30/40px indication:
https://groups.google.com/forum/?utm_medium=email&utm_source=footer#!msg/tesseract-ocr/Wdh_JJwnw94/xk2ErJnFBQAJ
For my trainings (fine tuning) I used 48px (with 2px of white border, so text was about 44), maybe the size does not matter much if you do fine tuning, but I never did a precise comparison. Maybe 48 is even better. The white border probably was not important.
One thing to keep in mind is that IMO there is not THE correct way to train because different fonts or different types of images (contrast, noise, etc.) may work best with different parameters. So you need to experiment a little with these if you want optimal results.
This leads to the most important part: Am I done training? without this you are just wasting time.
What I describe in this post is not completely correct due to the way ocrd works (I should discuss this on github so see if it should be fixed or not).
The basic idea of any machine learning training is this: split the data in two parts, use one for training and use the other to check the result. The idea is that if you train too much only on a few things you get exceptionally good on these but you overspecialize and get worse at all the rest (this is called overfitting). So you get 99.999% accuracy on the training and 74% on the eval set and real world data that is what really matters (real world is usually a little worse than eval).
The problem I found is that ocrd recreates the files list.train and list.eval every time you run it (it was not designed for incremental training I think). So, if you follow my instructions, you'll mix the train and eval files and this is bad.
So I modified the ocrd Makefile to create these two files explicitly at the beginning of the training (and only once).
This is the edit (about line 80):
# Create lists of lstmf filenames for training and eval
#lists: $(ALL_LSTMF) data/list.train data/list.eval
lists: $(ALL_LSTMF)
train-lists: data/list.train data/list.eval
Now you need to call "make train-lists" only once when you start a new training session with new data (not after each "iteration step").
Ocrd by default does a 90/10 split (RATIO_TRAIN
:= 0.90), if you have some data (1000/10000 samples) do a 80/20. If you
have a ton (100k+ samples) of data 90/10 or evel 95/5 may be fine.
About PSM. I did my training with PSM 6 but for one model (the most complex one, out of 8) I found that using PSM 13 when doing the recognition gives better results for punctation and other special characters.
Again, I do not know how much difference the PSM param makes during training. From what I understand PSM 6 does some custom cleanup/preprocessing to the images, PSM 13 leaves them untouched (completely?).
About the parameters you listed in your post: I know the meaning of a few of them but I think that in general they are quite useless (or you need to understand more to mess with them). What I mostly refer to is the output from lstmeval. char train and word train are the error on the recognition these are probably the only one to look at as a reference (but these refer to the training data, not the eval data). best char error is the best so far, the training is noisy and goes up and down. delta is probably the variation from the previous output and rms is root mean square of something. In other words you do not really understand all of them to do the training.
One iteration means one image, so max_iterations should be at least equal to your images. If you have a ton of images you may see that you do not need to process all of them to reach the "saturation" point when extra training is useless, but normally you want to process all of them even a few times (until the eval score stabilize or get worse for a few iteration).
One note: if you repeat the whole training multiple time (for example trying different image sizes) you need to keep aside the list.train/eval files otherwise you compare with a different set of eval images (and with a little data set this can make a big difference).
Another note: while you fine tune (specialize) on a new "font(s)" you get a little worse on all the others. If you care about other fonts too you should check on them with lstmeval too.
Bye
Lorenzo