Training with a large number of LSTMF files

ProgressNotPerfection

unread,

Sep 11, 2018, 8:57:34 AM9/11/18

to tesseract-ocr

Hi Tesseract Group

I am trying to train tesseract to recognize handwritten characters and have prepared several thousand lstmf files (from tif/box sets) so I can finetune best trained eng.traineddata, I read elsewhere on this forum that a low number (say 300 - 400) if iterations is recommended when finetuning to avoid overfitting. In my case though it appears that if I choose a low number of iterations, only (approximately) that number of lstmf files get loaded by the training process. I assumed that each iteration would be a training pass over all the lstmf files. Below is my script (which assumes my lstmf files are ready in trained_output_dir). How should I amend this so that it loads all my lstmf files? Should the number of iterations be greater than the number of lstmf files? ... or is there a maximum number of lstmf files that can used for training at once?

Any help would be much appreciated

Thanks

#! /bin/bash

#####################################################

# Script to finetune a language traineddata file for a set of

# pre built lstmf files and a starter traineddata

# for tesseract4.0.0-beta

# Modify directory paths and filenames as required for your setup.

#####################################################

Lang=eng

bestdata_dir=~/tesseract-ocr/tessdata_best

tesstrain_dir=~/tesseract-ocr/src/training

trained_output_dir=~/tesseract-ocr/src/training/eng-finetune-impact

echo "###### EXTRACT BEST LSTM MODEL ######"

combine_tessdata -e $bestdata_dir/$Lang.traineddata $bestdata_dir/$Lang.lstm

echo "###### LSTM TRAINING ######"

echo "#### running lstmtraining for finetuning from $bestdata_dir/$Lang.traineddata #####"

lstmtraining \

--continue_from $bestdata_dir/$Lang.lstm \

--net_spec '[1,49,0,1 Ct3,3,16 Mp3,3 Lfys64 Lfx96 Lrx96 Lfx512 O1c78]' \

--old_traineddata $bestdata_dir/$Lang.traineddata \

--traineddata $trained_output_dir/$Lang/$Lang.traineddata \

--max_iterations 400 \

--debug_interval 0 \

--train_listfile $trained_output_dir/$Lang.training_files.txt \

--model_output $trained_output_dir/finetune

echo "###### BUILD FINETUNED MODEL ######"

echo "#### Building final trained file $Lang-finetune-$Lang.traineddata ####"

lstmtraining \

--stop_training \

--continue_from $trained_output_dir/finetune_checkpoint \

--old_traineddata $bestdata_dir/$Lang.traineddata \

--traineddata $trained_output_dir/$Lang/$Lang.traineddata \

--model_output "$trained_output_dir/$Lang-finetune-$Lang.traineddata"

Shree Devi Kumar

unread,

Sep 11, 2018, 10:18:39 AM9/11/18

to tesser...@googlegroups.com

> I assumed that each iteration would be a training pass over all the lstmf files

No. Each iteration is just one line of text in one font.

Change debug interval to -1 to see details of each iteration.

--debug_interval -1 \

Finetuning with 300-400 iterations may not be enough for handwriting.

--
You received this message because you are subscribed to the Google Groups "tesseract-ocr" group.
To unsubscribe from this group and stop receiving emails from it, send an email to tesseract-ocr+unsubscribe@googlegroups.com.
To post to this group, send email to tesser...@googlegroups.com.
Visit this group at https://groups.google.com/group/tesseract-ocr.
To view this discussion on the web visit https://groups.google.com/d/msgid/tesseract-ocr/2ccbe310-2cc1-4ee9-b724-e1551d0e7daf%40googlegroups.com.
For more options, visit https://groups.google.com/d/optout.

--

____________________________________________________________
भजन - कीर्तन - आरती @ http://bhajans.ramparivar.com

Message has been deleted

ProgressNotPerfection

unread,

Sep 11, 2018, 11:13:29 AM9/11/18

to tesseract-ocr

Thank you Shree

I ran with --debug_interval -1 as you suggested and I can see 1 iteration showing 1 text line from a given font (lstmf) and then the next iteration showing 1 text line from the next font. This suggests I would need number of iterations calculated from [number of training_text lines] * [number of lstmf files] in order to use all my training data? e.g if my training text is 100 lines and I have 2000 lstmf files, I need 200000 iterations. Is that right?

Apologies if I am asking silly questions - I am new to tesseract training.

Lorenzo Bolzani

unread,

Sep 11, 2018, 11:28:21 AM9/11/18

to tesser...@googlegroups.com

Hi, I trained with about 50k very short samples with no problems, going up to 50k iterations in several steps.

My suggestion is to train for a few iterations (like 1000), check the accuracy on the validation set (not on the training set), then set the next target to 2000 (so it trains 1000 more), etc. and stop when it peaks.

I suppose, but I'm not sure about this, that the subset of files is randomized, so it picks a different set on each run. I hope so or I have to do it all over again... Please let me know if you should find this out.

See here for more details on the train/check loop:

https://groups.google.com/d/msg/tesseract-ocr/be4-rjvY2tQ/dlRK6t6lCgAJ

About the number of iterations: I think you cannot compute it and it is not so important to visit all the samples. Each sample contains a lot of letters, with different frequencies. Even if you do not use all the samples each letter is seen many times.

If your samples are generated from a normal static font all the characters are identical and the extra samples just add more cases of letter sequences and splits between the words and even these, after a while, will start to repeat. This is not much different than to train on the very same samples many times and leads to overfitting.

Rather than trying to guess or calculate the iterations I think it's better to just measure the result.

Bye

Lorenzo

--
You received this message because you are subscribed to the Google Groups "tesseract-ocr" group.

To unsubscribe from this group and stop receiving emails from it, send an email to tesseract-oc...@googlegroups.com.

Shree Devi Kumar

unread,

Sep 11, 2018, 11:29:10 AM9/11/18

to tesser...@googlegroups.com

Please see https://github.com/tesseract-ocr/tesseract/wiki/TrainingTesseract-4.00 for details about training of tesseract 4.0.0

A number of people have tried training for handwriting, hopefully they will chime in to provide guidance. You can also search forum archives for the same.

--
You received this message because you are subscribed to the Google Groups "tesseract-ocr" group.

To unsubscribe from this group and stop receiving emails from it, send an email to tesseract-ocr+unsubscribe@googlegroups.com.

To post to this group, send email to tesser...@googlegroups.com.
Visit this group at https://groups.google.com/group/tesseract-ocr.

To view this discussion on the web visit https://groups.google.com/d/msgid/tesseract-ocr/e966ced1-4c35-4f3f-b969-b2a6e616d292%40googlegroups.com.

For more options, visit https://groups.google.com/d/optout.

ProgressNotPerfection

unread,

Sep 12, 2018, 7:15:09 AM9/12/18

to tesseract-ocr

Hi Lorenzo

Thanks for suggestion, I began stepping up the iterations and measuring the results, but my box crashed (looks like it ran out of memory) at 6K iterations, so I will need to prepare a larger server to continue this. I take your point about 'number of iterations' and characters repeating within the training text, but to ensure each character of each font is trained at least once. The 'number of iterations' must at least be ('lstmf count' * 'minimum training text lines that cover entire charset'). In your case, unless your short samples contained only 1 line of training text, I don't see how 50k iterations could see every character (at least once) for 50k samples...

Re: the subset of files I don't think these are randomized because if I train 2 models on the same lstmf files for the same number of iterations I get exactly the same test results for each on real world data.

Not sure if this is relevant, but under tesseract 3, there apparently used to be a training limit of 64 fonts at a time. I wonder whether this such a limit still applies to tesseract 4 lstmfs? or whether there is some reasonable relationship we can apply (between say 'training text lines', 'rarest character frequency', 'number of fonts' and 'number of iterations).

Until I can source a larger server to train until it peaks as you suggest, I think I'll try fine tuning on say 64 fonts at a time, setting --old_traineddata to be the output of the last run each time.

On Tuesday, September 11, 2018 at 1:57:34 PM UTC+1, ProgressNotPerfection wrote:

Lorenzo Bolzani

unread,

Sep 12, 2018, 10:46:31 AM9/12/18

to tesser...@googlegroups.com

Il giorno mer 12 set 2018 alle ore 13:15 ProgressNotPerfection <jimqui...@gmail.com> ha scritto:

Hi Lorenzo
Thanks for suggestion, I began stepping up the iterations and measuring the results, but my box crashed (looks like it ran out of memory) at 6K iterations, so I will need to prepare a larger server to continue this. I take your point about 'number of iterations' and characters repeating within the training text, but to ensure each character of each font is trained at least once. The 'number of iterations' must at least be ('lstmf count' * 'minimum training text lines that cover entire charset'). In your case, unless your short samples contained only 1 line of training text, I don't see how 50k iterations could see every character (at least once) for 50k samples...

Hi Jim,

I used a cloud server with 30GB ram but I think I was able to run it locally too with 16GB (my images are quite small). IIRC the 50k training took around 20 hours (doing the multistep training loads the samples over and over slowing it down a lot).

Strictly speaking, for one single font, you can visit the whole alphabet using just one sample containing exactly the whole alphabet on one line. I suppose your sample images are one single line about 50/100 characters long with some spaces in between. So each image can potentially contain the whole alphabet even more than once.

But you are not interest in characters only but also in characters splitting and words splitting, so you want to have many different combinations of characters. And, if you are using real world text, character frequency also matters.

I do not get what you mean with "lstmf count", I think it is the number of samples, but I do not understand why you multiply it by the alphabet length. Maybe we are using the "sample" word with different meanings. With sample I mean a single line, not a page or a single character.

My 50k samples contained approximately 50k * 15 individual characters (one short line each). My samples were from real world "scanned" documents with several different fonts (hundreds) so each character was practically unique. At the same time, when you take almost a million characters there are not so many variations and the value of adding more samples diminish quickly as you are adding just a small percentage of new data. In practice the difference at iteration 40k and 50k was minimal.

Re: the subset of files I don't think these are randomized because if I train 2 models on the same lstmf files for the same number of iterations I get exactly the same test results for each on real world data.

Thanks for aswering this. I did one more test to be sure, before retraining everything and there is no need to do it.

As you noticed, the training file list is not shuffled but when your resume from a certain iteration the already processed samples are skipped (even if they are reloaded each time). So it visit the whole training set.

Not sure if this is relevant, but under tesseract 3, there apparently used to be a training limit of 64 fonts at a time. I wonder whether this such a limit still applies to tesseract 4 lstmfs? or whether there is some reasonable relationship we can apply (between say 'training text lines', 'rarest character frequency', 'number of fonts' and 'number of iterations).

I never used tesseract 3 but I think there are no strict or soft limits on fonts for tesseract 4. It would depend on too many factors, for example how much the fonts differ.

Tesseract 4 training is a standard neural network training and the standard way to train is to do it until the validation result improves.

Until I can source a larger server to train until it peaks as you suggest, I think I'll try fine tuning on say 64 fonts at a time, setting --old_traineddata to be the output of the last run each time.

Yes, there should be no difference in the final result. I would just suggest to shuffle ALL your samples otherwise the last fonts you use for training will shift the model towards those fonts like you are doing a fine tuning on those specific fonts after you trained on the others.

This is what I would do: pick 20% of all the samples and put them apart as validation set. Then split the remaining 80% in smaller batches according to the available memory. Train on all the batches (for a few iterations each) and then check the validation result. Repeat.

Bye

Lorenzo

--
You received this message because you are subscribed to the Google Groups "tesseract-ocr" group.

To unsubscribe from this group and stop receiving emails from it, send an email to tesseract-oc...@googlegroups.com.

To post to this group, send email to tesser...@googlegroups.com.
Visit this group at https://groups.google.com/group/tesseract-ocr.

To view this discussion on the web visit https://groups.google.com/d/msgid/tesseract-ocr/0b8e2d17-e944-4537-afda-e1dc5dc820e4%40googlegroups.com.

ProgressNotPerfection

unread,

Sep 12, 2018, 1:44:53 PM9/12/18

to tesseract-ocr

Hi Lorenzo

To clarify, my training text is 73 lines of words (with some numbers/punctuation etc.), each about 70 chars long including spaces. From this text I generated a tif/box set for each handwritten font (i.e. 1 font = handwritten characters from 1 author). I then used tesstrain.sh to generate lstmf files from these. So by lstmf count I just mean the number of handwritten font tif pages, and given each of my tif pages contain 73 lines of text, then by your measure that's 73 samples per page (each containing a different subset of my charset).

I am running a script now which finetunes a batch of 64 fonts (i.e. pages with 73 lines) to 4000 iterations, then uses the resulting model as old_traineddata for the next batch. This will take several days to finish now but should allow me to use all my training lines without running out of memory. I hadn't considered the font shift issue that you mentioned though. Presumably by this you mean that the accuracy on later trained fonts will better than that of earlier trained fonts? If so this would explain why printed text accuracy gets worse as I train on handwritten fonts.

Thinking about what you said about resuming from a certain iteration though. I wonder if instead I could say train my first 64 fonts to 4000 iterations, leave my checkpoints in place but set my training files list to the next 64 fomts (at 8000 iterations) and resume? If this still skips the previous 4000 iterations and I don't use '--stop_training' until all my font batches are trained, would this prevent the shift towards the later fonts?

Thanks

Jim

On Tuesday, September 11, 2018 at 1:57:34 PM UTC+1, ProgressNotPerfection wrote:

Lorenzo Bolzani

unread,

Sep 12, 2018, 3:19:39 PM9/12/18

to tesser...@googlegroups.com

Il giorno mer 12 set 2018 alle ore 19:44 ProgressNotPerfection <jimqui...@gmail.com> ha scritto:

Hi Lorenzo
To clarify, my training text is 73 lines of words (with some numbers/punctuation etc.), each about 70 chars long including spaces. From this text I generated a tif/box set for each handwritten font (i.e. 1 font = handwritten characters from 1 author). I then used tesstrain.sh to generate lstmf files from these. So by lstmf count I just mean the number of handwritten font tif pages, and given each of my tif pages contain 73 lines of text, then by your measure that's 73 samples per page (each containing a different subset of my charset).

Maybe I got it. An lstmf file can contain multiple lines/samples. I always used single line images, I pre-cut the full page into lines before moving to the tesseract training.

So each author/font wrote one page with 73 lines of text. I suppose this text is similar to the text you'll want to recognize, like full words/sentences. For each page you have an lstmf.

I am running a script now which finetunes a batch of 64 fonts (i.e. pages with 73 lines) to 4000 iterations, then uses the resulting model as old_traineddata for the next batch. This will take several days to finish now but should allow me to use all my training lines without running out of memory. I hadn't considered the font shift issue that you mentioned though. Presumably by this you mean that the accuracy on later trained fonts will better than that of earlier trained fonts? If so this would explain why printed text accuracy gets worse as I train on handwritten fonts.

Yes. When you fine tune on new data you progressively loose the previous training, it gets slowly overwritten.

Ideally, the best thing you can do is to cut the pages into individual images, one for each line. Generate an lstmf for each of them and shuffle them all. In this way all fonts gets a comparable amount of "priority".

If the handwritten "fonts" are reasonably similar, this line level shuffle might not make a significant difference in the final result.

If you really need to preserve both printed and handwritten recognition you should mix both kind of text in the training data.

Obviously the more fonts you try to learn at the same time, the lower the accuracy will be, especially if the fonts differ a lot.

Thinking about what you said about resuming from a certain iteration though. I wonder if instead I could say train my first 64 fonts to 4000 iterations, leave my checkpoints in place but set my training files list to the next 64 fomts (at 8000 iterations) and resume? If this still skips the previous 4000 iterations and I don't use '--stop_training' until all my font batches are trained, would this prevent the shift towards the later fonts?

No, it won't prevent the shift, the model is updated immediately, stop-training simply convert a checkpoint into a traineddata file (a simple format conversion).

There is also another problem: from the test I did, when you resume from iteration 100 the first 100 lines of the training_files.txt are skipped to avoid to go over the same samples again (I think the training reads the current iteration count from the checkpoint file).

So if you did 4000 iterations with font A and use that checkpoint it will start at line 4001 with font B (and never use the previous lines).

BTW I noticed the flag --max_image_MB I would give it a try.

While we are here, in the training output the actual training is marked as:

Iteration 7: ALIGNED TRUTH : MY WORD
Iteration 7: BEST OCR TEXT : MV WDRR

while

Loaded 1/1 pages (1-1) of document aaa.lstmf

is a message from the background file loading thread that gets randomly intermixed with the actual training output.

So you may see that the file aaa.lstmf is loaded but later this is never used for training (you need to refer to the printed file content to see what files are used...).

Bye

Lorenzo

--
You received this message because you are subscribed to the Google Groups "tesseract-ocr" group.
To unsubscribe from this group and stop receiving emails from it, send an email to tesseract-oc...@googlegroups.com.
To post to this group, send email to tesser...@googlegroups.com.
Visit this group at https://groups.google.com/group/tesseract-ocr.

To view this discussion on the web visit https://groups.google.com/d/msgid/tesseract-ocr/a538bbe9-1332-4ec7-8432-c0e3894f209f%40googlegroups.com.

ProgressNotPerfection

unread,

Sep 13, 2018, 3:47:12 AM9/13/18

to tesseract-ocr

Hi Lorenzo

If the previous trained data slowly gets overwritten then I suppose there is a max number of font variations that can be reasonably contained within one model (I wondered why the traineddata always stays the same size)

In your case where you have one line images, presumably each one has unique training supplied text to tesstrain.sh when you created its lstmf file. Did your box files for each sample also map the individual characters?

I currently have a preprocess that combines individual handwritten character images into words and lines on the final tif. I only have one instance of each character per author though so can't get the character variation within a given line unless I start mixing authors. Looks like I'll need to rework this preprocess to create small unique varying samples like yours.

Do you mind me asking what level of accuracy you this gives you for previously unseen handwriting?

BTW that script I started yesterday is only 5% complete so at this rate it will take 2 weeks to finish - Maybe quicker to build a new server after all I think :-)

Thanks

Jim

On Tuesday, September 11, 2018 at 1:57:34 PM UTC+1, ProgressNotPerfection wrote:

Lorenzo Bolzani

unread,

Sep 13, 2018, 7:18:07 AM9/13/18

to tesser...@googlegroups.com

Il giorno gio 13 set 2018 alle ore 09:47 ProgressNotPerfection <jimqui...@gmail.com> ha scritto:

Hi Lorenzo
If the previous trained data slowly gets overwritten then I suppose there is a max number of font variations that can be reasonably contained within one model (I wondered why the traineddata always stays the same size)

You are not going to get a hard number, not even a soft one. Consider these examples:

10 variations of the serif font (serif, serif sans, etc.)

10 classic latin fonts (arial, times, serif, verdana, courier, etc.)

6 mixed latin fonts (serif, comic, impact, gothic, etc.)

6 fonts: chinese, latin, thai, japanese, indian, arabic

2 handwritten fonts: here each individual character differs as differs each split between characters and between words

2 fonts: one printed and one handwritten

So to come up with some kind of general metric is not possible.

Also running the very same trained model on different validation texts is going to give you different results depending on the scan quality, image size, content, etc.

So even if someone says: 5 fonts gives 95% accuracy, 10 fonts 85% and 100 fonts 60% does not means much because you should ask: ok, what fonts are used in the validation text? The inital 5? A mix of all the 100? In the latter case the scores would probably be the opposite of the ones I wrote here. How much these 100 differ?

The standard tesseract models are trained with many fonts, a few hundreds I think.

And here I'm assuming a fixed model size: a bigger model can learn more fonts, but requires more data and more training time (and execution time will be slower).

About the trainingdata size: characters are not actually stored inside the model. The model learns some "masks" that it later uses to match the individual characters. For numbers the "masks" may look like this, blu (approximately) means "black pixels", red means "white pixels", black means "I do not care":

(taken from here)

The model will later pick the mask that more closely matches the given character. So every time the network is trained with a new sample these blobs moves just a little to better cover the new sample (and to not cover it in case of all the non matching blobs). The next one may even move them back. In the end they stabilize on some kind of "average" font.

This is from a very simple neural network, tesseract is more complex, but the general idea is the same. The size and number of blobs is fixed (it's in the network specification) and does not change with training. The blobs start with random values and move to useful patterns. In the case of fine tuning you start with good "blobs" and shift them just a little.

In your case where you have one line images, presumably each one has unique training supplied text to tesstrain.sh when you created its lstmf file. Did your box files for each sample also map the individual characters?

With tesseract 4 you do not need to map individual characters, just lines. I used pairs of files: one image with one line and one gt.txt file with the text for that line. I used ocrd-train to generate the box files and the lstmf files. My images are tightly cropped around the text and ocrd generate dumb box files containing the whole image.

I currently have a preprocess that combines individual handwritten character images into words and lines on the final tif. I only have one instance of each character per author though so can't get the character variation within a given line unless I start mixing authors. Looks like I'll need to rework this preprocess to create small unique varying samples like yours.

I'm not sure if I got it right, but each line should all come from the same author font.

For handwritten text I think it's not ideal to start from individual characters and join then. The training is not only about characters but also about learning where one characters ends and the next begins and how they connect to each other in all the possible variations, especially for cursive text. If you are using block letters it's less important, even less if the characters come from a pre-printed form where the characters were already split.

I expect you want to recognize full words, not individual characters pre-split.

You might try to randomize a little the way the lines are generated, if you are not already doing this, like using a random spacing between the letters/words and a slightly random vertical placements of the fonts. You may also randomly alter each character a little: size, rotation, shear, contrast, make it bolder, blurry, noisy, etc. (see "data augmentation"). You want it to look as real word data with as few "fixed/repeating patterns" as possible. It's better to have some ugly samples than super clean ones.

One more thing: pre-process the training data in the same way you'll process the real world data (size, denoise, sharpen, etc.).

To be clear: I'm suggesting a few things that you could do to improve the results, but these are not mandatory. Maybe all together they are going to give you a 1% or 2% gain (to throw around a random number, it depends on your actual data). In other words, they may not be worth the effort.

Do you mind me asking what level of accuracy you this gives you for previously unseen handwriting?

I'm attaching the lstmeval log from the training on the 50k dataset (made in multiple steps). The best result I got was:

train3_57500: Eval Char error rate=1.2611253, Word error rate=3.6073043

Result with the eng_best model was:

Eval Char error rate=14.102503, Word error rate=22.452569

You can see that the very first iterations are the ones that really matter.

BTW that script I started yesterday is only 5% complete so at this rate it will take 2 weeks to finish - Maybe quicker to build a new server after all I think :-)

You may consider a cloud training, with $10 you may complete it in one day :)

Bye

Lorenzo

tess_log.txt

ProgressNotPerfection

unread,

Sep 14, 2018, 6:26:52 AM9/14/18

to tesseract-ocr

So with say 2000 fonts then (i.e. handwriting samples by 2000 authors), I suppose there's far more variation than the standard sized tesseract model is intended for. I did read that the netowrk spec cannot be changed by finetuning so maybe I should try training from scratch to create a bigger model.

Very interesting about having a single box around the whole image though - I didn't know you could do that. I started off following the Tesseract 3 training process after reading the https://github.com/tesseract-ocr/tesseract/wiki/TrainingTesseract-4.00 page (which said the process is similar), so I assumed boxes had to be per character (I didn't think this mattered though as I was intending to train just for block handwriting rather than cursive). It will take a bit of work for me to amend all my training data this way. I did read also though that mixing fonts within the same tif SHOULD be done in Tesseract 4, but I'm not sure whether that means on the same line or just within the same page.

If that eval text is handwriting previously unseen by your model, then 'Char error rate=14.102503' accuracy figure is very good. The best I have so far is a Char error rate of 30.294!

BTW I sample checked that long running model I was training though and it seemed severely overfitted so I stopped the process for now.

Thanks

Jim

On Tuesday, September 11, 2018 at 1:57:34 PM UTC+1, ProgressNotPerfection wrote:

Shree Devi Kumar

unread,

Sep 14, 2018, 6:51:45 AM9/14/18

to tesser...@googlegroups.com

> Very interesting about having a single box around the whole image though

That only works when the whole image is a single line of text.

Example of box file created by ocrd for a single line image with groundtruth as "Athāto Gobhiloktānām anyeshāṁ caiva karmaṇām" - note it ends in a line with a TAB character to mark end of line.

A 0 0 1905 114 0

t 0 0 1905 114 0

h 0 0 1905 114 0

ā 0 0 1905 114 0

t 0 0 1905 114 0

o 0 0 1905 114 0

0 0 1905 114 0

G 0 0 1905 114 0

o 0 0 1905 114 0

b 0 0 1905 114 0

h 0 0 1905 114 0

i 0 0 1905 114 0

l 0 0 1905 114 0

o 0 0 1905 114 0

k 0 0 1905 114 0

t 0 0 1905 114 0

ā 0 0 1905 114 0

n 0 0 1905 114 0

ā 0 0 1905 114 0

m 0 0 1905 114 0

0 0 1905 114 0

a 0 0 1905 114 0

n 0 0 1905 114 0

y 0 0 1905 114 0

e 0 0 1905 114 0

s 0 0 1905 114 0

h 0 0 1905 114 0

ā 0 0 1905 114 0

ṁ 0 0 1905 114 0

0 0 1905 114 0

c 0 0 1905 114 0

a 0 0 1905 114 0

i 0 0 1905 114 0

v 0 0 1905 114 0

a 0 0 1905 114 0

0 0 1905 114 0

k 0 0 1905 114 0

a 0 0 1905 114 0

r 0 0 1905 114 0

m 0 0 1905 114 0

a 0 0 1905 114 0

ṇ 0 0 1905 114 0

ā 0 0 1905 114 0

m 0 0 1905 114 0

1905 114 1906 115 0

Shree Devi Kumar

unread,

Sep 14, 2018, 7:05:37 AM9/14/18

to tesser...@googlegroups.com

> So with say 2000 fonts then (i.e. handwriting samples by 2000 authors), I suppose there's far more variation than the standard sized tesseract model is intended for. I did read that the netowrk spec cannot be changed by finetuning so maybe I should try training from scratch to create a bigger model.

>> Neural networks require significantly more training data and train a lot slower than base Tesseract. For Latin-based languages, the existing model data provided has been trained on about 400000 textlines spanning about 4500 fonts. For other scripts, not so many fonts are available, but they have still been trained on a similar number of textlines. Instead of taking a few minutes to a couple of hours to train, Tesseract 4.00 takes a few days to a couple of weeks.

You can try the plus-minus finetune training, specially if your unicharset is fully covered by or you need only a partial subset of eng or script/Latin traineddata. That way it should keep the printed text support, though probably biased in favor of handwriting. Try with a training set of 2000 lines, one by each author and train for 4000 iterations.

You can try something like

combine_tessdata -u ../tessdata_best/eng.traineddata ../tessdata_best/eng.

/usr/bin/time -v ~/tesseract/src/training/lstmtraining \

--model_output ./plus_from_eng/plus \

--continue_from ../tessdata_best/eng.lstm \

--old_traineddata ../tessdata_best/eng.traineddata \

--traineddata ./engsample/eng/eng.traineddata \

--train_listfile ./engsample/eng.training_files.txt \

--eval_listfile ./engtest/eng.training_files.txt \

--debug_interval -1 \

--max_image_MB 7000 \

--max_iterations 4000

It is a quick experiment, worth a try.

vikram sareen

unread,

May 18, 2019, 1:29:27 PM5/18/19

to tesseract-ocr

hi shree,

did you manage to crack this...

we are also trying to get handwritten working for english but no luck.

truly appreciate your help and guidance...

thanks in advance.

regards,

vikram

Shree Devi Kumar

unread,

May 18, 2019, 2:33:48 PM5/18/19

to tesser...@googlegroups.com

No, I have not done handwriting training. Others who have tried can share if they had success.

--
You received this message because you are subscribed to the Google Groups "tesseract-ocr" group.
To unsubscribe from this group and stop receiving emails from it, send an email to tesseract-oc...@googlegroups.com.
To post to this group, send email to tesser...@googlegroups.com.
Visit this group at https://groups.google.com/group/tesseract-ocr.

To view this discussion on the web visit https://groups.google.com/d/msgid/tesseract-ocr/da7a1432-b16d-4a96-98b0-b54110150adc%40googlegroups.com.

Timothy Snyder

unread,

May 19, 2019, 7:25:55 PM5/19/19

to tesser...@googlegroups.com

I had moderate-to-good success fine tuning the Tesseract 4 english model with handwriting samples from the IAM handwriting database.

To view this discussion on the web visit https://groups.google.com/d/msgid/tesseract-ocr/CAG2NduWN9%2Bhwj%3DtEKoZJ%2BEgo%3DErWnmP7SKMDAyEfiAGcHpZGkg%40mail.gmail.com.

Reply all

Reply to author

Forward