--
You received this message because you are subscribed to the Google Groups "tesseract-ocr" group.
To unsubscribe from this group and stop receiving emails from it, send an email to tesseract-ocr+unsubscribe@googlegroups.com.
To post to this group, send email to tesser...@googlegroups.com.
Visit this group at https://groups.google.com/group/tesseract-ocr.
To view this discussion on the web visit https://groups.google.com/d/msgid/tesseract-ocr/2ccbe310-2cc1-4ee9-b724-e1551d0e7daf%40googlegroups.com.
For more options, visit https://groups.google.com/d/optout.
--
You received this message because you are subscribed to the Google Groups "tesseract-ocr" group.
To unsubscribe from this group and stop receiving emails from it, send an email to tesseract-oc...@googlegroups.com.
--
You received this message because you are subscribed to the Google Groups "tesseract-ocr" group.
To unsubscribe from this group and stop receiving emails from it, send an email to tesseract-ocr+unsubscribe@googlegroups.com.
To post to this group, send email to tesser...@googlegroups.com.
Visit this group at https://groups.google.com/group/tesseract-ocr.
To view this discussion on the web visit https://groups.google.com/d/msgid/tesseract-ocr/e966ced1-4c35-4f3f-b969-b2a6e616d292%40googlegroups.com.
Hi LorenzoThanks for suggestion, I began stepping up the iterations and measuring the results, but my box crashed (looks like it ran out of memory) at 6K iterations, so I will need to prepare a larger server to continue this. I take your point about 'number of iterations' and characters repeating within the training text, but to ensure each character of each font is trained at least once. The 'number of iterations' must at least be ('lstmf count' * 'minimum training text lines that cover entire charset'). In your case, unless your short samples contained only 1 line of training text, I don't see how 50k iterations could see every character (at least once) for 50k samples...
Re: the subset of files I don't think these are randomized because if I train 2 models on the same lstmf files for the same number of iterations I get exactly the same test results for each on real world data.
Not sure if this is relevant, but under tesseract 3, there apparently used to be a training limit of 64 fonts at a time. I wonder whether this such a limit still applies to tesseract 4 lstmfs? or whether there is some reasonable relationship we can apply (between say 'training text lines', 'rarest character frequency', 'number of fonts' and 'number of iterations).
Until I can source a larger server to train until it peaks as you suggest, I think I'll try fine tuning on say 64 fonts at a time, setting --old_traineddata to be the output of the last run each time.
--
You received this message because you are subscribed to the Google Groups "tesseract-ocr" group.
To unsubscribe from this group and stop receiving emails from it, send an email to tesseract-oc...@googlegroups.com.
To post to this group, send email to tesser...@googlegroups.com.
Visit this group at https://groups.google.com/group/tesseract-ocr.
To view this discussion on the web visit https://groups.google.com/d/msgid/tesseract-ocr/0b8e2d17-e944-4537-afda-e1dc5dc820e4%40googlegroups.com.
Hi LorenzoTo clarify, my training text is 73 lines of words (with some numbers/punctuation etc.), each about 70 chars long including spaces. From this text I generated a tif/box set for each handwritten font (i.e. 1 font = handwritten characters from 1 author). I then used tesstrain.sh to generate lstmf files from these. So by lstmf count I just mean the number of handwritten font tif pages, and given each of my tif pages contain 73 lines of text, then by your measure that's 73 samples per page (each containing a different subset of my charset).
I am running a script now which finetunes a batch of 64 fonts (i.e. pages with 73 lines) to 4000 iterations, then uses the resulting model as old_traineddata for the next batch. This will take several days to finish now but should allow me to use all my training lines without running out of memory. I hadn't considered the font shift issue that you mentioned though. Presumably by this you mean that the accuracy on later trained fonts will better than that of earlier trained fonts? If so this would explain why printed text accuracy gets worse as I train on handwritten fonts.
Thinking about what you said about resuming from a certain iteration though. I wonder if instead I could say train my first 64 fonts to 4000 iterations, leave my checkpoints in place but set my training files list to the next 64 fomts (at 8000 iterations) and resume? If this still skips the previous 4000 iterations and I don't use '--stop_training' until all my font batches are trained, would this prevent the shift towards the later fonts?
--
You received this message because you are subscribed to the Google Groups "tesseract-ocr" group.
To unsubscribe from this group and stop receiving emails from it, send an email to tesseract-oc...@googlegroups.com.
To post to this group, send email to tesser...@googlegroups.com.
Visit this group at https://groups.google.com/group/tesseract-ocr.
To view this discussion on the web visit https://groups.google.com/d/msgid/tesseract-ocr/a538bbe9-1332-4ec7-8432-c0e3894f209f%40googlegroups.com.
Hi LorenzoIf the previous trained data slowly gets overwritten then I suppose there is a max number of font variations that can be reasonably contained within one model (I wondered why the traineddata always stays the same size)

In your case where you have one line images, presumably each one has unique training supplied text to tesstrain.sh when you created its lstmf file. Did your box files for each sample also map the individual characters?
I currently have a preprocess that combines individual handwritten character images into words and lines on the final tif. I only have one instance of each character per author though so can't get the character variation within a given line unless I start mixing authors. Looks like I'll need to rework this preprocess to create small unique varying samples like yours.
Do you mind me asking what level of accuracy you this gives you for previously unseen handwriting?
BTW that script I started yesterday is only 5% complete so at this rate it will take 2 weeks to finish - Maybe quicker to build a new server after all I think :-)
--
You received this message because you are subscribed to the Google Groups "tesseract-ocr" group.
To unsubscribe from this group and stop receiving emails from it, send an email to tesseract-oc...@googlegroups.com.
To post to this group, send email to tesser...@googlegroups.com.
Visit this group at https://groups.google.com/group/tesseract-ocr.
To view this discussion on the web visit https://groups.google.com/d/msgid/tesseract-ocr/da7a1432-b16d-4a96-98b0-b54110150adc%40googlegroups.com.
To view this discussion on the web visit https://groups.google.com/d/msgid/tesseract-ocr/CAG2NduWN9%2Bhwj%3DtEKoZJ%2BEgo%3DErWnmP7SKMDAyEfiAGcHpZGkg%40mail.gmail.com.