Armenian.traineddata hye language tesseract

René JM Clais

unread,

Oct 8, 2023, 2:38:57 PM10/8/23

to tesseract-ocr

I experienced that the official hye.traineddata does not include the և letter.

Does someone experience the same problem if yes, what is the turnaround ?

Thanks for an answer

Des Bw

unread,

Oct 15, 2023, 2:39:53 AM10/15/23

to tesseract-ocr

Check the conversation in this forum where Schree trained the Norwegian data to include the missing letter Æ. I used this method to train for Amharic; and worked for me.

Basically, the method is to cut off the top layer of the network and train from there.

Fine tuning doesn't work for adding missing letters.

René JM Clais

unread,

Oct 20, 2023, 6:44:40 AM10/20/23

to tesser...@googlegroups.com

I have no idea what do you mean with 'cut off the top layer' ?

Can I find a documentation about this process somewhere ?

I am a tesseract user not (yet) a tesseract specialist.

--
You received this message because you are subscribed to the Google Groups "tesseract-ocr" group.
To unsubscribe from this group and stop receiving emails from it, send an email to tesseract-oc...@googlegroups.com.
To view this discussion on the web visit https://groups.google.com/d/msgid/tesseract-ocr/8b4a3db2-ef4b-4323-95a7-c62feb92937an%40googlegroups.com.

Des Bw

unread,

Oct 20, 2023, 8:43:04 AM10/20/23

to tesseract-ocr

Fine tune. Starting with an existing trained language, train on your specific additional data. This may work for problems that are close to the existing training data, but different in some subtle way, like a particularly unusual font. May work with even a small amount of training data.
Cut off the top layer (or some arbitrary number of layers) from the network and retrain a new top layer using the new data. If fine tuning doesn’t work, this is most likely the next best option. Cutting off the top layer could still work for training a completely new language or script, if you start with the most similar looking script.
Retrain from scratch. This is a daunting task, unless you have a very representative and sufficiently large training set for your problem. If not, you are likely to end up with an over-fitted network that does really well on the training data, but not on the actual data.

https://tesseract-ocr.github.io/tessdoc/tess5/TrainingTesseract-5.html

Des Bw

unread,

Oct 20, 2023, 8:49:05 AM10/20/23

to tesseract-ocr

I have exactly the same problem as you have: and neither am I a specialist in Tesseract. I have been experimenting with various setups.

Training from a layer seems to offer the best option for introducing a missing character. But, I am still struggling because I am not getting the same accuracy the default Best model.

- I have been training using 400,000 text lines. It is giving good accuracy on the synthetic data; but terrible output on scanned documents.

Training Tesseract is very daunting task. I spend many weeks on it; and got not satisfactory results. You need to experiment with various set ups and see the outcomes.

Reply all

Reply to author

Forward