Tesseract to recognize images or shapes

119 views
Skip to first unread message

achille sadjang

unread,
Apr 17, 2024, 12:47:52 AMApr 17
to tesseract-ocr

Hello everyone,

I have a concern: is it possible to train Tesseract to recognize images or shapes? If so, could someone guide me on how to proceed?

Yaofu Zhou

unread,
May 21, 2024, 2:20:48 PMMay 21
to tesseract-ocr
Absolutely.
1. I would first design my mapping between the shapes and a set of unicodes, so that each shape is mapped to a single character.
2. I would procedurally generate at least a few thousands of images for each shape with variations, and label them using the unicode characters. 
3. Please take a look at Tesstrain, and particularly its Makefile, so that you know what is involved in the training process. I would go over the official documentation of Tesstrain and run "make help" to see the input needed.

Kassim Papa

unread,
May 26, 2024, 2:49:23 PMMay 26
to tesseract-ocr
I tried to do it. It led to multiple bugs. 

For example it started seeing the images ok but not the usual letter.

Yaofu Zhou

unread,
May 26, 2024, 3:35:39 PMMay 26
to tesser...@googlegroups.com
Did you fine-tune an existing model or trained a new model from scratch?

Fine-tuning without sufficient training material will degrade the performance of the base model. Also, you have to be thoughtful about how you want to resolve among, say, a circle, a zero, and letter O. Sufficient context in the training set may help. For example, letter o always appears within a word, while a circle usually stands alone. This is something LSTM can learn, but you need a big high quality training set, which can be procedurally generated if you design the rules well.

If you train a new model dedicated for shapes from scratch, you can use it with other models for normal languages at the same time. However, you might not have control over how Tesseract OCR assigns priority when it sees a circle among letter Os and zeros.


On May 26, 2024, at 14:49, Kassim Papa <kassi...@gmail.com> wrote:

I tried to do it. It led to multiple bugs. 
--
You received this message because you are subscribed to a topic in the Google Groups "tesseract-ocr" group.
To unsubscribe from this topic, visit https://groups.google.com/d/topic/tesseract-ocr/jTKhMTP6x3U/unsubscribe.
To unsubscribe from this group and all its topics, send an email to tesseract-oc...@googlegroups.com.
To view this discussion on the web visit https://groups.google.com/d/msgid/tesseract-ocr/d3e09b62-de6f-4573-a136-663b9b36de20n%40googlegroups.com.
Reply all
Reply to author
Forward
0 new messages