Question about training data and psm

63 views

Skip to first unread message

Neil Du Toit

unread,

Dec 1, 2021, 6:07:48 AM12/1/21

to tesseract-ocr

Hey

I've got a simple question and then I'll provide more context. I want to know whether I can fine-tune train tesseract using image/text pairs where each pair is only a single word.

My understanding is that training happens on "line-level" data (which is how tesstrain describes it). The problem is that while this rules out using multi-line input, it doesn't necessarily rule out using single words. However I suspect that if training expects a full line of text then feeding in single words might yield bad results?

It looks like tesstrain allows you to set the training psm but does this change anything because training is always on line level data?

I have looked for example ground truths on github and found several. Most of the training examples are full lines but I've seen the occasional single-word training pair.

The reason that I want to use single words is because I have built a curation interface for fixing tesseract errors after ocr and the interface operates at the word level. So I am generating word image / correct text pairs for every word that tesseract gets wrong and I want to feed this data back into fine tuning tesseract in like a batch reinforcement learning type setup.