Render Ground Truth from Scratch for Training

Adam

unread,

Oct 20, 2023, 11:23:12 AM10/20/23

to tesseract-ocr

Hello, I simply cannot find the answer to this seemingly simple simple question. I am trying to create a fresh ground truth for a highly limited set of fonts, for training tesseract 4.x

Using text2image I have rendered a large TIF-image and the corresponding BOX-file, from a 100-line-text-file,

My understanding is that this large image is not suitable for training, and that I must break this down into single line images and txt files, to start training. Am I mistaken?

Now I am trying to continue with the tools in the tesseract-ocr/tesstrain repo (to generate all those small images) But for example generate_gt_from_box.py outputs nothing. Nor can I see how any of the Makefile targets apply to my goal.

Please help, thanks!
_______________________________________________________________________________
I have searched for days, so I also really wonder where I could have found the answer to this myself. There are so many READMEs and resources all over the place, so I feel like I might be staring at the answer without realising it.

Danny

unread,

Oct 20, 2023, 8:30:06 PM10/20/23

to tesseract-ocr

The docs are pretty bad so I'm not surprised you didn't find an answer.

We also needed to train against a unusual font so here's our experience. Your situation might be different.

1. the training data needs to be much much bigger than 100 lines. We took the ".wordlist" file from the language data directory, added our own words to the top and use that to generate ground truth. It's about 50,000 lines.

2. each line should be separately rendered to a picture, gt.txt file containing the text in question and a .box file into ground-truth. So that's three files for each of the 50,000 lines, total 150,000 files

Unfortunately, text2Image would not work for our specific font so we ended up writing our own code to generate the image and box files. It reads the wordlist file line by line, renders an image with the text line and uses the font info to extract the character boundaries. (unlike text2image our program figures out the overall bounding rectangle of the text, adds a margin, and creates the image exactly the right size. text2image, at least in my experience, often creates a huge image with mostly whitespace around it.

3. use the 50,000 sets of ground truth files to train the model

Hope that helps.

Danny

Des Bw

unread,

Oct 21, 2023, 4:53:53 AM10/21/23

to tesseract-ocr

Hi Danny,

Can you share your program for the community please? This is open source software; and many people are struggling to get things done. Sharing some experience and pieces of code could help a lot of people.

Reply all

Reply to author

Forward