Extracting checkboxes in Tesseract

Winston Shaji Jacob

unread,

Mar 11, 2021, 7:56:40 PM3/11/21

to tesseract-ocr

Im suprised theres no easy way to extract marked and unmarked checkboxes (ballot boxes),
basically the U+2610 ☐ and U+2612 ☒
I cant figure out how to make tesseract recognize this

Antonio Felipe Cechi

unread,

Mar 12, 2021, 7:01:48 AM3/12/21

to tesseract-ocr

I really don't know if it's the correct way, but I achieved this with a fine tunning.

If there is a better way, I would be happy to know.

Winston Shaji Jacob

unread,

Apr 11, 2021, 5:46:22 PM4/11/21

to tesseract-ocr

How did you fine tune?

Netão

unread,

Apr 12, 2021, 3:56:14 PM4/12/21

to tesser...@googlegroups.com

Oh boy....

Well, there are some steps to do (again, a made this looking on google, if someone knows a better way, please let me know). I'll enumerate them with a short description, if you need some more details, we can talk later.

Prepare the training set: you'll need some examples to work with. The more, the merrier. After that, you need to standardize the training set. I found better results with 300 dpi images, in TIFF format.
Process the training set: one of the mistakes I made was applying some filters to the images and not applying the same filter on the training set. If you use some processing or filter (I used binarization and noise removal), you need to apply that to the training set as well.
Create the truth files: the training will be on the result of these truth files. In early versions of tesseract, you have to cut the images and provide some text files. It's easier now, you can create .box files of your images, using the tesseract. The command is tesseract <image>.tiff <output_name> -l <language> wordstrbox
Change the .box files: with the truth files (these .box), correct them. These files will be the base for the fine tuning. If the output was an "a" and it must be a "s", change it in these files.
Create the training files: after correcting every box file you have for the training set, create the training files. The command is tesseract <image>.tiff <output_name> lstm.train
Generate the training base file: no mystery here, the training requires a file with the path for ALL lstmf files created in the previous step. In linux, you could achieve this with the command ls -1 *.lstmf > all_lstmf.txt
Tuning: now comes the real training. The command is:

lstmtraining \

--model_output <path_output> \

--continue_from <path_language_lstm> \

--trainineddata <path_traineddata> \

--train_listfile <path_all_lstmf.txt> \

--max_iterations <max_iterations>

Some considerations in the command above: you'll need the lstm file from the language you are fine tuning. You can get it from the github of the tesseract (ALWAYS USE THE BEST FOLDER). You need the traineddata of this language too. Again, use the BEST.

After the training finishes, create the traineddata for the new fine tuned language:

lstmtraining \

--stop_training \

--continue_from <path_output>_checkpoint \

--traineddata <path_traineddata> \

--model_output <path_output_new_language>.traineddata

With these steps, you'll have a new .traineddata file. Put it on your tessdata directory and you're ready to go.

I could've missed something, I doing this by heart, but I'm almost sure that's all I did.

Hope can help.

Best regards.

--
You received this message because you are subscribed to the Google Groups "tesseract-ocr" group.
To unsubscribe from this group and stop receiving emails from it, send an email to tesseract-oc...@googlegroups.com.
To view this discussion on the web visit https://groups.google.com/d/msgid/tesseract-ocr/8b7cee3d-f413-4738-ab84-21f42281f85fn%40googlegroups.com.

--

Netão

“The trouble with being punctual is that nobody's there to appreciate it.”

Franklin P. Jones

Reply all

Reply to author

Forward