Extracting checkboxes in Tesseract

1,247 views
Skip to first unread message

Winston Shaji Jacob

unread,
Mar 11, 2021, 7:56:40 PM3/11/21
to tesseract-ocr
Im suprised theres no easy way to extract marked and unmarked checkboxes (ballot boxes),
basically the U+2610  ☐   and U+2612 ☒
I cant figure out how to make tesseract recognize this

Antonio Felipe Cechi

unread,
Mar 12, 2021, 7:01:48 AM3/12/21
to tesseract-ocr
I really don't know if it's the correct way, but I achieved this with a fine tunning.

If there is a better way, I would be happy to know.

Winston Shaji Jacob

unread,
Apr 11, 2021, 5:46:22 PM4/11/21
to tesseract-ocr
How did you fine tune?

Netão

unread,
Apr 12, 2021, 3:56:14 PM4/12/21
to tesser...@googlegroups.com
Oh boy....

Well, there are some steps to do (again, a made this looking on google, if someone knows a better way, please let me know). I'll enumerate them with a short description, if you need some more details, we can talk later.
  1. Prepare the training set: you'll need some examples to work with. The more, the merrier. After that, you need to standardize the training set. I found better results with 300 dpi images, in TIFF format.
  2. Process the training set: one of the mistakes I made was applying some filters to the images and not applying the same filter on the training set. If you use some processing or filter (I used binarization and noise removal), you need to apply that to the training set as well.
  3. Create the truth files: the training will be on the result of these truth files. In early versions of tesseract, you have to cut the images and provide some text files. It's easier now, you can create .box files of your images, using the tesseract. The command is tesseract <image>.tiff <output_name> -l <language> wordstrbox
  4. Change the .box files: with the truth files (these .box), correct them. These files will be the base for the fine tuning. If the output was an "a" and it must be a "s", change it in these files.
  5. Create the training files: after correcting every box file you have for the training set, create the training files. The command is tesseract <image>.tiff <output_name> lstm.train
  6. Generate the training base file: no mystery here, the training requires a file with the path for ALL lstmf files created in the previous step. In linux, you could achieve this with the command ls -1 *.lstmf > all_lstmf.txt
  7. Tuning: now comes the real training. The command is:
lstmtraining \
--model_output <path_output> \
--continue_from <path_language_lstm> \
--trainineddata <path_traineddata> \
--train_listfile <path_all_lstmf.txt> \
--max_iterations <max_iterations>

Some  considerations in the command above: you'll need the lstm file from the language you are fine tuning. You can get it from the github of the tesseract (ALWAYS USE THE BEST FOLDER). You need the traineddata of this language too. Again, use the BEST.

After the training finishes, create the traineddata for the new fine tuned language:
lstmtraining \
--stop_training \
--continue_from <path_output>_checkpoint \
--traineddata <path_traineddata> \
--model_output <path_output_new_language>.traineddata

With these steps, you'll have a new .traineddata file. Put it on your tessdata directory and you're ready to go.

I could've missed something, I doing this by heart, but I'm almost sure that's all I did.

Hope can help.

Best regards.


--
You received this message because you are subscribed to the Google Groups "tesseract-ocr" group.
To unsubscribe from this group and stop receiving emails from it, send an email to tesseract-oc...@googlegroups.com.
To view this discussion on the web visit https://groups.google.com/d/msgid/tesseract-ocr/8b7cee3d-f413-4738-ab84-21f42281f85fn%40googlegroups.com.


--
Netão

The trouble with being punctual is that nobody's there to appreciate it.”
Franklin P. Jones
Reply all
Reply to author
Forward
0 new messages