Training without font files

aaron...@stonybrook.edu

unread,

Jan 26, 2019, 3:06:59 AM1/26/19

to tesser...@googlegroups.com

Hello,

I’m trying to train Tesseract 4 using images (and associated box files). I can’t pinpoint the font name and prefer to avoid sourcing the font itself.

I’m currently trying to train on MacOS High Sierra, but have access to Trisquel and Windows 8.1, as well.

I find the directions on the wiki to assume either considerable prior knowledge to really understand the structure of the training terminal commands (and why each of them is important) or a leap of faith with trial and error.

Any help would be most appreciated?

-Aaron

Shree Devi Kumar

unread,

Jan 26, 2019, 3:28:27 AM1/26/19

to tesser...@googlegroups.com

Check out

https://github.com/OCR-D/ocrd-train

--
You received this message because you are subscribed to the Google Groups "tesseract-ocr" group.
To unsubscribe from this group and stop receiving emails from it, send an email to tesseract-oc...@googlegroups.com.
To post to this group, send email to tesser...@googlegroups.com.
Visit this group at https://groups.google.com/group/tesseract-ocr.
To view this discussion on the web visit https://groups.google.com/d/msgid/tesseract-ocr/C3A0C93E-FFBC-474C-87DC-C3F53F3F0F70%40stonybrook.edu.
For more options, visit https://groups.google.com/d/optout.

aaron...@stonybrook.edu

unread,

Jan 26, 2019, 12:15:51 PM1/26/19

to tesser...@googlegroups.com

Thank you for the suggestion, but I have tried OCR-D Train previously, and seem to have an issue running even the training example. I receive issues with make and also ascii encoding errors (likely from the included python script). Might you have advice for accomplishing my initial goal without the helper app?

To view this discussion on the web visit https://groups.google.com/d/msgid/tesseract-ocr/CAG2NduUHgU%2B3rUS6gWqKKY299%2B5iXWKBAngmRfkuGGPixBhLeg%40mail.gmail.com.

Shree Devi Kumar

unread,

Jan 29, 2019, 12:51:30 PM1/29/19

to tesser...@googlegroups.com

>I’m trying to train Tesseract 4 using images (and associated box files). I can’t pinpoint the font name and prefer to avoid sourcing the font itself.

LSTM training is much easier with font files since a large amount of training data is needed. You could try https://www.whatfontis.com/ to identify the font and then finetune with it.

The box files generated by tesseract are NOT in a format needed for LSTM training. They will need to be modified by hand to use for training.

I am attaching modified version of tesstrain bash scripts which add

OPTIONAL flag for specifying directory with user specified box/tiff pairs.

Files should be named similar to ${LANG_CODE}.${fontname}.exp${EXPOSURE}.box/tif

--my_boxtiff_dir MY_BOXTIFF_DIR # Location of user specified box/tiff files.

To view this discussion on the web visit https://groups.google.com/d/msgid/tesseract-ocr/C542CC19-7610-43B2-92EA-9355B2A493C7%40stonybrook.edu.

For more options, visit https://groups.google.com/d/optout.

--

____________________________________________________________
भजन - कीर्तन - आरती @ http://bhajans.ramparivar.com

tesstrain.sh

tesstrain_utils.sh

Aaron Knoll

unread,

Jan 29, 2019, 2:59:20 PM1/29/19

to tesser...@googlegroups.com

Thank you for the follow-up. Is it possible to fine-tune Tesseract without font files if I can't accurately source them (and without training from scratch)?

To view this discussion on the web visit https://groups.google.com/d/msgid/tesseract-ocr/CAG2NduW02dm-aJwLL6i5cJd-292cdzWZ9-Emd4b9ukPs6r_yvg%40mail.gmail.com.

Reply all

Reply to author

Forward