Original training data for eng.traineddata

61 views

Skip to first unread message

Duy Khanh

unread,

Jun 20, 2023, 10:38:33 PM6/20/23

to tesseract-ocr

Hi. Is the existing eng.training_text in langdata_lstm the full text corpus used for training the eng.traineddata? Do we have a list of fonts used for generating the images?

Zdenko Podobny

unread,

Jun 21, 2023, 1:15:08 AM6/21/23

to tesser...@googlegroups.com

With opensourced data you will not be able to create (from scratch) the same quality traineddata as Google provided.

However there are some projects that fine tuned Google model successfully e.g. (UB-Mannheim/: https://madoc.bib.uni-mannheim.de/53748/ )

Zdenko

st 21. 6. 2023 o 4:38 Duy Khanh <touu...@gmail.com> napísal(a):

Hi. Is the existing eng.training_text in langdata_lstm the full text corpus used for training the eng.traineddata? Do we have a list of fonts used for generating the images?

--
You received this message because you are subscribed to the Google Groups "tesseract-ocr" group.
To unsubscribe from this group and stop receiving emails from it, send an email to tesseract-oc...@googlegroups.com.
To view this discussion on the web visit https://groups.google.com/d/msgid/tesseract-ocr/1051395b-f3e9-4d5d-9f06-76454291d117n%40googlegroups.com.

Reply all

Reply to author

Forward

0 new messages