Original training data for eng.traineddata

61 views
Skip to first unread message

Duy Khanh

unread,
Jun 20, 2023, 10:38:33 PM6/20/23
to tesseract-ocr
Hi. Is the existing eng.training_text in langdata_lstm the full text corpus used for training the eng.traineddata? Do we have a list of fonts used for generating the images?

Zdenko Podobny

unread,
Jun 21, 2023, 1:15:08 AM6/21/23
to tesser...@googlegroups.com
With opensourced data you will not be able to create (from scratch) the same quality traineddata as Google provided. 
However there are some projects that fine tuned Google model successfully e.g. (UB-Mannheim/: https://madoc.bib.uni-mannheim.de/53748/ )


Zdenko


st 21. 6. 2023 o 4:38 Duy Khanh <touu...@gmail.com> napísal(a):
Hi. Is the existing eng.training_text in langdata_lstm the full text corpus used for training the eng.traineddata? Do we have a list of fonts used for generating the images?

--
You received this message because you are subscribed to the Google Groups "tesseract-ocr" group.
To unsubscribe from this group and stop receiving emails from it, send an email to tesseract-oc...@googlegroups.com.
To view this discussion on the web visit https://groups.google.com/d/msgid/tesseract-ocr/1051395b-f3e9-4d5d-9f06-76454291d117n%40googlegroups.com.
Reply all
Reply to author
Forward
0 new messages