tesseract data language model sources

abram stern

unread,

Oct 17, 2019, 11:40:33 PM10/17/19

to tesseract-ocr

Hi tesseract community,

I'm working on a research project about OCR and I'm wondering where the included data models (eg 'fast', 'best') come from -- or put another way, what source material is used for training them? I haven't been able to find this documented anywhere and am interested to know if it involves public domain corpora, data obtained through book scanning, or other sources.

Best regards,

Abram

Shree Devi Kumar

unread,

Oct 18, 2019, 12:10:25 AM10/18/19

to tesseract-ocr

See https://github.com/tesseract-ocr/tesseract/issues/654#issuecomment-274574951

--
You received this message because you are subscribed to the Google Groups "tesseract-ocr" group.
To unsubscribe from this group and stop receiving emails from it, send an email to tesseract-oc...@googlegroups.com.
To view this discussion on the web visit https://groups.google.com/d/msgid/tesseract-ocr/bdb45c2b-1764-4384-95e5-a5d884e2c5ab%40googlegroups.com.

--

____________________________________________________________
भजन - कीर्तन - आरती @ http://bhajans.ramparivar.com

Shree Devi Kumar

unread,

Oct 18, 2019, 12:11:14 AM10/18/19

to tesseract-ocr

https://github.com/tesseract-ocr/langdata_lstm

has the files used.

abram stern

unread,

Oct 18, 2019, 1:00:59 AM10/18/19

to tesser...@googlegroups.com

thanks, this is exactly what I was looking for! -a

To view this discussion on the web visit https://groups.google.com/d/msgid/tesseract-ocr/CAG2NduVTYohE9sH%3D6yk7%2BZOCnJ%2B%2Baom0FwnAM4oo0%3DJdcbDDVg%40mail.gmail.com.

--

Abram Stern (aphid)

PhD Candidate, Film and Digital Media

University of California, Santa Cruz

ap...@ucsc.edu // a...@aphid.org ⚛ // (831) 224-0334 (mobile/signal)

Reply all

Reply to author

Forward