What is the purpose of trained data files present under tessdata/script folder

207 views
Skip to first unread message

Vikas Goel

unread,
Jul 19, 2018, 2:29:54 PM7/19/18
to tesseract-ocr
After installing tesseract, there are trained data files present under "C:\Program Files (x86)\Tesseract-OCR\tessdata" as well as "C:\Program Files (x86)\Tesseract-OCR\tessdata\script". As per my uderstanding, tesseract engine uses the files present under "C:\Program Files (x86)\Tesseract-OCR\tessdata". Please confirm and let me know the purpose of trained data files present under "C:\Program Files (x86)\Tesseract-OCR\tessdata\script"

Shree Devi Kumar

unread,
Jul 20, 2018, 12:02:39 AM7/20/18
to tesser...@googlegroups.com
Files in tessdata are for a particular language eg. Hindi, Sanskrit, Marathi, Nepali.

Files in tessdata/script are for a particular script used for writing the languages eg. Devanagari.

Also note that most script files also include support for English.

So, if you have a document with Hindi+English+Sanskrit you can use Devanagari.traineddata.

In some cases you may find it to be better than the language data,

On Thu, Jul 19, 2018 at 11:59 PM Vikas Goel <goel.vi...@gmail.com> wrote:
After installing tesseract, there are trained data files present under "C:\Program Files (x86)\Tesseract-OCR\tessdata" as well as "C:\Program Files (x86)\Tesseract-OCR\tessdata\script". As per my uderstanding, tesseract engine uses the files present under "C:\Program Files (x86)\Tesseract-OCR\tessdata". Please confirm and let me know the purpose of trained data files present under "C:\Program Files (x86)\Tesseract-OCR\tessdata\script"

--
You received this message because you are subscribed to the Google Groups "tesseract-ocr" group.
To unsubscribe from this group and stop receiving emails from it, send an email to tesseract-oc...@googlegroups.com.
To post to this group, send email to tesser...@googlegroups.com.
Visit this group at https://groups.google.com/group/tesseract-ocr.
To view this discussion on the web visit https://groups.google.com/d/msgid/tesseract-ocr/837c0600-c734-4ab8-9ed0-3f0d4a08b04a%40googlegroups.com.
For more options, visit https://groups.google.com/d/optout.


--

____________________________________________________________
भजन - कीर्तन - आरती @ http://bhajans.ramparivar.com

chandra churh chatterjee

unread,
Jul 20, 2018, 12:35:27 AM7/20/18
to tesser...@googlegroups.com
The traineddata files present under script folder are referenced using the TESSDATA_PREFIX environmemt variable so that while running tesseract.exe from command line, those trained datas can be used using the        "-l" command
For example:
Command-tesseract.exe input out -l eng --oem 1 
This command makes tesseract 4 use the eng.traineddata for evaluation.

Chandra Churh Chatterjee

On Thu, Jul 19, 2018, 11:59 PM Vikas Goel <goel.vi...@gmail.com> wrote:
After installing tesseract, there are trained data files present under "C:\Program Files (x86)\Tesseract-OCR\tessdata" as well as "C:\Program Files (x86)\Tesseract-OCR\tessdata\script". As per my uderstanding, tesseract engine uses the files present under "C:\Program Files (x86)\Tesseract-OCR\tessdata". Please confirm and let me know the purpose of trained data files present under "C:\Program Files (x86)\Tesseract-OCR\tessdata\script"

--

Vikas Goel

unread,
Jul 20, 2018, 7:22:21 AM7/20/18
to tesseract-ocr
Thanks Shree

Vikas Goel

unread,
Jul 20, 2018, 7:24:18 AM7/20/18
to tesseract-ocr
Thanks Chandra. But I think the comand you mentioned is the default command which will use trained data present under tessdata folder, not under tessdata/script folder

Vikas Goel

unread,
Jul 20, 2018, 7:25:18 AM7/20/18
to tesseract-ocr
So, if I need to use trained data under script folder, then I need to copy them to tessdata folder or need to specify any specific language code for using these scripts data.


On Friday, July 20, 2018 at 9:32:39 AM UTC+5:30, shree wrote:
Reply all
Reply to author
Forward
0 new messages