Does the number in the .exp# file type matter?

40 views
Skip to first unread message

Dan9er

unread,
Sep 23, 2017, 11:16:48 AM9/23/17
to tesseract-ocr
I'm making a unicharset file so I can compile DAWG dictionary files so I can use it with tesstrain.sh. I want to use multiple exposures (-1, 0,1) for the tiff/box pairs. How should name them to separate the different exposures?

Can I do this?:

lang.Arial.exp0
lang
.Arial.exp1
lang
.Arial.exp2

Or will changing the file numbers screw things up? As an alternative, can I do this?:

lang.Arial0.exp0
lang
.Arial1.exp0
lang
.Arial2.exp0

ShreeDevi Kumar

unread,
Sep 23, 2017, 12:05:37 PM9/23/17
to tesser...@googlegroups.com
You cannot use a random unicharset, it needs to be the same one used for training the model.

For multiple exposures, use the following method

training/tesstrain.sh \
--fonts_dir /mnt/c/Windows/Fonts \
 --lang eng \
 --noextract_font_properties  --linedata_only \
 --exposures "-1, 0, 1" \
 --langdata_dir ../langdata \
 --tessdata_dir ../tessdata \
 --fontlist \
  "Arial" \
  "Tahoma" \
  "Times New Roman," \
  "Sanskrit 2003," \
    "FreeSerif Italic" \
    "Times New Roman, Italic" \
  --output_dir ../tesstutorial/eng


ShreeDevi
____________________________________________________________
भजन - कीर्तन - आरती @ http://bhajans.ramparivar.com

--
You received this message because you are subscribed to the Google Groups "tesseract-ocr" group.
To unsubscribe from this group and stop receiving emails from it, send an email to tesseract-ocr+unsubscribe@googlegroups.com.
To post to this group, send email to tesser...@googlegroups.com.
Visit this group at https://groups.google.com/group/tesseract-ocr.
To view this discussion on the web visit https://groups.google.com/d/msgid/tesseract-ocr/6e9f4a45-5dde-41f6-8a41-a403778aef54%40googlegroups.com.
For more options, visit https://groups.google.com/d/optout.

Dan9er

unread,
Sep 24, 2017, 9:35:38 AM9/24/17
to tesseract-ocr
That answer doesn't help me.

How can I add dictionary files to tesstrain?
To unsubscribe from this group and stop receiving emails from it, send an email to tesseract-oc...@googlegroups.com.

ShreeDevi Kumar

unread,
Sep 24, 2017, 12:08:32 PM9/24/17
to tesser...@googlegroups.com
Please read tesstrain_utils.sh if you want to know the details.

Dictionary files are built from your sources in langdata. Unicharset is also built from your training_text in langdata.

To unsubscribe from this group and stop receiving emails from it, send an email to tesseract-ocr+unsubscribe@googlegroups.com.

To post to this group, send email to tesser...@googlegroups.com.
Visit this group at https://groups.google.com/group/tesseract-ocr.
Reply all
Reply to author
Forward
0 new messages