multiple problem with fine tuning

69 views
Skip to first unread message

Hosein Khoshdel

unread,
Jul 16, 2018, 4:42:54 AM7/16/18
to tesseract-ocr
hi before asking my question i want to thank shree whose comments are very helpful both here and in github repo of tesseract.

i want to fine tune fas.traineddata to support some new fonts. the first problem arises when i use the following command:

tesstrain.sh --fonts_dir /c/folder/fonts/ --lang fas --noextract_font_properties --linedata_only --exposures "0" --langdata_dir ../langdata --tessdata_dir ../tessdata --fontlist "b nazanin" --output_dir ../../tessdata/fas/

i put fas.traineddata, which i downloaded tessdata_best repo, in ../tessdata folder, but it gives error and says that it can not find eng.traineddata. this problem is resolved when i put eng.traineddata in ../tessdata but why should it want eng when i specify that lang is fas?

anyway for now i pasted eng,traineddata and moved on. the second problem is with tiff/box pair generated with the above command. first of all in some words in tiff files the characters are not joined.for example there is:

but it should be 

another problem is that the box file generated is from left to right but it should be RTL. this problem is addressed here but i did not understand if there is a solution for it or not.

lastly i am confused with the fine tuning process. is tesstrain.sh only for generating tiff/box pairs? what are the next steps. is using lstmtraining.exe the next and final step?

btw i'm using:

tesseract 4.0.0-beta.3
 leptonica-1.76.0 (Jul 10 2018, 21:36:38) [MSC v.1900 LIB Debug x64]
  libgif 5.1.4 : libjpeg 9b : libpng 1.6.34 : libtiff 4.0.9 : zlib 1.2.11 : libwebp 0.6.1 : libopenjp2 2.3.0
 Found AVX
 Found SSE

which i built with vs2015 also i'm using win 8.1
fas.b_nazanin.exp0.tif
fas.b_nazanin.exp0.box

Shree Devi Kumar

unread,
Jul 16, 2018, 12:38:38 PM7/16/18
to tesser...@googlegroups.com
> first of all in some words in tiff files the characters are not joined.

Make sure to include ZWNJ and ZWJ in your unicharset.

>  box file generated is from left to right but it should be RTL

According to Ray that is intentional.

>  is using lstmtraining.exe the next and final step

Yes. tesstrain.sh process only creates a 'starter traineddata' (unlike for tesseract3).

--
You received this message because you are subscribed to the Google Groups "tesseract-ocr" group.
To unsubscribe from this group and stop receiving emails from it, send an email to tesseract-oc...@googlegroups.com.
To post to this group, send email to tesser...@googlegroups.com.
Visit this group at https://groups.google.com/group/tesseract-ocr.
To view this discussion on the web visit https://groups.google.com/d/msgid/tesseract-ocr/6f28256d-f2d4-4d13-a439-751465ec97dd%40googlegroups.com.
For more options, visit https://groups.google.com/d/optout.


--

____________________________________________________________
भजन - कीर्तन - आरती @ http://bhajans.ramparivar.com
Reply all
Reply to author
Forward
0 new messages