Error when doing the set_unicharset_properties command on Windows

525 views
Skip to first unread message

Jehan

unread,
Feb 23, 2018, 5:38:31 AM2/23/18
to tesseract-ocr
I'm training Tesseract on Windows for a new font and everything went pretty well until the set_unicharset_properties command step:

set_unicharset_properties -U .\unicharset -O .\unicharset2 -F "C:\Windows\Fonts\Roman.tff" --script_dir='C:\Program Files (x86)\Tesseract-OCR\training'

Loaded unicharset of size 7 from file .\unicharset
Setting unichar properties
Other case c of C is not in unicharset
Other case f of F is not in unicharset
Setting script properties
Failed to load script unicharset from:C:\Program Files (x86)\Tesseract-OCR\training/Latin.unicharset
Warning: properties incomplete for index 3 = C
Warning: properties incomplete for index 4 = 0
Warning: properties incomplete for index 5 = 1
Warning: properties incomplete for index 6 = F
Writing unicharset to file .\unicharset2

I've verified that Latin.unicharset is in the right directory.

The problem (I'm pretty sure) is on the end of this line :

Failed to load script unicharset from:C:\Program Files (x86)\Tesseract-OCR\training/Latin.unicharset

The thing is that the training software adds a "/" instead of a "\".
I've looked on unicharset_training_utils.cpp, in the line 166, the "/" is added without taking care if the command is used on Windows or Linux.

Is there a solution for Windows to load Latin.unicharset even with this "/" ?
If not, what is the easiest solution ?

For information, my unicharset2 file looks like that :
7
NULL 0 Common 0
Joined 7 0,255,0,255,0,0,0,0,0,0 Latin 1 0 1 Joined # Joined [4a 6f 69 6e 65 64 ]a
|Broken|0|1 f 0,255,0,255,0,0,0,0,0,0 Common 2 10 2 |Broken|0|1 # Broken
C 5 0,255,0,255,0,0,0,0,0,0 Latin 3 0 3 C # C [43 ]A
0 8 0,255,0,255,0,0,0,0,0,0 Common 4 2 4 0 # 0 [30 ]0 
...
 

ShreeDevi Kumar

unread,
Feb 23, 2018, 6:04:53 AM2/23/18
to tesser...@googlegroups.com
Please open this as an issue in github repo - https://github.com/tesseract-ocr/tesseract/issues

the "/" is added without taking care if the command is used on Windows or Linux.

Found a couple of places in that file where this is the case.

    // Load the unicharset for the script if available.
    string filename = script_dir + "/" +
                      unicharset->get_script_from_script_id(s) + ".unicharset";

​and

    // Load the xheights for the script if available.
    string filename = script_dir + "/" +
                      unicharset.get_script_from_script_id(s) + ".xheights";

ShreeDevi
____________________________________________________________
भजन - कीर्तन - आरती @ http://bhajans.ramparivar.com

--
You received this message because you are subscribed to the Google Groups "tesseract-ocr" group.
To unsubscribe from this group and stop receiving emails from it, send an email to tesseract-ocr+unsubscribe@googlegroups.com.
To post to this group, send email to tesser...@googlegroups.com.
Visit this group at https://groups.google.com/group/tesseract-ocr.
To view this discussion on the web visit https://groups.google.com/d/msgid/tesseract-ocr/aa3a131c-51fe-42ea-9fba-336ef89737cd%40googlegroups.com.
For more options, visit https://groups.google.com/d/optout.

Jehan

unread,
Feb 23, 2018, 8:30:18 AM2/23/18
to tesseract-ocr
Again, thank you for posting it earlier than me :)

Anyway, do you know how could I pass this problem ? Is there any trick that could help me ? Maybe using Git bash or something ?
To unsubscribe from this group and stop receiving emails from it, send an email to tesseract-oc...@googlegroups.com.

ShreeDevi Kumar

unread,
Feb 23, 2018, 8:35:38 AM2/23/18
to tesser...@googlegroups.com
I use mobaxterm and WSL (bash under windows) on Windows 10.

If you are training for legacy tesseract engine (not LSTM) you can use Jtessboxeditor for training.

ShreeDevi
____________________________________________________________
भजन - कीर्तन - आरती @ http://bhajans.ramparivar.com

To unsubscribe from this group and stop receiving emails from it, send an email to tesseract-ocr+unsubscribe@googlegroups.com.

To post to this group, send email to tesser...@googlegroups.com.
Visit this group at https://groups.google.com/group/tesseract-ocr.

ShreeDevi Kumar

unread,
Feb 23, 2018, 8:43:12 AM2/23/18
to tesser...@googlegroups.com
I have used git bash for running tesseract. Not tried for training.

You can use the ppa from the link below, rather than trying to build it.

Reply all
Reply to author
Forward
0 new messages