Error when doing the set_unicharset_properties command on Windows

Jehan

unread,

Feb 23, 2018, 5:38:31 AM2/23/18

to tesseract-ocr

I'm training Tesseract on Windows for a new font and everything went pretty well until the set_unicharset_properties command step:

set_unicharset_properties -U .\unicharset -O .\unicharset2 -F "C:\Windows\Fonts\Roman.tff" --script_dir='C:\Program Files (x86)\Tesseract-OCR\training'

Loaded unicharset of size 7 from file .\unicharset
Setting unichar properties
Other case c of C is not in unicharset
Other case f of F is not in unicharset
Setting script properties
Failed to load script unicharset from:C:\Program Files (x86)\Tesseract-OCR\training/Latin.unicharset
Warning: properties incomplete for index 3 = C
Warning: properties incomplete for index 4 = 0
Warning: properties incomplete for index 5 = 1
Warning: properties incomplete for index 6 = F
Writing unicharset to file .\unicharset2

I've verified that Latin.unicharset is in the right directory.

The problem (I'm pretty sure) is on the end of this line :

Failed to load script unicharset from:C:\Program Files (x86)\Tesseract-OCR\training/Latin.unicharset

The thing is that the training software adds a "/" instead of a "\".
I've looked on unicharset_training_utils.cpp, in the line 166, the "/" is added without taking care if the command is used on Windows or Linux.

Is there a solution for Windows to load Latin.unicharset even with this "/" ?

If not, what is the easiest solution ?

For information, my unicharset2 file looks like that :

7
NULL 0 Common 0
Joined 7 0,255,0,255,0,0,0,0,0,0 Latin 1 0 1 Joined # Joined [4a 6f 69 6e 65 64 ]a
|Broken|0|1 f 0,255,0,255,0,0,0,0,0,0 Common 2 10 2 |Broken|0|1 # Broken
C 5 0,255,0,255,0,0,0,0,0,0 Latin 3 0 3 C # C [43 ]A
0 8 0,255,0,255,0,0,0,0,0,0 Common 4 2 4 0 # 0 [30 ]0
...

ShreeDevi Kumar

unread,

Feb 23, 2018, 6:04:53 AM2/23/18

to tesser...@googlegroups.com

Please open this as an issue in github repo - https://github.com/tesseract-ocr/tesseract/issues

> the "/" is added without taking care if the command is used on Windows or Linux.

Found a couple of places in that file where this is the case.

// Load the unicharset for the script if available.
string filename = script_dir + "/" +
unicharset->get_script_from_script_id(s) + ".unicharset";

and

// Load the xheights for the script if available.
string filename = script_dir + "/" +
unicharset.get_script_from_script_id(s) + ".xheights";

ShreeDevi
____________________________________________________________
भजन - कीर्तन - आरती @ http://bhajans.ramparivar.com

--
You received this message because you are subscribed to the Google Groups "tesseract-ocr" group.
To unsubscribe from this group and stop receiving emails from it, send an email to tesseract-ocr+unsubscribe@googlegroups.com.
To post to this group, send email to tesser...@googlegroups.com.
Visit this group at https://groups.google.com/group/tesseract-ocr.
To view this discussion on the web visit https://groups.google.com/d/msgid/tesseract-ocr/aa3a131c-51fe-42ea-9fba-336ef89737cd%40googlegroups.com.
For more options, visit https://groups.google.com/d/optout.

Jehan

unread,

Feb 23, 2018, 8:30:18 AM2/23/18

to tesseract-ocr

Again, thank you for posting it earlier than me :)

Anyway, do you know how could I pass this problem ? Is there any trick that could help me ? Maybe using Git bash or something ?

To unsubscribe from this group and stop receiving emails from it, send an email to tesseract-oc...@googlegroups.com.

ShreeDevi Kumar

unread,

Feb 23, 2018, 8:35:38 AM2/23/18

to tesser...@googlegroups.com

I use mobaxterm and WSL (bash under windows) on Windows 10.

If you are training for legacy tesseract engine (not LSTM) you can use Jtessboxeditor for training.

ShreeDevi
____________________________________________________________
भजन - कीर्तन - आरती @ http://bhajans.ramparivar.com

To unsubscribe from this group and stop receiving emails from it, send an email to tesseract-ocr+unsubscribe@googlegroups.com.

To post to this group, send email to tesser...@googlegroups.com.
Visit this group at https://groups.google.com/group/tesseract-ocr.

To view this discussion on the web visit https://groups.google.com/d/msgid/tesseract-ocr/51e77998-357a-4bcd-a2f3-daec8eb4315a%40googlegroups.com.

ShreeDevi Kumar

unread,

Feb 23, 2018, 8:43:12 AM2/23/18

to tesser...@googlegroups.com

I have used git bash for running tesseract. Not tried for training.

You can use the ppa from the link below, rather than trying to build it.

https://launchpad.net/~alex-p/+archive/ubuntu/tesseract-ocr/+packages

Reply all

Reply to author

Forward