In some recent posts, I've seen people with similar problems as mine, but no answer as how to fix it. I'm trying to train tesseract to be more accurate with a new font. When creating the unicharset using unicharset_extractor on my box file:
```
a 32 692 165 958 0
b 221 734 354 958 0
c 32 446 165 628 0
d 221 488 354 628 0
e 32 275 165 373 0
f 221 317 277 373 0
```
I get the following output:
```
9
NULL 0 NULL 0
Joined 0 0,255,0,255,0,0,0,0,0,0 NULL 0 0 0 # Joined [4a 6f 69 6e 65 64 ]
|Broken|0|1 0 0,255,0,255,0,0,0,0,0,0 NULL 0 0 0 # Broken
a 0 0,255,0,255,0,0,0,0,0,0 NULL 0 0 0 # a [61 ]
b 0 0,255,0,255,0,0,0,0,0,0 NULL 0 0 0 # b [62 ]
c 0 0,255,0,255,0,0,0,0,0,0 NULL 0 0 0 # c [63 ]
d 0 0,255,0,255,0,0,0,0,0,0 NULL 0 0 0 # d [64 ]
e 0 0,255,0,255,0,0,0,0,0,0 NULL 0 0 0 # e [65 ]
f 0 0,255,0,255,0,0,0,0,0,0 NULL 0 0 0 # f [66 ]
```
and when i run shapeclustering, if gives a the first few lines of:
```
Bad properties for index 3, char a: 0,255 0,255 0,0 0,0 0,0
Bad properties for index 4, char b: 0,255 0,
```
It seems that the unicharset_extractor isn't properly parsing the box file. Some obvious problems with the unicharset file are the "properties" bit mask is 0, the "glyph_metrics" field appears invalid (0,255,0,255,0,0,0,0,0,0), the "script" field should be either "Latin" or "Common", but is NULL, etc.
Anyone have an idea why is is happening?
O/S: Ubuntu 15.10
Tesseract Ver: 3.04
Posts with no simple resolution: