Training Tesseract: unicharset extractor producing "Bad properties"

785 views

Skip to first unread message

guspo...@gmail.com

unread,

Nov 30, 2015, 5:42:23 PM11/30/15

to tesseract-ocr

In some recent posts, I've seen people with similar problems as mine, but no answer as how to fix it. I'm trying to train tesseract to be more accurate with a new font. When creating the unicharset using unicharset_extractor on my box file:

```

a 32 692 165 958 0

b 221 734 354 958 0

c 32 446 165 628 0

d 221 488 354 628 0

e 32 275 165 373 0

f 221 317 277 373 0

```

I get the following output:

```

NULL 0 NULL 0

Joined 0 0,255,0,255,0,0,0,0,0,0 NULL 0 0 0 # Joined [4a 6f 69 6e 65 64 ]

|Broken|0|1 0 0,255,0,255,0,0,0,0,0,0 NULL 0 0 0 # Broken

a 0 0,255,0,255,0,0,0,0,0,0 NULL 0 0 0 # a [61 ]

b 0 0,255,0,255,0,0,0,0,0,0 NULL 0 0 0 # b [62 ]

c 0 0,255,0,255,0,0,0,0,0,0 NULL 0 0 0 # c [63 ]

d 0 0,255,0,255,0,0,0,0,0,0 NULL 0 0 0 # d [64 ]

e 0 0,255,0,255,0,0,0,0,0,0 NULL 0 0 0 # e [65 ]

f 0 0,255,0,255,0,0,0,0,0,0 NULL 0 0 0 # f [66 ]

```

and when i run shapeclustering, if gives a the first few lines of:

```

Bad properties for index 3, char a: 0,255 0,255 0,0 0,0 0,0

Bad properties for index 4, char b: 0,255 0,

```

It seems that the unicharset_extractor isn't properly parsing the box file. Some obvious problems with the unicharset file are the "properties" bit mask is 0, the "glyph_metrics" field appears invalid (0,255,0,255,0,0,0,0,0,0), the "script" field should be either "Latin" or "Common", but is NULL, etc.

Anyone have an idea why is is happening?

O/S: Ubuntu 15.10

Tesseract Ver: 3.04

Posts with no simple resolution:

https://github.com/tesseract-ocr/tesseract/issues/139

Meltem Çetiner

unread,

Mar 9, 2016, 7:12:41 AM3/9/16

to tesseract-ocr

Hi, Im trying to train as well and I have the same problem. I got this result :

"P 5 0,255,0,255,0,32767,0,32767,0,32767 NULL 54 0 0 # # P [50 ]A

A 5 0,255,0,255,0,32767,0,32767,0,32767 NULL 38 0 0 # # A [41 ]A

S 5 0,255,0,255,0,32767,0,32767,0,32767 NULL 53 0 0 # # S [53 ]A"

I have the problem with the fields of glyph_metric and script. Is there any idea?

Reply all

Reply to author

Forward

0 new messages