Fwd: tesseract-ocr - Google Groups: Message Pending [{INLZp6-bu9eaHioCaWcwAW2_BwmznK2y0}]

97 views
Skip to first unread message

zdenko podobny

unread,
Sep 22, 2015, 2:49:54 AM9/22/15
to tesseract-ocr

---------- Forwarded message ----------
From: Juan Pablo Aveggio <jpav...@gmail.com>
To: tesseract-ocr <tesser...@googlegroups.com>
Cc: 
Date: Mon, 21 Sep 2015 16:17:45 -0700 (PDT)
Subject: Train tesseract 3.04 for recognition of six patterns no existents in UTF-8
Hello
I'm trying to train tesseract for recognition of patterns present in tickets. Each ticket possesses a unique pattern in a predetermined place which determines its value. As these patterns are not including unicode characters,  I assigned them the characters 'a' to 'f'.
I created a .tif image with six patterns:
and the corresponding file box:
a 32 692 165 958 0
b
221 734 354 958 0
c
32 446 165 628 0
d
221 488 354 628 0
e
32 275 165 373 0
f
221 317 277 373 0

Then I ran:
tesseract bil.pat.exp0.tif bil.pat.exp0 box.train
and output:
Tesseract Open Source OCR Engine v3.04.00 with Leptonica
Page 1
APPLY_BOXES
:
   
Boxes read from boxfile:       6
APPLY_BOXES
: Unlabelled word at :Bounding box=(-958,221)->(-734,277)
APPLY_BOXES
: Unlabelled word at :Bounding box=(-628,221)->(-488,277)
APPLY_BOXES
: Unlabelled word at :Bounding box=(-958,32)->(-734,88)
APPLY_BOXES
: Unlabelled word at :Bounding box=(-628,32)->(-488,88)
APPLY_BOXES
: Unlabelled word at :Bounding box=(-373,32)->(-317,88)
   
Found 6 good blobs.
   
5 remaining unlabelled words deleted.
Generated training data for 6 words
That can not mean negative coordinates. Despite this I tried to keep going.
My font_properties is:
bil.pat.box 0 0 1 0 0
bil.words_list is:
a
b
c
d
e
f

then I ran:
$ unicharset_extractor bil.pat.exp0.box
Extracting unicharset from bil.pat.exp0.box
Wrote unicharset file ./unicharset.
but the unicharset file has:
9
NULL
0 NULL 0
Joined 0 0,255,0,255,0,0,0,0,0,0 NULL 0 0 0     # Joined [4a 6f 69 6e 65 64 ]
|Broken|0|1 0 0,255,0,255,0,0,0,0,0,0 NULL 0 0 0        # Broken
a
0 0,255,0,255,0,0,0,0,0,0 NULL 0 0 0  # a [61 ]
b
0 0,255,0,255,0,0,0,0,0,0 NULL 0 0 0  # b [62 ]
c
0 0,255,0,255,0,0,0,0,0,0 NULL 0 0 0  # c [63 ]
d
0 0,255,0,255,0,0,0,0,0,0 NULL 0 0 0  # d [64 ]
e
0 0,255,0,255,0,0,0,0,0,0 NULL 0 0 0  # e [65 ]
f
0 0,255,0,255,0,0,0,0,0,0 NULL 0 0 0  # f [66 ]
Then I ran:
$ mftraining -F font_properties -U unicharset -O bil.unicharset bil.pat.exp0.tr  
Read shape table shapetable of 0 shapes
Reading bil.pat.exp0.tr ...
Bad properties for index 3, char a: 0,255 0,255 0,0 0,0 0,0
Bad properties for index 4, char b: 0,255 0,255 0,0 0,0 0,0
Bad properties for index 5, char c: 0,255 0,255 0,0 0,0 0,0
Bad properties for index 6, char d: 0,255 0,255 0,0 0,0 0,0
Bad properties for index 7, char e: 0,255 0,255 0,0 0,0 0,0
Bad properties for index 8, char f: 0,255 0,255 0,0 0,0 0,0
Warning: no protos/configs for Joined in CreateIntTemplates()
Warning: no protos/configs for |Broken|0|1 in CreateIntTemplates()
Warning: no protos/configs for a in CreateIntTemplates()
Warning: no protos/configs for b in CreateIntTemplates()
Warning: no protos/configs for c in CreateIntTemplates()
Warning: no protos/configs for d in CreateIntTemplates()
Warning: no protos/configs for e in CreateIntTemplates()
Warning: no protos/configs for f in CreateIntTemplates()
Done!
That's what I'm doing wrong?
I am on debian.
tesseract 3.04.00
 leptonica
-1.72
  libgif
4.1.6(?) : libjpeg 6b (libjpeg-turbo 1.4.0) : libpng 1.2.50 : libtiff 4.0.5 : zlib 1.2.8 : libwebp 0.4.3 : libopenjp2 2.1.0
From already thank you very much!





Reply all
Reply to author
Forward
0 new messages