Issue 698 in tesseract-ocr: "FAILURE! Couldn't find a matching blob" on a perfectly created .box file and clear image

1,159 views
Skip to first unread message

tesser...@googlecode.com

unread,
Apr 30, 2012, 6:10:26 AM4/30/12
to tesserac...@googlegroups.com
Status: New
Owner: ----

New issue 698 by wolverin...@gmail.com: "FAILURE! Couldn't find a matching
blob" on a perfectly created .box file and clear image
http://code.google.com/p/tesseract-ocr/issues/detail?id=698

What steps will reproduce the problem?
1.In attached files below, in ApplyBoxes function, ResegmentCharBox returns
false, which means the function failed to find a major overlap between the
given box and words which it has found in earlier steps.

What is the expected output? What do you see instead?
1.Expected to see no "FAILURE! Couldn't find a matching blob", since the
box file and input image are both produced by a separate program and has no
error or defects.

Please use labels and text to provide additional information.
I just wonder why some of the words could not be found by tesseract and
good boxes rejected due to this problem.

the first two files are generated by my program and passed to tesseract for
training(this is just a page from a 600-page tiff and box file), and in the
third file boxes that boxes rejected by tesseract are saved.( just rename
the png file to "notfound.png" and open with a box editor)

Attachments:
fas.generatedBox.exp00.box 63.8 KB
fas.generatedBox.exp00.png 227 KB
notfound.box 1.4 KB

tesser...@googlecode.com

unread,
Apr 30, 2012, 6:39:23 AM4/30/12
to tesserac...@googlegroups.com

Comment #1 on issue 698 by wolverin...@gmail.com: "FAILURE! Couldn't find a
matching blob" on a perfectly created .box file and clear image
http://code.google.com/p/tesseract-ocr/issues/detail?id=698

I'm using tesseract 3.01 on windows 7 and vc++ 2010.

tesser...@googlecode.com

unread,
May 1, 2012, 2:40:13 AM5/1/12
to tesserac...@googlegroups.com

Comment #2 on issue 698 by wolverin...@gmail.com: "FAILURE! Couldn't find a
matching blob" on a perfectly created .box file and clear image
http://code.google.com/p/tesseract-ocr/issues/detail?id=698

actually I found out that if you pass only this page, tesseract would
accept all the boxes and say:

Page 0
APPLY_BOXES:
[...]
Boxes failed resegmentation: 0
TRAINING ...

but if I give it the 533-page-tiff file resegmentation fails for the boxes
mentioned previously. So the problem has to be with tiff file or what?
[since the original tiff is over 450MB I can't post it here]


tesser...@googlecode.com

unread,
May 1, 2012, 12:37:32 PM5/1/12
to tesserac...@googlegroups.com

Comment #3 on issue 698 by webdatak...@gmail.com: "FAILURE! Couldn't find a
matching blob" on a perfectly created .box file and clear image
http://code.google.com/p/tesseract-ocr/issues/detail?id=698

I got the same problem. (Issue 694)
the image is clear enough and the bounding box is modified correctly
but Tesseract said "FAILURE! Couldn't find a matching blob"
for now no official people talks about this problem :(

tesser...@googlecode.com

unread,
May 2, 2012, 2:36:53 AM5/2/12
to tesserac...@googlegroups.com

Comment #4 on issue 698 by wolverin...@gmail.com: "FAILURE! Couldn't find a
matching blob" on a perfectly created .box file and clear image
http://code.google.com/p/tesseract-ocr/issues/detail?id=698

I find a workaround for this problem, but i don't know the side effects.
add the following lines to the end of the box.train file:

textord_noise_rejwords F
textord_noise_rejrows F

these options reduces the amount of "FAILURE!..." significantly.

Details of what happens:
when "textord_noise_rejwords" is set to false, tesseract will NOT filter
noises from words by avoiding the call to
clean_noise_from_words(row_it.data ()) in cleanup_blocks(...) function in
tordmain.cpp file. This implies that those boxes considered as noise.
I think, when the a ligature fills a small portion of a box area it is
considered as noise and thus rejected.

I'm still working on it :)


tesser...@googlecode.com

unread,
May 3, 2012, 11:14:46 AM5/3/12
to tesserac...@googlegroups.com

Comment #5 on issue 698 by webdatak...@gmail.com: "FAILURE! Couldn't find a
matching blob" on a perfectly created .box file and clear image
http://code.google.com/p/tesseract-ocr/issues/detail?id=698

I added this extra options to box.train
according to my training result , it still got the same thing as result
without this options. :(

tesser...@googlecode.com

unread,
May 8, 2012, 12:27:41 AM5/8/12
to tesserac...@googlegroups.com

Comment #6 on issue 698 by wolverin...@gmail.com: "FAILURE! Couldn't find a
matching blob" on a perfectly created .box file and clear image
http://code.google.com/p/tesseract-ocr/issues/detail?id=698

I've got about 1% better word precision with these options. Not as good as
expected but it's something. in my training tiff image every word is
repeated at least 16 times.


tesser...@googlecode.com

unread,
Jul 24, 2012, 8:00:23 AM7/24/12
to tesserac...@googlegroups.com

Comment #7 on issue 698 by zde...@gmail.com: "FAILURE! Couldn't find a
matching blob" on a perfectly created .box file and clear image
http://code.google.com/p/tesseract-ocr/issues/detail?id=698

@wolverine.shy:can you try current svn (3.02 r732) code? Because it works
for me (3.01 acts as you reported):

tesseract fas.generatedBox.exp00.png fas.generatedBox.exp00 nobatch
box.train box.train
Tesseract Open Source OCR Engine v3.02 with Leptonica
row xheight=15, but median xheight = 20.5
row xheight=29, but median xheight = 20.5
row xheight=29, but median xheight = 20.5
row xheight=15.3333, but median xheight = 20.5
row xheight=15, but median xheight = 20.5
row xheight=17, but median xheight = 20.5
row xheight=15, but median xheight = 20.5
row xheight=6.66667, but median xheight = 20.5
row xheight=23.1842, but median xheight = 20.5
row xheight=25, but median xheight = 20.5
row xheight=23.1842, but median xheight = 20.5
APPLY_BOXES:
Boxes read from boxfile: 2525
Found 2525 good blobs.
TRAINING ... Font name = generatedBox
Generated training data for 895 words

Tested on Windows XP.

tesser...@googlecode.com

unread,
Jul 27, 2012, 4:56:35 PM7/27/12
to tesserac...@googlegroups.com

Comment #8 on issue 698 by nine.ele...@gmail.com: "FAILURE! Couldn't find a
matching blob" on a perfectly created .box file and clear image
http://code.google.com/p/tesseract-ocr/issues/detail?id=698

What is the significance of ...
"row xheight=47, but median xheight = 18.4848"
it mentions similar data for about 20 lines/warnings ... why ?

Also, I notice a trend in ratio of 'boxes read' to 'data for words' of 2525
to 895 ... a ratio of about one third.... Is their a way of checking or
seeing the rejected data?
And then is there a way of improving the percentage of 'training data' ?

kind regards

Richard

tesser...@googlecode.com

unread,
Jul 28, 2012, 4:40:12 AM7/28/12
to tesserac...@googlegroups.com

Comment #9 on issue 698 by zde...@gmail.com: "FAILURE! Couldn't find a
matching blob" on a perfectly created .box file and clear image
http://code.google.com/p/tesseract-ocr/issues/detail?id=698

@Richard: "row xheight=47, but median xheight = 18.4848" - this is info
that row has significant "deviation" from median xheight (e.g. there could
be letter with different font/size...)

"rejected data" - boxes are created for "letters" for there should be
difference in number of letters and number of words (e.g. there are no
rejected data from this point of view).

Please use forum for discussion and not issue tracked.


tesser...@googlecode.com

unread,
Sep 26, 2012, 3:19:18 PM9/26/12
to tesserac...@googlegroups.com
Updates:
Status: WorksForMe

Comment #10 on issue 698 by zde...@gmail.com: "FAILURE! Couldn't find a
matching blob" on a perfectly created .box file and clear image
http://code.google.com/p/tesseract-ocr/issues/detail?id=698

Works for me in tesseract 3.02

tesser...@googlecode.com

unread,
Dec 18, 2012, 2:59:19 AM12/18/12
to tesserac...@googlegroups.com

Comment #11 on issue 698 by szhai...@gmail.com: "FAILURE! Couldn't find a
matching blob" on a perfectly created .box file and clear image
http://code.google.com/p/tesseract-ocr/issues/detail?id=698

$ tesseract chi_sim.simhei.exp0.tif chi_sim.simhei.exp0 box.train box.train

Tesseract Open Source OCR Engine v3.02.02 with Leptonica
FAIL!
APPLY_BOXES: boxfile line 0/a ((20,580),(33,551)): FAILURE! Couldn't find a
matching blob
FAIL!
APPLY_BOXES: boxfile line 1/b ((33,580),(46,551)): FAILURE! Couldn't find a
matching blob
FAIL!
APPLY_BOXES: boxfile line 2/c ((46,580),(59,551)): FAILURE! Couldn't find a
matching blob
FAIL!
APPLY_BOXES: boxfile line 3/d ((59,580),(72,551)): FAILURE! Couldn't find a
matching blob
FAIL!
APPLY_BOXES: boxfile line 4/e ((72,580),(85,551)): FAILURE! Couldn't find a
matching blob
FAIL!
APPLY_BOXES: boxfile line 5/f ((85,580),(98,551)): FAILURE! Couldn't find a
matching blob
FAIL!
APPLY_BOXES: boxfile line 6/g ((98,580),(111,551)): FAILURE! Couldn't find
a matching blob
FAIL!
APPLY_BOXES: boxfile line 7/h ((111,580),(124,551)): FAILURE! Couldn't find
a matching blob
FAIL!
APPLY_BOXES: boxfile line 8/i ((124,580),(137,551)): FAILURE! Couldn't find
a matching blob
FAIL!
APPLY_BOXES: boxfile line 9/j ((137,580),(150,551)): FAILURE! Couldn't find
a matching blob
FAIL!
APPLY_BOXES: boxfile line 10/k ((150,580),(163,551)): FAILURE! Couldn't
find a matching blob
FAIL!
APPLY_BOXES: boxfile line 11/l ((163,580),(176,551)): FAILURE! Couldn't
find a matching blob
FAIL!
APPLY_BOXES: boxfile line 12/m ((176,580),(189,551)): FAILURE! Couldn't
find a matching blob
FAIL!
APPLY_BOXES: boxfile line 13/n ((189,580),(202,551)): FAILURE! Couldn't
find a matching blob
FAIL!
APPLY_BOXES: boxfile line 14/o ((202,580),(215,551)): FAILURE! Couldn't
find a matching blob
FAIL!
APPLY_BOXES: boxfile line 15/p ((215,580),(228,551)): FAILURE! Couldn't
find a matching blob
FAIL!
APPLY_BOXES: boxfile line 16/q ((228,580),(241,551)): FAILURE! Couldn't
find a matching blob
FAIL!
APPLY_BOXES: boxfile line 17/r ((241,580),(254,551)): FAILURE! Couldn't
find a matching blob
FAIL!
APPLY_BOXES: boxfile line 18/s ((254,580),(267,551)): FAILURE! Couldn't
find a matching blob
FAIL!
APPLY_BOXES: boxfile line 19/t ((267,580),(280,551)): FAILURE! Couldn't
find a matching blob
FAIL!
APPLY_BOXES: boxfile line 20/u ((280,580),(293,551)): FAILURE! Couldn't
find a matching blob
FAIL!
APPLY_BOXES: boxfile line 21/v ((293,580),(306,551)): FAILURE! Couldn't
find a matching blob
FAIL!
APPLY_BOXES: boxfile line 22/w ((306,580),(319,551)): FAILURE! Couldn't
find a matching blob
FAIL!
APPLY_BOXES: boxfile line 23/x ((319,580),(332,551)): FAILURE! Couldn't
find a matching blob
FAIL!
APPLY_BOXES: boxfile line 24/y ((332,580),(345,551)): FAILURE! Couldn't
find a matching blob
FAIL!
APPLY_BOXES: boxfile line 25/z ((345,580),(358,551)): FAILURE! Couldn't
find a matching blob
FAIL!
APPLY_BOXES: boxfile line 26/测 ((371,580),(396,551)): FAILURE! Couldn't
find a matching blob
FAIL!
APPLY_BOXES: boxfile line 27/试 ((396,580),(421,551)): FAILURE! Couldn't
find a matching blob
FAIL!
APPLY_BOXES: boxfile line 28/文 ((421,580),(446,551)): FAILURE! Couldn't
find a matching blob
FAIL!
APPLY_BOXES: boxfile line 29/字 ((446,580),(471,551)): FAILURE! Couldn't
find a matching blob
APPLY_BOXES:
Boxes read from boxfile: 30
Boxes failed resegmentation: 30
APPLY_BOXES: Unlabelled word at :Bounding box=(21,555)->(122,575)
APPLY_BOXES: Unlabelled word at :Bounding box=(129,558)->(131,575)
APPLY_BOXES: Unlabelled word at :Bounding box=(139,555)->(161,575)
APPLY_BOXES: Unlabelled word at :Bounding box=(168,558)->(170,575)
APPLY_BOXES: Unlabelled word at :Bounding box=(176,555)->(356,573)
APPLY_BOXES: Unlabelled word at :Bounding box=(372,555)->(470,578)
Found 0 good blobs.
6 remaining unlabelled words deleted.
Generated training data for 0 words


tesser...@googlecode.com

unread,
Dec 18, 2012, 3:04:01 AM12/18/12
to tesserac...@googlegroups.com

Comment #12 on issue 698 by szhai...@gmail.com: "FAILURE! Couldn't find a
matching blob" on a perfectly created .box file and clear image
http://code.google.com/p/tesseract-ocr/issues/detail?id=698

It doesn't work for me in 3.02.

when i run "tesseract chi_sim.simhei.exp0.tif chi_sim.simhei.exp0 nobatch
box.train"

it cries:

Tesseract Open Source OCR Engine v3.02.02 with Leptonica
FAIL!
APPLY_BOXES: boxfile line 0/a ((20,580),(33,551)): FAILURE! Couldn't find a
matching blob
FAIL!
APPLY_BOXES: boxfile line 1/b ((33,580),(46,551)): FAILURE! Couldn't find a
matching blob
...
...
...
APPLY_BOXES: boxfile line 28/文 ((421,580),(446,551)): FAILURE! Couldn't
find a matching blob
FAIL!
APPLY_BOXES: boxfile line 29/字 ((446,580),(471,551)): FAILURE! Couldn't
find a matching blob
APPLY_BOXES:
Boxes read from boxfile: 30
Boxes failed resegmentation: 30
APPLY_BOXES: Unlabelled word at :Bounding box=(21,555)->(122,575)
APPLY_BOXES: Unlabelled word at :Bounding box=(129,558)->(131,575)
APPLY_BOXES: Unlabelled word at :Bounding box=(139,555)->(161,575)
APPLY_BOXES: Unlabelled word at :Bounding box=(168,558)->(170,575)
APPLY_BOXES: Unlabelled word at :Bounding box=(176,555)->(356,573)
APPLY_BOXES: Unlabelled word at :Bounding box=(372,555)->(470,578)
Found 0 good blobs.
6 remaining unlabelled words deleted.
Generated training data for 0 words

And the following files are my data files for training:



Attachments:
chi_sim.simhei.exp0.tif 469 KB
chi_sim.simhei.exp0.box 595 bytes

tesser...@googlecode.com

unread,
Dec 18, 2012, 3:05:52 AM12/18/12
to tesserac...@googlegroups.com

Comment #13 on issue 698 by szhai...@gmail.com: "FAILURE! Couldn't find a
matching blob" on a perfectly created .box file and clear image
http://code.google.com/p/tesseract-ocr/issues/detail?id=698

By the way, i am using the python script from
https://github.com/BaltoRouberol/TesseractTrainer.

Have u tested the script there ?

tesser...@googlecode.com

unread,
Dec 27, 2012, 12:11:43 PM12/27/12
to tesserac...@googlegroups.com

Comment #14 on issue 698 by roubero...@gmail.com: "FAILURE! Couldn't find a
matching blob" on a perfectly created .box file and clear image
http://code.google.com/p/tesseract-ocr/issues/detail?id=698

TesseractTrainer developper here. I'm currently trying to add tesseract
3.02 training support in TesseractTrainer. I'm facing the exact same bug
when using tesseract 3.02.

I'm getting the usual "FAILURE! Couldn't find a matching blob" messages.
Tesseract Open Source OCR Engine v3.02 with Leptonica
Page 0
FAIL!
APPLY_BOXES: boxfile line 294/T ((20,580),(45,522)): FAILURE! Couldn't find
a matching blob
FAIL!
APPLY_BOXES: boxfile line 294/h ((45,580),(68,522)): FAILURE! Couldn't find
a matching blob
FAIL!
APPLY_BOXES: boxfile line 294/e ((68,580),(91,522)): FAILURE! Couldn't find
a matching blob
FAIL!
...
APPLY_BOXES:
Boxes read from boxfile: 294
Boxes failed resegmentation: 294
APPLY_BOXES: Unlabelled word at :Bounding box=(21,531)->(89,568)
APPLY_BOXES: Unlabelled word at :Bounding box=(104,531)->(261,565)
APPLY_BOXES: Unlabelled word at :Bounding box=(276,531)->(307,568)
APPLY_BOXES: Unlabelled word at :Bounding box=(319,531)->(380,568)
...

I then try to train tesseract 3.01 on the same tif, boxfile, font, font
size, etc,
which works like a charm.

Is there a major sensitivity difference bewteen 3.01 and 3.02, which could
explain such massive failures?

I use the following data files for training (tif and boxfile).

--
Cheers
Balthazar

Attachments:
test.helveticanarrow.exp0.box 25.2 KB
test.helveticanarrow.exp0.tif 938 KB

tesser...@googlecode.com

unread,
Dec 27, 2012, 12:22:46 PM12/27/12
to tesserac...@googlegroups.com

Comment #15 on issue 698 by roubero...@gmail.com: "FAILURE! Couldn't find a
matching blob" on a perfectly created .box file and clear image
http://code.google.com/p/tesseract-ocr/issues/detail?id=698

By the way, I'm getting the same output when using TesseractTrainer and
when using the following command:

$ tesseract test.helveticanarrow.exp0.tif test.helveticanarrow.exp0 nobatch
box.train

I just though it was worth mentioning it, as the error could have come from
my side too.

--
Cheers
Balthazar

tesser...@googlecode.com

unread,
Dec 28, 2012, 8:17:35 AM12/28/12
to tesserac...@googlegroups.com

Comment #16 on issue 698 by zde...@gmail.com: "FAILURE! Couldn't find a
matching blob" on a perfectly created .box file and clear image
http://code.google.com/p/tesseract-ocr/issues/detail?id=698

Baltazar, you are right - error is comming from your side ;-). Box file
created by your tools cause problems for tesseract.

Please find attached boxfile that produce no error. I created it suggested
by wiki:
tesseract test.helveticanarrow.exp0.tif test.helveticanarrow.exp0x
makebox

Output of that cause 2 errors:
APPLY_BOXES: boxfile line 221/t ((184,469),(189,486)): FAILURE! Couldn't
find a matching blob
APPLY_BOXES: boxfile line 399/t ((657,411),(662,428)): FAILURE! Couldn't
find a matching blob

I did a quick fix with joining lines (counting from 0) 220+221 and 398+399.
Reason - "r" and "t" are too close - after binarizing of your image they
are joined (see attached screenshot for better explanation). Maybe you can
increase space between symbols (or DPI 72 is too low).

I hope this helps you to improve your tool.


Attachments:
test.helveticanarrow.exp0.box 26.5 KB
qtb-test.helveticanarrow.exp0.png 37.1 KB

tesser...@googlecode.com

unread,
Feb 11, 2013, 5:52:44 PM2/11/13
to tesserac...@googlegroups.com

Comment #17 on issue 698 by apej...@gmail.com: "FAILURE! Couldn't find a
matching blob" on a perfectly created .box file and clear image
http://code.google.com/p/tesseract-ocr/issues/detail?id=698

Bug is in TesseractTrainer. More exactly in _write_boxline method
(multipage_tif.py). Switching Y coordinations helped.

Otherwise great little piece of software ( tesseract & trainer ;-) )

tesser...@googlecode.com

unread,
Jun 17, 2015, 5:01:49 AM6/17/15
to tesserac...@googlegroups.com

Comment #18 on issue 698 by a.keshri...@gmail.com: "FAILURE! Couldn't find
a matching blob" on a perfectly created .box file and clear image
https://code.google.com/p/tesseract-ocr/issues/detail?id=698

hello sir,
while training tesseract 3.01, i am getting the error "boxes failed
resegmentation".
I have attached the screen shot of error (error.png) along with tiff image
and box file.
It is simply ignoring some characters(digits) in the image.
This is happening with most of the images(.tiff) and boxfile pair i have. I
am not able to figure it out the reason of this.
I wanna ask you, what is the reason for this and how can I resolve this.
please answer me as soon as possible as i am stuck at this point and unable
to proceed further.

Thanks.

Attachments:
error.PNG 19.0 KB
sseg.myfont21.exp21.tiff 200 KB
sseg.myfont21.exp21.box 271 bytes

--
You received this message because this project is configured to send all
issue notifications to this address.
You may adjust your notification preferences at:
https://code.google.com/hosting/settings

tesser...@googlecode.com

unread,
Jul 29, 2015, 3:29:09 AM7/29/15
to tesserac...@googlegroups.com

Comment #19 on issue 698 by zyzhe...@gmail.com: "FAILURE! Couldn't find a
matching blob" on a perfectly created .box file and clear image
https://code.google.com/p/tesseract-ocr/issues/detail?id=698

Hi there,

I also came into the same problem. can anyone help me?

I'm running tesseract 3.03.02 on Ubuntu 14.04. Mant thanks

Attachments:
digit.meter.exp0.tif 41.9 KB
digit.meter.exp0.box 8.2 KB
Reply all
Reply to author
Forward
0 new messages