Training error "Couldn't find a matching blob"

1,996 views
Skip to first unread message

Paul Kitchen

unread,
May 26, 2018, 2:52:39 AM5/26/18
to tesseract-ocr
I am creating training data for GD&T symbols using Tesseract 3.05.01. One of my TIFF files I use for training is in the attached gdt.symbols.exp10.tif. When I attempt to use this TIFF with the corresponding gdt.symbols.exp10.box, I get this output:

Tesseract Open Source OCR Engine v3.05.01 with Leptonica
Page 1
FAIL!
APPLY_BOXES: boxfile line 7/Ⓜ ((1153,69),(1431,346)): FAILURE! Couldn't find a matching blob
FAIL!
APPLY_BOXES: boxfile line 10/Ⓜ ((1993,69),(2268,346)): FAILURE! Couldn't find a matching blob
APPLY_BOXES:
   Boxes read from boxfile:      10
   Boxes failed resegmentation:       2
   Found 8 good blobs.
Generated training data for 5 words

Basically, both circled M symbols are failing.

I've attached ImagesWithBoxes.PNG which is a screen capture from jTessBoxEditor showing the TIFF image with boxes. As you can see, the boxes appear to be correct.

Why isn't tesseract able to use the circle M symbols for training? Can I change the image of the symbols some how to help tesseract... maybe connect the circle and M parts with a line?

Thanks in advance.
gdt.symbols.exp10.box
gdt.symbols.exp10.tif
ImageWithBoxes.PNG

Quan Nguyen

unread,
May 27, 2018, 10:07:13 AM5/27/18
to tesseract-ocr
You need a much larger sample, in the range of hundreds or at least several dozens, so that even though some symbols could experience "Couldn't find a matching blob" errors, other samples would get picked up.

Paul Kitchen

unread,
May 29, 2018, 4:54:05 PM5/29/18
to tesseract-ocr
I'm actually training with several other TIFF images which contain the "Circle M" symbol (uppercase M inside a circle). In all cases, tesseract reports the error message "Couldn't find a matching blob". So I think the issue is something fundamental with the algorithm rather than just an anomaly with the image I posted. I suspect that the circle around the M might have something to do with it but I don't know enough about tesseract's algorithm to know how it handles this situation. Are there any parameters I could use that would instruct tesseract to use the raw image as-is rather than trying to match blobs?

Paul Kitchen

unread,
May 31, 2018, 4:25:25 PM5/31/18
to tesseract-ocr
After a lot of stepping through tesseract code, I found the problem. 

1)      In file coutln.cpp, function C_OUTLINE::IsLegallyNested(), we assign outer_area() to an inT32, parent_area. Then lower in the function, we multiple child->outer_area() by parent_area. This caused an integer overflow which resulted in a bad sign for the multiplication. The fix was to make parent_area an inT64 so that integer overflow cannot happen.


The two 32-bit integers being multiplied were -51874 and 60218. The true result should be -3123748532 but the maximum result cannot be greater than 2^31 or you will have sign/overflow problems, which is the case here. The computer result was 1171218764, causing the if-statement to go down the wrong path.

dfs







shree

unread,
May 31, 2018, 4:39:08 PM5/31/18
to tesseract-ocr
This has been an issue for long. Thanks for finding the problem.

Please submit a PR on github.

Zdenko Podobny

unread,
Jun 2, 2018, 4:16:49 AM6/2/18
to tesser...@googlegroups.com

št 31. 5. 2018 o 22:39 shree <shree...@gmail.com> napísal(a):
--
You received this message because you are subscribed to the Google Groups "tesseract-ocr" group.
To unsubscribe from this group and stop receiving emails from it, send an email to tesseract-oc...@googlegroups.com.
To post to this group, send email to tesser...@googlegroups.com.
Visit this group at https://groups.google.com/group/tesseract-ocr.
To view this discussion on the web visit https://groups.google.com/d/msgid/tesseract-ocr/1ef0e822-9518-4cbb-af39-5a8ec6370d00%40googlegroups.com.
For more options, visit https://groups.google.com/d/optout.

Zdenko Podobny

unread,
Jun 2, 2018, 4:22:16 AM6/2/18
to tesser...@googlegroups.com
Please check if this is ok now. If yes, I am willing to make 3.05.02 release ;-)

Zdenko


so 2. 6. 2018 o 10:16 Zdenko Podobny <zde...@gmail.com> napísal(a):

Paul Kitchen

unread,
Jun 2, 2018, 10:03:10 PM6/2/18
to tesseract-ocr
Zdenko,

Thanks for making that fix. I am currently running tesseract from source on my computer. I've already made the fix on my source. However, if the fix were in an official release, then I could go back to using the officially released product.

I did find one other bug that I fixed locally in my tesseract code. Unless this other bug were also fixed in the official version, then I wouldn't be able to leave my custom code. Here are the bug details:

1)      In file boxread.cpp, function ReadAllBoxes(), we convert GenericVector<char> to const char* without a trailing ‘\0’. This can cause buffer read overrun inside the call to ReadMemBoxes(). To fix this, change function LoadDataFromFile() to always reserve an extra byte so the caller can add a ‘\0’ if they want. Then in ReadAllBoxes(), append ‘\0’ to the vector after calling LoadDataFromFile(). Here are the fixed functions:


inline bool LoadDataFromFile(const STRING& filename,
                             
GenericVector<char>* data) {
 
bool result = false;
  FILE
* fp = fopen(filename.string(), "rb");
 
if (fp != NULL) {
    fseek
(fp, 0, SEEK_END);
    size_t size
= ftell(fp);
    fseek
(fp, 0, SEEK_SET);
   
if (size > 0) {
     
// reserve an extra byte in case caller wants to append a '\0' character
      data
->reserve(size + 1);
      data
->resize_no_init(size);
      result
= fread(&(*data)[0], 1, size, fp) == size;
   
}
    fclose
(fp);
 
}
 
return result;
}

bool ReadAllBoxes(int target_page, bool skip_blanks, const STRING& filename,
                 
GenericVector<TBOX>* boxes,
                 
GenericVector<STRING>* texts,
                 
GenericVector<STRING>* box_texts,
                 
GenericVector<int>* pages) {
 
GenericVector<char> box_data;
 
if (!tesseract::LoadDataFromFile(BoxFileName(filename), &box_data))
   
return false;
  box_data
.push_back('\0');
 
return ReadMemBoxes(target_page, skip_blanks, &box_data[0], boxes, texts,
                      box_texts
, pages);
}

Zdenko Podobny

unread,
Jun 4, 2018, 2:42:05 AM6/4/18
to tesser...@googlegroups.com
Paul,

at the moment focus is on 4.0 release. But I understand that some user still need/prefer to use 3.05.

Can you create some test/demonstration case for you last bugfix? Is it not fixed in 4.00...

Zdenko


ne 3. 6. 2018 o 4:03 Paul Kitchen <paul.k...@hexagonmetrology.com> napísal(a):

Stefan Weil

unread,
Jun 4, 2018, 11:32:44 AM6/4/18
to tesseract-ocr
As far as I see 4.0.0 is good. I have sent a pull request which backports the fix from 4.0.0 (a simplified variant of Paul's fix) to 3.05.

Stefan

Zdenko Podobny

unread,
Jun 4, 2018, 1:15:18 PM6/4/18
to tesser...@googlegroups.com
Stefan,

Paul suggest to modified also LoadDataFromFile (ccutil/genericvector.h). That modification is not needed?

Zdenko


po 4. 6. 2018 o 17:32 'Stefan Weil' via tesseract-ocr <tesser...@googlegroups.com> napísal(a):
As far as I see 4.0.0 is good. I have sent a pull request which backports the fix from 4.0.0 (a simplified variant of Paul's fix) to 3.05.

Stefan

--
You received this message because you are subscribed to the Google Groups "tesseract-ocr" group.
To unsubscribe from this group and stop receiving emails from it, send an email to tesseract-oc...@googlegroups.com.
To post to this group, send email to tesser...@googlegroups.com.
Visit this group at https://groups.google.com/group/tesseract-ocr.

Paul Kitchen

unread,
Jun 4, 2018, 9:48:41 PM6/4/18
to tesseract-ocr
Here is a sample of the problem it causes. I run the following to train the attached image and box file:

tesseract gdt.symbols.exp0.tif gdt.symbols.exp0 box.train

And here is the output:

Tesseract Open Source OCR Engine v3.05.00dev with Leptonica

Page 1
Bad box coordinates in boxfile string! ²²²²▌▌▌▌▌▌▌▌▌▌▌▌▌▌▌╦ÇƧ≡¿←
APPLY_BOXES
:
   
Boxes read from boxfile:       7
   
Found 7 good blobs.
Generated training data for 3 words

The message about the bad box coordinates is caused because function ReadMemBoxes() reads memory past the end of the const char* box_data parameter.

With the fix I suggested, this is the output:

Tesseract Open Source OCR Engine v3.05.00dev with Leptonica
Page 1
APPLY_BOXES
:
   
Boxes read from boxfile:       7
   
Found 7 good blobs.
Generated training data for 3 words
gdt.symbols.exp0.tif
gdt.symbols.exp0.box

Paul Kitchen

unread,
Jun 4, 2018, 10:38:39 PM6/4/18
to tesseract-ocr
ZDenko,

I checked out the latest tesseract code and updated to branch 3.05. I see that the int64_t area bug is already fixed (thanks!). I also see that the buffer read overrun is partially fixed. There is this line in ReadAllBoxes():

box_data.push_back('\0');

Since the memory will have to be deleted and reallocated, this will be quite inefficient. That is why I added this line to LoadDataFromFile():

data->reserve(size + 1);

I'm willing to make the change in a feature branch then create the pull request. I tried to create a branch in github but apparently I don't have branch creation privilege. I thought about forking but I'm not familiar with how that works, or if it would even be appropriate. Can you either make the change yourself or grant me branch creation privilege in the repo so I can make the change in a branch then create a pull request?

By the way, I checked out master branch and it also has the same problem in LoadDataFromFile().

Zdenko Podobny

unread,
Jun 5, 2018, 5:00:23 AM6/5/18
to tesser...@googlegroups.com
Please make PR for master (4.0) branch and I will cherry-pick for 3.05...

Zdenko


ut 5. 6. 2018 o 4:38 Paul Kitchen <paul.k...@hexagonmetrology.com> napísal(a):
--
You received this message because you are subscribed to the Google Groups "tesseract-ocr" group.
To unsubscribe from this group and stop receiving emails from it, send an email to tesseract-oc...@googlegroups.com.
To post to this group, send email to tesser...@googlegroups.com.
Visit this group at https://groups.google.com/group/tesseract-ocr.

Paul Kitchen

unread,
Jun 5, 2018, 9:06:17 AM6/5/18
to tesseract-ocr
ZDenko,

Unfortunately I don't seem to have write permissions on the tesseract repo so I am unable to create a branch off of master to make the changes. Who do I need to lobby to get write permission?

Zdenko Podobny

unread,
Jun 5, 2018, 9:23:41 AM6/5/18
to tesser...@googlegroups.com
You need to fork official repository and then you have all permission you need. When you make your changes you can send pull request to official repository with your changes.

Zdenko


ut 5. 6. 2018 o 15:06 Paul Kitchen <paul.k...@hexagonmetrology.com> napísal(a):

Paul Kitchen

unread,
Jun 5, 2018, 10:52:54 AM6/5/18
to tesseract-ocr
ZDenko,

I'm new to this so hopefully I did everything correctly. Here is the issue I created:


And here is the pull request:

Zdenko Podobny

unread,
Jun 5, 2018, 10:59:08 AM6/5/18
to tesser...@googlegroups.com
Yes, it is ok, but you do not have to create separate issue for PR (PR is a issue too)

Zdenko


ut 5. 6. 2018 o 16:52 Paul Kitchen <paul.k...@hexagonmetrology.com> napísal(a):

Paul Kitchen

unread,
Jun 5, 2018, 11:53:16 AM6/5/18
to tesseract-ocr
Thank you for your help with these issues. The 3.05 branch now has all the issues fixed that I found.

Mehul Bhardwaj

unread,
Aug 10, 2018, 7:51:59 AM8/10/18
to tesseract-ocr
Hi,

I went through this discussion thread and updated to Tesseract 3.05.02. Previously I was working with version 3.05. I was getting the same error of "FAILURE: Couldn't find a matching blob" for about 15% of my training characters. 

But even after updating, I am still getting the exact same number of errors as before.

Could there be any other reason for this?

I have about 174 training images, which are fairly identical in terms of brightness, sharpness, background noise and have identical character spacing, resolution.
Out of 174 images, 48 images had no such error. 106 images had 5 or less such errors. Each image has, on an average, 170 characters. So I am fairly certain that the image type or other factors such as character size, scaling, spacing has nothing to do with it.

Any recommended tests to identify the issue will be very appreciated.

Best Regards
Mehul

ry...@inspectionxpert.com

unread,
Mar 19, 2019, 2:54:52 PM3/19/19
to tesseract-ocr
Wondering if this issue was fixed in Tesseract 3.05.02. Any ideas?

Tairen Chen

unread,
Jun 13, 2019, 5:15:24 PM6/13/19
to tesseract-ocr
Hi, Paul and @Zdenko, and all
    
      I am trying to use Tesseract 3.05.02 to do License Plate (LP) Recognition (unclear and fuzzy LPs) in the Ubuntu 16.04. 

      I have tried to use Tesseract 4.10 with lstmtraining but the result is not good and I guess because of lstmtraining consider all characters and number in the LP with same position indices. So I go back to use Tesseract 3.05 because we can specify the characters and numbers position in detail.

      After I generated the TIFF and BOX files, I have similar errors like what you mentioned here when I am trying to generate the  "box.train" with Tesseract 3.05.02.
       
      The errors information likes below:
      """
       Tesseract Open Source OCR Engine v3.05.02 with Leptonica
       Page 1
       Error in pixConvertRGBToGray: pixs not 32 bpp
       Error in pixGetWidth: pix not defined
       Error in pixGetHeight: pix not defined
       Error in pixGetDepth: pix not defined
       Error in pixCreateHeader: width must be > 0
       Error in pixCreateNoInit: pixd not made
        Error in pixCreate: pixd not made
       Error in pixSetAllArbitrary: pix not defined
       Error in pixConvertRGBToGray: pixs not 32 bpp
       Detected 24 diacritics
       row xheight=61, but median xheight = 10.75
       FAIL!
       APPLY_BOXES: boxfile line 10/0 ((529,1434),(564,1482)): FAILURE! Couldn't find a matching blob
       FAIL!
       APPLY_BOXES: boxfile line 12/T ((609,1430),(638,1481)): FAILURE! Couldn't find a matching blob
       FAIL!
       APPLY_BOXES: boxfile line 17/7 ((821,1430),(848,1480)): FAILURE! Couldn't find a matching blob
       APPLY_BOXES:
      Boxes read from boxfile:      21
      Boxes failed resegmentation:       3
      APPLY_BOXES: Unlabelled word at :Bounding box=(354,1478)->(389,1494)
      APPLY_BOXES: Unlabelled word at :Bounding box=(785,1479)->(961,1497)
      APPLY_BOXES: Unlabelled word at :Bounding box=(1416,-425)->(1468,-367)
      Found 18 good blobs.
      Leaving 9 unlabelled blobs in 0 words.
      3 remaining unlabelled words deleted.
      Generated training data for 4 words
      """
      I do not know how to debug the Tesseract 3.05. I attach the TIFF and BOX files ( "eng.hollow2.exp0.box" is the original generated and "eng.hollow2.exp0_revised.box" is the one that I edit using jTessBoxEditor ) for your reference.
      Please give us hints and helps. Thank you in advance!
All the best,
                                 Tairen
eng.hollow2.exp0.box
eng.hollow2.exp0_revised.box
eng.hollow2.exp0.tif
Reply all
Reply to author
Forward
0 new messages