Train tesseract 3.04 for recognition of six patterns no existents in UTF-8

Juan Pablo Aveggio

unread,

Sep 22, 2015, 3:00:20 AM9/22/15

to tesseract-ocr

Hello

I'm trying to train tesseract for recognition of patterns present in tickets. Each ticket possesses a unique pattern in a predetermined place which determines its value. As these patterns are not including unicode characters, I assigned them the characters 'a' to 'f'.

I created a .tif image with six patterns:

bil.pat.exp0.tif

and the corresponding file box:

bil.pat.exp0.box

a 32 692 165 958 0 
b 221 734 354 958 0 
c 32 446 165 628 0 
d 221 488 354 628 0 
e 32 275 165 373 0 
f 221 317 277 373 0

Then I ran:

tesseract bil.pat.exp0.tif bil.pat.exp0 box.train

and output:

Tesseract Open Source OCR Engine v3.04.00 with Leptonica 
Page 1 
APPLY_BOXES: 
   Boxes read from boxfile:       6 
APPLY_BOXES: Unlabelled word at :Bounding box=(-958,221)->(-734,277) 
APPLY_BOXES: Unlabelled word at :Bounding box=(-628,221)->(-488,277) 
APPLY_BOXES: Unlabelled word at :Bounding box=(-958,32)->(-734,88) 
APPLY_BOXES: Unlabelled word at :Bounding box=(-628,32)->(-488,88) 
APPLY_BOXES: Unlabelled word at :Bounding box=(-373,32)->(-317,88) 
   Found 6 good blobs. 
   5 remaining unlabelled words deleted. 
Generated training data for 6 words

That can not mean negative coordinates. Despite this I tried to keep going.

My font_properties is:

bil.pat.box 0 0 1 0 0

bil.words_list is:

a 
b 
c 
d 
e 
f 

then I ran:

$ unicharset_extractor bil.pat.exp0.box
Extracting unicharset from bil.pat.exp0.box 
Wrote unicharset file ./unicharset.

but the unicharset file has:

9 
NULL 0 NULL 0 
Joined 0 0,255,0,255,0,0,0,0,0,0 NULL 0 0 0     # Joined [4a 6f 69 6e 65 64 ] 
|Broken|0|1 0 0,255,0,255,0,0,0,0,0,0 NULL 0 0 0        # Broken 
a 0 0,255,0,255,0,0,0,0,0,0 NULL 0 0 0  # a [61 ] 
b 0 0,255,0,255,0,0,0,0,0,0 NULL 0 0 0  # b [62 ] 
c 0 0,255,0,255,0,0,0,0,0,0 NULL 0 0 0  # c [63 ] 
d 0 0,255,0,255,0,0,0,0,0,0 NULL 0 0 0  # d [64 ] 
e 0 0,255,0,255,0,0,0,0,0,0 NULL 0 0 0  # e [65 ] 
f 0 0,255,0,255,0,0,0,0,0,0 NULL 0 0 0  # f [66 ]

Then I ran:

$ mftraining -F font_properties -U unicharset -O bil.unicharset bil.pat.exp0.tr  
Read shape table shapetable of 0 shapes 
Reading bil.pat.exp0.tr ... 
Bad properties for index 3, char a: 0,255 0,255 0,0 0,0 0,0 
Bad properties for index 4, char b: 0,255 0,255 0,0 0,0 0,0 
Bad properties for index 5, char c: 0,255 0,255 0,0 0,0 0,0 
Bad properties for index 6, char d: 0,255 0,255 0,0 0,0 0,0 
Bad properties for index 7, char e: 0,255 0,255 0,0 0,0 0,0 
Bad properties for index 8, char f: 0,255 0,255 0,0 0,0 0,0 
Warning: no protos/configs for Joined in CreateIntTemplates() 
Warning: no protos/configs for |Broken|0|1 in CreateIntTemplates() 
Warning: no protos/configs for a in CreateIntTemplates() 
Warning: no protos/configs for b in CreateIntTemplates() 
Warning: no protos/configs for c in CreateIntTemplates() 
Warning: no protos/configs for d in CreateIntTemplates() 
Warning: no protos/configs for e in CreateIntTemplates() 
Warning: no protos/configs for f in CreateIntTemplates() 
Done!

That's what I'm doing wrong?

I am on debian.

tesseract 3.04.00 
 leptonica-1.72 
  libgif 4.1.6(?) : libjpeg 6b (libjpeg-turbo 1.4.0) : libpng 1.2.50 : libtiff 4.0.5 : zlib 1.2.8 : libwebp 0.4.3 : libopenjp2 2.1.0

From already thank you very much!

Dmitri Silaev

unread,

Sep 23, 2015, 5:26:43 PM9/23/15

to tesser...@googlegroups.com

Hi Juan Pablo,

The problem seems interesting. However not sure if you can use Tesseract for that. Could you show one or more example tickets?

Best regards,
Dmitri Silaev
www.CustomOCR.com

--
You received this message because you are subscribed to the Google Groups "tesseract-ocr" group.
To unsubscribe from this group and stop receiving emails from it, send an email to tesseract-oc...@googlegroups.com.
To post to this group, send email to tesser...@googlegroups.com.
Visit this group at http://groups.google.com/group/tesseract-ocr.
To view this discussion on the web visit https://groups.google.com/d/msgid/tesseract-ocr/a619104a-79d5-40ec-8a08-a6a9941ec292%40googlegroups.com.
For more options, visit https://groups.google.com/d/optout.

Juan Pablo Aveggio

unread,

Sep 25, 2015, 9:12:03 PM9/25/15

to tesseract-ocr

Hi Dmitri Silaev.

Thanks for reply. They are bills, sorry for mistranslation. You can see examples:

2 5 10 20 50 100

These patterns have relief for the blind, but they are very worn and no longer apply. So I'm working on an android app to detect the value and speech it to user.

Dmitri Silaev

unread,

Sep 26, 2015, 10:36:18 AM9/26/15

to tesser...@googlegroups.com

Hi Juan Pablo,

The problem cannot be solved by Tesseract as is. Even given such perfect images like you've shown, Tesseract would fail since your "characters" are too disjointed, have no meaningful baseline and only happen as singletons.

However a simple and robust recognition can be implemented without Tesseract using common sense and a bit of programming. Of image processing operations, you only would need trivial thresholding. Though, some more involved image preprocessing is required to convert the image to the form close to what you've demonstrated in your sample images.

The said preprocessing would be needed anyway even if Tesseract worked for your "characters". Tell what you already have done so far in this direction so I can share more details about the above method, if you wish.

-Dmitri

To view this discussion on the web visit https://groups.google.com/d/msgid/tesseract-ocr/e00b21a3-88c7-4535-96b2-833010610308%40googlegroups.com.

Juan Pablo Aveggio

unread,

Sep 27, 2015, 5:18:47 PM9/27/15

to tesseract-ocr

Hi Dmitri Silaev

Thanks for your useful help. Actually I have almost no progress, in terms of image preprocessing. Just convert the image to grayscale before applying OCR. But I could not get good training data. The test code is as follows:

#include <opencv2/highgui/highgui.hpp>
#include <opencv2/imgproc/imgproc.hpp>
#include <tesseract/baseapi.h>
#include <leptonica/allheaders.h>
#include <iostream>

using namespace cv;

int main(int, char**)
{
    VideoCapture cap(0); // open the default camera
    if(!cap.isOpened())  // check if we succeeded
        return -1;
    int c;
    Mat gray;
    namedWindow("gray", 1);
    tesseract::TessBaseAPI tess;
    tess.Init("/usr/share/tesseract-ocr/tessdata/", "bil", tesseract::OEM_DEFAULT );
    tess.SetPageSegMode(tesseract::PSM_SINGLE_WORD);

    for(;;)
    {
        Mat frame;
        cap >> frame;
        cvtColor(frame, gray, CV_BGR2GRAY);

                  c = waitKey(30);
        if(c == 27) break;
        else if(c > 0) {
            tess.SetImage((uchar*)gray.data, gray.cols, gray.rows, 1, gray.cols);
            Boxa* boxes = tess.GetComponentImages(tesseract::RIL_WORD, true, NULL, NULL);
            for(int i=0; i < boxes->n; i++){
                BOX* box = boxaGetBox(boxes, i, L_CLONE);
                rectangle(gray, Point(box->x, box->y), Point(box->x+box->w, box->y+box->h), Scalar(255, 0, 0));
            }
            char* out = tess.GetUTF8Text();
            std::cout << out << std::endl;
            imshow("gray", gray);
            waitKey(4000);
        }else imshow("gray", gray);
    }
    tess.~TessBaseAPI();
    // the camera will be deinitialized automatically in VideoCapture destructor
    return 0;
}

With this code and my training data only thing I've done is draw a square around the pattern, but only when I place the ticket close enough to the camera and exactly horizontal. It has yielded some results, but with very low hit. I also tested PSM_SINGLE_CHARACTER page segmentation mode, with similar results.

I thought all this could be due to errors thrown during training process, which resulted in bad training data. Now I understand that this is because my characters appear disconnected, isolated, and tesseract is designed to detect horizontal lines of text with words, mostly several characters.

Then I could just use OpenCV to solve this problem? The hardest part seems to be finding the region where the pattern in the bill, and its rotation is. Once echos with this information, I could straighten out and deal with this small subimage more easily.

I do not have much experience with OpenCV. But I'm willing to learn. I imagine that we will have to apply an algorithm to detect edges or corners, to try to get the contour of the ticket. We have to consider that the ticket might be being partially captured on camera. It could even be the reverse, so that the pattern will not be found. I see it quite difficult, but it's a good challenge.

Finally, I would note that I have selected this pattern because I thought it would be easier to detect. They are also issuing a new currency, with many different typefaces and design, but the pattern have not changed. But any suggestions are welcome.

Thank you very much for your interest.

Best regards

Juan Pablo Aveggio

Dmitri Silaev

unread,

Oct 2, 2015, 4:39:12 PM10/2/15

to tesser...@googlegroups.com

Hi Juan Pablo,

Here are my thoughts about how I'd go with the initial version of image processor. I'm not sure OpenCV is the best tool for doing all this and you're free to choose how to implement this.

As far as I understand the idea of your app, bills can be captured in an arbitrary manner, but in most cases close to horizontal position and at a projection close to fronto-parallel. This allows us to assume that images would have minimal distortions of characters and symbols. Also, the user would try to capture as much bill area as possible, hence another assumption - most of the white area containing tactile symbols would be visible in the captured image.

That said, for the initial version I'd not even bother with locating the full bill area, i.e. edge detection, etc. but only would do a simple search for the biggest and lightest area. That should be the "white area". In this area, apply a simple threshold and you'll get connected components (CCs). Among them, you'd need only those that are close in form to rotated squares. They are also easy to find as such CCs fill up almost entire area of their rotated bounding rectangle. These CCs would be those tactile symbols you are after.

The rest is trivial - count tactile symbols and get the denomination of your bill.

Of course, you'd add more sophistication to cope with real world images but the backbone of the algorithm looks to me like this. All work is done in grayscale.

HTH

To view this discussion on the web visit https://groups.google.com/d/msgid/tesseract-ocr/12ffb9a1-8530-445f-b126-2b5a884efd3e%40googlegroups.com.

Shishir Singhal

unread,

Oct 3, 2015, 1:35:16 PM10/3/15

to tesseract-ocr

sir i am doing a project based on hand written character recognition based on google tesseract but i the problem is i dont find any suitable suit to make it learn for hand writting. sir after some research on internet, it has to be first to build BOX file of the image to be learned and then edit this file with the help any box file editor sir but i am not able to make box filr of the image ...can you plzzz tell me how to make box file ..??

Dmitri Silaev

unread,

Oct 5, 2015, 9:01:57 AM10/5/15

to tesser...@googlegroups.com

Shishir,

Do not hijack this thread. Go create a separate one with your own question.

-Dmitri

On Sat, Oct 3, 2015 at 10:19 AM, Shishir Singhal <shishirr...@gmail.com> wrote:

sir i am doing a project based on hand written character recognition based on google tesseract but i the problem is i dont find any suitable suit to make it learn for hand writting. sir after some research on internet, it has to be first to build BOX file of the image to be learned and then edit this file with the help any box file editor sir but i am not able to make box filr of the image ...can you plzzz tell me how to make box file ..??

--

You received this message because you are subscribed to the Google Groups "tesseract-ocr" group.
To unsubscribe from this group and stop receiving emails from it, send an email to tesseract-oc...@googlegroups.com.
To post to this group, send email to tesser...@googlegroups.com.
Visit this group at http://groups.google.com/group/tesseract-ocr.

To view this discussion on the web visit https://groups.google.com/d/msgid/tesseract-ocr/c743781d-3951-47ae-9ecc-77266bf20075%40googlegroups.com.

Tom Morris

unread,

Oct 5, 2015, 12:16:21 PM10/5/15

to tesseract-ocr

I think Dmitri's suggest to start simple is a good one, but, if you need it, don't forget that you've got a lot of other information that can be leveraged to help. The notes all have a fixed aspect ratio (and size?). They've got a relatively standard layout. The denomination is encoded multiple places on the note. You can use as much of this additional information as you need to help make the solution easier (or to cross check the quality of your result).