Extra characters showing up

62 views
Skip to first unread message

Ed Dow

unread,
Feb 25, 2022, 1:02:27 AM2/25/22
to tesseract-ocr
Greetings,

I'm using tesseract 4.0.0 in a C/C++ application where I capture an image and then "scrape" text/data from it.  I am having issues with tesseract recognizing the ROI with just several characters ( see attached). 

The attached image is:  014
Recognized as:  /~—6h014 5

If I get rid of extra space around the number it gets better but the problem is sometimes the string of characters is outside the ROI so I have to increase the size to get all of them.

I've tried using OpenCV to grayscale, blur and resize which has seemed to help a little.  I've also tried all the PSM modes.

The other thing that is puzzling is that from the command line it works great.  Maybe this is due to the image being saved as a jpg first before the OCR is done.  Inside the application it's raw data.

Any thoughts?
Ed


Tesseract Version:

tesseract 4.0.0-beta.1
 leptonica-1.75.3
  libgif 5.1.4 :
  libjpeg 8d (libjpeg-turbo 1.5.2) :
   libpng 1.6.34 :
  libtiff 4.0.9 :
  zlib 1.2.11 :
  libwebp 0.6.1 :
  libopenjp2 2.3.0
screenNum_014.jpg

Ed Dow

unread,
Mar 1, 2022, 12:22:19 PM3/1/22
to tesseract-ocr
Greetings,

I found a potential solution to rewrite each pixel to either white or black based on a set threshold. After looking at OpenCV functions I found "threshold" would do just that but Tesseract was still finding "ghost" characters in the white areas of the image.  So I had to find where the string starts and grab an ROI from that point.  Note that the THRESH_BINARY_INV parameter to threshold will also convert dark colors to white and light colors to black.  From things I've read Tesseract likes black characters on white backgrounds.

So the solution I came up with is the following using OpenCV and tesseract:
 
    Mat img;  // should already have the image
    Mat cropped;
    Mat grayed;
    Mat inverted;
    Mat cropNum;

    // Crop the original image to the defined ROI
    Rect roi(xStart,yStart,xMove,yMove);
    cropped = img(roi);

   // Convert Image to Gray
    cvtColor(cropped, grayed, COLOR_BGR2GRAY);
   
    // Invert Image to black and white
    threshold(grayed,  inverted, 100, 255, THRESH_BINARY_INV); 
   
    // Use tesseract to OCR
    tesseract::TessBaseAPI *ocr = new tesseract::TessBaseAPI();
    ocr->Init(NULL, "eng", tesseract::OEM_LSTM_ONLY);

    ocr->SetPageSegMode(tesseract::PSM_SINGLE_WORD);
  
    ocr->SetImage( inverted .data,  inverted .cols,  inverted .rows, 1,  inverted .step);

    popupNum = string(ocr->GetUTF8Text());


    NOTE: Be careful with the 4th parameter in ocr->SetImage  function.   This is the number of bits per pixel. 
                After converting to grayscale it's 1 and not 3.   I forgot about this and I was getting 3 strings back.  Quite strange.

Zdenko Podobny

unread,
Mar 1, 2022, 12:32:10 PM3/1/22
to tesser...@googlegroups.com
;-) 

> tesseract screenNum_014.jpg -
Estimating resolution as 903
014

> tesseract -v
tesseract 5.1.0
 leptonica-1.83.0 (Jan 26 2022, 19:15:03) [MSC v.1929 LIB Release x64]
  libgif 5.2.1 : libjpeg 6b (libjpeg-turbo 2.0.91) : libpng 1.6.37 : libtiff 4.3.0 : zlib 1.2.11
 Found AVX2
 Found AVX
 Found FMA
 Found SSE4.1
 Found OpenMP 2019
 Found libarchive 3.5.1 zlib/1.2.11 liblzma/5.2.4 bz2lib/1.0.6 libzstd/1.4.9
 Found libcurl/7.75.0 zlib/1.2.11 libssh2/1.10.1_DEV



Zdenko


ut 1. 3. 2022 o 18:22 Ed Dow <eddo...@gmail.com> napísal(a):
--
You received this message because you are subscribed to the Google Groups "tesseract-ocr" group.
To unsubscribe from this group and stop receiving emails from it, send an email to tesseract-oc...@googlegroups.com.
To view this discussion on the web visit https://groups.google.com/d/msgid/tesseract-ocr/bae3383d-84ee-402c-aa2f-af4fe7273a4fn%40googlegroups.com.
Reply all
Reply to author
Forward
0 new messages