my scan of alphanumeric data needs TLC

Stephane Charette

unread,

Aug 27, 2019, 5:12:15 AM8/27/19

to tesseract-ocr

I have a large number of images that contain a single line of alphanumeric data. My scans so far have not been great, and I could use some assistance.

Several vars are turned off as recommended in the docs:

    key.push_back("load_system_dawg");
    val.push_back("false");
    key.push_back("load_freq_dawg");
    val.push_back("false");

These are set at initialization:

    tess->Init(nullptr, "eng", tesseract::OEM_DEFAULT, nullptr, 0, &key, &val, false);
    tess->SetPageSegMode(tesseract::PageSegMode::PSM_SINGLE_LINE);

Some images are close, such as this one:

...which is interpreted as "SZ2EC 3".

Other like this one return a blank string:

And then I have some like this one which is so close, but Tesseract removes the spaces between the letters, so this example results in "1201":

I've posted my full .cpp test file and more example images showing the problem on StackOverflow: https://stackoverflow.com/questions/57670769/how-to-get-tesseract-to-recognize-these-alphanumeric-strings

Thanks,

Stéphane

Shree Devi Kumar

unread,

Aug 27, 2019, 5:26:08 AM8/27/19

to tesseract-ocr

If all your images are in this bold thick font, fine tuning for impact may help with some of the recognition errors.

--
You received this message because you are subscribed to the Google Groups "tesseract-ocr" group.
To unsubscribe from this group and stop receiving emails from it, send an email to tesseract-oc...@googlegroups.com.
To view this discussion on the web visit https://groups.google.com/d/msgid/tesseract-ocr/f721e105-d0d6-4322-b9c5-6c5f2d487d06%40googlegroups.com.

Lorenzo Bolzani

unread,

Aug 27, 2019, 6:17:42 AM8/27/19

to tesser...@googlegroups.com

Try to manually clean the images with Gimp, remove the black noise and see if it helps. Also try to remove the white border. After each step run tesseract again to see if the problem was there.

Also try to downscale the images so that the text is 40/60 px tall, try different sizes and see what works best. As an alternative you can play with the dpi settings (but I never did this). Tesseract does not know how tall your text is and where lines are, if the 0 is a zero or a big dot, if the 1 is a one or a quote.

Also try PSM single block.

Once you found the problem, fix the image with code before passing it to tesseract.

Bye

Lorenzo

--

Shree Devi Kumar

unread,

Aug 27, 2019, 6:44:30 AM8/27/19

to tesseract-ocr

You can try the finetuned traineddata from tutorial at

https://github.com/Shreeshrii/tess4tutorial/tree/master/impact_from_full

Here are the results I get using those vs the ones with eng.traineddata from tessdata_bst:

***** 2v2Xj ****
1 K 45
1 K45

***** 3VtsA ****
308 8
308 8

***** FxcEl ****
1 Ka
1a

***** gwrBt ****
23 B 13
238 13

***** hAJOM ****
1_C 15
1 C15

***** kATPl ****
20°F C 13
Fr C 13

***** Oj222 ****
12 0 1
120 1

***** rOexn ****
1 C 13
1013

***** UBqvX ****
34 E 1
34 E 1

***** unnamed ****
32 EC 9
32EC 9

***** Vwv5G ****
32 EC 5
32EC 5

--

____________________________________________________________
भजन - कीर्तन - आरती @ http://bhajans.ramparivar.com

Timothy Snyder

unread,

Aug 27, 2019, 9:11:22 AM8/27/19

to tesser...@googlegroups.com

Try out the single line PSM modes (7 and 13). I've had the best luck with 13 on single line images. Also, see to removing the extra black marks that aren't part of the letters.

--

Reply all

Reply to author

Forward