Recognition of "5" instead of "S"

RangerRick

unread,

Apr 28, 2019, 1:41:35 AM4/28/19

to tesseract-ocr

Hi,

I'm new to Tesseract, using latest version 4 executable on Windows 7.

I'm converting Morse code CW from JPG into text using Tesseract. It works almost right, just missing on the number 5, which is usually misinterpreted as an "S". Here's an example of the issue.

Here's how it's being interpreted:

3AMWA DE FASMX QFSMXQ CQ CQ DE FSMXQ FSMXQ CQ DE FSMXQ ENSMAA I III FSMXQ FSMXQ NHE K Â»

I have tried adjusting the various command line parameters but no joy. I believe the font is Fontcraft Courier DemiBold, but shouldn't matter. In this case, the image is 96 DPI and 24 pixels tall (total, including border).

I started to try and retrain to optimize for this font, but that looks like a pretty daunting task.

Any guidance would be greatly appreciated.

Rick

RangerRick

unread,

Apr 28, 2019, 10:03:49 AM4/28/19

to tesseract-ocr

Attaching the bitmap image

output.jpg

RangerRick

unread,

Apr 28, 2019, 10:49:09 AM4/28/19

to tesseract-ocr

Ok. Now I have tried the "best" traindata file (no difference) and removing the alpha layer (no difference). I even created a new, simpler bitmap using Courier New font (attached), which still fails.

Tesseract just can't distinguish between the number 5 and an S.

On Sunday, April 28, 2019 at 12:41:35 AM UTC-5, RangerRick wrote:

output2.jpg

Zdenko Podobny

unread,

Apr 28, 2019, 12:55:59 PM4/28/19

to tesser...@googlegroups.com

jpeg is not suitable for OCR - your image has too many artifacts so I am wandering there is "5" vs "S" problem only.

Zdenko

ne 28. 4. 2019 o 16:49 RangerRick <rbr...@gmail.com> napísal(a):

--
You received this message because you are subscribed to the Google Groups "tesseract-ocr" group.
To unsubscribe from this group and stop receiving emails from it, send an email to tesseract-oc...@googlegroups.com.
To post to this group, send email to tesser...@googlegroups.com.
Visit this group at https://groups.google.com/group/tesseract-ocr.
To view this discussion on the web visit https://groups.google.com/d/msgid/tesseract-ocr/ab572776-22f8-4259-a7b4-ec6615d11bb4%40googlegroups.com.
For more options, visit https://groups.google.com/d/optout.

Shree Devi Kumar

unread,

Apr 28, 2019, 2:51:02 PM4/28/19

to tesser...@googlegroups.com

Finetuning with Courier font with a training text similar to image you are recognizing with more samples of 5 will give better result.

Shree Devi Kumar

unread,

Apr 28, 2019, 3:02:50 PM4/28/19

to tesser...@googlegroups.com

You can test with the finetuned traineddata file from

https://github.com/Shreeshrii/tessdata_shreetest/blob/master/engmorse.traineddata

Download the file (raw file)

Use it with `-l engmorse`

If you have not not placed it in your tessdata directory identified by TESSDATA_PREFIX

also provide the path with `--tessdata-dir /path/to/finetuned/traineddata`

ubuntu@tesseract-ocr:~/TEST$ tesseract morse.jpg - -l eng
Warning: Invalid resolution 0 dpi. Using 70 instead.
Estimating resolution as 125

3AMWA DE FASMX QFSMXQ CQ CQ DE FS5MXQ FSMXQ CQ DE FSMXQ ENSMAR I III FSMXQ FSMXQ NHE K »
ubuntu@tesseract-ocr:~/TEST$ tesseract morse.jpg - -l engmorse --tessdata-dir ~/tesstutorial
Warning: Invalid resolution 0 dpi. Using 70 instead.
Estimating resolution as 125

3AMWA DE FASMX QF5MXQ CQ CQ DE F5MXQ F5MXQ CQ DE F5MXQ ENS5MAA I III F5MXQ F5MXQ NHE K

--

____________________________________________________________
भजन - कीर्तन - आरती @ http://bhajans.ramparivar.com

Lorenzo Bolzani

unread,

Apr 28, 2019, 3:07:20 PM4/28/19

to tesser...@googlegroups.com

I think the problem is also that the network does not expect a mix of letters and numbers. The text is processed as a continuous stream and not as individual characters. This is good for text but not for codes.

So if you want to fine tune you need to provide similar mixed sequences.

Also, if possible, try to use a bigger text, here it is 13px, something between 30/50px should work better. Also preprocessing (or generating) the image to have a high contrast black/white image might help (not a binary threshold, just a little more contrast).

If you can choose which font to use try a few different ones.

Of course, if the structure of the codes is regular you can simply replace S with 5.

Lorenzo

RangerRick

unread,

Apr 28, 2019, 5:11:37 PM4/28/19

to tesseract-ocr

Shree.

That works perfectly! Thank you very much. Don't know where the engmorse training data originated, but it certainly did the trick.

Best,

Rick W5FCX

To unsubscribe from this group and stop receiving emails from it, send an email to tesser...@googlegroups.com.

To post to this group, send email to tesser...@googlegroups.com.
Visit this group at https://groups.google.com/group/tesseract-ocr.
To view this discussion on the web visit https://groups.google.com/d/msgid/tesseract-ocr/ab572776-22f8-4259-a7b4-ec6615d11bb4%40googlegroups.com.
For more options, visit https://groups.google.com/d/optout.

Reply all

Reply to author

Forward