Recognition of "5" instead of "S"

125 views
Skip to first unread message

RangerRick

unread,
Apr 28, 2019, 1:41:35 AM4/28/19
to tesseract-ocr
Hi,

I'm new to Tesseract, using latest version 4 executable on Windows 7.

I'm converting Morse code CW from JPG into text using Tesseract. It works almost right, just missing on the number 5, which is usually misinterpreted as an "S".  Here's an example of the issue.


output.jpg



Here's how it's being interpreted:

                                                                  3AMWA DE FASMX QFSMXQ CQ CQ DE FSMXQ FSMXQ CQ DE FSMXQ ENSMAA I III FSMXQ FSMXQ NHE K »


I have tried adjusting the various command line parameters but no joy. I believe the font is Fontcraft Courier DemiBold, but shouldn't matter.  In this case, the image is 96 DPI and 24 pixels tall (total, including border).

I started to try and retrain to optimize for this font, but that looks like a pretty daunting task.

Any guidance would be greatly appreciated.

Rick


RangerRick

unread,
Apr 28, 2019, 10:03:49 AM4/28/19
to tesseract-ocr
Attaching the bitmap image
output.jpg

RangerRick

unread,
Apr 28, 2019, 10:49:09 AM4/28/19
to tesseract-ocr
Ok. Now I have tried the "best" traindata file (no difference) and removing the alpha layer (no difference). I even created a new, simpler bitmap using Courier New font (attached), which still fails.

Tesseract just can't distinguish between the number 5 and an S.


On Sunday, April 28, 2019 at 12:41:35 AM UTC-5, RangerRick wrote:
output2.jpg

Zdenko Podobny

unread,
Apr 28, 2019, 12:55:59 PM4/28/19
to tesser...@googlegroups.com
jpeg is not suitable for OCR - your image has too many artifacts so I am wandering there is "5" vs "S" problem only.

Zdenko


ne 28. 4. 2019 o 16:49 RangerRick <rbr...@gmail.com> napísal(a):
--
You received this message because you are subscribed to the Google Groups "tesseract-ocr" group.
To unsubscribe from this group and stop receiving emails from it, send an email to tesseract-oc...@googlegroups.com.
To post to this group, send email to tesser...@googlegroups.com.
Visit this group at https://groups.google.com/group/tesseract-ocr.
To view this discussion on the web visit https://groups.google.com/d/msgid/tesseract-ocr/ab572776-22f8-4259-a7b4-ec6615d11bb4%40googlegroups.com.
For more options, visit https://groups.google.com/d/optout.

Shree Devi Kumar

unread,
Apr 28, 2019, 2:51:02 PM4/28/19
to tesser...@googlegroups.com
Finetuning with Courier font with a training text similar to image you are recognizing with more samples of 5 will give better result.


Shree Devi Kumar

unread,
Apr 28, 2019, 3:02:50 PM4/28/19
to tesser...@googlegroups.com
You can test with the finetuned traineddata file from

Download the file (raw file)
Use it with `-l engmorse`
If you have not not placed it in your tessdata directory identified by TESSDATA_PREFIX
also provide the path with `--tessdata-dir /path/to/finetuned/traineddata`

ubuntu@tesseract-ocr:~/TEST$ tesseract morse.jpg - -l eng
Warning: Invalid resolution 0 dpi. Using 70 instead.
Estimating resolution as 125


3AMWA DE FASMX QFSMXQ CQ CQ DE FS5MXQ FSMXQ CQ DE FSMXQ ENSMAR I III FSMXQ FSMXQ NHE K »
ubuntu@tesseract-ocr:~/TEST$ tesseract morse.jpg - -l engmorse --tessdata-dir ~/tesstutorial
Warning: Invalid resolution 0 dpi. Using 70 instead.
Estimating resolution as 125


3AMWA DE FASMX QF5MXQ CQ CQ DE F5MXQ F5MXQ CQ DE F5MXQ ENS5MAA I III F5MXQ F5MXQ NHE K

--

____________________________________________________________
भजन - कीर्तन - आरती @ http://bhajans.ramparivar.com

Lorenzo Bolzani

unread,
Apr 28, 2019, 3:07:20 PM4/28/19
to tesser...@googlegroups.com

I think the problem is also that the network does not expect a mix of letters and numbers. The text is processed as a continuous stream and not as individual characters. This is good for text but not for codes.

So if you want to fine tune you need to provide similar mixed sequences.

Also, if possible, try to use a bigger text, here it is 13px, something between 30/50px should work better. Also preprocessing (or generating) the image to have a high contrast black/white image might help (not a binary threshold, just a little more contrast).

If you can choose which font to use try a few different ones.

Of course, if the structure of the codes is regular you can simply replace S with 5.
 

Lorenzo

RangerRick

unread,
Apr 28, 2019, 5:11:37 PM4/28/19
to tesseract-ocr
Shree.

That works perfectly!  Thank you very much. Don't know where the engmorse training data originated, but it certainly did the trick.

Best,
Rick W5FCX
To unsubscribe from this group and stop receiving emails from it, send an email to tesser...@googlegroups.com.
Reply all
Reply to author
Forward
0 new messages