Re: [tesseract-ocr] '33' recognized correctly, '3' not recognized at all...

65 views
Skip to first unread message
Message has been deleted

Shree Devi Kumar

unread,
Aug 31, 2019, 1:02:20 PM8/31/19
to tesseract-ocr
ubuntu@tesseract-ocr:~/TEST$ tesseract twonumbers.png - --psm 6 --tessdata-dir ~/tessdata --oem 1
2 127

a 15

7 56

7 58

9 58

19 65
24 91
3375
ubuntu@tesseract-ocr:~/TEST$ tesseract twonumbers.png - --psm 6 --tessdata-dir ~/tessdata_best --oem 1
2 127

a 15

7 56

7 58

9 58

19 65
24 91
3375
ubuntu@tesseract-ocr:~/TEST$ tesseract twonumbers.png - --psm 6 --tessdata-dir ~/tessdata_fast --oem 1
2 127

4 15

7 56

7 58

9 58

19 65
24 «(91
33 «75

On Sat, Aug 31, 2019 at 9:54 PM Jack <otak...@gmail.com> wrote:
I have a weird niche project here, essentially I have about 4,000 images, each with 2 numbers between 0 and 127.
I've tweaked the images in a million different ways and I can't get tesseract to recognized individual numbers, with the exception of 2, all other 1 digit numbers are not recognized.

Also, for some reason if I use tesseract directly I get way worse results, whereas if I convert to pdf first and use ocrmypdf, which apparently uses tesseract, I get WAY better results, which I don't understand.

The font is very straight-forward I think, so I'm not sure if training would be helpful, but I'm open to the idea if needed.

Here are the sample images I'm using for testing, before and after I modified them:
Okay some of them failed to upload but that's the gist.

Thanks,
Jack

--
You received this message because you are subscribed to the Google Groups "tesseract-ocr" group.
To unsubscribe from this group and stop receiving emails from it, send an email to tesseract-oc...@googlegroups.com.
To view this discussion on the web visit https://groups.google.com/d/msgid/tesseract-ocr/7be5ed42-df44-4530-b7a2-0d0fa340918e%40googlegroups.com.


--

____________________________________________________________
भजन - कीर्तन - आरती @ http://bhajans.ramparivar.com
twonumbers.png

Jack

unread,
Aug 31, 2019, 10:34:32 PM8/31/19
to tesseract-ocr
Thank you for replying, that was very helpful.
I've now tried tessdata_best and tessdata_fast trained data found on the tesseract github, which has drastically improved my results, but still not as accurate as yours.
Here are my outputs:

tesseract listpng output2 --psm 6 --tessdata-dir ~/tessdata/tessdata_best --oem 1
3 70
2 127
4 15
7 96
7 98
9 B58
9 65
19 695
29 91
33 75

tesseract listpng output_fast --psm 6 --tessdata-dir ~/tessdata/tessdata_fast --oem 1
3 70

2 127
4 15
7 56
7 58
9 #58
9 #65
19 ~=665
24 #691
33 #675
Message has been deleted

Jack

unread,
Aug 31, 2019, 11:04:46 PM8/31/19
to tesseract-ocr
Thank you very much for your reply, that was very helpful, I think that should do the trick.
To unsubscribe from this group and stop receiving emails from it, send an email to tesser...@googlegroups.com.

Shree Devi Kumar

unread,
Sep 1, 2019, 12:11:47 AM9/1/19
to tesseract-ocr
I am using the latest code from master branch. 

I would expect same result with same image and same traineddata files.

--
You received this message because you are subscribed to the Google Groups "tesseract-ocr" group.
To unsubscribe from this group and stop receiving emails from it, send an email to tesseract-oc...@googlegroups.com.
To view this discussion on the web visit https://groups.google.com/d/msgid/tesseract-ocr/934d89f8-a455-4787-8d8d-8986cc615059%40googlegroups.com.

Shree Devi Kumar

unread,
Sep 1, 2019, 12:13:55 AM9/1/19
to tesseract-ocr
Well, I just took a screenshot of your images from the link since I could not figure out how to get individual images. The only doctoring was to save it at 300 dpi in irfanview.

On Sun, 1 Sep 2019, 08:27 Jack, <otak...@gmail.com> wrote:
Ah, now I see it has something to do with the way you doctored the images, I get the same output as you did when I ran your pic through. So what's the secret?

On Saturday, August 31, 2019 at 11:24:23 AM UTC-5, Jack wrote:
I have a weird niche project here, essentially I have about 4,000 images, each with 2 numbers between 0 and 127.
I've tweaked the images in a million different ways and I can't get tesseract to recognized individual numbers, with the exception of 2, all other 1 digit numbers are not recognized.

Also, for some reason if I use tesseract directly I get way worse results, whereas if I convert to pdf first and use ocrmypdf, which apparently uses tesseract, I get WAY better results, which I don't understand.

The font is very straight-forward I think, so I'm not sure if training would be helpful, but I'm open to the idea if needed.

Here are the sample images I'm using for testing, before and after I modified them:
Okay some of them failed to upload but that's the gist.

Thanks,
Jack

--
You received this message because you are subscribed to the Google Groups "tesseract-ocr" group.
To unsubscribe from this group and stop receiving emails from it, send an email to tesseract-oc...@googlegroups.com.
Reply all
Reply to author
Forward
0 new messages