How come tesseract 4.0 misses, what am I missing here?

96 views
Skip to first unread message

cohen...@gmail.com

unread,
Jun 28, 2018, 3:34:37 PM6/28/18
to tesseract-ocr
I'm  quite new to tesseract and would like to use it in a project for OCR purposes,
I found a tutorial on the web with photos, so I have executed tesseract (tesseract 4.0.0-beta.2) on it,
and noticed it has successfully retrieved every single word, wow IMPRESSIVE!!

so I took my smartphone and took a crystal clear photo (no blurry), and hoped it would work for me too.
but NOTHING it failed miserably (every word miss :/ bummer)

I read this too: https://github.com/tesseract-ocr/tesseract/wiki/ImproveQuality

I tried to figure out what's i'm doing wrong by comparing the metedata EXIF of each photo,
but apparently the photo's metadata from the web tutorial has been stripped :/

Can someone explain to me. what am i missing here??
I'm attaching the two photos.


Thank you in advance :)


myShot.jpg
tutorialShot.jpg

Shree Devi Kumar

unread,
Jun 28, 2018, 3:38:20 PM6/28/18
to tesser...@googlegroups.com
Rotate your shot to correct orientation and try.

On 6/28/18, cohen...@gmail.com <cohen...@gmail.com> wrote:
> I'm quite new to tesseract and would like to use it in a project for OCR
> purposes,
> I found a tutorial on the web with photos, so I have executed tesseract
> (tesseract 4.0.0-beta.2) on it,
> and noticed it has *successfully retrieved every single word*, wow
> IMPRESSIVE!!
>
> so I took my smartphone and took a crystal clear photo (no blurry), and
> hoped it would work for me too.
> but *NOTHING it failed miserably* (every word miss :/ bummer)
>
> I read this too:
> https://github.com/tesseract-ocr/tesseract/wiki/ImproveQuality
>
> I tried to figure out what's i'm doing wrong by comparing the metedata EXIF
>
> of each photo,
> but apparently the photo's metadata from the web tutorial has been stripped
>
> :/
>
> Can someone explain to me. what am i missing here??
> I'm attaching the two photos.
>
>
> Thank you in advance :)
>
>
> --
> You received this message because you are subscribed to the Google Groups
> "tesseract-ocr" group.
> To unsubscribe from this group and stop receiving emails from it, send an
> email to tesseract-oc...@googlegroups.com.
> To post to this group, send email to tesser...@googlegroups.com.
> Visit this group at https://groups.google.com/group/tesseract-ocr.
> To view this discussion on the web visit
> https://groups.google.com/d/msgid/tesseract-ocr/9676c56c-4ed4-4329-9aad-82937c495b91%40googlegroups.com.
> For more options, visit https://groups.google.com/d/optout.
>


--

____________________________________________________________
भजन - कीर्तन - आरती @ http://bhajans.ramparivar.com

cohen...@gmail.com

unread,
Jun 28, 2018, 4:42:55 PM6/28/18
to tesseract-ocr
Thank you Shree!! :)

Ok after rotating it,
tesseract haven't succeed retrieving the text.

BUT I kept experimenting with convert app (part of ImageMagick 6.8.9), and resized the photo twice,
eventually the words got retrieved!! hooray! :)

So I was wondering.
what is a good practice when taking shots with a smartphone?

it was mentioned:
Tesseract works best on images which have a DPI of at least 300 dpi, so it may be beneficial to resize images.

moreover it was mentioned in the git repo:
Tesseract does various image processing operations internally (using the Leptonica library) before doing the actual OCR.

BTW is there a config parameter for enabling resizing image in case of a problematic input?

Dattatraya Tembare

unread,
Jun 29, 2018, 12:53:49 PM6/29/18
to tesseract-ocr
Image editing you could do using ImageMagick (command line/java api)

Martin Jenniges

unread,
Jun 29, 2018, 1:37:47 PM6/29/18
to tesser...@googlegroups.com
Hello,

when I use the TXT-File, which was created from Tesseract in
Windows-Cmd,  with Libre Office Writer: the German Spezial Character üöä
ect are wrong.

I help me, with open the txt-foöe with Notepad++ and copy and paste the
text in Writer.

Can I do anything, that Libre Office Writer open the txt-file with the
correct Characters ?

Thank You for your Answers!

See regard

Martin Jenniges

Zdenko Podobny

unread,
Jun 29, 2018, 1:45:49 PM6/29/18
to tesser...@googlegroups.com

pi 29. 6. 2018 o 19:37 Martin Jenniges <martinj...@skynet.be> napísal(a):
--
You received this message because you are subscribed to the Google Groups "tesseract-ocr" group.
To unsubscribe from this group and stop receiving emails from it, send an email to tesseract-oc...@googlegroups.com.
To post to this group, send email to tesser...@googlegroups.com.
Visit this group at https://groups.google.com/group/tesseract-ocr.

Martin Jenniges

unread,
Jun 30, 2018, 3:11:00 AM6/30/18
to tesser...@googlegroups.com
Hello,

thank you for your answer.

I have found the answer in LibreOffice: File open/filtered as txt- text encoding, then chose utf-8

See regard
Martin
Reply all
Reply to author
Forward
0 new messages