How come tesseract 4.0 misses, what am I missing here?

cohen...@gmail.com

unread,

Jun 28, 2018, 3:34:37 PM6/28/18

to tesseract-ocr

I'm quite new to tesseract and would like to use it in a project for OCR purposes,
I found a tutorial on the web with photos, so I have executed tesseract (tesseract 4.0.0-beta.2) on it,
and noticed it has successfully retrieved every single word, wow IMPRESSIVE!!

so I took my smartphone and took a crystal clear photo (no blurry), and hoped it would work for me too.
but NOTHING it failed miserably (every word miss :/ bummer)

I read this too: https://github.com/tesseract-ocr/tesseract/wiki/ImproveQuality

I tried to figure out what's i'm doing wrong by comparing the metedata EXIF of each photo,
but apparently the photo's metadata from the web tutorial has been stripped :/

Can someone explain to me. what am i missing here??
I'm attaching the two photos.

Thank you in advance :)

myShot.jpg

tutorialShot.jpg

Shree Devi Kumar

unread,

Jun 28, 2018, 3:38:20 PM6/28/18

to tesser...@googlegroups.com

Rotate your shot to correct orientation and try.

On 6/28/18, cohen...@gmail.com <cohen...@gmail.com> wrote:
> I'm quite new to tesseract and would like to use it in a project for OCR
> purposes,
> I found a tutorial on the web with photos, so I have executed tesseract
> (tesseract 4.0.0-beta.2) on it,

> and noticed it has *successfully retrieved every single word*, wow

> IMPRESSIVE!!
>
> so I took my smartphone and took a crystal clear photo (no blurry), and
> hoped it would work for me too.

> but *NOTHING it failed miserably* (every word miss :/ bummer)

>
> I read this too:
> https://github.com/tesseract-ocr/tesseract/wiki/ImproveQuality
>
> I tried to figure out what's i'm doing wrong by comparing the metedata EXIF
>
> of each photo,
> but apparently the photo's metadata from the web tutorial has been stripped
>
> :/
>
> Can someone explain to me. what am i missing here??
> I'm attaching the two photos.
>
>
> Thank you in advance :)
>
>

> --
> You received this message because you are subscribed to the Google Groups
> "tesseract-ocr" group.
> To unsubscribe from this group and stop receiving emails from it, send an
> email to tesseract-oc...@googlegroups.com.
> To post to this group, send email to tesser...@googlegroups.com.
> Visit this group at https://groups.google.com/group/tesseract-ocr.
> To view this discussion on the web visit
> https://groups.google.com/d/msgid/tesseract-ocr/9676c56c-4ed4-4329-9aad-82937c495b91%40googlegroups.com.
> For more options, visit https://groups.google.com/d/optout.
>

--

____________________________________________________________
भजन - कीर्तन - आरती @ http://bhajans.ramparivar.com

cohen...@gmail.com

unread,

Jun 28, 2018, 4:42:55 PM6/28/18

to tesseract-ocr

Thank you Shree!! :)

Ok after rotating it,
tesseract haven't succeed retrieving the text.

BUT I kept experimenting with convert app (part of ImageMagick 6.8.9), and resized the photo twice,
eventually the words got retrieved!! hooray! :)

So I was wondering.
what is a good practice when taking shots with a smartphone?

it was mentioned:

Tesseract works best on images which have a DPI of at least 300 dpi, so it may be beneficial to resize images.

moreover it was mentioned in the git repo:

Tesseract does various image processing operations internally (using the Leptonica library) before doing the actual OCR.

BTW is there a config parameter for enabling resizing image in case of a problematic input?

Dattatraya Tembare

unread,

Jun 29, 2018, 12:53:49 PM6/29/18

to tesseract-ocr

Image editing you could do using ImageMagick (command line/java api)

Martin Jenniges

unread,

Jun 29, 2018, 1:37:47 PM6/29/18

to tesser...@googlegroups.com

Hello,

when I use the TXT-File, which was created from Tesseract in
Windows-Cmd, with Libre Office Writer: the German Spezial Character üöä
ect are wrong.

I help me, with open the txt-foöe with Notepad++ and copy and paste the
text in Writer.

Can I do anything, that Libre Office Writer open the txt-file with the
correct Characters ?

Thank You for your Answers!

See regard

Martin Jenniges

Zdenko Podobny

unread,

Jun 29, 2018, 1:45:49 PM6/29/18

to tesser...@googlegroups.com

this is not tesseract problem:

https://ask.libreoffice.org/en/question/97993/why-doesnt-lo-writer-open-and-save-text-documents-encoded-in-utf-8-without-bom-any-plans-to-fix-this-soon/

tesseract output is UTF-8 encoded.

Zdenko

pi 29. 6. 2018 o 19:37 Martin Jenniges <martinj...@skynet.be> napísal(a):

--
You received this message because you are subscribed to the Google Groups "tesseract-ocr" group.
To unsubscribe from this group and stop receiving emails from it, send an email to tesseract-oc...@googlegroups.com.
To post to this group, send email to tesser...@googlegroups.com.
Visit this group at https://groups.google.com/group/tesseract-ocr.

To view this discussion on the web visit https://groups.google.com/d/msgid/tesseract-ocr/27334e1f-beae-3a97-4bde-02bc45d18c0e%40skynet.be.

Martin Jenniges

unread,

Jun 30, 2018, 3:11:00 AM6/30/18

to tesser...@googlegroups.com

Hello,

thank you for your answer.

I have found the answer in LibreOffice: File open/filtered as txt- text encoding, then chose utf-8

See regard
Martin

To view this discussion on the web visit https://groups.google.com/d/msgid/tesseract-ocr/CAJbzG8xrafbB-WeQOAQaNWYjQ-1SKnEaLjrqopKbWBOrMVfDYw%40mail.gmail.com.

Reply all

Reply to author

Forward