Works perfectly...except skips several lines

408 views
Skip to first unread message

louis...@gmail.com

unread,
Nov 13, 2016, 1:55:18 PM11/13/16
to tesseract-ocr
Tesseract is reading every line of my input image perfectly, except that it skips a large chunk of the text.

My image (image_to_process.jpg) consists of a few words on the first line, then an empty line, then a larger paragraph. Tesseract is correctly outputting that first line, then skipping most of the paragraph, before correctly reading the last few lines of the paragraph. Specifically, from the line starting "Some of the resistance..." to the line starting "homes, or their ranks..." (the first 7 lines of that paragraph) there is just no output.

I thought maybe this was a problem with the image pre-processing, so I tried tessedit_write_images. The resulting tessinput.tif is attached, but I didn't see anything noteworthy. And I couldn't find anything relevant in this message board or on Stack Overflow.

Any ideas?

Thanks,
Louis
---

Input and output:

>>>> tesseract ~/Downloads/image_to_process.jpg stdout -c tessedit_write_images=true
Warning in pixReadMemJpeg: work-around: writing to a temp file
V _-_ V--v IIIV'J V r] ''''' U U' a

city punched by money. 
no replacements among those who have to work far more
than what we used to call full-time at high-end corporate
jobs to pay for housing in this gilded cage. And those
peoplemay have never known why bodies in places
matter, or how they mattered in this place.


image_to_process.jpg
tessinput.tif

Andrew J Freyer

unread,
Dec 2, 2016, 3:44:06 PM12/2/16
to tesseract-ocr
I can confirm I am experiencing the same issue described above. Entire lines in (what should be) very readable images are skipped consistently. 

S

unread,
Dec 2, 2016, 5:42:50 PM12/2/16
to tesseract-ocr
Just a guess, but it looks like the baseline / text angle isn't consistent on those omitted lines. E.g. in "Some of the resistance during," the bottom of the S is noticeably higher than the bottom of the d, but by last three words, there's no noticeable slope. By the "no replacements" line, things have evened out a bit, and the final line seems quite flat.

I've attached an image overlaying perfectly horizontal blue lines over the text to better show what I'm seeing.

Not sure if this the actual cause, but this is what jumped out at me when looking at the image.
tessinput_tif-with-horizontal-reference.png

Art Rhyno.

unread,
Dec 2, 2016, 8:04:13 PM12/2/16
to tesser...@googlegroups.com

The source image is really large, maybe try downsizing it to 800x450 or so.

 

art

--
You received this message because you are subscribed to the Google Groups "tesseract-ocr" group.
To unsubscribe from this group and stop receiving emails from it, send an email to tesseract-oc...@googlegroups.com.
To post to this group, send email to tesser...@googlegroups.com.
Visit this group at https://groups.google.com/group/tesseract-ocr.
To view this discussion on the web visit https://groups.google.com/d/msgid/tesseract-ocr/9b94074b-c242-4944-99cb-5f15ca4eabcd%40googlegroups.com.
For more options, visit https://groups.google.com/d/optout.

Reply all
Reply to author
Forward
0 new messages