OCR Newspaper Obit No Spaces Between Words From One Line To The Next

64 views

Skip to first unread message

Ken Oates

unread,

Mar 10, 2016, 2:26:24 AM3/10/16

to tesseract-ocr

When OCR'ing a newspaper clip, the text produced acts like each line is discreet, not attached as a paragraph. The word at the end of the line is directly attached to that on the next line and hyphenated words remain with the hyphen in them. Am I missing something? The same image uploaded and processed in Google Docs works perfectly. I have attached the source .jpg and the resulting text.

Thanks. Ken

Document_07777_2.jpg

Document_07777_2.txt

Tom Morris

unread,

Mar 10, 2016, 7:35:04 PM3/10/16

to tesseract-ocr

On Thursday, March 10, 2016 at 2:26:24 AM UTC-5, Ken Oates wrote:

When OCR'ing a newspaper clip, the text produced acts like each line is discreet, not attached as a paragraph. The word at the end of the line is directly attached to that on the next line and hyphenated words remain with the hyphen in them. Am I missing something? The same image uploaded and processed in Google Docs works perfectly. I have attached the source .jpg and the resulting text.

Tesseract returns the text broken into blocks and lines, as it appears on the page. If you want to join the lines within a block, you'll need to concatenate them yourself.