OCR Newspaper Obit No Spaces Between Words From One Line To The Next

64 views
Skip to first unread message

Ken Oates

unread,
Mar 10, 2016, 2:26:24 AM3/10/16
to tesseract-ocr
When OCR'ing a newspaper clip, the text produced acts like each line is discreet, not attached as a paragraph.  The word at the end of the line is directly attached to that on the next line and hyphenated words remain with the hyphen in them.  Am I missing something?  The same image uploaded and processed in Google Docs works perfectly.  I have attached the source .jpg and the resulting text.

Thanks.  Ken


Document_07777_2.jpg
Document_07777_2.txt

Tom Morris

unread,
Mar 10, 2016, 7:35:04 PM3/10/16
to tesseract-ocr
On Thursday, March 10, 2016 at 2:26:24 AM UTC-5, Ken Oates wrote:
When OCR'ing a newspaper clip, the text produced acts like each line is discreet, not attached as a paragraph.  The word at the end of the line is directly attached to that on the next line and hyphenated words remain with the hyphen in them.  Am I missing something?  The same image uploaded and processed in Google Docs works perfectly.  I have attached the source .jpg and the resulting text.

Tesseract returns the text broken into blocks and lines, as it appears on the page. If you want to join the lines within a block, you'll need to concatenate them yourself.

tom
Reply all
Reply to author
Forward
0 new messages