Odd behavior when trying to force a box to split

Dan Vanderkam

unread,

Dec 30, 2014, 7:48:20 PM12/30/14

to tesser...@googlegroups.com

More context here. I'm trying to get Tesseract to split some of its detected boxes in half or thirds.

My approach has been to draw white vertical lines through the joined letters, so from before:

to after:

(http://i.imgur.com/TPcCsi0.png)

If you can't see the lines, here they are in red:

(http://i.imgur.com/MjSa0FS.png)

I would have expected that drawing the white lines would split these boxes apart. It does that, but it also has a side effect: it joins the "9" on the first line with the "s" below it on the next line:

even if I draw a white line below the "9" and the "0", this still happens. As you might expect, these tall letters wreak havoc on the resulting OCR'd text.

I'm baffled why this is happening. Based on this SO answer, my understanding was that Tesseract looked at connected components to find boxes, so I would have expected the white lines to force apart two components.

Is it possible to give Tesseract an explicit list of boxes? If not, is there a more effective way to force apart two letters than what I'm doing?

Thanks!

- Dan

ShreeDevi Kumar

unread,

Dec 30, 2014, 11:10:05 PM12/30/14

to tesser...@googlegroups.com

what page segmentation mode are you using?

https://tesseract-ocr.googlecode.com/svn/trunk/doc/tesseract.1.html

ShreeDevi
____________________________________________________________
भजन - कीर्तन - आरती @ http://bhajans.ramparivar.com

- Dan

--
You received this message because you are subscribed to the Google Groups "tesseract-ocr" group.
To unsubscribe from this group and stop receiving emails from it, send an email to tesseract-oc...@googlegroups.com.
To post to this group, send email to tesser...@googlegroups.com.
Visit this group at http://groups.google.com/group/tesseract-ocr.
To view this discussion on the web visit https://groups.google.com/d/msgid/tesseract-ocr/CAGiBXrzXUU9tC6MaKz89pugooXq31iDLQP1E3qr7d3s1CVgoxQ%40mail.gmail.com.
For more options, visit https://groups.google.com/d/optout.

Dan Vanderkam

unread,

Jan 1, 2015, 4:33:51 PM1/1/15

to tesser...@googlegroups.com

I'm not specifying psm explicitly, so it must be 3 = Fully automatic page segmentation, but no OSD. (Default)

ShreeDevi Kumar

unread,

Jan 1, 2015, 11:55:12 PM1/1/15

to tesser...@googlegroups.com

I think you need to deskew/dewarp the lines, increase brighness, get the imaes at 300dpi and try.

I tested using your images with vietocr (4.0 beta) with the following output ...

----------------------

East 133rd Street, cast from Cypress Ave. In the background is

the United Electric Light and Power Co. plant on the East River Shore.

April 12, 1931.

P. L. Sperr.

NO REPRODUCTIONS.

------------------

901 Harrie Ave., west aide, between East lGlet and East 162nd

Streets.

About 1925 .

W. B. Vernem.

MAY BE REPRODUCED.

ShreeDevi
____________________________________________________________
भजन - कीर्तन - आरती @ http://bhajans.ramparivar.com

To view this discussion on the web visit https://groups.google.com/d/msgid/tesseract-ocr/0668c869-535a-4dbc-ba02-e4b1c40f9fab%40googlegroups.com.

Reply all

Reply to author

Forward