underlined text problem - tess4j

64 views

Skip to first unread message

iShahad thobaiti

unread,

Jul 20, 2017, 9:47:25 AM7/20/17

to tesseract-ocr

Hello,

I'm trying to extract text from pdf file and it contain underlined text that the OCR cannot recognize accurately, It either skip the text or wrongly recognize it.

What is the best way to overcome the issue ?

Thanks

akhil katpally

unread,

Jul 21, 2017, 12:24:53 AM7/21/17

to tesseract-ocr

You can remove the underlines using leptonica line removal algorithm

Quan Nguyen

unread,

Jul 22, 2017, 6:47:34 PM7/22/17

to tesseract-ocr

Pix LeptUtils.removeLines(Pix pixs)

iShahad thobaiti

unread,

Jul 23, 2017, 3:56:28 AM7/23/17

to tesseract-ocr

I try using the method as follow

Pix pixTemp= LeptUtils.removeLines(LeptUtils.convertImageToPix(bi));

BufferedImage imageRemovedLines = LeptUtils.convertPixToImage(pixTemp);

but eclipse terminated it terminates at : LeptUtils.convertPixToImage

THintz

unread,

Jul 23, 2017, 7:28:09 AM7/23/17

to tesseract-ocr

I think that method only supports grayscale.

iShahad thobaiti

unread,

Jul 23, 2017, 7:38:12 AM7/23/17

to tesseract-ocr

I'm applying it on grayscaled image

Reply all

Reply to author

Forward

0 new messages