underlined text problem - tess4j

64 views
Skip to first unread message

iShahad thobaiti

unread,
Jul 20, 2017, 9:47:25 AM7/20/17
to tesseract-ocr

Hello, 

I'm trying to extract text from pdf file and it contain underlined text that the OCR cannot recognize accurately, It either skip the text or wrongly recognize it.

What is the best way to overcome the issue ? 


Thanks 

akhil katpally

unread,
Jul 21, 2017, 12:24:53 AM7/21/17
to tesseract-ocr
You can remove the underlines using leptonica line removal algorithm

Quan Nguyen

unread,
Jul 22, 2017, 6:47:34 PM7/22/17
to tesseract-ocr

iShahad thobaiti

unread,
Jul 23, 2017, 3:56:28 AM7/23/17
to tesseract-ocr
I try using the method as follow

Pix pixTemp= LeptUtils.removeLines(LeptUtils.convertImageToPix(bi));

BufferedImage imageRemovedLines = LeptUtils.convertPixToImage(pixTemp);


but eclipse terminated it terminates at  : LeptUtils.convertPixToImage

THintz

unread,
Jul 23, 2017, 7:28:09 AM7/23/17
to tesseract-ocr
I think that method only supports grayscale.

iShahad thobaiti

unread,
Jul 23, 2017, 7:38:12 AM7/23/17
to tesseract-ocr
I'm applying it on grayscaled image 
Reply all
Reply to author
Forward
0 new messages