How to get "tables" ocr-ed

101 views
Skip to first unread message

V S Rawat

unread,
Aug 10, 2014, 10:17:07 AM8/10/14
to tocr
We often get text in which images or pdf have tables.

Text is in several columns, which should be treated separated and should
be put in the same line with some separator like tab and quotes to get
csv format.

However my method of tesseract at vietocr.Net doesn't help there.

It does recognizes separate areas, and ocrs them separately, but puts
that one column below the other, say, all rows of first column at top,
then all rows of second column, then all rows of next column so on.

It is not much helpful because it takes lots of efforts to put all text
of one row together.

Is there any method of making tesseract identify tables and do ocr in
some helpful way?

or should this problem be addressed to frontend vietocr.Net developers?

Thanks.
--
Rawat




Quan Nguyen

unread,
Aug 10, 2014, 10:26:35 AM8/10/14
to tesser...@googlegroups.com
Table is a known limitation of Tesseract OCR engine.

If you know how to eliminate the table borders, you would get better results from Tesseract.
Reply all
Reply to author
Forward
0 new messages