Failure to recognize columns

757 views
Skip to first unread message

fuzzy7k

unread,
Oct 12, 2016, 5:21:17 PM10/12/16
to tesseract-ocr
I have scanned some index pages that I would like to ocr for rapid searching. I am using tesseract from the command line. The problem is that tesseract ignores the whitespace between columns and merges everything together, essentially fragmenting the contents. Using some debug output I see that no "columns" are detected. Probably more important is that three "blocks" are detected, one around the first and last line, and one encompassing everything in between. Is there a way to train block detection, or some parameters that I can tweak to optimize this?

I have attached the image merely as an abstract representation of the text layout to show the types of columns I am dealing with. Ideally, it would also be nice to know if tab stops can be trained and used to oneline each individual topic, which I could do postprocess if I could get tabstops printed.
Screenshot_20161012_164633.png

ShreeDevi Kumar

unread,
Oct 13, 2016, 1:46:45 AM10/13/16
to tesser...@googlegroups.com

Which page segmentation mode (psm) did you try?


On 12 Oct 2016 11:21 p.m., "fuzzy7k" <kva...@gmail.com> wrote:
I have scanned some index pages that I would like to ocr for rapid searching. I am using tesseract from the command line. The problem is that tesseract ignores the whitespace between columns and merges everything together, essentially fragmenting the contents. Using some debug output I see that no "columns" are detected. Probably more important is that three "blocks" are detected, one around the first and last line, and one encompassing everything in between. Is there a way to train block detection, or some parameters that I can tweak to optimize this?

I have attached the image merely as an abstract representation of the text layout to show the types of columns I am dealing with. Ideally, it would also be nice to know if tab stops can be trained and used to oneline each individual topic, which I could do postprocess if I could get tabstops printed.

--
You received this message because you are subscribed to the Google Groups "tesseract-ocr" group.
To unsubscribe from this group and stop receiving emails from it, send an email to tesseract-ocr+unsubscribe@googlegroups.com.
To post to this group, send email to tesser...@googlegroups.com.
Visit this group at https://groups.google.com/group/tesseract-ocr.
To view this discussion on the web visit https://groups.google.com/d/msgid/tesseract-ocr/5b4800f9-cead-4959-9260-52e98ee596b7%40googlegroups.com.
For more options, visit https://groups.google.com/d/optout.

fuzzy7k

unread,
Oct 13, 2016, 7:13:53 AM10/13/16
to tesseract-ocr
I tried psm 0-3


On Thursday, October 13, 2016 at 1:46:45 AM UTC-4, shree wrote:

Which page segmentation mode (psm) did you try?

On 12 Oct 2016 11:21 p.m., "fuzzy7k" <kva...@gmail.com> wrote:
I have scanned some index pages that I would like to ocr for rapid searching. I am using tesseract from the command line. The problem is that tesseract ignores the whitespace between columns and merges everything together, essentially fragmenting the contents. Using some debug output I see that no "columns" are detected. Probably more important is that three "blocks" are detected, one around the first and last line, and one encompassing everything in between. Is there a way to train block detection, or some parameters that I can tweak to optimize this?

I have attached the image merely as an abstract representation of the text layout to show the types of columns I am dealing with. Ideally, it would also be nice to know if tab stops can be trained and used to oneline each individual topic, which I could do postprocess if I could get tabstops printed.

--
You received this message because you are subscribed to the Google Groups "tesseract-ocr" group.
To unsubscribe from this group and stop receiving emails from it, send an email to tesseract-oc...@googlegroups.com.

ShreeDevi Kumar

unread,
Oct 13, 2016, 8:21:09 AM10/13/16
to tesser...@googlegroups.com

Try psm 6, also 11, 12

https://github.com/tesseract-ocr/tesseract/issues/434


To unsubscribe from this group and stop receiving emails from it, send an email to tesseract-ocr+unsubscribe@googlegroups.com.

To post to this group, send email to tesser...@googlegroups.com.
Visit this group at https://groups.google.com/group/tesseract-ocr.

fuzzy7k

unread,
Oct 13, 2016, 8:30:05 PM10/13/16
to tesseract-ocr
6 gives the exact same results as 3 (i.e. no column separation). 11 & 12 are essentially the same in that they pull text from left to right, but with three times as many newlines.


On Thursday, October 13, 2016 at 8:21:09 AM UTC-4, shree wrote:
On 13 Oct 2016 1:13 p.m., "fuzzy7k" <kva...@gmail.com> wrote:
I tried psm 0-3

On Thursday, October 13, 2016 at 1:46:45 AM UTC-4, shree wrote:

Which page segmentation mode (psm) did you try?


On 12 Oct 2016 11:21 p.m., "fuzzy7k" <kva...@gmail.com> wrote:
I have scanned some index pages that I would like to ocr for rapid searching. I am using tesseract from the command line. The problem is that tesseract ignores the whitespace between columns and merges everything together, essentially fragmenting the contents. Using some debug output I see that no "columns" are detected. Probably more important is that three "blocks" are detected, one around the first and last line, and one encompassing everything in between. Is there a way to train block detection, or some parameters that I can tweak to optimize this?

I have attached the image merely as an abstract representation of the text layout to show the types of columns I am dealing with. Ideally, it would also be nice to know if tab stops can be trained and used to oneline each individual topic, which I could do postprocess if I could get tabstops printed.

--
You received this message because you are subscribed to the Google Groups "tesseract-ocr" group.
To unsubscribe from this group and stop receiving emails from it, send an email to tesseract-oc...@googlegroups.com.
To post to this group, send email to tesser...@googlegroups.com.
Visit this group at https://groups.google.com/group/tesseract-ocr.
To view this discussion on the web visit https://groups.google.com/d/msgid/tesseract-ocr/5b4800f9-cead-4959-9260-52e98ee596b7%40googlegroups.com.
For more options, visit https://groups.google.com/d/optout.

--
You received this message because you are subscribed to the Google Groups "tesseract-ocr" group.
To unsubscribe from this group and stop receiving emails from it, send an email to tesseract-oc...@googlegroups.com.
To post to this group, send email to tesser...@googlegroups.com.
Visit this group at https://groups.google.com/group/tesseract-ocr.

fuzzy7k

unread,
Oct 13, 2016, 8:53:25 PM10/13/16
to tesseract-ocr
Going back to psm 3, I did find that textord_tabfind_find_tables 0 helped, in that it draws only one box around the "block" of text, instead of the three that I was first getting. This is obviously the same as psm 6, but psm 6 should not run column detection, which is something that I want unless I can get tesseract to draw "blocks" vertically around the individual columns.

ShreeDevi Kumar

unread,
Oct 14, 2016, 3:29:53 AM10/14/16
to tesser...@googlegroups.com

You can also experiment with hocr and tsv output modes to see if they help.


To unsubscribe from this group and stop receiving emails from it, send an email to tesseract-ocr+unsubscribe@googlegroups.com.

To post to this group, send email to tesser...@googlegroups.com.
Visit this group at https://groups.google.com/group/tesseract-ocr.

fuzzy7k

unread,
Oct 14, 2016, 5:08:51 PM10/14/16
to tesseract-ocr
negative

Tom Morris

unread,
Oct 15, 2016, 9:49:20 PM10/15/16
to tesseract-ocr
On Wednesday, October 12, 2016 at 5:21:17 PM UTC-4, fuzzy7k wrote:
I have scanned some index pages that I would like to ocr for rapid searching. I am using tesseract from the command line. The problem is that tesseract ignores the whitespace between columns and merges everything together, essentially fragmenting the contents. Using some debug output I see that no "columns" are detected. ...


I have attached the image merely as an abstract representation of the text layout to show the types of columns I am dealing with. Ideally, it would also be nice to know if tab stops can be trained and used to oneline each individual topic, which I could do postprocess if I could get tabstops printed.

Tesseract is probably getting confused by the indents for the entries. It should be pretty easy to identify the columns using image processing (.e.g. create a histogram of black pixel counts for each vertical pixel column). Why not just do the page segmentation yourself and pass the three columns to Tesseract separately.

Tom 

fuzzy7k

unread,
Oct 23, 2016, 9:35:21 PM10/23/16
to tesseract-ocr
Well, I have used ocrfeeder to draw up columns individually, but that is a lot of mouse clicking and copy/pasting. I don't care to do that for 40 pages of index material, considering most of the text will probably never even be looked  at. That's why I was hoping to find a line of code that I could tweak so that I can just whip up a script to take on the whole batch with the press of a finger. I made a few changes in textord/colfind.cpp, but concluded that I was chasing a rabbit into a hole. I had success with drawing a line freestyle between the columns. I'm currently looking into how to do that with convert.

I like the histogram idea. That sounds like a good feature request.

fuzzy7k

unread,
Oct 23, 2016, 9:51:16 PM10/23/16
to tesseract-ocr
It's less than elegant, but works
convert -draw "line 800,0 800,10000" -draw "line 1500,0 1500,10000" index-3.pnm x.pnm
Reply all
Reply to author
Forward
0 new messages