Page segmentation and preserve_interword_space are not working

167 views
Skip to first unread message

Prav

unread,
Jul 26, 2017, 1:00:57 PM7/26/17
to tesseract-ocr
Hi,

I am trying to extract tabular data. For this I am converting the image into hocr. 
Now this hocr is not coming properly. It first puts the data for one column and then for the other. I do not get data which is put row wise and column wise so that the extraction comes as a proper table.

I have tried with -psm 5 and with -psm 6 but in both cases the hocr looks identical.

I am using tesseract 3.05

even preserve_interword_space set to 1 is not working.

Any help would be useful

For eg
we have the following in the image

Colulmn 1             Column 2
X                           1
Y                           2
Z                           3

hocr is giving

X
Y
Z
1
2
3

I would like the output to be

X     1
Y     2
Z     3

Will be grateful for any help and/or ideas

Thanks

ShreeDevi Kumar

unread,
Jul 26, 2017, 1:56:18 PM7/26/17
to tesser...@googlegroups.com
Try  'tsv' instead of 'hocr'

ShreeDevi
____________________________________________________________
भजन - कीर्तन - आरती @ http://bhajans.ramparivar.com

--
You received this message because you are subscribed to the Google Groups "tesseract-ocr" group.
To unsubscribe from this group and stop receiving emails from it, send an email to tesseract-ocr+unsubscribe@googlegroups.com.
To post to this group, send email to tesser...@googlegroups.com.
Visit this group at https://groups.google.com/group/tesseract-ocr.
To view this discussion on the web visit https://groups.google.com/d/msgid/tesseract-ocr/d2b68f4a-8f1b-473b-bd27-818d9d1a28be%40googlegroups.com.
For more options, visit https://groups.google.com/d/optout.

Prav

unread,
Jul 26, 2017, 10:38:16 PM7/26/17
to tesseract-ocr
Thanks for the reply.

TSV is giving data in a column. So it covers column1 then column2 and finally column 3 one below the other.
I am not able to figure out how to construct a table from a TSV.
To unsubscribe from this group and stop receiving emails from it, send an email to tesseract-oc...@googlegroups.com.
Reply all
Reply to author
Forward
0 new messages