Read 2 column Image Horizontally (line by line) rather than Vertically (column by column)

1,353 views
Skip to first unread message

Mike Hall

unread,
Apr 6, 2017, 1:36:22 PM4/6/17
to tesseract-ocr
We have a C# .Net app that is using Tesseract to do Optical Character Recognition (OCR) on .tiff files.  I've attached a sample tiff file.

We are then outputting the data to a text file.  However, Tesseract is reading the data in a Vertical fashion.  In my example image, it is reading the tiff as two columns of data and the data the data is being outputted from Tesseract like this:
 
TYPE:
DATE:
Address:
City:
State:
Owner:
Owner Type:
Acreage:
Mortgage:
12345
2017-04-06
100 Main St.
Some City
Some State
John Doe
Primary
10.25
Yes

What we want is Tesseract to read the tiff file horizontally and have the output look like this:

TYPE:
12345
DATE:
2017-04-06
Address:
100 Main St.
City:
Some City
State:
Some State
Owner:
John Doe
Owner Type:
Primary
Acreage:
10.25
Mortgage:
Yes

We've tried the various Page Sementation options for Tesseract, but they all produce the same result.
Has anyone run into this same issue? Anybody have any ideas?

MyTestFile.tiff

ShreeDevi Kumar

unread,
Apr 6, 2017, 2:12:18 PM4/6/17
to tesser...@googlegroups.com
Have u tried --psm 6

- excuse the brevity, sent from mobile

--
You received this message because you are subscribed to the Google Groups "tesseract-ocr" group.
To unsubscribe from this group and stop receiving emails from it, send an email to tesseract-ocr+unsubscribe@googlegroups.com.
To post to this group, send email to tesser...@googlegroups.com.
Visit this group at https://groups.google.com/group/tesseract-ocr.
To view this discussion on the web visit https://groups.google.com/d/msgid/tesseract-ocr/790b41ef-f97f-4695-b7c8-1c68bdd1cd38%40googlegroups.com.
For more options, visit https://groups.google.com/d/optout.

Mike Hall

unread,
Apr 6, 2017, 4:48:55 PM4/6/17
to tesseract-ocr
Yes, we are using the -psm 6 command line argument.  And it was not working.

But I figured out the issue. 

Tesseract has a set of config files. Inside several of these config files (hocr, pdf, tsv, unlv) is the setting tessedit_pageseg_mode. This setting was set to 1 in all the config files.   Once I removed the tessedit_pageseg_mode parameter from the config files, our command line argument of -psm 6 worked.

Alternatively, I did experiment with the config files.  When I changed the tessedit_pageseg_mode setting to 6 in all the config files and ran Tesseract with the -psm 6 command line argument, it also worked.

Thanks
To unsubscribe from this group and stop receiving emails from it, send an email to tesseract-oc...@googlegroups.com.

ShreeDevi Kumar

unread,
Apr 6, 2017, 9:06:27 PM4/6/17
to tesser...@googlegroups.com
Normally, for text output, the other config files should not impact.



- excuse the brevity, sent from mobile
To unsubscribe from this group and stop receiving emails from it, send an email to tesseract-ocr+unsubscribe@googlegroups.com.

To post to this group, send email to tesser...@googlegroups.com.
Visit this group at https://groups.google.com/group/tesseract-ocr.

Mike Hall

unread,
Apr 7, 2017, 12:38:58 PM4/7/17
to tesseract-ocr
We are using hocr and pdf outputs as well.
Reply all
Reply to author
Forward
0 new messages