How to make tesseract not split the image into sections

750 views
Skip to first unread message

Benjamin Sølberg

unread,
Jan 2, 2014, 7:12:53 PM1/2/14
to tesser...@googlegroups.com
Hi all

I am training tesseract to work with a custom font.
Things are moving forward but there are clouds in the sky.

When using tesseract it insists to cut the texts into sections.
I understand why as the textual layout may seems to be column based.
The text is very much like an excel sheet with columns.
But I would very much like tesseract to give me all the text in one giant line based blob instead of many sections.

Is it possible to make tesseract not chop up the image into (columns in my case) ?

Regards
Benjamin

Quan Nguyen

unread,
Jan 3, 2014, 11:48:06 AM1/3/14
to tesser...@googlegroups.com
Try with PSM 4, 5, or 6.

Benjamin Sølberg

unread,
Jan 3, 2014, 2:02:24 PM1/3/14
to tesser...@googlegroups.com
Thank you, i'll try that.

Is it possible to achieve the same functionality by using a config parameter as I also need to run this on an iPhone ?

Regards
Benjamin

Quan Nguyen

unread,
Jan 3, 2014, 3:09:06 PM1/3/14
to tesser...@googlegroups.com
I think it is possible.
 
tessedit_pageseg_mode 6
 

zdenko podobny

unread,
Jan 3, 2014, 4:22:29 PM1/3/14
to tesser...@googlegroups.com
You can set page segmentation mode with SetPageSegMode. See example on wiki[1] or have a look at tesseract source code.

--
--
You received this message because you are subscribed to the Google
Groups "tesseract-ocr" group.
To post to this group, send email to tesser...@googlegroups.com
To unsubscribe from this group, send email to
tesseract-oc...@googlegroups.com
For more options, visit this group at
http://groups.google.com/group/tesseract-ocr?hl=en
 
---
You received this message because you are subscribed to the Google Groups "tesseract-ocr" group.
To unsubscribe from this group and stop receiving emails from it, send an email to tesseract-oc...@googlegroups.com.
For more options, visit https://groups.google.com/groups/opt_out.

Benjamin Sølberg

unread,
Jan 3, 2014, 6:43:18 PM1/3/14
to tesser...@googlegroups.com
Thanks for the info and the link.

Benjamin

Benjamin Sølberg

unread,
Jan 3, 2014, 6:56:50 PM1/3/14
to tesser...@googlegroups.com
Have given it a try.

The output is now in one block as needed, thats good.

But the problem now seems to be that it does not take my training data into much account.
Special chars are no longer reconized.
 I guess the "-psm 6" option makes it stop earlier in the process.
It it possible to just make it skip the segmentation process and have the rest as usual ?
I am just taking a pure guess here on how it works.

Benjamin

Quan Nguyen

unread,
Jan 3, 2014, 9:24:22 PM1/3/14
to tesser...@googlegroups.com
Make sure the command and parameters/options are in proper order.
 
Usage:tesseract.exe imagename outputbase|stdout [-l lang] [-psm pagesegmode] [configfile...]

Benjamin Sølberg

unread,
Jan 4, 2014, 5:21:51 PM1/4/14
to tesser...@googlegroups.com
Thanks.

Did just that.

Strange.
Reply all
Reply to author
Forward
0 new messages