Tesseract is giving column data on the last line of file

51 views
Skip to first unread message

ada...@turningcloud.com

unread,
Feb 22, 2018, 5:18:40 AM2/22/18
to tesseract-ocr

The issue I am facing is that when i scan a file which has coumn data separeated by "|" , OR, then in a single line, tesseract is printing the last column data after the last line of the file.
I'll be attaching the image for your referral. Hope i receive some help soon. The output image has the discrepancy on the last line .

Can anyone suggest some solution. @shree much help needed.


ShreeDevi Kumar

unread,
Feb 22, 2018, 5:52:21 AM2/22/18
to tesser...@googlegroups.com
What --psm are you using?

Tesseract might be treating the last portion as a different column.

Try PSM 4 or 6.

--
You received this message because you are subscribed to the Google Groups "tesseract-ocr" group.
To unsubscribe from this group and stop receiving emails from it, send an email to tesseract-ocr+unsubscribe@googlegroups.com.
To post to this group, send email to tesser...@googlegroups.com.
Visit this group at https://groups.google.com/group/tesseract-ocr.
To view this discussion on the web visit https://groups.google.com/d/msgid/tesseract-ocr/76378a71-f459-454e-9c6c-a0e3f682b1b9%40googlegroups.com.
For more options, visit https://groups.google.com/d/optout.

ada...@turningcloud.com

unread,
Feb 23, 2018, 12:07:23 AM2/23/18
to tesseract-ocr
@shree

You are awesome.

Your solution straightaway gave me the solution. You are awesome man. Really appreciate your help. You have responded whenever I needed it. :)

Keep up the good work.

Regards
Adarsh SHUKLA


On Thursday, February 22, 2018 at 4:22:21 PM UTC+5:30, shree wrote:
What --psm are you using?

Tesseract might be treating the last portion as a different column.

Try PSM 4 or 6.
On 22-Feb-2018 3:48 PM, <ada...@turningcloud.com> wrote:

The issue I am facing is that when i scan a file which has coumn data separeated by "|" , OR, then in a single line, tesseract is printing the last column data after the last line of the file.
I'll be attaching the image for your referral. Hope i receive some help soon. The output image has the discrepancy on the last line .

Can anyone suggest some solution. @shree much help needed.


--
You received this message because you are subscribed to the Google Groups "tesseract-ocr" group.
To unsubscribe from this group and stop receiving emails from it, send an email to tesseract-oc...@googlegroups.com.

ada...@turningcloud.com

unread,
Feb 23, 2018, 1:59:36 AM2/23/18
to tesseract-ocr
Is there any way to remove the End of page symbol that appears in the image? It looks like a box with some 000c written at the end.

Regards
Adarsh



On  Thursday, February 22, 2018 at 4:22:21 PM UTC+5:30, shree wrote:
What --psm are you using?

Tesseract might be treating the last portion as a different column.

Try PSM 4 or 6.
On 22-Feb-2018 3:48 PM, <ada...@turningcloud.com> wrote:

The issue I am facing is that when i scan a file which has coumn data separeated by "|" , OR, then in a single line, tesseract is printing the last column data after the last line of the file.
I'll be attaching the image for your referral. Hope i receive some help soon. The output image has the discrepancy on the last line .

Can anyone suggest some solution. @shree much help needed.


--
You received this message because you are subscribed to the Google Groups "tesseract-ocr" group.
To unsubscribe from this group and stop receiving emails from it, send an email to tesseract-oc...@googlegroups.com.

ShreeDevi Kumar

unread,
Feb 23, 2018, 7:33:32 AM2/23/18
to tesser...@googlegroups.com
Probably FF.

Tesseract adds a page break (normally form feed) by default.

It is still possible to suppress page breaks by setting an empty
page_separator.



ShreeDevi
____________________________________________________________
भजन - कीर्तन - आरती @ http://bhajans.ramparivar.com

To unsubscribe from this group and stop receiving emails from it, send an email to tesseract-ocr+unsubscribe@googlegroups.com.

To post to this group, send email to tesser...@googlegroups.com.
Visit this group at https://groups.google.com/group/tesseract-ocr.

ada...@turningcloud.com

unread,
Feb 26, 2018, 2:19:16 AM2/26/18
to tesseract-ocr
Can you please suggest a way to print a newline instead of FF. I am able to print any character other than formfeed by using the " -c page_separator="Hello" " option, but i don't know how to print a newline.

Thanks in advance.

Regards
Adarsh

ShreeDevi Kumar

unread,
Feb 26, 2018, 3:34:04 AM2/26/18
to tesser...@googlegroups.com
try

-c page_separator= "\n"

or the code for CRLF

ada...@turningcloud.com

unread,
Feb 26, 2018, 6:49:10 AM2/26/18
to tesseract-ocr
Thanks alot shree.
Reply all
Reply to author
Forward
0 new messages