Re: New line recogniztion

2,346 views

Skip to first unread message

zdenko podobny

unread,

Apr 6, 2013, 5:29:33 AM4/6/13

to tesser...@googlegroups.com

On Fri, Apr 5, 2013 at 11:20 PM, Ruud van Houtum <ruudv...@gmail.com> wrote:

Hello,

I am using Tesseract to output text files from scanned documents.
All text images contain typed text and are fairly clear/clean. So far Tesseract has a pretty good accuracy and I am quite content.

However Tesseract doesn't seem to recognize line breaks, and I was wondering if this is an available option or not?

It does. If not than provide example.

At first I thought this is not possible but searching online brings me topics (such as: http://code.google.com/p/tesseract-ocr/issues/detail?id=575) which seem to show that it should be possible.

Is there a parameter that should be included in the command prompt?
I am using Windows 7, cmd.exe.

Thanks in advance,
R

BTW I would recommend adding http://tesseract-ocr.googlecode.com/svn-history/trunk/doc/tesseract.1.html to the wiki page, it took me very long to find this page (its hidden in the FAQ) and it provides some helpful information about the parameters.

Did you read https://code.google.com/p/tesseract-ocr/wiki/Documentation?

--
--
You received this message because you are subscribed to the Google
Groups "tesseract-ocr" group.
To post to this group, send email to tesser...@googlegroups.com
To unsubscribe from this group, send email to
tesseract-oc...@googlegroups.com
For more options, visit this group at
http://groups.google.com/group/tesseract-ocr?hl=en

---
You received this message because you are subscribed to the Google Groups "tesseract-ocr" group.
To unsubscribe from this group and stop receiving emails from it, send an email to tesseract-oc...@googlegroups.com.
For more options, visit https://groups.google.com/groups/opt_out.

Dan Edwards

unread,

Feb 26, 2015, 7:03:54 AM2/26/15

to tesser...@googlegroups.com

Ruud,

I experience the same issue you describe but after looking at an output file in a hex editor the reason is clear. Tesseract seems to determine line feeds perferctly fine but it only inserts the Line Feed character (0x0A) and not the carriage return character that a windows text file expects. (0x0D 0x0A)

So it would be fairly easy to take the output from tesseract and then feed it through another converter that changes all the 0x0A characters to 0x0D 0x0A. But it is unfortunate that it does not support such an option inherently.

On Friday, April 5, 2013 at 4:20:21 PM UTC-5, Ruud van Houtum wrote:

Hello,

I am using Tesseract to output text files from scanned documents.
All text images contain typed text and are fairly clear/clean. So far Tesseract has a pretty good accuracy and I am quite content.

However Tesseract doesn't seem to recognize line breaks, and I was wondering if this is an available option or not?

Reply all

Reply to author

Forward

0 new messages