PATCH: output form feed control character between pages

517 views
Skip to first unread message

dhara...@gmail.com

unread,
Jan 29, 2015, 6:05:44 PM1/29/15
to tesser...@googlegroups.com
Attached is another very trivial patch that the project may find useful.

We have found that during post-processing of tesseract output text, it can be very helpful to have the form feed (page break) control character present at the end of a page.

This patch adds a configuration parameter called "include_formfeed_pagebreaks" which enables this behavior (for TessTextRenderer only... seemed like hOCR and box already contained page number metadata, and I don't know what UNLV text is.).

I'm also including a sample tiff image and the output with the parameter disabled (the default behavior) and enabled.

Thanks again.

David
109359.tiff
include_formfeed_pagebreaks.patch
output-default.txt
output-include-formfeeds.txt

Jan Ruzicka

unread,
Jan 30, 2015, 1:01:16 AM1/30/15
to tesser...@googlegroups.com
Hi David,

Thanks for your contributions.

Can the page separator be configurable?
The line separator and the paragraph separator are.

Can the page separator also be defined as a string?
That would allow for some text that can be easily post processed by a text editor that does not work well with form feed.
With "[PAGE BREAK HERE]\n" being an extreme example.

Thanks
Jan
> --
> You received this message because you are subscribed to the Google Groups "tesseract-dev" group.
> To unsubscribe from this group and stop receiving emails from it, send an email to tesseract-de...@googlegroups.com.
> To post to this group, send email to tesser...@googlegroups.com.
> Visit this group at http://groups.google.com/group/tesseract-dev.
> To view this discussion on the web visit https://groups.google.com/d/msgid/tesseract-dev/83326398-3522-4aa2-86b5-b492d9c7bbcb%40googlegroups.com.
> For more options, visit https://groups.google.com/d/optout.
> <109359.tiff><include_formfeed_pagebreaks.patch><output-default.txt><output-include-formfeeds.txt>

dhara...@gmail.com

unread,
Feb 1, 2015, 3:27:33 PM2/1/15
to tesser...@googlegroups.com
The line and paragraph separator are configurable at the API level, but it doesn't appear there is a command-line/config parameter for them.  And those both live at the result iterator level.  If I understand the code correctly, it looks like all of the iterator-related code deals with everything in the context of a single page, which is why I had to add the output to the renderer ...

Are you suggesting I add a getter and setter to the renderer API for this?

Or are you suggesting that I adjust the config parameter to allow the value to be set to an arbitrary string, instead of just a true/false flag?

FWIW, I would think most text editors would handle formfeed just fine, in fact both emacs and vi have support for jumping between form feeds as a code navigation feature.  And formfeeds are treated as whitespace usually, so even enabling the option would in theory not hurt most existing post-processing of the text.  In any case it certainly doesn't hurt to have it configurable, I am just wondering what the best place to put this would be.

David

Jan Ruzicka

unread,
Feb 3, 2015, 2:49:04 AM2/3/15
to tesser...@googlegroups.com
Hi David,
Sorry for the long delay in responding.
I have also realized that the line and paragraph separators don't have any external settings associated.
Of course, it happened 5 minutes after sending the e-mail.

With text editors, I'm concerned for poor souls doomed to eternal hells of notepad.

I don't know enough about the code and architecture to suggest a good place for the page separator settings.

Can more familiar contributors suggest a good solution/place?

Thank you
Jan

PS: I was considering a page start/end separator.
This additional complication would allow placing XML (e.g.<p></p>) around the page text.
The page start/end would take care of replacing first and last tag for case of using "</p><p>" as a single separator.
On the other hand, do additional page separators bring any significant advantages to justify this complication?
> To view this discussion on the web visit https://groups.google.com/d/msgid/tesseract-dev/47d4b44f-3ed4-443d-a787-72643e1e7545%40googlegroups.com.

Zdenko Podobný

unread,
Feb 7, 2015, 4:30:39 PM2/7/15
to tesser...@googlegroups.com
I committed this and I put page separator to parameter page_separator.
So something like this can be used (by poor souls doomed to eternal hells of notepad):
    tesseract -c include_page_breaks=1 -c page_separator="[PAGE SEPRATOR]" 109359.tiff 109359
Reply all
Reply to author
Forward
0 new messages