Page Breaks

141 views
Skip to first unread message

asv...@gmail.com

unread,
Mar 12, 2016, 1:44:12 PM3/12/16
to tesseract-ocr
If I OCR a multipage tiff file using Tesseract it comes out as one single page .txt file.  Is there a way to maintain the page breaks?
Thanks.

zdenko podobny

unread,
Mar 12, 2016, 1:55:38 PM3/12/16
to tesser...@googlegroups.com
Default page separator is  the form feed control character.
You can modify it with parameter page_separator.

Zdenko

On Sat, Mar 12, 2016 at 7:21 PM, <asv...@gmail.com> wrote:
If I OCR a multipage tiff file using Tesseract it comes out as one single page .txt file.  Is there a way to maintain the page breaks?
Thanks.

--
You received this message because you are subscribed to the Google Groups "tesseract-ocr" group.
To unsubscribe from this group and stop receiving emails from it, send an email to tesseract-oc...@googlegroups.com.
To post to this group, send email to tesser...@googlegroups.com.
Visit this group at https://groups.google.com/group/tesseract-ocr.
To view this discussion on the web visit https://groups.google.com/d/msgid/tesseract-ocr/f68e3922-5868-4b88-827a-d75332b3f6e8%40googlegroups.com.
For more options, visit https://groups.google.com/d/optout.

asv...@gmail.com

unread,
Mar 13, 2016, 1:02:20 AM3/13/16
to tesseract-ocr
Thanks Zdenko.  I'm still stuck.  I OCR'd an 81 page tiff file and I've searched my output txt file for the form feed character (asc 12) and didn't find one. I have windows version of tesseract 3.02.  Also I don't see a parameter for page_separator in the command-line options.  Do you know what I'm doing wrong?

zdenko podobny

unread,
Mar 13, 2016, 1:17:36 AM3/13/16
to tesser...@googlegroups.com
you have very old version of tesseract.
page_separator was implemented after 3.02 release

Zdenko

--
You received this message because you are subscribed to the Google Groups "tesseract-ocr" group.
To unsubscribe from this group and stop receiving emails from it, send an email to tesseract-oc...@googlegroups.com.
To post to this group, send email to tesser...@googlegroups.com.
Visit this group at https://groups.google.com/group/tesseract-ocr.

Charles L Bunders

unread,
Aug 27, 2016, 2:25:38 AM8/27/16
to tesseract-ocr
I am running tesseract 3.03 and I don't see this option available. Is there another way to do it via an option?

$ tesseract -v
tesseract 3.03
 leptonica-1.71
  libgif 4.1.6(?) : libjpeg 6b : libpng 1.2.50 : libtiff 4.0.3 : zlib 1.2.8 : libwebp 0.4.1 : libopenjp2 2.1.0

$ tesseract --print-parameters | grep -i 'page'
textord_show_page_cuts 0
tessedit_pageseg_mode 6
pageseg_devanagari_split_strategy 0
applybox_page 0
tessedit_page_number -1
tessedit_dump_pageseg_images 0

Thanks!

Quan Nguyen

unread,
Aug 27, 2016, 11:01:45 AM8/27/16
to tesseract-ocr
tesseract -c include_page_breaks=1 -c page_separator="[PAGE SEPRATOR]" input.tiff output

Zdenko Podobný

unread,
Aug 27, 2016, 11:26:24 AM8/27/16
to tesser...@googlegroups.com
try to upgrade to tesseract 3.04. 3.03 version was not officially released.

Zdenko

--
You received this message because you are subscribed to the Google Groups "tesseract-ocr" group.
To unsubscribe from this group and stop receiving emails from it, send an email to tesseract-ocr+unsubscribe@googlegroups.com.

To post to this group, send email to tesser...@googlegroups.com.
Visit this group at https://groups.google.com/group/tesseract-ocr.
Reply all
Reply to author
Forward
0 new messages