Page Breaks

Visto 141 veces
Saltar al primer mensaje no leído

asv...@gmail.com

no leída,
12 mar 2016, 13:44:1212/3/16
a tesseract-ocr
If I OCR a multipage tiff file using Tesseract it comes out as one single page .txt file.  Is there a way to maintain the page breaks?
Thanks.

zdenko podobny

no leída,
12 mar 2016, 13:55:3812/3/16
a tesser...@googlegroups.com
Default page separator is  the form feed control character.
You can modify it with parameter page_separator.

Zdenko

On Sat, Mar 12, 2016 at 7:21 PM, <asv...@gmail.com> wrote:
If I OCR a multipage tiff file using Tesseract it comes out as one single page .txt file.  Is there a way to maintain the page breaks?
Thanks.

--
You received this message because you are subscribed to the Google Groups "tesseract-ocr" group.
To unsubscribe from this group and stop receiving emails from it, send an email to tesseract-oc...@googlegroups.com.
To post to this group, send email to tesser...@googlegroups.com.
Visit this group at https://groups.google.com/group/tesseract-ocr.
To view this discussion on the web visit https://groups.google.com/d/msgid/tesseract-ocr/f68e3922-5868-4b88-827a-d75332b3f6e8%40googlegroups.com.
For more options, visit https://groups.google.com/d/optout.

asv...@gmail.com

no leída,
13 mar 2016, 1:02:2013/3/16
a tesseract-ocr
Thanks Zdenko.  I'm still stuck.  I OCR'd an 81 page tiff file and I've searched my output txt file for the form feed character (asc 12) and didn't find one. I have windows version of tesseract 3.02.  Also I don't see a parameter for page_separator in the command-line options.  Do you know what I'm doing wrong?

zdenko podobny

no leída,
13 mar 2016, 1:17:3613/3/16
a tesser...@googlegroups.com
you have very old version of tesseract.
page_separator was implemented after 3.02 release

Zdenko

--
You received this message because you are subscribed to the Google Groups "tesseract-ocr" group.
To unsubscribe from this group and stop receiving emails from it, send an email to tesseract-oc...@googlegroups.com.
To post to this group, send email to tesser...@googlegroups.com.
Visit this group at https://groups.google.com/group/tesseract-ocr.

Charles L Bunders

no leída,
27 ago 2016, 2:25:3827/8/16
a tesseract-ocr
I am running tesseract 3.03 and I don't see this option available. Is there another way to do it via an option?

$ tesseract -v
tesseract 3.03
 leptonica-1.71
  libgif 4.1.6(?) : libjpeg 6b : libpng 1.2.50 : libtiff 4.0.3 : zlib 1.2.8 : libwebp 0.4.1 : libopenjp2 2.1.0

$ tesseract --print-parameters | grep -i 'page'
textord_show_page_cuts 0
tessedit_pageseg_mode 6
pageseg_devanagari_split_strategy 0
applybox_page 0
tessedit_page_number -1
tessedit_dump_pageseg_images 0

Thanks!

Quan Nguyen

no leída,
27 ago 2016, 11:01:4527/8/16
a tesseract-ocr
tesseract -c include_page_breaks=1 -c page_separator="[PAGE SEPRATOR]" input.tiff output

Zdenko Podobný

no leída,
27 ago 2016, 11:26:2427/8/16
a tesser...@googlegroups.com
try to upgrade to tesseract 3.04. 3.03 version was not officially released.

Zdenko

--
You received this message because you are subscribed to the Google Groups "tesseract-ocr" group.
To unsubscribe from this group and stop receiving emails from it, send an email to tesseract-ocr+unsubscribe@googlegroups.com.

To post to this group, send email to tesser...@googlegroups.com.
Visit this group at https://groups.google.com/group/tesseract-ocr.
Responder a todos
Responder al autor
Reenviar
0 mensajes nuevos