How to OCR a Multipage TIFF?

73 views
Skip to first unread message

Pedro Correia

unread,
Feb 16, 2017, 11:18:39 AM2/16/17
to ocropus
Hi there, I've read that multipage tiff support is available since v 0.4.1.
Currently, I need OCRopus to run on a multipage TIFF (a book) and output a single hocr containing the whole book's text. However, I've noticed that when I run it, the output provided is the OCR of the first page only, the others are simply ignored.
Is there any argument or something that I can use in order to tell OCRopus that the input is a multipage TIFF and not a regular TIFF file? 
Thanks in advance,
Pedro

Pedro Correia

unread,
Feb 16, 2017, 11:20:38 AM2/16/17
to ocropus
PS: I can't afford to split the multipage tiff into several tiff files, because my groundtruth is a single txt file.

Philipp Zumstein

unread,
Feb 16, 2017, 2:37:30 PM2/16/17
to ocr...@googlegroups.com
I think this option is not yet supported anymore. BTW where did you read that?

However, it should be possible to achieve your goals with commands like these:

convert multipage.tiff page.png
ocropus-nlbin page*.png
ocropus-gpageseg page*.bin.png
ocropus-rpred page*/*.bin.png
ocropus-hocr page*/*.txt



--
You received this message because you are subscribed to the Google Groups "ocropus" group.
To unsubscribe from this group and stop receiving emails from it, send an email to ocropus+unsubscribe@googlegroups.com.
To post to this group, send email to ocr...@googlegroups.com.
To view this discussion on the web visit https://groups.google.com/d/msgid/ocropus/ba79be37-9332-4ce8-b6f8-16821bd47e32%40googlegroups.com.

For more options, visit https://groups.google.com/d/optout.

Pedro Correia

unread,
Feb 22, 2017, 8:43:04 AM2/22/17
to ocropus
Dear Philipp,
as I said, I can't afford to convert the multipage into several images, due to my groundtruth that is a singles txt file. There's really no way to apply ocropus to a multipage?


Here, Tom refer to the "Subversion version of ocropus", which could supposedly work on multipages: https://groups.google.com/forum/?utm_medium=email&utm_source=footer#!searchin/ocropus/multipage|sort:relevance/ocropus/OcvP0Z2tFj4/IW_3Wt3WFpoJ
However, I couldn't find it, in order to download it.


Em quinta-feira, 16 de fevereiro de 2017 17:37:30 UTC-2, Philipp Zumstein escreveu:
I think this option is not yet supported anymore. BTW where did you read that?

However, it should be possible to achieve your goals with commands like these:

convert multipage.tiff page.png
ocropus-nlbin page*.png
ocropus-gpageseg page*.bin.png
ocropus-rpred page*/*.bin.png
ocropus-hocr page*/*.txt


2017-02-16 17:20 GMT+01:00 Pedro Correia <correia...@gmail.com>:
PS: I can't afford to split the multipage tiff into several tiff files, because my groundtruth is a single txt file.

Em quinta-feira, 16 de fevereiro de 2017 14:18:39 UTC-2, Pedro Correia escreveu:
Hi there, I've read that multipage tiff support is available since v 0.4.1.
Currently, I need OCRopus to run on a multipage TIFF (a book) and output a single hocr containing the whole book's text. However, I've noticed that when I run it, the output provided is the OCR of the first page only, the others are simply ignored.
Is there any argument or something that I can use in order to tell OCRopus that the input is a multipage TIFF and not a regular TIFF file? 
Thanks in advance,
Pedro

--
You received this message because you are subscribed to the Google Groups "ocropus" group.
To unsubscribe from this group and stop receiving emails from it, send an email to ocropus+u...@googlegroups.com.

To post to this group, send email to ocr...@googlegroups.com.

Philipp Zumstein

unread,
Feb 22, 2017, 1:29:05 PM2/22/17
to ocr...@googlegroups.com
The steps I outlined will produce ONE hocr file in the end containing all your pages. Did you try it out? Afterwards you can for example use something like hocr-eval-lines see https://github.com/tmbdev/hocr-tools#hocr-eval-lines for comparison.

BTW you can find some of the old versions linked here https://github.com/tmbdev/ocropy/wiki/Older-versions , but I don't think you have to use 7 years old versions for your task.

To unsubscribe from this group and stop receiving emails from it, send an email to ocropus+unsubscribe@googlegroups.com.

To post to this group, send email to ocr...@googlegroups.com.
Reply all
Reply to author
Forward
0 new messages