Tesseract max pages while ocring?

397 views
Skip to first unread message

Nikolai Velkov

unread,
Nov 10, 2017, 7:47:31 AM11/10/17
to tesseract-ocr
We're using tesseract 3.0.5 to ocr pdf files and when ocring a pdf file with 1000+ pages, tesseract goes to page 999 and then stops ocring. No error or anything (using it with java and tess4j btw). It's also not about the size since i tested it with a pdf file of 1000+ pages with only the letter 'A' on each page. The file is about 2.3 mbs. Is there any configuration that specifies a max amount of pages to ocr ?

Quan Nguyen

unread,
Nov 13, 2017, 9:47:56 AM11/13/17
to tesseract-ocr
The GhostScript-based PDF module in Tess4J sets the limit to 999 since it was thought that the users would never attempt to go beyond that since loading only a few hundreds of 300-DPI full-size image pages into memory would already cause out-of-memory exceptions.

Nikolai Velkov

unread,
Nov 15, 2017, 2:43:08 AM11/15/17
to tesseract-ocr
So is there a fix for that ?

Quan Nguyen

unread,
Nov 15, 2017, 10:09:42 AM11/15/17
to tesseract-ocr
Try the latest version, 3.4.2.

Nikolai Velkov

unread,
Nov 16, 2017, 4:41:17 AM11/16/17
to tesseract-ocr
We are using 3.5.x

ShreeDevi Kumar

unread,
Nov 16, 2017, 4:47:35 AM11/16/17
to tesser...@googlegroups.com
I think Quan is referring to tess4j version - 


Version 3.4.2 (14 November 2017) - Update Lept4J to 1.6.2 - Update GhostScript to 9.22 - Improve handling of PDF files in multi-threaded environment - Lift limits on number of pages in PDF - Use TESSDATA_PREFIX environment variable by default, if defined


ShreeDevi
____________________________________________________________
भजन - कीर्तन - आरती @ http://bhajans.ramparivar.com

--
You received this message because you are subscribed to the Google Groups "tesseract-ocr" group.
To unsubscribe from this group and stop receiving emails from it, send an email to tesseract-ocr+unsubscribe@googlegroups.com.
To post to this group, send email to tesser...@googlegroups.com.
Visit this group at https://groups.google.com/group/tesseract-ocr.
To view this discussion on the web visit https://groups.google.com/d/msgid/tesseract-ocr/1c45babd-75b9-46a8-ab0a-2b8014d1b0cb%40googlegroups.com.
For more options, visit https://groups.google.com/d/optout.

Reply all
Reply to author
Forward
0 new messages