Parallel processing of PDF - tess4j

75 views

Skip to first unread message

iShahad thobaiti

unread,

Aug 2, 2017, 3:07:02 AM8/2/17

to tesseract-ocr

Hello,
I'm trying to parallelize the ocr proccess since I have alot of pdf documents and I try to use the sample code as a guide :
https://sourceforge.net/p/tess4j/discussion/1202293/thread/4562eccb/
it works for png images but not with pdf files?
is it possiable to parallize ocr for pdf?

the error i'm having :
A fatal error has been detected by the Java Runtime Environment:
SIGSEGV (0xb) at pc=0x0000000124482bc9, pid=36535, tid=0x0000000000001c03
JRE version: Java(TM) SE Runtime Environment (8.0_121-b13) (build 1.8.0_121-b13)
Java VM: Java HotSpot(TM) 64-Bit Server VM (25.121-b13 mixed mode bsd-amd64 compressed oops)
Problematic frame:
C [libgs.dylib+0x3afbc9] copy_error_string+0xd
Failed to write core dump. Core dumps have been disabled. To enable core dumping, try "ulimit -c unlimited" before starting Java again
An error report file with more information is saved as:
/Users/iShahad/NetBeansProjects/OCR/hs_err_pid36535.log
If you would like to submit a bug report, please visit:
http://bugreport.java.com/bugreport/crash.jsp
The crash happened outside the Java Virtual Machine in native code.
See problematic frame for where to report the bug.

I noticed that if the number of pdf files = number of threads it get proccessed with no errors.

but when I add more files I get the error :|

one solution is to convert all the pdfs to png images then parallize over them.

I don't want to do that its not a practical solution.

I want to understand why is it not parallizing the pdf files as the png images?

is there a way to overcome it? other than converting pdf to png :(

Reply all

Reply to author

Forward

0 new messages