Parallel processing of PDF - tess4j

75 views
Skip to first unread message

iShahad thobaiti

unread,
Aug 2, 2017, 3:07:02 AM8/2/17
to tesseract-ocr

Hello, 

I'm trying to parallelize the ocr proccess since I have alot of pdf documents and I try to use the sample code as a guide :

https://sourceforge.net/p/tess4j/discussion/1202293/thread/4562eccb/

it works for png images but not with pdf files?

is it possiable to parallize ocr for pdf?

the error i'm having : 

A fatal error has been detected by the Java Runtime Environment:

SIGSEGV (0xb) at pc=0x0000000124482bc9, pid=36535, tid=0x0000000000001c03

JRE version: Java(TM) SE Runtime Environment (8.0_121-b13) (build 1.8.0_121-b13)

Java VM: Java HotSpot(TM) 64-Bit Server VM (25.121-b13 mixed mode bsd-amd64 compressed oops)

Problematic frame:

C  [libgs.dylib+0x3afbc9] copy_error_string+0xd

Failed to write core dump. Core dumps have been disabled. To enable core dumping, try "ulimit -c unlimited" before starting Java again

An error report file with more information is saved as:

/Users/iShahad/NetBeansProjects/OCR/hs_err_pid36535.log

If you would like to submit a bug report, please visit:

http://bugreport.java.com/bugreport/crash.jsp

The crash happened outside the Java Virtual Machine in native code.

See problematic frame for where to report the bug.

 

I noticed that if the number of pdf files = number of threads it get proccessed with no errors.
but when I add more files I get the error :| 

one solution is to convert all the pdfs to png images then parallize over them.
I don't want to do that its not a practical solution. 

I want to understand why is it not parallizing the pdf files as the png images? 
is there a way to overcome it? other than converting pdf to png :( 


 
Reply all
Reply to author
Forward
0 new messages