Multithreading Tesseract process issues with .NET Core in Ubuntu

40 views
Skip to first unread message

Lucas L.

unread,
Mar 4, 2020, 10:26:24 AM3/4/20
to tesseract-ocr
Hello, I apologize in advance if this seems like the wrong place to post this. It is Tesseract-related, but it seems like the issue may be more at fault with .NET than Tesseract. However, I have found almost no one else who has this particular issue and I'm running out of options.


I will summarize in order to avoid simply copy-pasting everything from my SO post. We run Tesseract 4.00 in multiple threads on an Ubuntu 18.04 VM, and it is called as an external process from a .NET Core 2.1 application (I have also tried upgrading to 3.1, but that did not seem to make a difference). I am aware of the "OMP_THREAD_LIMIT" variable, but we want to process multiple pages from a split document file at once, so we call Tesseract on multiple threads (currently, it's set to 8 degrees of parallelism). This didn't have any issues in the past, but recently I have been making changes to reduce the number of reads/writes to disk in the service, and now it seems to crash with the message "Error while reaping child" randomly while processing a file. The stack trace is in the SO post. Rarely it won't happen at all, but usually it will occur (more likely on larger files since the processes need to run more frequently). It could occur at the very start of processing a document or at the very end.  

I have tried using the prerelease of the API wrapper found here https://github.com/charlesw/tesseract which uses a recent version of Tesseract, but it does not seem to handle multithreading very well (I suppose I could just be using it wrong, but it does not allow me to process multiple pages simultaneously without disposing the first page).

It seems like an issue with the Process class in .NET cleaning up the child resources when a process ends. Tesseract is a child process to the dotnet process when it is called. However, I'm really not sure what I can do to make .NET clean up the children without throwing an error. I was reading the .NET Core source code and they mentioned that they must make a global lock in order to add/remove process references (https://github.com/dotnet/runtime/blob/master/src/libraries/System.Diagnostics.Process/src/System/Diagnostics/ProcessWaitState.Unix.cs). I'm wondering if there is some interaction between multithreading, possibly the GC, and this global ref table that causes an issue.

Lucas L.

unread,
Mar 4, 2020, 12:36:16 PM3/4/20
to tesseract-ocr
OK, so I have been testing this with different files. What I have noticed is that even with extremely small files, such as a 224 KB test PDF which has 3 blank pages in it, processing the file for OCR still takes 31 seconds. It seems almost as if the Tesseract processes are deadlocked for an extended period of time before being able to execute (or possibly after they execute and are trying to close). In the older production version of the code, we frequently see files which are small enough to take only 6 or less seconds. Again, the way in which I call Tesseract as an external process hasn't really changed between the old and new versions of the code, as far as I can tell, aside from the fact that I call Parallel.For instead of Parallel.ForEach. I checked the .NET source code and it seems Parallel.ForEach resolves to the same worker method as Parallel.For, so I doubt that is the issue.
Reply all
Reply to author
Forward
0 new messages