Support for multithreading build by CMake doesn't work

183 views
Skip to first unread message

Krzysztof J

unread,
May 8, 2022, 8:24:03 AM5/8/22
to tesseract-ocr

have the problems & questions:

1). Question 1: While preparing the build, I noticed that the "OPENMP_BUILD" setting is not included when building the solution see below:

configuration_tesseract.png

Anyone can say something more about it? Is using multiprocessing at the moment recommended? What's the state of it now? I only saw subject # 1662 where it was turned off, but it was 4 years ago :o

2). Question: Are there any other ways to take advantage of multithreading in Tesseract besides OpenMP in Tesseract 5.1.0? Anyone have experience in this topic? For now I am working on 1 thread, but ultimately I would like to switch to multiple threads.

Zdenko Podobny

unread,
May 9, 2022, 12:44:05 PM5/9/22
to tesser...@googlegroups.com
Hello,

1) search issue tracker for openmp[1] reports for more details. There are different experiences. For me, it seems for me like it does not help on linux (and mac?) - just consumes the CPU. My experience[2] is that it helps on windows, but maybe it is the question of HW& SW configuration. To be on the safe side - OpenMP is turned off by default, so if somebody turns it on,  such user/developer should be responsible for the consequences ;-)

2) I made some test with multithreading of tesserocr in python and it does not work for me. It works only with 1 thread (I never use multithreading, so maybe the problem is on my side.). 

Anyway expect and contribution in this area (OpenMP) is warmly welcomed.


Zdenko


ne 8. 5. 2022 o 14:24 Krzysztof J <k.jerz...@gmail.com> napísal(a):
--
You received this message because you are subscribed to the Google Groups "tesseract-ocr" group.
To unsubscribe from this group and stop receiving emails from it, send an email to tesseract-oc...@googlegroups.com.
To view this discussion on the web visit https://groups.google.com/d/msgid/tesseract-ocr/190a9765-e00c-4f1b-b784-b81851d2a0c4n%40googlegroups.com.

Krzysztof J

unread,
May 19, 2022, 7:39:30 AM5/19/22
to tesseract-ocr
Hello zdenop,

My idea is use multithreading for multiple tiffs - e.g. contain 30 pages. Currently tesseract is working on 1 thread, not using its full potential for Windows - We could get for example half of available system threads and automatically allocate some of the pages of the tiff file as independent images. The results would be collected into 1 structure, which would be, for example a map of results. The implementation could be carried out on the level of some wrapper class, whih has been prepared for communication with the OCR engine. Example functional diagram for a 4-core processor is presented by below schema. Is this a good direction to run several Tesseract OCR instances simultaneously?

Test.png

Zdenko Podobny

unread,
May 20, 2022, 5:53:42 AM5/20/22
to tesser...@googlegroups.com
Best way would be to try it ;-)
AFAIR there were similar approaches e.g. ([1], [2], [3] - IMHO using GNU Parallel was quite popular; search for "tesseract parallel" - google provide 1.28 Mio results), but please be aware of this open issue[4]...


Zdenko


št 19. 5. 2022 o 13:39 Krzysztof J <k.jerz...@gmail.com> napísal(a):
Reply all
Reply to author
Forward
0 new messages