Hello,
I was actually planning to post something to ask for help on this board but I've recently figured out the problem. I'm posting for any future individuals that come across this problem since it took me a few days of trial and error to figure out (and python multiprocessing constantly finds ways to present a new challenge.)
Background: I'm running a massive data transformation in python using multiprocessing on 48-96 CPU AWS EC2 machines. My goal is to use pytesseract to transform millions of images into OCR data for machine learning model training. Each process receives its own batch of images and then the plan is they go to work OCRing for a few days. Couple notable items is that I'm using this environment setting as recommended: os.environ["OMP_THREAD_LIMIT"] = "1", and also using multiprocessing.pool.Pool's maxtasksperchild set to ~ 1000 in hopes of keeping the process environment clean as it runs pytesseract.image_to_data() over many images.
Problem: After kicking off the script, I've been watching CPU utilization and been baffled by the left two sessions on the chart below where over time, CPU utilization slowly drops off before I just kill the job to troubleshoot. At first glance it was as if tesseract was getting tired and just slowing down over time.
Solution: After much trial and error, I finally figured out that my temp (/tmp/) directory was getting so full (100k+ files), that this seemed to cause some IO overhead (I guess?) somewhere, maybe writing the files locally to OCR, which slowly tanked my CPU utilization over time. The solution is to periodically run pytesseract.pytesseract.cleanup("/tmp/tess*") throughout my script so that this directory can stay reasonably sized. In the bottom right script session, it can be noted how this improved CPU utilization (albiet temporarily) until the directory started to fill again.
I think linux temp directories clean after some number of days normally, so it is necessary to do this periodically in the script.
Hope this solves the issue for someone in the future. Happy OCRing everyone!
