Hi,
I'm running some processing on a Windows machine using the recent Mannheim 5.0 alpha builds, outputting to hOCR. When I run it on a job with a few hundred pages, the CPU usage constantly hovers around 10% (1 thread), and memory/GPU usage doesn't seem to change much.
Now, while I could split the jobs by pages, and run them in parallel (or split across multiple machines), and then write a little script to combine the different hOCR outputs together, I can't help but wonder if there is a better way to do this? Is there some intermediate format from tesseract that I can get, and then feed them all into one hOCR file directly?
Thanks!