Hi,
Just a few vague suggestions based on experience running it on a cluster
(ymmv):
* ramdisk could help reduce wear on SSDs, but I don't think it will
matter much in processing speed, the majority of time is not spent in
I/O if you use SSDs
* Run tesseract with only one thread to get the most out of your CPUs
(disable OpenMP) - this will maximise your throughput
* The average peak ram (max in the process lifetime) from Tesseract (at
archive.org) is about 100MB, with occasional max spikes to 2GB of ram
(likely for big images/newspapers) - upper 90 percentile is about
200MB-300MB.
* The average OCR time per page (at
archive.org) is about 7.5 seconds,
but we have a lot of old CPU cores mixed in (some only have sse2).
Maybe with the with the average runtime & ram usage you can figure out
what you need.
Finally, keep in mind that in some cases Tesseract can run for many
minutes or hours and slowly consume ram - this happens very rarely, but
does happen on some inputs, so be sure to cap the running time / ram if
you run it on a big cluster.
Cheers,
Merlijn