Recomended HW for Tesseract

808 views
Skip to first unread message

juan carlos hernández

unread,
Oct 20, 2021, 5:31:43 AM10/20/21
to tesseract-ocr
Hi all

I'm managing a project that needs to OCR documents in real time. We expect to have multiple users scanning and OCRing documents in the order of tens of users simultaneously, maybe 100 users at a time or more. We need to get OCR done for documents with about 50 pages in less than 20 seconds. Our documents will be scaned with 300dpi.
As we are in a huge organization in a public administration, we can afford to buy very powerful servers to run tesseract.

Do you have any advice on what HW is best suited for tesseract? 
I've revised the Intel Xeon family of processors, and I think that choosing the Xeon Platinum processors would be a good option. 
Apart from having fast processors, what other components affect the performance of tesseract, amount and speed of memory, having SSD or a RamDisk?

Thanks in advance
Juan Carlos

Merlijn B.W. Wajer

unread,
Oct 20, 2021, 6:20:15 AM10/20/21
to tesser...@googlegroups.com
Hi,
Just a few vague suggestions based on experience running it on a cluster
(ymmv):

* ramdisk could help reduce wear on SSDs, but I don't think it will
matter much in processing speed, the majority of time is not spent in
I/O if you use SSDs
* Run tesseract with only one thread to get the most out of your CPUs
(disable OpenMP) - this will maximise your throughput
* The average peak ram (max in the process lifetime) from Tesseract (at
archive.org) is about 100MB, with occasional max spikes to 2GB of ram
(likely for big images/newspapers) - upper 90 percentile is about
200MB-300MB.
* The average OCR time per page (at archive.org) is about 7.5 seconds,
but we have a lot of old CPU cores mixed in (some only have sse2).

Maybe with the with the average runtime & ram usage you can figure out
what you need.

Finally, keep in mind that in some cases Tesseract can run for many
minutes or hours and slowly consume ram - this happens very rarely, but
does happen on some inputs, so be sure to cap the running time / ram if
you run it on a big cluster.

Cheers,
Merlijn
Reply all
Reply to author
Forward
0 new messages