Combining output from multiple jobs into one hOCR file

Vidar

unread,

Feb 4, 2021, 8:20:10 PM2/4/21

to tesseract-ocr

Hi,

I'm running some processing on a Windows machine using the recent Mannheim 5.0 alpha builds, outputting to hOCR. When I run it on a job with a few hundred pages, the CPU usage constantly hovers around 10% (1 thread), and memory/GPU usage doesn't seem to change much.

Now, while I could split the jobs by pages, and run them in parallel (or split across multiple machines), and then write a little script to combine the different hOCR outputs together, I can't help but wonder if there is a better way to do this? Is there some intermediate format from tesseract that I can get, and then feed them all into one hOCR file directly?

Thanks!

Merlijn B.W. Wajer

unread,

Feb 4, 2021, 8:36:27 PM2/4/21

to tesser...@googlegroups.com

Hi Vidar,

I ran into this exact problem, and I used hocr-combine from hocr-tools
[1] to solve this problem. But I ran into limitations of that program,
it doesn't read/write in a streaming manner, and runs out of memory.

I wrote a streaming replacement here [2], which will not use a lot of ram.

Cheers,
Merlijn

[1] https://github.com/ocropus/hocr-tools
[2]
https://git.archive.org/merlijn/archive-hocr-tools/-/blob/master/bin/hocr-combine-stream

Vidar

unread,

Feb 5, 2021, 12:25:11 AM2/5/21

to tesseract-ocr

Thanks a million, both of these seem like excellent options! :D

Reply all

Reply to author

Forward