Tesseract takes 30 to 60 minutes to process a single drawing

66 views

Skip to first unread message

farhad khalafi

unread,

Feb 8, 2020, 3:54:56 PM2/8/20

to tesseract-ocr

I used the official Tesseract 5.0 alpha build for 64-bits under Windows to do this test. The document is a single page TIFF image of a noisy engineering drawing. Using segmentation mode 6, the file was processed in 30 minutes. I tried mode 11 to look for sparse text next. The processing time increased to over one hour.

Normally, I wouldn't attempt to OCR a file like this. However, we have a project that has a large number of scanned images and it is impractical to examine files individually.

Is there a way to set a timeout or get some preliminary data during segmentation so that we can detect and skip such noisy files?

Also, when we run this file in a custom program with Tesseract monitor class enabled, the engine gets to 100% progress in 10 to 20 minutes but then gets stuck there, presumably trying to format the results list. It detects something like 30,000+ symbols, mostly non-words.

Please note that the attachment is only a screenshot due to copyright issues. The actual file is about 3.5MB TIFF G4 compressed.

43.png

Runtime.png

Tom Morris

unread,

Feb 10, 2020, 12:33:04 PM2/10/20

to tesseract-ocr

On Saturday, February 8, 2020 at 3:54:56 PM UTC-5, farhad khalafi wrote:

I used the official Tesseract 5.0 alpha build for 64-bits under Windows to do this test. The document is a single page TIFF image of a noisy engineering drawing. Using segmentation mode 6, the file was processed in 30 minutes. I tried mode 11 to look for sparse text next. The processing time increased to over one hour.

Normally, I wouldn't attempt to OCR a file like this. However, we have a project that has a large number of scanned images and it is impractical to examine files individually.

Is there a way to set a timeout or get some preliminary data during segmentation so that we can detect and skip such noisy files?

... The actual file is about 3.5MB TIFF G4 compressed.

It seems like you could do some simple frequency analysis in a preprocessing program to detect the high frequency noise. If the volume of images is enough to justify the engineering effort, you could probably even do the analysis in the domain of the G4 codewords without even decompressing the image.

Also, do you need to OCR the entire image? Most engineering drawings are well structured with the important information in specific corners of the image. Could you just OCR those blocks?