Tesseract on technical drawings

Olav Storøy

unread,

Nov 9, 2023, 9:04:37 AM11/9/23

to tesseract-ocr

Hi! I'm trying to use Tesseract OCR on scanned layout drawings of industrial facilities. My main goal is to find tags (unique equipment IDs). I use the tessdata_best eng.traineddata model. I only pass tesseract with a custom config file. Here is what it looks like:

tessedit_char_whitelist ABCDEFGHIJKLMNOPQRSTUVWXYZ-_—.1234567890
load_system_dawg 0
load_freq_dawg 0
tessedit_create_tsv 1
user_patterns_file custom-patterns.txt
tessedit_pageseg_mode 3
tessedit_ocr_engine_mode 1
textord_heavy_nr 1

The custom-patterns.txt file consists of the tags that I mentioned earlier. I sometimes switch between PSM 11 (sparse) and PSM 3 (auto). Sometimes PSM 3 works better in areas where PSM 11 struggles, and vice versa.

My question is: which config params should I look to play around with most to refine my results? With PSM 11, Tesseract struggles with text rotated by 90 degrees, and text that has neighboring non-text graphical elements. PSM 3 gets nicer and tighter text boxes, but then seemingly rejects the "easiest" texts on the sheet. I am including screenshots to show this.

There are over 600 config variables and there really aren't good resources on which ones are most impactful or useful to control the process. Help would be greatly appreciated.

PSM3 too inclusive.png

PSM11 too inclusive.png

PSM3 no detects.png

PSM11 too inclusive2.png

Tom Morris

unread,

Nov 9, 2023, 1:55:18 PM11/9/23

to tesseract-ocr

On Thursday, November 9, 2023 at 9:04:37 AM UTC-5 olavs...@gmail.com wrote:

With PSM 11, Tesseract struggles with text rotated by 90 degrees, and text that has neighboring non-text graphical elements. PSM 3 gets nicer and tighter text boxes, but then seemingly rejects the "easiest" texts on the sheet.

Why not PSM 12 "Sparse text with OSD" instead of PSM 11, particularly since you want multiple orientations?

I am including screenshots to show this.

It would be helpful if you described what the expected results are. e.g. Does it matter that the centerline (CL) symbol gets included in the bounding box even if it doesn't affect the recognition? Providing an unannotated source image (or section of an image) that people could experiment with might also yield you more useful suggestions (I won't have the time, but others might).

Tom

Olav Storøy

unread,

Nov 10, 2023, 3:03:42 AM11/10/23

to tesseract-ocr

Thanks for your reply!

It isn't clear to me if OSD is meant for orientation of the whole page or orientation of individual text elements on the page. But that's a good point, I should be using PSM 12 anyways, since there are actually rotated pages in my dataset.

As additional context, there are many (hundreds to thousands) such drawings that I would like to run on when I'm confident the config can get a decent accuracy. I just want it to be as robust as possible. For example I would prefer it didn't include the CL symbol because that gave it a 0 confidence score, even though it did in fact recognize correctly. But in other cases where "41-8304" included too many other elements, it disturbed the recognition and missed. Ideally the bounding should be like "41-8305", tightly around it.

Tesseract has sort of demonstrated that it is capable of high accuracy, I just don't know how to optimize it with the right config variables.

I am including two screenshots of larger sections for anyone to try on. Both are 300 dpi. "image.png" is the one from above, and "image2" is a more challenging representative.

image.png

image2.png

Tom Morris

unread,

Nov 11, 2023, 7:51:58 PM11/11/23

to tesseract-ocr

On Friday, November 10, 2023 at 3:03:42 AM UTC-5 olavs...@gmail.com wrote:

It isn't clear to me if OSD is meant for orientation of the whole page or orientation of individual text elements on the page

Sorry, I should have mentioned that earlier. I'm pretty sure it's page orientation and while I think it can handle vertical text, I don't think it can handle rotated text, so you'll probably have to run things twice.

For example I would prefer it didn't include the CL symbol because that gave it a 0 confidence score, even though it did in fact recognize correctly.

This may be difficult for cases where the CL symbol is very close in size to your digits, but you might be able to do something base on character confidence scores.

I just don't know how to optimize it with the right config variables.

I think your biggest problem is probably page segmentation and that's one of Tesseract's weakest areas. I'm not sure how much tweaking parameters is going to help, but perhaps someone else has some ideas.

Tom

Olav Storøy

unread,

Nov 13, 2023, 5:35:20 AM11/13/23

to tesser...@googlegroups.com

Thanks again for your reply

Yeah it seems page segmentation is the crucial issue. If the bounding boxes are good, the recognition is usually very good.

I think I've sort of reached the limit on what I can do with base Tesseract. I think the next step would be custom training / fine-tuning.

--
You received this message because you are subscribed to the Google Groups "tesseract-ocr" group.
To unsubscribe from this group and stop receiving emails from it, send an email to tesseract-oc...@googlegroups.com.
To view this discussion on the web visit https://groups.google.com/d/msgid/tesseract-ocr/c60cf545-4d52-4333-8790-4f2442fc517fn%40googlegroups.com.

Tom Morris

unread,

Nov 13, 2023, 11:13:52 AM11/13/23

to tesseract-ocr

On Monday, November 13, 2023 at 5:35:20 AM UTC-5 olavs...@gmail.com wrote:

Yeah it seems page segmentation is the crucial issue. If the bounding boxes are good, the recognition is usually very good.

I think I've sort of reached the limit on what I can do with base Tesseract. I think the next step would be custom training / fine-tuning.

Tesseract's page layout analysis / segmentation isn't training based, so I don't think this is going to help you. If you wanted to recognize the C/L glyph, you could do fine tuning training for it, but it's not going to help you with the problem of finding rotated text and accurately determining bounding boxes for text of interest.

It's been ages since I've done serious image processing, but I'd recommend looking at something like OpenCV's text detection:

https://docs.opencv.org/4.8.0/d4/d43/tutorial_dnn_text_spotting.html

Aspirationally, you can get some idea of what's possible by playing with Google's Cloud Vision API demo

https://cloud.google.com/vision/docs/drag-and-drop

It lets you just drag & drop an image and then inspect the results both visually and via the JSON that the API produces.

Good luck!

Tom

Olav Storøy

unread,

Nov 13, 2023, 12:18:30 PM11/13/23

to tesser...@googlegroups.com

Ah! Thanks for the heads up, that probably saved me alot of time. I'll definitely have a look at OpenCV text detection and Cloud Vision. I really appreciate the tips.

--

You received this message because you are subscribed to the Google Groups "tesseract-ocr" group.
To unsubscribe from this group and stop receiving emails from it, send an email to tesseract-oc...@googlegroups.com.

To view this discussion on the web visit https://groups.google.com/d/msgid/tesseract-ocr/3a6e0271-db4b-4624-bada-51167dd6d744n%40googlegroups.com.

Art Rhyno

unread,

Nov 13, 2023, 2:05:59 PM11/13/23

to tesser...@googlegroups.com

With such clear diagrams, there might be value in having OpenCV remove the horizontal and vertical lines, and then identifying and merging the blobs that are left to get the regions for recognition. I tried this a bit with one of your examples, it would take more refinement but there might be a path to getting good bounding boxes at the image level.

art

--

You received this message because you are subscribed to the Google Groups "tesseract-ocr" group.
To unsubscribe from this group and stop receiving emails from it, send an email to tesseract-oc...@googlegroups.com.

To view this discussion on the web visit https://groups.google.com/d/msgid/tesseract-ocr/3a6e0271-db4b-4624-bada-51167dd6d744n%40googlegroups.com.

contours.png

Olav Storøy

unread,

Nov 14, 2023, 2:26:20 AM11/14/23

to tesser...@googlegroups.com

Interesting! I'd be worried it could remove important text features, but maybe tune it to not remove lines shorter than x. I definitely need to look at cv2. Until now I've sort of assumed it's best to make Tesseract do as much of this process as possible... Thanks for your input

To view this discussion on the web visit https://groups.google.com/d/msgid/tesseract-ocr/YQBPR0101MB990290D0BD05A1D3F3A8BA40DCB3A%40YQBPR0101MB9902.CANPRD01.PROD.OUTLOOK.COM.

Olav Storøy

unread,

Nov 14, 2023, 2:53:50 AM11/14/23

to tesser...@googlegroups.com

Update: I tried the Google Vision API, and it is actually ridiculously good. It hit all the targets except two, and that's without a tag dictionary.

Reply all

Reply to author

Forward