Training strategy to add a few GDT Symbols to eng

99 views
Skip to first unread message

Boot

unread,
Dec 1, 2020, 2:17:13 PM12/1/20
to tesseract-ocr
I'm working on a training model to recognize Mechanical Engineering drawings that may contain GDT symbols such as a symbol to indicate depth, a counterbore, countersink, diameter, etc. I saw that the eng.traineddata has a number of these GDT symbols already but not all. I'm using Legacy OEM.

I am obtaining 2 different types of images from these mechanical drawings - images that contain Notes which are typically english paragraphs/sentences of text, and images that contain dimensions/gdt symbols.

For the Notes regions of the drawing (in general, recognition of all letters, numbers, punctuation), i'm satisfied with the results that the eng.traineddata language produces.

For images obtained from the drawing that contain dimension text such as "⌀1.05 + .05 - .03 TYP" , I have developed a training model that is trained with letters A-Z (only uppercase letters - typical on these drawings - dimensions can have english text before or after as well), limited punctuation chars, and all the GDT symbols I need. It works OK on some fonts - but is not as good as the eng.traineddata model is at recognizing letters, numbers, punctuation. I'm assuming the main reason is because I haven't trained it with nearly as many fonts as the eng.traineddata model has been trained with. So my question is.. What's the best way to develop this language I need - which is just the eng model plus a few additional characters? Does it make sense to try to re-create the eng training data on my own? That seems like a daunting task that I'm trying to avoid. Do I have to re-create the eng language to add a few symbols?

Thanks for any Advice,
Boot

Reply all
Reply to author
Forward
0 new messages