Training from Scratch

280 views
Skip to first unread message

Simon

unread,
Nov 22, 2023, 8:46:53 AM11/22/23
to tesseract-ocr
As it is not properly possible to combine my traineddata from scratch with an existing one, I have decided to also train my traineddata model numbers. Therefore I wrote a script which synthetically generates groundtruth data with text2image. 
This script uses dozens of different fonts and creates numbers for the following formats. 
X.XXX
X.XX
X,XX
X,XXX
I generated 10,000 files to train the numbers. But unfortunately numbers get recognized pretty poorly with the best model. (most of times only "0."; "0" or "0," gets recognized)  
So I wanted to ask if It is not enough training (ground truth data) for proper recognition when I train several fonts. 
Thanks in advance for you help. 

Des Bw

unread,
Nov 22, 2023, 9:27:02 AM11/22/23
to tesseract-ocr
From my limited experience, you need a lot more data than that to train from scratch. If you can't make more than that data, you might first try to fine tune:and then train by removing the top layer of the best model. 

Simon

unread,
Nov 23, 2023, 4:15:56 AM11/23/23
to tesseract-ocr
If I need to train new characters that are not recognized by a default model, is fine tuning in this case the right approach?
One of these characters ist the one for angularity: 

This symbols appear in technical drawings and should be recognised in those. E.g. for the scenario in the following picture tesseract should reconize this symbol. 



angularity.png

Also here is one of the pngs I tried to train with: 
angularity_0_r0.jpg 
They all look pretty similar to this one. Things that change are the angle, the propotion and the thickness of the lines. All examples have this 64x64 pixel box around it. 


Is Fine Tuning for this scenario the right approach as I only find information for fine tuning for specific fonts. For fine tune also the "tesstrain" repository would not be needed as it is used for training from scratch, correct?

Des Bw

unread,
Nov 23, 2023, 4:28:26 AM11/23/23
to tesseract-ocr
If the original model lacks the ∠ symbol, fine tuning is not going to add it for you. We have all went through that process. To introduce a new character, removing the top layer and train from there is the most effective approach.  

Simon

unread,
Nov 23, 2023, 4:35:12 AM11/23/23
to tesseract-ocr
Thanks a lot!
This is not possible with the tesstrain repository right?

Des Bw

unread,
Nov 23, 2023, 4:39:21 AM11/23/23
to tesseract-ocr
Download the best model and try it. If it recognizes, that is great. You an also look at the unicharset of the best model. 

Des Bw

unread,
Nov 23, 2023, 4:41:20 AM11/23/23
to tesseract-ocr
If you are planning to train, you need to make sure that your images contain all those variations: in thickness, angle etc. I don't know if text2image can do that for you. You might need to do it manually; or use some other tool. 

Zdenko Podobny

unread,
Nov 23, 2023, 12:59:01 PM11/23/23
to tesser...@googlegroups.com

št 23. 11. 2023 o 10:28 Des Bw <desal...@gmail.com> napísal(a):
If the original model lacks the ∠ symbol, fine tuning is not going to add it for you.

Really??? 
Tesseract documentation: Fine tuning is the process of training an existing model on new data without changing any part of the network, although you can now add characters to the character set. (See Fine Tuning for ± a few characters).

 
--
You received this message because you are subscribed to the Google Groups "tesseract-ocr" group.
To unsubscribe from this group and stop receiving emails from it, send an email to tesseract-oc...@googlegroups.com.
To view this discussion on the web visit https://groups.google.com/d/msgid/tesseract-ocr/fb4a1b27-db44-49a6-adfa-ada9e13030aan%40googlegroups.com.

Lorenzo Bolzani

unread,
Nov 24, 2023, 4:45:14 AM11/24/23
to tesser...@googlegroups.com
Hi Simon,
if I understand correctly how tesseract works, it follows this steps:

- it segments the image into lines of text
- it then takes each individual line and slides a small window, 1px wide I think, over it, from one end to the other. For each step the model outputs a prediction. The model, being an bidirectional LSTM has some memory of the previous and following pixel columns.
- all these predictions are converted into characters using beam search

Please correct me if I got it wrong. So the first thing I think looking at your picture is the segmentation step. Do you want to read the "< 0,05 A" block only? Is the segmentation step able to isolate it? This is the first thing I would try to understand.
Also your sample image for "<" has a very different angle to the one before 0,05.

In this case a would try to do a custom segmentation, looking for rectangular boxes of a certain height, aspect ratio, etc. Then cropping these out (maybe dropping the rectangular box and the black vertical lines) and feed them to tesseract. This of course requires custom programming.

This might give good results even without fine tuning. I would try this manually with GIMP first.


Also I suppose you are not going to encounter a lot of wild fonts into these kind of diagrams. The more fonts you use, the harder the training. I would focus on very few fonts, even one. I would start with exactly one font and train on these to see quickly if my training setup/pipeline is working. And if the training results reflect onto the diagrams later. If the model error rate is good on the individual text lines and it is bad on the real images it might be a segmentation problem that training cannot fix. Or the problem might be the external box, that I suppose you do not have in your generated data.

Ideally, I would use real crops from these diagrams rather than images from text2image.

Also distinguishing 0 from O with many fonts is very hard. Often you have domain knowledge that can help you to fix these errors in post, for example 0,O5 can be easily spotted and fixed. You can, for example, assume that each box contains only one kind of data and guess the most likely one from this or from the box sequence, etc.

I got good results with 20k samples (real world scanned docs, multi fonts). 10k seems reasonable, I also assume your output "characters set" is very small, like the numbers and a few capital letters and a couple of symbols (no %, ^, &, {, etc.).



Lorenzo

--
You received this message because you are subscribed to the Google Groups "tesseract-ocr" group.
To unsubscribe from this group and stop receiving emails from it, send an email to tesseract-oc...@googlegroups.com.

Des Bw

unread,
Nov 24, 2023, 6:12:34 AM11/24/23
to tesseract-ocr
@zdenop: 
Yes, because the characters start to show up (get recognized) only after you run a few thousands of iterations. For me, new characters start to get recognized only after I run 5000 iterations. At that point, the base model will be deteriorated terribly. It is now a common knowledge that a fine-tuning running above 400 iterations highly compromises the base model. For that, fine-tuning is not effective to add new characters (even if the guide says that is possible). 

Dear Zdenop, I would love to be know if there is a way around it. I am languishing with tesseract for months now because the default model missed one important character.  

Simon

unread,
Nov 25, 2023, 6:24:59 AM11/25/23
to tesseract-ocr
Yes in general I want to recognice this part  "< 0,05 A" except that the < ist actually    the character for angularity. 

The segmentation process of tesseract can't be edited right? So you mean I would need to make an Tesseract independent program that localizes the boxes crops them out and feeds them to Tesseract? In that case I still would need to train Tesseract for recognizing   .  So I am still wondering how to train this sign properly. 

Because you asked if the isolation step is able to isolate it, I can check this by looking at the hocr information right?



Lorenzo Bolzani

unread,
Nov 27, 2023, 10:52:46 AM11/27/23
to tesser...@googlegroups.com

Hi Simon, yes, I think the instructions you can give to the segmentation step are quite limited, mostly the PSM parameter and I suppose a few minor ones. There is something about tables but I've never used it and yours might be too small for this to work. Yes, you should be able to see what is happening looking at the HOCR file.

You could also try the attached script, it was made for the 4.x version but might work with 5.x too. It draws boxes around letters according to the tesseract output. I'm attaching the output on a simple text and on several crops from your image: only in the clean one you can see the text boxes. You can do the same from the HOCR file.

Yes, you still need to fine tune for the new character. I was able to train up to 57k iterations still improving the results on a test dataset. You need to fine tune including the new symbols AND all the other symbols you expect to recognize in the training dataset.


I'm not sure if you are using something like this:

 merge_unicharsets $(TESSDATA)/$(CONTINUE_FROM).lstm-unicharset $(TRAIN)/my.unicharset  "$@"

if so, you can replace it with:

 cp "$(TRAIN)/my.unicharset" "data/unicharset"

and the new model will output only the characters that are present in your new dataset (for example to discard lower case letters, the < character, %, !, #, etc.)

Also, if you do not need to recognize the < symbol, you could reuse this rather than adding a new one completely. I mean that when you generate the images with the "angle" symbol you put < in the transcription. Maybe it helps, maybe it won't.



Bye

Lorenzo




ocr_boxes.py
ocr boxes_screenshot_27.11.2023-3.png
ocr boxes_screenshot_27.11.2023x.png
ocr boxes_screenshot_27.11.2023.png
ocr boxes_screenshot_27.11.2023-2.png

Simon

unread,
Nov 29, 2023, 3:36:05 AM11/29/23
to tesseract-ocr
Hey Lorenzo,

thanks a lot for your response. I've seen in the HOCR files of different technical drawings that the Tesseract Text Segmentation has massive problems recognizing zones with text, probably because of the varios lines and complex constructions within the technical drawing. Even the zones where text appears get recognized very rarely. So it seems pretty obvious to me that no Tesseract is not build for documents where no clear text lines are 
Therefore I decided to follow your suggestion to crop out the boxes (Feature Control Frame) and feed them seperately to Tesseract. To identify those boxes I would try to use OpenCV. I also try to generate training data which should be similar to these Feature Control Frames for the training of Tesseract. Do you think this approach could be successfull?

Lorenzo Bolzani

unread,
Nov 29, 2023, 6:57:53 AM11/29/23
to tesser...@googlegroups.com
Yes, isolating the text fragments gives you a lot more control over the final OCR step.

If you have control over the program generating the diagrams, you could use it to generate the training data with the corresponding gt files or at least a set of diagrams with known text. You could also consider using google vision API to generate the training data from these diagrams, it could cost just a couple of dollars if you can pack a lot of text into a single image.

To detect the text, you could try with cv2 textDetection model or simpler morphology operations used to detect, for example, the credit card numbers, mrz lines, etc.

To remove the box around the text, once you have a small crop with the text region only, you could do a floodFill with black color in one white corner of the crop, followed by a white floodFill from the same place: this should completely remove the box around the text and the extra lines.

You could also try the black flood fill on the whole image for segmentation, it should leave only the "boxed" regions. Then do a components analysis to select text regions according to size/aspect ratio.

If you have free text above the boxed one, you could first detect the boxed text, then go back to the original image and examine the region just above the boxed region.

Tesseract does a much better job on the unboxed text, see attachment (taken from here: https://www.gdandtbasics.com/feature-control-frame), still I do not think it may be good enough for a robust solution, even with some fine tuning. Might work for simple diagrams.


Lorenzo


ocr boxes_screenshot_29.11.2023.png
Reply all
Reply to author
Forward
0 new messages