How to generate training images with noise

Keith Smith

unread,

Oct 12, 2023, 2:15:06 PM10/12/23

to tesseract-ocr

Hello,

I am trying to use tesseract to OCR the MICR line of checks (i.e. the micr-e13b font). The training data that I found at https://github.com/BigPino67/Tesseract-MICR-OCR/blob/master/Tessdata/mcr.traineddata does not produce accurate results on my data set.

I have a set of over 20K check images along with the MICR text for those images; however, I do not have box files for them.

So I started generating box files and manually correcting them via JTessBoxEditor, but I soon learned that it would take a LONG time to do this for enough checks to properly train tesseract. So I am just started generating synthetic images using tesseract's text2image; however, the images generated are perfect (i.e. no blur, skew, etc), so I am doubting that this will result in training tesseract to handle my less-than-perfect check images.

Does anyone have suggestions for the best methodology to use? Is there a way to get text2image (or another tool) to generate less-than-perfect images? Or can someone suggest a less labor intensive way of using real check images to train tesseract?

Thanks in advance,

Keith

Shree Devi Kumar

unread,

Oct 13, 2023, 1:00:06 AM10/13/23

to tesseract-ocr

Have you looked at

https://github.com/tesseract-ocr/tesstrain

--
You received this message because you are subscribed to the Google Groups "tesseract-ocr" group.
To unsubscribe from this group and stop receiving emails from it, send an email to tesseract-oc...@googlegroups.com.
To view this discussion on the web visit https://groups.google.com/d/msgid/tesseract-ocr/b92d2ab9-3da1-4ef8-bafe-5217821c5601n%40googlegroups.com.

Keith Smith

unread,

Oct 13, 2023, 6:13:44 AM10/13/23

to tesser...@googlegroups.com

Yes I have. I am asking about how to automate the generation of the ground truth images and box files, because from what I understand, tesseract requires on the order of 10K images and box files to train on. However, unless I am missing something, what I read at https://github.com/tesseract-ocr/tesstrain assumes the ground truth (images + box files) already exist.

To view this discussion on the web visit https://groups.google.com/d/msgid/tesseract-ocr/CAG2NduWVBZ-FGXZUTwTX56DQvwtCY9rB%2BuPTjjok62u2BEF%3DzA%40mail.gmail.com.

Shree Devi Kumar

unread,

Oct 13, 2023, 7:46:20 AM10/13/23

to tesseract-ocr

If you have single line images, then you only need matching single line text transcription for the tesstrain makefile training process. It will generate the required box files.

This is different from the old text2image process.

>>Images must be TIFF and have the extension .tif or PNG and have the extension .png, .bin.png or .nrm.png.

>>Transcriptions must be single-line plain text and have the same name as the line image but with the image extension replaced by .gt.txt.

Please try a test run with the example set-up.

To view this discussion on the web visit https://groups.google.com/d/msgid/tesseract-ocr/CAL1pF5aGd6P1CCF0y5ufakhbDzSzbBQNF7A4iECnu4dFdsC0rQ%40mail.gmail.com.

Shree Devi Kumar

unread,

Oct 13, 2023, 7:54:31 AM10/13/23

to tesseract-ocr

Keith Smith

unread,

Oct 13, 2023, 10:59:59 AM10/13/23

to tesser...@googlegroups.com

Thanks Shree for the clarification. I'll give it a try. I was following https://github.com/tesseract-ocr/tessdoc/blob/main/tess5/TrainingTesseract-5.md and obviously misunderstood.

To view this discussion on the web visit https://groups.google.com/d/msgid/tesseract-ocr/CAG2NduU-qKCUG5wBTN3ke1NFN4_5aG6arF1HabHE12vZngby0A%40mail.gmail.com.

Keith Smith

unread,

Oct 18, 2023, 11:27:58 AM10/18/23

to tesser...@googlegroups.com

I tried using tesstrain but am not getting 0% accuracy, so any help on what I'm doing wrong or misunderstanding would be greatly appreciated.

Specifically, here is what I did given my 20K check images and data from my x9.37 file. For each check, I

1. cropped the image so that they included only the bottom of the check with the MICR line

2. generated the gt.txt file based on the values for the check from the x9.37 file associated with the MICR line

3. ran "make training MODEL_NAME=micr_e13b" until it terminated. The BCER was at about 34%.

I then used the resulting micr_d13b.traineddata file but it yielded dismal results. So I looked at the box files that were generated, and each of them had the same coordinates for each character which covered the entire image area.

So I looked at the generate_line_box.py script and it seems that is what it is coded to do from looking at https://github.com/tesseract-ocr/tesstrain/blob/main/generate_line_box.py#L26

Shouldn't the box file coordinates be different for each character?

Thanks,

Keith

Des Bw

unread,

Nov 1, 2023, 8:06:20 AM11/1/23

to tesseract-ocr

I am not sure if you are supposed to use those box files for training purposes. All the guides and manuals I have read use either text2image script, or the manual method(which is presumably outdated method).

Reply all

Reply to author

Forward