confuse whether Otsu Thresholding affects lstm training

kotom...@gmail.com

unread,

Apr 2, 2019, 9:56:59 PM4/2/19

to tesseract-ocr

Sorry for disturb again. I have sent my issue befire, but no one gives the answer. I really need your help.

I go through the source code and find tesseract do Otsu Thresholding and put the binary pix in the Thresholder object.

But It seems the Thresholder object haven't been invoked if I use lstm engines.

As well as dpi size,tesseract wiki said it is better for 300 dpi. This is a requirement for tesseract 3.0 engine or even before, right?

If I training lstm tesseract, it doesn't matter whether I do binary or resize the dpi of images, right?

I will be every appreciated if any response is sent. Thank you so much!

Du Kotomi

unread,

Apr 3, 2019, 3:01:49 AM4/3/19

to tesseract-ocr

Anybody here?

--
You received this message because you are subscribed to the Google Groups "tesseract-ocr" group.
To unsubscribe from this group and stop receiving emails from it, send an email to tesseract-oc...@googlegroups.com.
To post to this group, send email to tesser...@googlegroups.com.
Visit this group at https://groups.google.com/group/tesseract-ocr.
To view this discussion on the web visit https://groups.google.com/d/msgid/tesseract-ocr/f1004b09-daa5-4d6b-909b-ad8eac267d34%40googlegroups.com.
For more options, visit https://groups.google.com/d/optout.

Shree Devi Kumar

unread,

Apr 3, 2019, 4:36:18 AM4/3/19

to tesser...@googlegroups.com

Usually for LSTM training we are using synthetic images created by text2image program using training text and fonts using tesstrain.sh or tesstrain.py. Hence there is no question of binarization or dpi as the program creates images as expected by tesseract training process.

To view this discussion on the web visit https://groups.google.com/d/msgid/tesseract-ocr/CAJ7gR93svAsWZDUjQ9kVEp_bvh53F6Yv5jQ8q5Ts4zObiCRy2Q%40mail.gmail.com.

For more options, visit https://groups.google.com/d/optout.

--

____________________________________________________________
भजन - कीर्तन - आरती @ http://bhajans.ramparivar.com

Du Kotomi

unread,

Apr 3, 2019, 5:08:20 AM4/3/19

to tesser...@googlegroups.com

If we use text2image tool, there is no such problem.

What about training with our real data. I have enough images for training. Should I need to do some preprocess like binary or resized dpi and then do lstm training?

To view this discussion on the web visit https://groups.google.com/d/msgid/tesseract-ocr/CAG2NduV2mq8zKivmYtq8aMO7D%2BUeRiyxY%3D%3DL5qaOCi7iF9XC-A%40mail.gmail.com.

Shree Devi Kumar

unread,

Apr 3, 2019, 6:51:22 AM4/3/19

to tesser...@googlegroups.com

I haven't trained with real images. I would guess that training images should be similar to what you will be using for OCR. It might be best to test with a small set of images and see what works best for you.

To view this discussion on the web visit https://groups.google.com/d/msgid/tesseract-ocr/CAJ7gR91s8vppNUwXPvkZzY%2BrmdJLee3pNMoKnLyAu3feoXzJsg%40mail.gmail.com.

For more options, visit https://groups.google.com/d/optout.

Du Kotomi

unread,

Apr 3, 2019, 7:09:38 AM4/3/19

to tesser...@googlegroups.com

Thank you for your kind reminder. I have done this. And confusing thing happens. I train my model with grey scale image without any 300dpi resize. It goes not that well when I validate with my some of my test data. But if I resize test image to 300 dpi, it’s better. In fact ,image will be resize to a fixed size before go through the network. Let’s say, the same text in the same image , no matter how I resize it, there will be a same fixed image to be put into the network. Why here image size matters too much.

Besides, The reason why I use grey scale is that I have seen the input of the given tesseract lstm model is 1 in depth not 3. My real data is very complicated screen shot from some video.

is it better if I use colored image to train model?

To view this discussion on the web visit https://groups.google.com/d/msgid/tesseract-ocr/CAG2NduXf3h-6FYmnLGVznTEdrFGPULgUpq66a_zkJhG8A9cJ4w%40mail.gmail.com.

Lorenzo Bolzani

unread,

Apr 3, 2019, 9:17:15 AM4/3/19

to tesser...@googlegroups.com

Hi, I train with real data. I use grayscale images, I think color makes no difference.

I do a very good image cleanup: background removal, denoise, straightening, sharpening, illumination correction, contrast stretching, etc. before passing the text to tesseract. This part is likely better done on color images (you can split in RGB/HSV channel depending on what you need).

So my final output is already almost "binary" and I do not do any real binarization/thresholding, I'm not sure if tessaeract does it or not but the difference would be minimal.

All the images are rescaled so that the text has always the same height, about 35/40px, with not border or a small (1/2px) border. Try with an evaluation set and see what works best for you.

Bye

Lorenzo

To view this discussion on the web visit https://groups.google.com/d/msgid/tesseract-ocr/CAJ7gR91s8vppNUwXPvkZzY%2BrmdJLee3pNMoKnLyAu3feoXzJsg%40mail.gmail.com.

Shree Devi Kumar

unread,

Apr 3, 2019, 12:31:20 PM4/3/19

to tesser...@googlegroups.com

Hi Lorenzo,

Do you have a script for image pre-processing? Please share, if possible. It will be helpful to many.

To view this discussion on the web visit https://groups.google.com/d/msgid/tesseract-ocr/CAMgOLLyLS5TEYN-C%2B-wRHgmsGrhFFBSmafqBr%3DwE86qEve7grA%40mail.gmail.com.

For more options, visit https://groups.google.com/d/optout.

Du Kotomi

unread,

Apr 3, 2019, 8:01:09 PM4/3/19

to tesser...@googlegroups.com

Thank you so much for your sharing.

It seems a very complicated cleanup. It will be very useful if you can provide some preprocessing script. And I am wondering there is also some thresholds depending on different images, right?

By the way, I have read some papers about LSTM +Ctc for ocr. The advantages for such techniques is from deep learning. We can get any complicated feature from convolution. So theoretically, it is no need to do such preprocessing. How do you think about this ?

To view this discussion on the web visit https://groups.google.com/d/msgid/tesseract-ocr/CAMgOLLyLS5TEYN-C%2B-wRHgmsGrhFFBSmafqBr%3DwE86qEve7grA%40mail.gmail.com.

Lorenzo Bolzani

unread,

Apr 8, 2019, 8:56:07 AM4/8/19

to tesser...@googlegroups.com

Hi Shree,

I'd love to but it is a commercial project I'm working on so I cannot share the current solution.

I will try to find the old scripts I used for the first attempts. Basically it was something like this:

- normalize lightness

- make illumination uniform (CLAHE on HSV "V" channel)

- denoise/divide to remove background (with custom level based on noise estimation)

- normalize text size for a fixed value

- remove "dust" with morphological operations

- remove light gray shades with a "soft threshold"

- stretch contrast/histogram

- straighten text (and dewarp for very long lines)

I used opencv and PIP.

The main problem is that a ton of fine tuning is required for each of these steps if the input are random pictures from smartphones/scanner/etc.

It also depends on how noisy the background is or if color can be used as a hint for background detection. For example converting the image to HSV makes very simple to remove colored noise or colored background. You select the parts with high saturation with a numpy mask and set them to white or black depending on their luminance.

Measuring noise, blurriness, contrast, etc. helps to decide what processing to apply or to do it proportionally to the measured value.

Many fine tuning values also depend on the image/text size.

Gaussian difference and divide and the best way I found general cleanup.

Sometimes multiply works great for details enhancement of low contrast images.

I can try to put together a small sample script because there are not many around or at least easy to find. Not much time to do it but I'll try.

Bye

Lorenzo

To view this discussion on the web visit https://groups.google.com/d/msgid/tesseract-ocr/CAG2NduX8unpRq38fxiUs2fTTtNVvSqAfhxZEUQWntm%3DtUAY8tQ%40mail.gmail.com.

Lorenzo Bolzani

unread,

Apr 8, 2019, 9:20:13 AM4/8/19

to tesser...@googlegroups.com

Hi,

yes, at the very least you can use some adaptive threshold method, like OTSU, to find the best parameters. But OTSU has its own parameters so you need to fine tune those too (a little).

What worked best for me was first to do a rough normalization of the images (lightness, contrast) and then do the thresholding. To do this you have to measure the current brightness and/or do a CLAHE adaptive correction.

https://stackoverflow.com/questions/25008458/how-to-apply-clahe-on-rgb-color-images

I think tesseract is an LSTM+Ctc based solution. I think by default it uses one convolutional layer (https://github.com/tesseract-ocr/tesseract/wiki/VGSLSpecs).

So yes, theoretically it could do the cleanup and the text conversion too.

Maybe a single conv layer is not enough, you may need more. And I would start from scratch with a ton of synthetic data mixed with real data(+augmentation).

Is it going to work better than external cleanup + fine tuning? I do not know and obviously depends on the specific data.

Note: maybe there is some automatic pre-processing that does a thresholding internally before feeding the data to the NN. If this is the case obviously this needs to be removed.

BTW: I'm seeing right now that tesseract can accept a three channels input but I do not know how the pre-trained models are configured.

Bye

Lorenzo

To view this discussion on the web visit https://groups.google.com/d/msgid/tesseract-ocr/CAJ7gR93bxKE_RTa%3DFd6iR6WWhL-nUpRQDD%2BA3TcLMudioL-prg%40mail.gmail.com.

Shree Devi Kumar

unread,

Apr 10, 2019, 12:52:23 PM4/10/19

to tesser...@googlegroups.com

Hi Lorenzo,

Thanks for detailed description of pre-processing steps. I will link from the wiki so that it is available for easy reference.

Thank you for sharing.

Reply all

Reply to author

Forward