I allow myself to elaborate in this thread on general image processing questions in this forum. On the other hand I also include one example solution at the end to justify this email.
Personally, I do not think that these questions should be posted exactly into this forum because tesseract is already doing a great job in segmentation when you do not have additional information about the input document set. Can it be improved? Definitely, but the price performance ratio is too high and I would rather see the authors/committers focusing on other things than handling of very specific documents.
That being said, to if you really want to have high(er) precision you simply have to do image processing.
I have seen references to opencv quite a lot but no matter how great that library is, for document image processing my suggestion is to use Leptonica (
https://github.com/DanBloomberg/leptonica/). Yes, the one tesseract is using internally. That library is very powerful, super fast even without cpu/gpu magic. I have to admit that I do not understand why it is not much more popular and more widely used if you are/have to be at least a bit serious with document image processing.
The basic keywords you should understand before even trying any processing are: connected components, basic morphological operations (dilate, erode, open, close), structuring elements and seed fills. With their rather simple usage, many questions in this forum could be answered (at least in a hardcoded way). The reason for only a few helpful answers might be that it takes a considerable amount of time and I believe some people have their internal frameworks where it can be done super easily but cannot share it.
Furthermore, the current (lstm based) traineddata are very good but you will find (even simple) examples where they are not performing well and you have to either do image processing or retrain (or use older version that relies on different properties). Have a look at these simple images:
4. download Latin best and execute do OCR for both images e.g.,
tesseract -l Latin --psm 8 --oem 1 ./t1.png stdout
and you should get `MMEA` vs `MEA`.
Well, this might not be the best example but I hope it illustrates the point.
Answer to original question
In order to keep this message "short", I will stop here and point you to a
and
The code users leptonica and it prepares the image by scaling and deskewing it, binarizing it and then it (very) roughly tries to find possible letter descenders of latin text on a line (here you could traverse the lines by columns and look for black pixels above/below), finds lines and computes the result. It looks far from perfect but the result is usable.
Kind Regards,
Jozef