OCR problem with condensed text

Augustin Fourcaud

unread,

May 5, 2023, 2:13:50 PM5/5/23

to tesseract-ocr

Hello, I am using Tesseract for a project (i’m not use to OCR) and I am encountering some issues that I haven't been able to resolve with the documentation, and I saw that questions can be asked here. I am coding in Python, on Jupyter, and using Pytesseract.

I would like to extract the text contained in tables (see img1). Most of these tables are scanned PDFs, which is why I am using OCR. So, I tried to apply it to specific cells to test the accuracy and adjust Tesseract. I created PNG images of different sizes and did several preprocessing tests. I have three types of images: table titles (title.png), cells with spaced text (light_cell.png), and cells with tight text (cellXXprcent.png). This is where I encounter problems that I cannot solve:

In the case of cells with tight text ( cellXXprcent.png, the 3 images are a small part of all formats i tested) , even on very zoomed-in text, which is of good quality, or on text of the right size (about 30px high) but of average quality, I cannot get good results. I have tried on the images by modifying the size in several different ways (scaling directly from the PDF with the scaleBy method of PYPDF2, saving at 300 DPI and resizing the PNGs with OpenCV) and with preprocessing (with thresholding, erosion, dilation, opening, top_hat, and with different sizes of ellipse and rectangle kernels) without really increasing the accuracy. I have applied everything (I think) that is said in the documentation, binarization, image border of 5 and 10 pixels, the images are not noisy and are straight, and there is no alpha channel. I have also tested with different OEMs and PSMs and by disabling Tesseract dictionaries, since my text is not in the form of words (should I write "-c load_system_dawg=false -c load_freq_dawg=false" or "-c load_system_dawg=false+load_freq_dawg=false" in the config? both work, so I don't know which format is correct). Is there a solution that I haven't tried yet?

I also have more general questions:

Is it the solution to train Tesseract with my own images, and if so, can I train it with a large size or with a specific size? I haven't done any training myself yet because my images are of good quality and don't have a particularly extravagant font.

How can I add the Greek script to the parameters to detect Lambda without disrupting the recognition of English characters? Currently, when I write (-l eng+greek), some English characters are recognized as Greek characters, I would like it to only recognize Lambda as Greek. Could the argument for whitelist be a solution?

Thank you very much in advance if you take the time to answer me, and have a good weekend.

cell100prcent.png

Untitled.ipynb

light_cell.png

cell490prcent.png

title.png

img1.PNG

cell350prcent.png

Zdenko Podobny

unread,

May 7, 2023, 3:49:29 PM5/7/23

to tesser...@googlegroups.com

Hello,

yes, you can train tesseract with your images. Have a look at

https://github.com/tesseract-ocr/tesstrain and an example project https://github.com/tesseract-ocr/tesstrain/blob/main/ocrd-testset.zip

You can retrain ( finetune) existing (e.g. just to add new letters/symbols or font) by using the parameter START_MODEL.

Zdenko

pi 5. 5. 2023 o 20:13 Augustin Fourcaud <a.fou...@gmail.com> napísal(a):

--
You received this message because you are subscribed to the Google Groups "tesseract-ocr" group.
To unsubscribe from this group and stop receiving emails from it, send an email to tesseract-oc...@googlegroups.com.
To view this discussion on the web visit https://groups.google.com/d/msgid/tesseract-ocr/99b58936-7fc1-4ee2-976b-5a942f58e5fcn%40googlegroups.com.

Augustin Fourcaud

unread,

May 10, 2023, 4:18:59 AM5/10/23

to tesseract-ocr

Thanks for your answer, I'll try it.

Reply all

Reply to author

Forward