Hello, I am using Tesseract for a project (i’m not use to OCR) and I am encountering some issues that I haven't been able to resolve with the documentation, and I saw that questions can be asked here. I am coding in Python, on Jupyter, and using Pytesseract.
I would like to extract the text contained in tables (see img1). Most of these tables are scanned PDFs, which is why I am using OCR. So, I tried to apply it to specific cells to test the accuracy and adjust Tesseract. I created PNG images of different sizes and did several preprocessing tests. I have three types of images: table titles (title.png), cells with spaced text (light_cell.png), and cells with tight text (cellXXprcent.png). This is where I encounter problems that I cannot solve:
In the case of cells with tight text ( cellXXprcent.png, the 3 images are a small part of all formats i tested) , even on very zoomed-in text, which is of good quality, or on text of the right size (about 30px high) but of average quality, I cannot get good results. I have tried on the images by modifying the size in several different ways (scaling directly from the PDF with the scaleBy method of PYPDF2, saving at 300 DPI and resizing the PNGs with OpenCV) and with preprocessing (with thresholding, erosion, dilation, opening, top_hat, and with different sizes of ellipse and rectangle kernels) without really increasing the accuracy. I have applied everything (I think) that is said in the documentation, binarization, image border of 5 and 10 pixels, the images are not noisy and are straight, and there is no alpha channel. I have also tested with different OEMs and PSMs and by disabling Tesseract dictionaries, since my text is not in the form of words (should I write "-c load_system_dawg=false -c load_freq_dawg=false" or "-c load_system_dawg=false+load_freq_dawg=false" in the config? both work, so I don't know which format is correct). Is there a solution that I haven't tried yet?
I also have more general questions:
Is it the solution to train Tesseract with my own images, and if so, can I train it with a large size or with a specific size? I haven't done any training myself yet because my images are of good quality and don't have a particularly extravagant font.
How can I add the Greek script to the parameters to detect Lambda without disrupting the recognition of English characters? Currently, when I write (-l eng+greek), some English characters are recognized as Greek characters, I would like it to only recognize Lambda as Greek. Could the argument for whitelist be a solution?
Thank you very much in advance if you take the time to answer me, and have a good weekend.
--
You received this message because you are subscribed to the Google Groups "tesseract-ocr" group.
To unsubscribe from this group and stop receiving emails from it, send an email to tesseract-oc...@googlegroups.com.
To view this discussion on the web visit https://groups.google.com/d/msgid/tesseract-ocr/99b58936-7fc1-4ee2-976b-5a942f58e5fcn%40googlegroups.com.