Using Tesseract as an OCR solution for blind people

Eigeldinger Simon

unread,

Apr 30, 2024, 1:59:05 PM4/30/24

to tesser...@googlegroups.com

Hi all,

I just want to update the info i have about tesseract.

I would need an OCR program that can recognize text in scanned documents.

Those are in jpg or multipage pdf format.

Pages may be up side down.

They also might contain images, tables and headings.

Can i recognize those pages out of the box with tesseract?

Can tesseract also recognize tables and headings?

A few years ago someone would need to process the images first.

Is this still the status?

Greetings,

Simon

Viet Thanh Sai

unread,

Apr 30, 2024, 2:07:32 PM4/30/24

to tesseract-ocr

Hello sir,

I have read your project description.

Recently, I worked on the very similar OCR project to yours.

In that project, OCR recognized texts, numbers and symbols from PDF draft.

OCR using Tesseract OCR was good but to ensure more accuracy, I preprocessed images from PDF with OpenCV.

Finally, I could provide the wonderful OCR results.

You can test OCR on this link with attached PDF files.

http://35.78.80.226:5000/

I am sure I can help you with your project.

Thanks.

Kind Regards.

TWR-05P-049_M05L10A29_05L0170X.C06_20230907182557.pdf

03GI-27.1_C02_REV0.pdf

Misti Hamon

unread,

Apr 30, 2024, 2:44:07 PM4/30/24

to tesser...@googlegroups.com

Image quality matters. Upside down or sideways images really need to be rotated first - that is easy to do without loading up an image editor, just need to get into the jpg's metadata.

It sounds like you are processing text books, to turn into something a screenreader can manage? Headers and such get recognized (you'll probably have to post-process the tesseract results, screen readers like hierarchical formats, will have to look at the formats tesseract provides and see if there is one that can be fed directly to a screen reader). Charts and tables, especially if they have a background color or row or column stripping it has problems with. If the images you are working with aren't evenly lit, or they are low DPI there will be problems too. (Personal experience here, been processing textbook type format books myself)

--
You received this message because you are subscribed to the Google Groups "tesseract-ocr" group.
To unsubscribe from this group and stop receiving emails from it, send an email to tesseract-oc...@googlegroups.com.
To view this discussion on the web visit https://groups.google.com/d/msgid/tesseract-ocr/a6d49be8cccb40c287d480a9e0053807%40hohenems.at.

Eigeldinger Simon

unread,

May 2, 2024, 8:32:35 AM5/2/24

to tesser...@googlegroups.com

Hi Misti,

Thanks for the info.

Will have a look at that.

Yes getting a good picture as a blind person isn't all that easy.

Which output format might be the best to preserve the most formatting, headings and other things? hocr?

Greetings,

Simon

To view this discussion on the web visit https://groups.google.com/d/msgid/tesseract-ocr/CAEnOb6S_Xrz%3D8LY_Gf8BbAdVoJZAqPR09tO6PpnKW-5C-Y%2Bt4g%40mail.gmail.com.

Reply all

Reply to author

Forward