Train Tesseract (german)

63 views

Skip to first unread message

testcoal

unread,

Apr 18, 2024, 12:52:50 PM4/18/24

to tesseract-ocr

Hi,

I've been utilizing Tesseract 4 to extract text from PNG and TIFF images, and all the content is in German. While the image quality is pretty decent, the extraction results have been less than stellar for some of them. I understand that to improve OCR accuracy, training Tesseract with additional data is recommended.

However, I've hit a roadblock as I only have the images without the exact text (ground truth) or bounding boxes. Creating this data manually seems like a massive undertaking—do you recommend this as the best course of action? Or, are there other solutions or perhaps existing prepared datasets for German that I could use?

Also, I'm curious about the volume of training data required. Is there a minimum number of images and corresponding texts that you'd consider sufficient to start seeing improved results?

Any guidance or resources you can provide would be greatly appreciated.

Atef

Misti Hamon

unread,

Apr 18, 2024, 1:11:26 PM4/18/24

to tesser...@googlegroups.com

Scanned books?

No help on training or choosing datasets, but, if these images are photoscanned book pages, did you run the images through book specific processing software (scantailor, spreads, or bookscan wizard are the 3 I know of, plus internet archive's scan tool scripts) to split your source images into a mixed raster type and enhance the text with a thresholding algorithm? The thresholding algorithm (especially if you play around a bit with the variables) can be extremely helpful if the lighting was a bit uneven or other issues making it a little tough for tesseract to see the pixels that make up your letters as part of the letters

--
You received this message because you are subscribed to the Google Groups "tesseract-ocr" group.
To unsubscribe from this group and stop receiving emails from it, send an email to tesseract-oc...@googlegroups.com.
To view this discussion on the web visit https://groups.google.com/d/msgid/tesseract-ocr/aeed1be3-e759-454f-89b5-ff3f0282d9a8n%40googlegroups.com.

Reply all

Reply to author

Forward

0 new messages