Tesseract expects book-like text pages, i.e. black (dark) text on white (light, plain) background.
While tesseract includes a few preprocessing features for removing lines (drawings), generally it is better to have everything which is not text removed beforehand.
You mention cropping: that is one way to approach this but tesseract works well when the "page" has a bit of a white border (background color); may i suggest you look into *masking* the non text area of the image instead?
The benefit of the masking approach is that text (word character) coordinates reported by tesseract (hoor, tsv formats) will then match the original image as masking does not change the image dimensions. Hence the key is to mask any in page image content and have it replaced by the background color (white).
One tool which can do this for batches is imagemagick. See https://legacy.imagemagick.org/Usage/masking/ for various ways to create and deal with masks there.
Ciao,
Ger
--
You received this message because you are subscribed to the Google Groups "tesseract-ocr" group.
To unsubscribe from this group and stop receiving emails from it, send an email to tesseract-oc...@googlegroups.com.
To view this discussion on the web visit https://groups.google.com/d/msgid/tesseract-ocr/d955cb58-484f-484f-bfd2-079642fa853dn%40googlegroups.com.