Hi tesseract community!
I've found an interesting scenario where a simple 4-digit number cropped from a PDF (i.e from a region rendered from a vector font, not from an embedded bitmap) is incorrectly OCR'd. I used ImageMagick to extract a .png from the source PDF, like this:
convert -density 1600 -trim input.pdf[42] -rotate 90 +repage -crop 600x720+900+3400 crop.png
...and then used tesseract to OCR it:
tesseract crop.png stdout --psm 6
The digits "1552" in the source image are OCR'd as "15562".
You can try for yourself like this:
tesseract 0swZuoU.png stdout --psm 6
The image as hosted on imgur is not bitwise-equivalent to crop.png, but it's impossible to tell apart by eye. I can upload the original crop.png somewhere else, if necessary.
I'm using the latest commit (30ebb31f) of the tesseract engine, and I tried with the latest commits (4767ea9 & e2aad9b) of both tessdata and tessdata_best.
Can I do anything to improve the OCR result in this sort of scenario?
Chris