I'm preparing text images (JPG) for Tesseract OCR conversion to text files (TXT) I note that it is important to resize my image docs so that capital letters are about 30-32 pixels in height. See Optimal image resolution (dpi/ppi) for Tesseract 4.0.0 and eng.traineddata?
- Open image file
- Enlarge text (zoom in)
- Draw parallel vertical line beside vertical of number or straight edge letter
- Select Analyze>Set Scale (see image below)
How to count pixels? Do I count the 'half pixels'? Where the pixel 'block' is a half-tone? In other words, for my total count, do I estimate the true height by including these half-tones.
Does anyone have a better procedure than this?
My aim is to come up with a resizing ratio that I can apply to a large collection of text files using a Python script. This being another step along the way to preparing docs for Tesseract.
Any suggestions would be appreciated.