So I've done a more extensive analysis of Tesseract v4.0.0 accuracy vs. text size / resolution.
1. I used the released Tesseract v4.0.0 library
2. I used the English language training file 22.4 MB in size from
this folder 3. I created bitmaps for OCR-ing in six different fonts, at 6 pts, 12 pts, and 24 pts in size, each across a wide range of dpi. The six fonts are shown in the attachment.
4. I used the text from the Declaration of Independence--approximately 6600 letters.
5. Overall I OCR'd over 2 million English alphabet characters
The OCR error rate was most strongly correlated to the height of a capital letter in pixels, regardless of dpi or point size. See plot below.
The most common errors when the letter height got too large were dependent on a particular font but included interpreting i as 1 (most common with serif fonts), f as t, confusing a semicolon with a colon, and interpreting a lower case letter as it's capital letter (e.g. i/I, s/S, o/O, k/K).
I have to say, the fact that there is an optimum letter size (in pixels) is quite unexpected. I would have expected that the higher the resolution of the letter (the more pixels), the lower the error rate would be. There is not a good reason I can think of to expect otherwise. If the OCR algorithm has an optimal letter size and it can down-sample to that size, then it should do that. I hope that future revisions of Tesseract will address this.
Next step: See if this also occurs on Tesseract v3.0.5 with Cube training data.
Here is the plot: