Re image size, etc.: see:
for *why* resizing, etc. are often beneficial to OCR confidence numbers & quality.
Quick meta-question though as I am quite surprised that people feed financial data into any kind of (fundamentally statistical and thus noise-injecting) OCR process (not just tesseract but any and all of them out there): wouldn't it be more business-smart to scrape financial performance reports like these or even better: get the direct data export from the SAP software at that company, so that you forego the entire machine-render-text-to-image + image-to-text OCR risky and costly process altogether?
That financial performance stuff is usually reported in PDF/A format for obvious reasons (chamber of commerce, stock exchange, investors, those kinds of folks who all like their data as *virginal* as can be) and when you grab that output you're one straight text extract away from success, instead of wrangling a risky OCR process chain, which, by definition, cannot deliver a 100% accurate reconstruction all the time.
As this clearly is corporate financial data you're processing (and we can thus safely assume this reported data will be fed into follow-up processes where the actual numbers are of some import), I would expect nobody involved will appreciate the implicit risk factors introduced by injecting a inherently noisy statistical filter in the number crunching process, which opens one to the forever clear and present risk of random number value inaccuracies due to the nature of any neural net's output?
You're certainly not the only one attempting to apply OCR to financial data around here (the mailing list is brimming with it), but when I see annual / quarterly corporate performance reports being processed like that, I start to worry a wee bit more than usual. Not for tesseract (it does its job just fine), but for the one who came up with the idea to plonk such data into an image file and feed it to any kind of OCR machinery. Sounds like an already previously failed due diligence execution to me, where the question should have been asked: can we get this data in any type of text format straight from the source, as that is a company and machine-produced already. txt, csv, pdf, excel, anything? At what cost?
Or can't you get the text data (why?! if you get the page images, it's published material, correct?) and do you intend to use tesseract / your OCR process as an *assistive process* where the OCR output is reviewed / vetted by a human before deemed of sufficient quality for further use?