Hi,
More on this later (I seem to still have issues posting with attachments here, plus running into a few surprises while doing bulk testing, so this is preliminary):
1. Dont use lossy image file formats if you can, so PNG is better than JPEG. From what I see, if you need lossy due to storage limitations, it seems webp is better than JPEG. Has to do with the type of noise jpeg introduces as "jpeg artifacts".
2. Scale (resize, use imagemagick or other tool to do this in bulk) the input image to approximate 30px capital letter height for each line. That's the ballpark, do try a couple of scales near that measure, e.g. test results with a set of scaled images 5% off to see which scale is 'optimal' for you. It can help to then run an additional test set with scales in a 1-2% geometric scale range (i.e. next scale to try is 102% of previous smaller test size).
How to check: output both hocr and tsv outputs with character confidence reporting turned on (tesseract hocr output for character confidence is broken, those numbers only show in tsv), then read those files and check both character and word confidence values output by tesseract. Pick the scaling+misc preprocessing that gives you the highest numbers there on average for your test set.
After that, it depends...
BTW: to my eye your image isn't noisy and you mention noise, hence: you got a few rotten ones for us? ;-)
Re noise, preprocessing: what I find helps is killing (masking) all noise that is a few pixels away from any character. Particularly when you are processing low dpi / jpeg input. This must be done before feeding it to tesseract as current tesseract does thresholding, etc for detecting the spots where the text (words) are at, but the latest engine (LSTM) is fed the raw input pixels so any useless noise ends up in there and degrades output.
TLDR:
- scale
- Denoise
- enhance contrast (not necessary in your case)
- ... other means to make image easier legible, anything goes ...
- dictionary, etc. for tesseract or post: I see you've got jargon in there (susp, iss, ...) which are not regular English dictionary words, so it might help to use a custom dict, but don't have hard data on that one yet myself)