6000*4500?!
Hm, sounds way too large for a simple text.
I'm guessing here, but it might be that you got
thwarted by the various "dpi" notes re ocr/tesseract out
there.
Bottom line: IIRC tesseract was trained on text
of around 30px high (note that I use PX = pixels as the
relevant unit of measure, I don't care about dpi because
that's something only really relevant to printing press people
(desktop publishing, etc.)
While a lot of folks hang onto dpi as unit of
measure it's derivative and only relevant when you scan
printed pages, which turns "points" (and picas and ....) into
pixels, which is where dpi pops up.
Anyway, the key bit for every image you feed to
an ocr engine like tesseract is attempting to match the ”x
height” Vs the training material as closely as possible for
any attempt at a good/optimal match.
For tesseract, this means you should aim for
each line if text to be somewhere between 20 and 50 pixels
high (and as clean looking in black & white / greyscale as
possible, but that comes second, after getting that line
height to the 20-50px range. Computers work in PX, not DPI, so
it's PX that's the driving criterium.
Since you mention "picking out a date” I ASSUME
your text area is one line of text only.
Drop all image areas that do not contain text.
Make sure the text is black on a white
background (you may need to invert your image when this is a
video grab or some such, f.e.)
There's a long wiki page about improving image
quality for tesseract processing too.
But first try to extract that line of text,
scale it so the digits are between 20-50px high and try some
sizes within that range.
Second most important bit, I find, is making
sure the input image has black text on white background or
anything greyscale/luminance-wise that approaches this as best
as possible. SOME tesseract modes / settings can cope with
white text on black BG, but that's you getting rather lucky so
don't bet on it.
tesseract is *engineered* for black text in
white background input images (paper book scans)
If you need further assistance on this
forum/mailing list, attack the image and tesseract commandline
you tried; those messages get more feedback as they are less
of a guessing game ;-)
PS: third most important work item that lots of
folks do wrong: when clipping/extracting lines of text,
postprocess those line images by adding a nice large
white=BACKGROUND COLOR boundary around the entire line.
Personally, I favor a "border" like that of about 0.5 to 1.0
the size of the line itself. The added border should be
SMOOTHLY transitioning from the actual image background to
prevent false edge detections in tesseract itself: this
problem doesn't happen for clean paper book scans (which
already have a plain white background) but is an important
aspect when extracting from "busy backgrounds".
Anyway, that topic is the size of a book all by
itself, so take it slow and get prio 1 right first: 1 line of
text to ocr = 20-50px high.
Cheers,
Ger
On Fri, 21 Jul 2023, 13:35
astro, wrote:
Hi Ger,
Thanks for your response. Yes. I found ImageMagick. Looks
t be very powerful and easy to implement. I tried it out
by upping the the image to 300 dpi and 6000x4500 and ran
the image thru the OCR process but tesseract had
difficulty in picking out the date on the image. I guess I
will have to play around so to see if I can improve
things.
Cheers
Nor