App to adjust imgage scaling

97 views
Skip to first unread message

nor s

unread,
Jul 20, 2023, 9:02:57 AM7/20/23
to tesseract-ocr
I'm trying to run tesseract-OCR on images that come to me at 72 DPI . The program is unable to decode these images and requires a 200 dpi  or better scale to be successful. Is there a program available, similar to tesseract-OCR, that would read a command line and convert an 72 dpi image to 200 dpi or some other specified value and save it in a specified location.  I'm running windows 10.
I can make these change in Photoshop but  I'm trying to automate the process since I have a lot of image to scan.

Any suggestion would be greatly appreciated.

Thanks
 Nor

Ger Hobbelt

unread,
Jul 21, 2023, 12:14:06 AM7/21/23
to tesseract-ocr
Check out ImageMagick, an open source image toolset. Specifically the 'convert' tool, look for commandline usage and application parameters/arguments, where you will find several ways to resize/rescale the image.
Also useful to ”tweak” the image as part of your ocr preprocessing pipeline before your image reaches tesseract.

Another big one would be OpenCV, but that would require you to write programs (python software or similar) while ImageMagick can accomplish a lot of what you want or might need and can be driven by some simple batch / Powershell / shell lines: much easier to get success that way if you're not already comfortable with coding software.

May appear overwhelming at first; read and try the various ways mentioned there to get a grasp and discover what you need to do for your scenario specifically. Ocr is not a simple process pipeline, so take your time with it.


--
You received this message because you are subscribed to the Google Groups "tesseract-ocr" group.
To unsubscribe from this group and stop receiving emails from it, send an email to tesseract-oc...@googlegroups.com.
To view this discussion on the web visit https://groups.google.com/d/msgid/tesseract-ocr/b6075062-921e-4da9-acdf-b0364dc3c960n%40googlegroups.com.

Ger Hobbelt

unread,
Jul 24, 2023, 2:59:57 PM7/24/23
to tesseract-ocr, astro
L.S.,

apologies, I think I screwed up with my last replies going private instead of group. Anyway, here's the trail until right now: see below.

(Clumsy on the mobi 😰, me)

@astro/nor: this means I'm the only one who got your first sample image, so might be good to resend it to group so everyone can follow. Sorry for messing up the reply chain here.


---------- Forwarded message ---------
From: Ger Hobbelt <g...@hobbelt.com>
Date: Mon, 24 Jul 2023, 20:50
Subject: Re: [tesseract-ocr] App to adjust imgage scaling
To: astro


Hi Nor,

Thanks for the background info and sample image.

I'm away from my machines for at least a week, only online per mobile and very short sporadic checks, so this will have to wait unless someone likes to take a swing at it, but the sample looks good at first glance from here.
There's some light grey noise in there but the tesseract binarization process part should easily take care of those, (fingers crossed) so ocr is expected to succeed most of the time with this. But, as always, the truth is in the testing so I'll have to see what a tesseract run does on my rig, f.e.

If I read you correctly, you now have a high ocr success rate? (Perfect ocr is always a miracle, but better than 90% would-be a good initial target to aim for. Tweaking that upwards is an *art* and I'm not an expert in that yet 😅)

Cheers,

Ger

PS: next step might be handy to show your tesseract command line you issue from VB, (plus sample image(s) and output you get out of tesseract, good & bad): there's a couple people on here who may voice some improvements if they spot any and have time to respond.


On Fri, 21 Jul 2023, 18:46 astro, wrote:
Hi Ger,
 The images I'm scanning are trail camera images that have the date/time on the picture in the bottom corner. I'm trying to extract the date/time values from the image. Normally the images are 1440x1080 at 96dpi . the only way I could get tesseract to read some of the time stamp was by upping the image size.  I have since changed my strategy and  used imageMagick to crop the bottom corner of the image that contains the date/time to a 540x70 image and leaving the 96dpi ( see attached). That seems to work very well. I'm currently looking to increase the reliability by trying various things including correcting the output where possible.

Thanks for the reply.

Cheers
 Nor

On 7/21/2023 12:01 PM, Ger Hobbelt wrote:
6000*4500?!

Hm, sounds way too large for a simple text.

I'm guessing here, but it might be that you got thwarted by the various "dpi" notes re ocr/tesseract out there.

Bottom line: IIRC tesseract was trained on text of around 30px high (note that I use PX = pixels as the relevant unit of measure, I don't care about dpi because that's something only really relevant to printing press people (desktop publishing, etc.)
While a lot of folks hang onto dpi as unit of measure it's derivative and only relevant when you scan printed pages, which turns "points" (and picas and ....) into pixels, which is where dpi pops up.

Anyway, the key bit for every image you feed to an ocr engine like tesseract is attempting to match the ”x height” Vs the training material as closely as possible for any attempt at a good/optimal match.
For tesseract, this means you should aim for each line if text to be somewhere between 20 and 50 pixels high (and as clean looking in black & white / greyscale as possible, but that comes second, after getting that line height to the 20-50px range. Computers work in PX, not DPI, so it's PX that's the driving criterium.

Since you mention "picking out a date” I ASSUME your text area is one line of text only.

Drop all image areas that do not contain text.
Make sure the text is black on a white background (you may need to invert your image when this is a video grab or some such, f.e.)
There's a long wiki page about improving image quality for tesseract processing too.
But first try to extract that line of text, scale it so the digits are between 20-50px high and try some sizes within that range.

Second most important bit, I find, is making sure the input image has black text on white background or anything greyscale/luminance-wise that approaches this as best as possible. SOME tesseract modes / settings can cope with white text on black BG, but that's you getting rather lucky so don't bet on it. 

tesseract is *engineered* for black text in white background input images (paper book scans)

If you need further assistance on this forum/mailing list, attack the image and tesseract commandline you tried; those messages get more feedback as they are less of a guessing game ;-)

PS: third most important work item that lots of folks do wrong: when clipping/extracting lines of text, postprocess those line images by adding a nice large white=BACKGROUND COLOR boundary around the entire line. Personally, I favor a "border" like that of about 0.5 to 1.0 the size of the line itself. The added border should be SMOOTHLY transitioning from the actual image background to prevent false edge detections in tesseract itself: this problem doesn't happen for clean paper book scans (which already have a plain white background) but is an important aspect when extracting from "busy backgrounds".
Anyway, that topic is the size of a book all by itself, so take it slow and get prio 1 right first: 1 line of text to ocr = 20-50px high.

Cheers,

Ger




On Fri, 21 Jul 2023, 13:35 astro, wrote:
Hi Ger,
Thanks for your response. Yes. I found ImageMagick. Looks t be very powerful and easy to implement. I tried it out by upping the the image to 300 dpi and 6000x4500  and ran the image thru the OCR process but tesseract had difficulty in picking out the date on the image. I guess I will have to play around so to see if I can improve things.

Cheers
 Nor

astro

unread,
Jul 24, 2023, 5:07:04 PM7/24/23
to tesser...@googlegroups.com

Here is the resend for the group.
Cheers
 Nor
outpx.jpg

astro

unread,
Jul 25, 2023, 9:22:55 AM7/25/23
to tesseract-ocr
Hi Ger et al,
      Let me consolidate the solutions I came across in trying to extract the time stamp from trail-cam pictures.
Normally the timestamp for these images can be extracted from the image EXIF meta data. However in some cases when the images gets manipulates or moved to other media that information gets lost. So if you want the date and time the image was taken you have to look at the image itself and read that information. I'm trying to catalog the images with their timestamp in a database so the best way to do that was to find a way to read that data from the Image programmatically and load it into the DB.

The images I'm reading are 1440x1080 at 96dpi.(see attached test2.JPEG)

I'm familiar with OCR but don't have a standalone program that does that. I asked around online and was pointed to Tesseract-OCR.
 My first try was to feed it the entire image which resulted in no character recognition at all. Not until I upped the image size to 300 dpi and 6000x4500 did I get indication in the output file that it saw some characters, not great but making headway. Obviously the OCR process on this size image was a bit slow.
My next try was to crop out just the section of the image that had the date and time info using Photoshop.To my delight the result was considerably better and faster.
 Since I need to do the cropping programmatically I installed Image Magick and used it to crop out a   564x72 rectangle and retained the original 96dpi which resulted in a Jpg as shown in the attached outpx.jpg. Running the OCR on the outpx.jpg resulted in the following text output:

02-05-2021 15:54:43

 
Exactly what I wanted. Right now I'm getting about a 90% plus reliability when scanning images in this fashion. I'm quite happy with that.

For those of you that are interested in how I incorporated this into my VB application here is the code snippet:

                    Dim p As New ProcessStartInfo
                    ' Crop the image to isolate the date/time
                    p.FileName = "magick"        ' point to ImageMagick
                    p.Arguments = "convert -crop 564x72+930+1015 " & tempPixFN & " D:\Tesseract-OCR\result\outpx.jpg"
                    p.WindowStyle = ProcessWindowStyle.Hidden     ' this keeps the process from spawning a command prompt windo
                    p.CreateNoWindow = True
                    Process.Start(p).WaitForExit()        ' wait for the process to complete

                    ' run OCR
                    p.FileName = "tesseract"
                    p.Arguments = "D:\Tesseract-OCR\result\outpx.jpg  D:\Tesseract-OCR\result\outpx --oem 1"
                    p.WindowStyle = ProcessWindowStyle.Hidden
                    p.CreateNoWindow = True
                    Process.Start(p).WaitForExit()


As the notes indicate, the first call is to ImageMagick to crop the image to 564x72 starting at x and y location of 930 ,1015 of the image and wait for its completion. The second call is to OCR to read the output image and extract the text.

I played around with changing the rectangle size and it's location to see if I can get better results, however this seems to be the best so far.

Thanks for the info you all provided that pushed me in the right direction.

Cheers
 Nor
test2.jpeg
outpx.jpg
Reply all
Reply to author
Forward
0 new messages