Trouble extracting date and time from image

45 views
Skip to first unread message

Michael Schuh

unread,
Oct 30, 2025, 1:26:49 PM (6 days ago) Oct 30
to tesseract-ocr
I am trying to extract the date and time from 

time.png

I have successfully use tesseract to extract text from other images.  tesseract does not find any text in the above image, 

   michael@argon:~/michael/trunk/src/tides$ tesseract time.png out
   Estimating resolution as 142

   michael@argon:~/michael/trunk/src/tides$ cat out.txt

   michael@argon:~/michael/trunk/src/tides$ ls -l out.txt
   -rw-r----- 1 michael michael 0 Oct 30 08:58 out.txt

Any help you can give me would be appreciated.  I attached the time.png file I used above.

Thanks,
   Michael
time.png

Rucha Patil

unread,
Oct 30, 2025, 2:44:05 PM (6 days ago) Oct 30
to tesser...@googlegroups.com
Could be a color performance issue. Text also has shadows.Tesseract does better on black on white. Try thresholding the image/ detect white color text make it black and then make the rest background white. Make sure you’re using the correct psm mode. 

--
You received this message because you are subscribed to the Google Groups "tesseract-ocr" group.
To unsubscribe from this group and stop receiving emails from it, send an email to tesseract-oc...@googlegroups.com.
To view this discussion visit https://groups.google.com/d/msgid/tesseract-ocr/77ac0d2b-7796-4f17-8bc6-0e70a9653adan%40googlegroups.com.

Ger Hobbelt

unread,
Oct 30, 2025, 2:57:34 PM (6 days ago) Oct 30
to tesser...@googlegroups.com
I cannot emphasize this single item (in a long list of stuff one can/must do before feeding any image to an OCR engine) enough: tesseract has been trained to 'read' books, i.e black text on white background. Consequently, any image preprocessing step(s) that get you there, are strongly advised.

This, and lots of other "I don't wanna hear this 🥴" important details show up in the documents and emails listed below: 
(I know people like twitter-sized or shorter text, but you've got some reading to do if you want to be successful at OCRing stuff. We all have to, it's not simple.)


and then a bunch of messages that are related; I'd rather not repeat myself, so please take your time and read those threads: some of it may sound crazy at first, but you're doing something that's touching on the edge of the original design goals and that means you're bound to meet some "weird behaviour" along the way. Before I let myself out, the second most important piece of advice I can give everyone: use HOCR (which is HTML content plus coordinates) or TSV output instead of anything else; do not, I repeat: !DO NOT! output txt format, just because every internet wizard out there does it in their blog: txt (text) format is minimal-information and you are way better off with a maximal-information output for when you need to diagnose trouble -- plus, now you've seen the workflow diagram that's part of the info above, turning HOCR/TSV into TXT should be part of your postprocessing, AFAIAC.
Other direct or sideways relevant blurbs to be read here (again, consider reading the entire threads; OCR is one of those activities where 'quickly scanning my text books to pass my exam' as you previously learned at school is not going to get you closer to success faster, on the contrary:


HTH

Met vriendelijke groeten / Best regards,

Ger Hobbelt

--------------------------------------------------
web:    http://www.hobbelt.com/
        http://www.hebbut.net/
mail:   g...@hobbelt.com
mobile: +31-6-11 120 978
--------------------------------------------------


--

Michael Schuh

unread,
Oct 30, 2025, 8:21:16 PM (5 days ago) Oct 30
to tesseract-ocr
Thanks.  I figured out how to use ImageMagick to change the mottled gray to green.

michael@argon:~/michael/trunk/src/tides$ convert time.png -fuzz 20% -fill "green" -opaque "gray(60%)" time_green.png

time_green.png

michael@argon:~/michael/trunk/src/tides$ tesseract time_green.png -
Estimating resolution as 147

10/29/2025
9:43:16 PM

Rucha Patil

unread,
Oct 31, 2025, 12:46:13 AM (5 days ago) Oct 31
to tesser...@googlegroups.com
Green? Why? I dont know if this might resolve the issue. Lmk the behavior I’m curious. But you need an image that has white background and black text. You can achieve this easily using cv2 functions. 

--

Ger Hobbelt

unread,
Oct 31, 2025, 8:52:26 PM (4 days ago) Oct 31
to tesser...@googlegroups.com
Indeed, why? (What is the thought that drove you to run this particular imagemagick command?)  While it might help visually debugging something you're trying, the simplest path towards "black text on white background" is 

1. converting any image to greyscale. (and see for yourself if that output is easily legible; if it's not, chances are the machine will have trouble too, so more preprocessing /before/ the greyscale transform is needed then)
2. use a 'threshold' (a.k.a. binarization) step to possibly help (though tesseract can oftentimes do a better job with greyscale instead of hard black & white as there's more 'detail' in the image pixels then. YMMV).

You can do this many ways, using imagemagick is one, openCV another. For one-offs I use Krita / Photoshop filter layers (stacking the filters to get what I want). 
Anything really that gets you something that approaches 'crisp dark/black text on a clean, white background, text characters about 30px high' (dpi is irrelevant, though often mentioned elsewhere: tesseract does digital image pixels, not classical printer mindset dots-per-inch). 

Note that 'simplest path towards' does not mean 'always the best way'.

Met vriendelijke groeten / Best regards,

Ger Hobbelt

--------------------------------------------------
web:    http://www.hobbelt.com/
        http://www.hebbut.net/
mail:   g...@hobbelt.com
mobile: +31-6-11 120 978
--------------------------------------------------

Michael Schuh

unread,
Nov 1, 2025, 1:01:46 AM (4 days ago) Nov 1
to tesser...@googlegroups.com
Rucha > Green? Why?

Ger > Indeed, why? (What is the thought that drove you to run this particular imagemagick command?)

Fair questions.  I saw both black and white in the text so I picked a background color that does not exist in the text and has high contrast.   tesseract did a great job with the green background.  I want to process images to extract Palo Alto California tide data, date, and time and then plot the results against xtide predictions.  I am close to processing a day's worth of images collected once a minute so I will see how well the green background works.  If I have problems, I will definitely try using your (Ger and Rucha's) advice.

Thank you Ger and Racha very much for your advice.

Best Regards,
   Michael

Ger Hobbelt

unread,
Nov 1, 2025, 11:49:00 AM (4 days ago) Nov 1
to tesseract-ocr
I suspected something like this.

FYI a technical detail that is very relevant for your case: when somebody feeds tesseract a white text on dark background image, tesseract OFTEN SEEMS TO WORK. Until you think it's doing fine and you get a very hard to notice lower total quality of OCR output than with comparable white text on black background. Here's what's going on under the hood and why I emphathetically advise everybody to NEVER feed tesseract white in black:

Tesseract code picks up your image and looks at its metadata: width, height and RGB/number of colors. Fine so far.
Now it goes and looks at the image pixels and runs a so-called segmentation process. Fundamentally, it runs it's own thresholding filter over your pixels to produce a pure 0/1 black & white picture copy: this one is simpler and faster to search as tesseract applies algorithms to discover the position and size of each but if text: the bounding-boxes list. Every box (a horizontal rectangle) surround [one] [word] [each]. Like I did with the square brackets [•••] just now. (for c++ code readers: yes, in skipping stuff and not being *exact* in what happens. RTFC of you want the absolute and definitive truth.)

Now each of these b-boxes (bounding boxes) are clipped (extracted) from your source image and fed, one vertical pixel line after another, into the LSTM OCR engine, which spots out a synchronous stream of probabilities: think "30% chance that was an 'a' just now, 83% chance is was a 'd' and 57% chance I was looking at a 'b' instead. Meanwhile here's all the rest of the alphabet but their chances are very low indeed."
So the next bit of tesseract logic looks at this and picks the highest probable occurrence: 'd'. (Again, wat more complex than this, but this is the base of it all and very relevant for our "don't ever do white-on-black while it might seem to work just fine right now!"

By the time tesseract has 'decoded' the perceived word in that little b-box image, it may have 'read' the word 'dank', for example. The 'd' was just the first character in there.
Tesseract ALSO has collected the top rankings (you may have noticed that my 'probabilities' did not add up to 100%, so we call them rankings instead of probabilities).
It also calculated a ranking for the word as a whole, say 78% (and rankings are not real percentages so I'm lying through my teeth here. RTFC if you need that for comfort. Meanwhile I stick to the storyline here...)

Now there's a tiny single line of code in tesseract which now gets to look at that number. It is one of the many "heuristics" in there. And it says: "if this word ranking is below 0.7 (70%), we need to TRY AGAIN: Invert(!!!) that word box image and run it through the engine once more! When you're done, compare the ranking of the word you got this second time around and may the best one win!"
For a human, the heuristic seems obvious and flawless. In actual practice however, the engine can be a little crazy sometimes when it's fed horribly unexpected pixel input and there's a small bit noticeable number of times where the gibberish wins because the engine got stoned as a squirrel and announced the inverted pixels have a 71% ranking for 'Q0618'. Highest bidder wins and you get gibberish (at best) or a totally incorrect word like 'quirk' at worst: both are very wrong, but your chances of discovering the second example fault is nigh impossible, particularly when you have automated this process as you process images in bulk.

Two ways (3, rather!) this has a detrimental affect on your output ice quality:

1: if you start with white-on-black, tesseract 'segmentation' has to deal with white-on-black too and my findings are: the b-boxes discovery delivers worse results. That bad in two ways and both (2) and (3) don't receive optimal input image clippings.
2: by now you will have guessed it: you started with white-on-black (white-on-green in your specific case) so the first round through tesseract is feeding it a bunch of highly unexpected 'crap' it was never taught to deal with: gibberish is the result and lots of 'words' arrive at that heuristic with rankings way below that 0.7 benchmark, so the second saves your ass by rerunning the INVERTED image and very probably observing serious winners that time, so everything LOOKS good for the test image.

Meanwhile, we know that the tesseract engine, like any neural net, can go nuts and output gibberish at surprising high confidence rankings: assuming your first run delivered gibberish with such a high confidence, barely or quite a lot higher than the 0.7 benchmark, you WILL NOT GET TGAT SECOND RUN and thus crazy stuff will be your end result. Ouch.

3: same as (2) but now twisted in the other direction: tesseract has a bout of self-doubt somehow (computer pixel fonts like yours are a candidate for this) and thus produces the intended word 'dank' during the second run but at a surprisingly LOW ranking if, say, 65%, while first round gibberish had the rather idiotic ranking of 67%, still below the 0.7 benchmark but "winner takes all" now has to obey and let the gibberish pass anyhow: 'dank' scored just a wee bit lower!
Again, fat failure in terms of total quality of output, but it happens. Rarely, but often enough to screw you up.

Of course you can argue the same from the by-design black-on-white input, so what's the real catch here?! When you ensure, BEFOREHAND, that tesseract receives black-on-white, high contrast, input images, (1) Will do a better job, hence reducing your total error rate. (2) is a non-scenario now because your first round gets black-on-white, as everybody trained for, so no crazy confusion this way. Thus another, notable, improvement in total error rate /quality.
(3) still happens, but in the reverse order: the first round produces the intended 'dank' word at low confidence, so second round is run and gibberish wins, OUCH!, **but** the actual probability of this happening just dropped a lot as your 'not passing the benchmark' test is now dependent on the 'lacking confidence' scenario part, which is (obviously?) *rarer* than the *totally-confused-but-rather-confident* first part of the original scenario (3).

Thus all 3 failure modes have a significantly lower probability of actually occurring when you feed tesseract black-on-white text, as it was designed to eat that kind of porridge.

Therefor: high contrast is good. Better yet: flip it around (Invert the image), possibly after having done the to-grwyscale conversion yourself, as well. Your images will thank you (bonus points! Not having to execute the second run means spending about half the time in the CPU-intensive neural net: higher performance and fewer errors all at the same time 🥳🥳)



Why does tesseract have that 0.7 heuristic then? That's a story for another time, but it has it's uses...

Met vriendelijke groeten / Best regards,

Ger Hobbelt

--------------------------------------------------
web:    http://www.hobbelt.com/
        http://www.hebbut.net/
mail:   g...@hobbelt.com
mobile: +31-6-11 120 978
--------------------------------------------------

Ger Hobbelt

unread,
Nov 1, 2025, 11:51:09 AM (4 days ago) Nov 1
to tesseract-ocr
(apologies for the typos and uncorrected mobile phone autocorrect eff-ups in that text just now)

Met vriendelijke groeten / Best regards,

Ger Hobbelt

--------------------------------------------------
web:    http://www.hobbelt.com/
        http://www.hebbut.net/
mail:   g...@hobbelt.com
mobile: +31-6-11 120 978
--------------------------------------------------

Michael Schuh

unread,
Nov 1, 2025, 12:08:22 PM (4 days ago) Nov 1
to tesser...@googlegroups.com
Ger,

Thank you very, very much for your detailed explanation of how tesseract processes my image!

I now see the wisdom and cpu time benefits of creating a black text on white background image.  I will work on doing this.

Best Regards,
   Michael

Ger Hobbelt

unread,
Nov 1, 2025, 5:08:02 PM (4 days ago) Nov 1
to tesser...@googlegroups.com
You're welcome! Good luck and take care!

....

(For posterity / google search, here's a corrected copy of my blurb earlier today on why b/w instead of running with w/b when the test image passes muster):


I suspected something like this.

FYI a technical detail that is very relevant for your case: when somebody feeds tesseract a white text on dark background image, tesseract OFTEN SEEMS TO WORK. Until you think it's doing fine and you get a very hard to notice lower total quality of OCR output than with comparable white text on black background.

Here's what's going on under the hood and why I emphatically advise everybody to NEVER feed tesseract white-text-on-black-background:

Tesseract code picks up your image and looks at its metadata: width, height and RGB/number of colors. Fine so far.
Now it goes and looks at the image pixels and runs a so-called segmentation process.

Fundamentally, it runs its own thresholding filter over your pixels to produce a pure 0/1 black & white picture copy: this one is simpler and faster to search as tesseract applies algorithms to discover the position and size of each b-box of text: the bounding-boxes list.
Every b-box (a horizontal rectangle) surrounds [one] [word] [each.] Like I did with the square brackets [•••] just now. (For C++ code readers: yes, I'm skipping stuff and not being exact in what happens. RTFC if you want the absolute and definitive truth.)

Now each of these b-boxes (bounding boxes) are clipped (extracted) from your source image and fed, one vertical pixel line after another, into the LSTM OCR engine, which spits out a synchronous stream of probabilities: think "30% chance that was an 'a' just now, 83% chance it was a 'd' and 57% chance I was looking at a 'b' instead. Meanwhile here's all the rest of the alphabet, but their chances are very low indeed."

So the next bit of tesseract logic looks at this and picks the highest probable occurrence: 'd'. (Again, reality is way more complex than this, but this is the base of it all and very relevant for our "don't ever do white-on-black while it might seem to work just fine right now!"

By the time tesseract has 'decoded' the perceived word in that little b-box image, it may have 'read' the word 'dank', for example. The 'd' was just the first character in there.

Meanwhile, tesseract ALSO has memorized the top rankings (you may have noticed that my 'probabilities' did not add up to 100%, so we call them rankings or scores instead of probabilities). It also calculated a ranking for the word as a whole, say 78% (and rankings are not real percentages so I'm lying through my teeth here. RTFC if you need that for comfort. Meanwhile I stick to the storyline here...)

We're still not done: there's a tiny, single line of code in tesseract which now gets to look at that number. It is one of the many "heuristics" in there. And it says: "if this word ranking is below 0.7 (70%), we need to TRY AGAIN: Invert(!!!) that word box image and run it through the engine once more! When you're done, compare the ranking of the word you got this second time around and may the best one win!"

For a human, the heuristic seems obvious and flawless. In actual practice however, the engine can act a little crazy sometimes when it's fed horribly unexpected pixel input and there's a small but noticeable number of times where the gibberish wins because the engine got stoned as a squirrel and announced the inverted pixels have, say, a 71% ranking for nonsense 'Q0618'. Highest bidder wins and you get gibberish (at best) or a totally incorrect word like 'quirk' at worst: both are very wrong, but your chances of discovering the second example fault is nigh impossible, particularly when you have automated this process as you process images in bulk.

Two ways (3, rather!) this has a detrimental affect on your output OCR quality:

  1. if you start with white-text-on-black-background, tesseract 'segmentation' has to deal with white-text-on-black-background too and my findings are: the b-boxes discovery delivers worse results. That's bad in two ways as both (2) and (3) don't receive optimal input image clippings.

  2. by now you will have guessed it: you started with white-text-on-black-background (white-text-on-green-background in your specific case) so the first round through tesseract is feeding it a bunch of highly unexpected 'crap' it was never taught to deal with: gibberish is the result and lots of 'words' arrive at that heuristic line with rankings way below that 0.7 benchmark, so the consequence, the second run, saves your ass by rerunning the INVERTED image and very probably observes serious winners this time, so everything LOOKS good for the test image.

    Meanwhile, we know that the tesseract engine, like any neural net, can go nuts and output gibberish with surprising high confidence rankings: assuming your first run delivered gibberish with such a high confidence, barely or quite a lot higher than the 0.7 benchmark, you WILL NOT GET THAT SECOND RUN and thus crazy stuff will be your end result. Ouch!

  3. same as (2) but now twisted in the other direction: tesseract has a bout of self-doubt somehow (computer pixel fonts like yours are a candidate for this) and thus produces the intended word 'dank' during the second run but at a surprisingly LOW ranking of, say, 65%, while first round gibberish had the rather idiotic ranking of 67%, still below the 0.7 benchmark but "winner takes all" has to obey and let the gibberish pass anyhow: 'dank' scored just a wee bit lower! Ouch!

  1. Again, fat failure in terms of total quality of output, but it happens. Rarely, but often enough to screw you up.


Of course you can argue the same from the by-design black-text-on-white-background input, so what's the real catch here?!

When you ensure, BEFOREHAND, that tesseract receives black-text-on-white-background, high contrast, input images:
(1) will do a better job, hence reducing your total error rate.
(2) is a non-scenario now because your first round gets black-text-on-white-background, as everybody trained for, so no crazy confusion this way. Thus another, notable, improvement in total error rate / quality.
(3) still happens, but in the reverse order: the first round produces the intended 'dank' word at low confidence (65%), so the second round is run and gibberish (at 67%) wins, OUCH!, but! the actual probability of this scenario happening just dropped a lot as your 'not passing the benchmark' test is now dependent on the 'lacking confidence' scenario part, which is (obviously?) rarer than the totally-confused-but-rather-confident first part of the original scenario (3).

Thus all 3 failure modes have a significantly lower probability of actually occurring when you feed tesseract black-text-on-white-background images, as it was designed to eat that kind of porridge.

Therefore: high contrast is good.
Better yet: flip it around (Invert the image colors), possibly after having done the to-greyscale conversion yourself, as well.
Your images will thank you.

✨Bonus points!✨ Not having to execute the second run, for every b-box tesseract found, means spending about half the time in the CPU-intensive neural net: higher performance and fewer errors all at the same time 🥳🥳



Why does tesseract have that 0.7 heuristic then? That's a story for another time, but it has its uses...



Met vriendelijke groeten / Best regards,

Ger Hobbelt

--------------------------------------------------
web:    http://www.hobbelt.com/
        http://www.hebbut.net/
mail:   g...@hobbelt.com
mobile: +31-6-11 120 978
--------------------------------------------------

Reply all
Reply to author
Forward
0 new messages