Info messages to stdout from training utils

Zdenko Podobny

unread,

Jan 20, 2023, 7:49:20 AM1/20/23

to tesser...@googlegroups.com

I realized that several users did not recognize errors during the training process.

IMO part of the problem is that all messages (error and standard output) from training tools are shown in stderr because of tprinf usage.

While this make sense in tesseract executable (OCR process output is sent to stdout, all other messages to stderr), in training we should use different approach: only errors (e.g. that should stop further process) should go to stderr and all other info should go to stdout.

Good example is unicharset_extractor:

https://github.com/tesseract-ocr/tesseract/blob/4142b328157fa5acdb5780308ecf308f1c6e2ec7/src/training/unicharset_extractor.cpp#L75-L84

Are you ok with this proposal? This would mean that tprinf will be used for errors, and std::cout/fprintf(stdout for rest...

Zdenko

Ger Hobbelt

unread,

Jan 27, 2023, 6:41:49 AM1/27/23

to tesser...@googlegroups.com

What I have here is a modified tesseract where all feedback\reporting like this still goes through tprintf() -- hence a single log channel for all -- but the entire codebase is checked to ensure that all error lines are prefixed with

ERROR:

while warning messages are prefixed with

WARNING:

and the rest is kept as-is (the 'info' and 'debug' level messages).

Rationale?

1: single channel is easiest to collect and redirect \ store \ hook (think applications where stderr and stdout don't exist or are not obvious, e.g. Win32 \ GUIs where you want to see such stuff in a panel or elsewhere, no console window wanted\needed). tprintf() is modified to call a userdef callback which does the actual write(stderr) or whatever you like\need.

2: I like cli apps which use stderr for all feedback\reporting so stdout is free to be used as data output channel. Think apps that can --output=[file] or --output=- to indicate stdout as the channel where generated\processed data is written, which is handy when you want to use it with pipes. This as a general concept, not specific to training tools, which probably won't need this, but "general expectations the same everywhere" sort of thing.

3. For all processes, ocr, training, or otherwise, the tprintf() \ printf() stuff in the code is all 'side channel' to me. Meaning: the process itself, whatever it is, works stdin \ file-in to stdout \ file-out, while any reporting, diagnostics, anything-not-the-primary-product, is 'side channel', hence stderr.

I consider tesseract rather more a library \ back end application set than a user-facing front-end app. I do realize nobody has done it (yet?), certainly not for *training* tesseract, but I'd like someone to at least be able to to integrate tesseract into their product, so current tprintf + stderr is my preference for this (supposed) context. (I'm biased as I would like to use tesseract that way myself one day ;-) )

4. Having a single "log channel" with *unique* "ERROR: " and "WARNING: " line\message prefixes for those makes it very simple for anyone\any software down the line to filter \postprocess those messages for user\specific purposes. (And, of course, "quiet output" is achieved by ditching everything without those specific prefixes ;-) )

My 2 cents, HTH. :-)

Cheers,

Ger

--
You received this message because you are subscribed to the Google Groups "tesseract-dev" group.
To unsubscribe from this group and stop receiving emails from it, send an email to tesseract-de...@googlegroups.com.
To view this discussion on the web visit https://groups.google.com/d/msgid/tesseract-dev/CAJbzG8zUyFWPy2TbuFui%3Dd66GANFwB_g9iR5z8dKDPCM3X7Q3Q%40mail.gmail.com.

Ger Hobbelt

unread,

Jan 27, 2023, 6:55:18 AM1/27/23

to tesser...@googlegroups.com

Oh, and having those ERROR: and WARNING: prefixes for the tprintf messages makes it also pretty obvious to users what is happening when used as-is.

If y'all like it, I can do an extract from my fork and make a pullreq for this. (Though my head has already been picking up Stefan's move to using the fmt lib, I could do a reverse on that relatively easily, so's to mix well with tesseract's mainline?)