L.S.,
I rather like to ask "are you interested to see this, is this in line with your own strategic (long term) tesseract intent?" before I expend the additional effort of turning this into PullReqs from my development fork, because that is a nontrivial effort, and useless expenditure to me if there is no real interest in having this stuff. It's more worth to me for you to give clear & opinionated response than socially beautiful short "submit pr" replies devoid of opinion. Thank you!
The replies to this will impact where and how I focus on my own tesseract work in the next 12 months: these will/did happen anyway, the question is: should I include considering each one for PR into mainline.
I have labeled the 'items' I consider subjects of PRs with a letter A..C, so you can easily pick and address specific items.
First off, my stance: I treat tesseract as an SDK, a software library, which happens to have CLI tools that a lot of peeps are using as their way to interface with the core. Great!
I do not make a difference between training and applying the OCR models' software parts (this is relevant when I talk about logging, diagnostics, below): the whole is an SDK that happens to come with usable CLI tools out of the box.
I am also rather strongly opinionated about the current means available for training diagnostics, where the ScrollView java app bears the brunt of my ire. Let's try to do something about that as well.
The PR candidates:
A. Using C++ fmt rather than C printf style message formatting throughout.
Some time ago, Stefan Weil started this, almost completed it, but it never got merged into mainline.
My rationale for having it and having it done rigourously throughout the codebase: fmt is, to me, the right balance between old skool, portability problem printf (%z, int64, etc etc) that is very important for everyone who doesn't live in a gcc-spans-my-entire-world dev environment, e.g. me, and on the other hand the very irksome, hard to read custom field formatting verbiage of iostream & friends. fmt is type safe, 99% of the time a simple "{}" suffices as replacement for any '%xyz' parameter, i like it. And it is now part of the c++ standard (std::format) which is a strong additional driver from my point of view.
Question: are you all interested in seeing this as a PR?
Aside/FYI in case anyone out there thinks otherwise: I don't give a gnats bottom about credit: whether it happens with my name included or not does not bother me: i don't live in a what-have-you-accomplished-lately realm regarding tesseract or related work, so rip it any way you like. I get the benefit of having fewer diffs with mainline tesseract repo so my work in tracking changes becomes a little easier. :-)
B. All tesseract diagnostics output travels through a single interface (think: tprint); this includes anything from errors to debug level info messages **and images**. As it does already today (tprint), tesseract comes with a default implementation which prints to stderr. i.o.w.: One 'logging' interface to rule them all.
The goal is to have a diagnostics output channel which can be filtered (how much you want to see) and routed (write to stderr, logfile, user-provided): tesseract is an SDK from my PoV and thus SHOULD come with an adjustable, hookable debug channel.
The chosen means is via the C++ spdlog library for it offers almost all the flexibility I want out of the box (that 'images included' bit is where the extra action needed is) and has shown it's got longevity of its own.
I am emphathetically NOT looking for another log4j CVE debacle, so don't expect much beyond enable/disable IO route stuff as offered by spdlog; in my opinion log4j is/was a sign of cofigure-insanity. I rather want to land somewhere between the two extremes of rigid tprintf+misc and over the top log4j configuration certification training required:
The ultimate goal is tesseract able to output all its diagnostics (again: errors, etc. are part of the diagnostics stream in my PoV!) to text (markdown-y-ish?) / html files which include <IMG> tags pointing at the also-stored images that would otherwise have gone towards ScrollView UI or similar. Diagnostics reporting which may persist post-run for review at a later time/date.
The goal is: run tesseract, click on html diag report, review in browser, while looking at the accompanying images as extracted from the tesseract process while it was running. Meanwhile, folks used to the classic stderr format should have that: we're all a bunch of conservatives after all (no relation with USA party politics, there, mind you. I am an alien for youse 'muricans out there. ;-) We use words differently...)
If you are curious: I have a PoC of this running and I can post one of more examples of the html output, with images, so you can see what I mean.
The higher level intent with this is: provide improved user-understandable diag output, f.e.: my fork also includes some extra, explanatory info texts in its diag stream, so you have a better chance of grokking what's going on and where-are-we-at in a tesseract run while you, the human, are already a little tired after your day's work. Reduce the brain strain, if only a little.
C. provide a means for tesseract to accept fully preprocessed image input data. This MAY include pre-done grey scale conversion, pre-done segmentation, pre-done image segment ordering (the box segment list in tesseract, which also determines the order of words written to hocr,tsv,etc outputs and which is currently solely determined by tesseract internal processing).
While tesseract currently has a bit of this in the form of an input rect (=bbox) list, this is fully inadequate for nontrivial page images.
The goal here: provide a means to override the tesseract-internal preproc stage (which is limited, which is fine with me), either in part or whole, so the OCR user can tune their inputs for maximum tesseract OCR performance in their particular scenarios.
This is not just about old newspapers (with their frugal, tight, page segmentation) but also situations where inputs' text lines are not the main content focus: think video subtitles, posters, advertisements and other visually rich publications such as brochures, but also scenarios where the page segmentation is very peculiar, e.g. bills of lading and other semi(!)-tabular paper.
i.o.w.: if your userland input needs specific treatment, tesseract won't stand in the way, you can have full control.
The idea here (not implemented yet, but on my list anyway) us to use an image layer approach akin to the ones used in the movie / animation industry: you, the user, feeds tesseract the page scan image as you did before. You also optionally provide the greyscale image if your process is better for you than the one already present in tesseract: a greyscale image layer or separate image file. Ditto for the segmentation: you can either handpaint or through any algo means produce an image-based segmentation mask image, like this:
- Blue pixels are image content that should be segmented as picture content, relevant output formats such as html/hocr and pdf can include these bboxed picture segments. Same color Blue equals: belongs to the same group\segment.
- Green pixels identify text pixels, which must be ocr'ed. You either provide this at page segment block, text line or per-word segmentation grade (the jury in my head is still out whether this should be done as a flag/setting or as an always-per-word layer, where the line/block level segment hierarchy can be extracted from different masks: 1 or 3 text mask layers, that is the question...)
As for the blue ones: same color equals belongs to same segment(word/line).
This solves, among other things, bbox based problems with pages filled with reduced line height text, e.g. annotated and classic prints, not just medieval writing where ascenders and descenders exist on the same horizontal pixel scan lines.
- Red pixels are for additional noise removal: while these are theoretically superfluous (all this can be done with a precise green mask) in practice it is handy to have a means where you can dab a few red tipexx splotches on top of aggravating scratches, burns and pits as one finds in classic microfiche and other photographed or scanned (dirt! damage!) imagery.
- Extra #1: Invert layer which dictates which pixels must be inverted, so as we get black text on white background everywhere: this is extra as it could be encoded in the greyscale layer, but I find elsewhere it's handy to have this separately as a layer mask.
- Extra #2: binarized image for V3 and in case you only wish to override the tesseract-internal binarization process: you MAY provide the black&white binarization image result yourself, as an optional extra layer/image. The current API has this already.
The blue/green/red colors of these additional mask layers is just a local choice, which I use for some of my other work; anyone who's a little familiar with professional video editing and/or rendering knows that these additional image layers are not colors, nor are they limited to 8bit colorspaces: 16 bit int and float16 mask layers are quite common.
(Which hands me the argument that such a layer, with 65K-and-a-bit 'values' is perfectly sufficient to direct the precise user-directed word segmentation of a large page image: you need to go pretty crazy huge image size before you run into a 2**16th power word count limit. (Yeah, I know what they said about 640k ram back in the day. I am lining up for addition to the list of people uttering stupid assumptions. Ha! OpenEXR also supports int32 layers, so I have my way out of there, off that list of embarrassments. ;-) )
Note that re diagnostics, I am of the opinion that the training process is not one where i and most(?) others are willing to sit around waiting to interact with the java scrollview tool: while any while-training-UI/GUI is a useful concept during research into new, smaller, research/PoC language models, I expect one to run training rounds in some sort of (semi)automated batch mode, where one wants to skim through the run reports / logging after the fact, while on the lookout for remarkable bits and accompanying images after the fact, so i am far more interested in getting all that debug/diagnostics data into an easy to read post-partum format: my preference is html+images, written to disk, so i can click and review at my leisure in the evening/when time allows, then jump in and fiddle with my training rig = inputs, scripts, whatever and wherever I decide to adjust my process, which is long running and thus not actively monitored 24/7 by Human Eyeball Mk.1.
Ditto for my OCR using runs: I want to be able to see everything after the fact: if I find any screwups, those inputs can be easily requeued for any subsequent batch run and I gain the ability to plan my own valuable time in different ways, where tesseract is not my main focus du jour. I'd rather re-run something than having to watch the bugger as it happens, if i want / need scrollview-grade feedback to tune my process.
That's all.
A and B already exist in raw form, B needs a redo in my opinion to make it proper / less crappy than what I have today, but what I want and needs to be done is largely ... only effort, no (re)architecturing apart from punching the right hole through don't+spdlog.
C is a cobbled together approach I employ elsewhere; the basic "extra layers" idea is to be transported and written in a tesseract preprocessor I am setting up, but has taken its sweet time, so C is future music, which I intend to compose and execute further in 2025.
(user-scriptable preproc stage with some ga-based 'autotuning' ideas which I still need to viability-test: they are a napkin-level battle plan today and cf. von Moltke must see the light and face harsh reality to test them, but that's another story)
Thanks for your time and responses,
Best regards,
Ger Hobbelt