I'm writing a program to convert tiff images of books to ePubs. I have a bunch (4000) book images which I converted in the 2000's with FineReader. I want to improve the results and am too cheap to buy an updated program. Plus, it's fun.
Tesseract looks like it gives equal or better results than my original system, however, the current incarnation does not support bold or italic, which is important, though arguably not essential.
The last I could find on this was from
2022. A bit more informative is
this.
Basically, the latter says that the information for bold and italic (at least) is available at some level in the code hierarchy, but would need some work to expose (from theraysmith) - or at least this is how I interpreted it. There was some indication that this would be desirable, but I'm not sure it's on your roadmap.
If it is, do you know when? If not, could it be added? If no to that, is it possible to run both Version 3 and Version 5 recognition?
My concern with the latter is that it appears that version 3 paths are explicitly commented out in V5 though a #define. This #define seems to be generated early in the compilation process by some Linuxy tools that are well beyond my (limited) Linux experience. How could I generate a library / set of dlls which would allow me to run both recognisers (one after the other probably and then pick the 'best' result)?
Hope this makes sense, and thanks in advance
Iain