Groups keyboard shortcuts have been updated
Dismiss
See shortcuts

Bold, Italic and Tesseract 3

64 views
Skip to first unread message

Iain Downs

unread,
Nov 16, 2024, 6:51:04 AM11/16/24
to tesseract-ocr
I'm writing a program to convert tiff images of books to ePubs.  I have a bunch (4000) book images which I converted in the 2000's with FineReader.  I want to improve the results and am too cheap to buy an updated program.  Plus, it's fun.

Tesseract looks like it gives equal or better results than my original system, however, the current incarnation does not support bold or italic, which is important, though arguably not essential.

The last I could find on this was from 2022.  A bit more informative is this.

Basically, the latter says that the information for bold and italic (at least) is available at some level in the code hierarchy, but would need some work to expose (from theraysmith) - or at least this is how I interpreted it.  There was some indication that this would be desirable, but I'm not sure it's on your roadmap.

If it is, do you know when?  If not, could it be added?  If no to that, is it possible to run both Version 3 and Version 5 recognition?

My concern with the latter is that it appears that version 3 paths are explicitly commented out in V5 though a #define.  This #define seems to be generated early in the compilation process by some Linuxy tools that are well beyond my (limited) Linux experience.  How could I generate a library / set of dlls which would allow me to run both recognisers (one after the other probably and then pick the 'best' result)?

Hope this makes sense, and thanks in advance

Iain

Iain Downs

unread,
Dec 6, 2024, 6:04:55 AM12/6/24
to tesseract-ocr
Just a nudge to see if there is any feedback on this question.  

Many thanks


Iain

Reply all
Reply to author
Forward
0 new messages