I have recently been experimenting with training Tesseract 3.04 to differentiate between standard and italic text, after obtaining poor results using the eng.traineddata from the 3.04 tessdata repo. This is with the end goal of using the Tesserocr wrapper for Python to create a script capable of outputting basic HTML files, as the hOCR files generated from the command line are too verbose for my purposes.
As part of the process of improving my language model, I have enabled the tessedit_debug_fonts flag in my hOCR config to view more detailed information on how the font recognition system works.
After noticing some anomalies in both the hOCR and Tesserocr output, I created 2 scripts. One parses the debug data from tesseract,and outputs whether the current word is italic based on the detected font. The other script uses Tesserocr to iterate on the word level and obtains the UTF-8 text for that word and its italic attribute, using the GetUTF8Text and WordFontAttributes methods respectively.
Here is the output of these scripts for the first line of the attached image:
tessDebugParser.py tesserocrItalicTest.py
First, False |First, False
---------------------------|---------------------------
there True |there True
---------------------------|---------------------------
is True |is True
---------------------------|---------------------------
no True |no True
---------------------------|---------------------------
physiological True |physiological True
---------------------------|---------------------------
requirement True |requirement True
---------------------------|---------------------------
for True |for True
---------------------------|---------------------------
sugar; True |sugar; True
---------------------------|---------------------------
all False |all True
---------------------------|---------------------------
human False |human False
---------------------------|---------------------------
Whilst the debug output shows that Tesseract is correctly detecting the italics used within the file, the output from Tesserocr does not match this. In this case, the error is in the word 'all' at the end of the line, and persists throughout the rest of the output.
The raw debug output for this word is as follows:
Examining fonts in a [61 ] l [6c ] l [6c ]
Font baskerville, total score = 161356
Font baskervilleItalic, total score = 31774
Word modal font=baskerville, score=2. No 2nd choice
These are full attributes for the same word, using Tesserocr:
all {'monospace': False, 'serif': True, 'bold': False, 'smallcaps': False, 'italic': True, 'pointsize': 105, 'font_name': 'baskervilleItalic', 'underlined': False, 'font_id': 1}
I have also included the hOCR output for the same file, which matches the output from Tesserocr, and the bask.traineddata file used to create it.
Is this behaviour an intentional feature of Tesseract? It wouldn't be an issue for me to write a more sophisticated debug file parser that would allow me to fulfill my original intention, but I'm curious as to what is causing this.
Many thanks,
Alan