Tesseract 3.04 font/italics detection - possible bug?

331 views
Skip to first unread message

Alan Slater

unread,
Feb 21, 2019, 12:17:19 AM2/21/19
to tesseract-ocr
I have recently been experimenting with training Tesseract 3.04 to differentiate between standard and italic text, after obtaining poor results using the eng.traineddata from the 3.04 tessdata repo. This is with the end goal of using the Tesserocr wrapper for Python to create a script capable of outputting basic HTML files, as the hOCR files generated from the command line are too verbose for my purposes.

As part of the process of improving my language model, I have enabled the tessedit_debug_fonts flag in my hOCR config to view more detailed information on how the font recognition system works.

After noticing some anomalies in both the hOCR and Tesserocr output, I created 2 scripts. One parses the debug data from tesseract,and outputs whether the current word is italic based on the detected font. The other script uses Tesserocr to iterate on the word level and obtains the UTF-8 text for that word and its italic attribute, using the GetUTF8Text and WordFontAttributes methods respectively.

Here is the output of these scripts for the first line of the attached image:

        tessDebugParser.py          tesserocrItalicTest.py

    First, False               |First, False  
    ---------------------------|---------------------------
    there True                 |there True
    ---------------------------|---------------------------
    is True                    |is True
    ---------------------------|---------------------------
    no True                    |no True
    ---------------------------|---------------------------
    physiological True         |physiological True
    ---------------------------|---------------------------
    requirement True           |requirement True
    ---------------------------|---------------------------
    for True                   |for True
    ---------------------------|---------------------------
    sugar; True                |sugar; True
    ---------------------------|---------------------------
    all False                  |all True
    ---------------------------|---------------------------
    human False                |human False
    ---------------------------|---------------------------

 
Whilst the debug output shows that Tesseract is correctly detecting the italics used within the file, the output from Tesserocr does not match this. In this case, the error is in the word 'all' at the end of the line, and persists throughout the rest of the output.

The raw debug output for this word is as follows:

        Examining fonts in a [61 ] l [6c ] l [6c ]
    Font baskerville, total score = 161356
    Font baskervilleItalic, total score = 31774
    Word modal font=baskerville, score=2. No 2nd choice


These are full attributes for the same word, using Tesserocr:

     all {'monospace': False, 'serif': True, 'bold': False, 'smallcaps': False, 'italic': True, 'pointsize': 105, 'font_name': 'baskervilleItalic', 'underlined': False, 'font_id': 1}

   
I have also included the hOCR output for the same file, which matches the output from Tesserocr, and the bask.traineddata file used to create it.

Is this behaviour an intentional feature of Tesseract? It wouldn't be an issue for me to write a more sophisticated debug file parser that would allow me to fulfill my original intention, but I'm curious as to what is causing this.

Many thanks,
Alan
sugar.png
tessDebugParser.py
DebugOutput.txt
sugar.hocr
bask.traineddata
tesserocrItalicTest.py

yadab sd

unread,
Jul 22, 2019, 1:33:08 AM7/22/19
to tesseract-ocr
I'm getting this error... Please help
RuntimeError: Failed to init API, possibly an invalid tessdata path: C:\Program Files (x86)\Tesseract-OCR\tessdata/
Reply all
Reply to author
Forward
0 new messages