user_patterns_suffix(Bazzar) does not improve accuracy

386 views
Skip to first unread message

Devon Yoo

unread,
Feb 29, 2016, 11:50:54 AM2/29/16
to tesseract-ocr

HI, I am trying to give a string pattern into TesseractEngine object when it is initiated.
I am using "A .Net wrapper for tesseract-ocr" 3.0.1.0 in C#.

Here is my code:


C# code

using( TesseractEngine engine = new TesseractEngine( 
    @"./tessdata", 
    "eng", 
    EngineMode.Default, 
    "bazzar" ) )   // here load config from bazzar *important*
{   
    engine.SetVariable( "tessedit_char_whitelist", "ABCDEFGHIJKLMNOPQRSTUVWXYZ0123456789-" );
    engine.SetVariable( "language_model_penalty_non_freq_dict_word", "1" );
    engine.SetVariable( "language_model_penalty_non_dict_word", "1" );

    string user_patterns_suffix;
    engine.TryGetStringVariable( "user_patterns_suffix", out user_patterns_suffix );
    using( Page page = engine.Process( bitmap, PageSegMode.SingleLine ) )
    {
        ...
    }
}


tessdata/configs/bazzar

load_system_dawg     F
load_freq_dawg       F
user_words_suffix    user-words
user_patterns_suffix user-patterns


tessdata/eng.user-patterns

25\w\w\w\d\d


tessdata/eng.user-words

JAN
FEB
MAR
APR
MAY
JUN
JUL
AUG
OCT
SEP
NOV
DEC


TestImage.jpg

25MAR16


Output from tesseract:

25HAR16

I have successfully inserted user-words and user-patterns into the tesseract object.
But the tesseract doesn't seem to refer to my user-words list because it keeps returning
HAR instead of MAR.
How can I force to read \w\w\w in the user-words list?

Reply all
Reply to author
Forward
0 new messages