I'm trying to figure out how to disable all the language model behaviour and just do character recognition and word-splitting on whitespace. I've tried different `--oem` modes including mode 0 with a legacy language file but tesseract still keeps trying to correct words/characters based on surrounding characters.
Say I have a "word" consisting of a letter and a number, like "S9" or "S99". Depending on the combination of settings I use I will usually get one of these incorrect behaviours:
- The S is substituted for a "$" (dollar sign) because it thinks it's currency
- The S is substituted for an "8" because it thinks it's a number
In most other situations it will see the same S correctly (ie, as part of an actual word). It's only when I mix letters and numbers that this behaviour is triggered which suggests this is not a character recognition issue in the traditional sense of just detecting the outline.
I should add the input I'm scanning is from a digital file and it's a high-res, low-noise document with high contrast and a clean serif font. Noise/artifacts are not really an issue and DPI can be be as high as required. I'm currently scanning at 300 DPI (approx 8000x12000px) but I can increase or decrease it if will help (it doesn't seem to).
I've tried disabling every relevant option I can find and it still keeps happening. Here is the full list of settings I'm passing:
CUSTOM_TESSERACT_CONFIG = (
'--oem 0 --psm 6 '
f'-c tessedit_char_whitelist="{VALID_CHARS}" '
'-c tessedit_enable_dict_correction=0 '
'-c load_system_dawg=0 '
'-c load_freq_dawg=0 '
'-c load_punc_dawg=0 '
'-c load_number_dawg=0 '
'-c load_unambig_dawg=0 '
'-c load_bigram_dawg=0 '
'-c load_fixed_length_dawgs=0 '
'-c wordrec_enable_assoc=0 '
'-c language_model_penalty_non_freq_dict_word=0 '
'-c language_model_penalty_non_dict_word=0 '
'-c tessedit_prefer_joined_punct=1 '
'-c textord_enable_word_ngrams=0 '
'-c tessedit_good_quality_unrej=1 '
'-c tessedit_enable_bigram_correction=0 '
'-c tessedit_enable_doc_dict=0 '
'-c textord_enable_out_of_punct=0 '
'-c textord_enable_xheight_stats=0 '
'-c enable_noise_removal=0 '
'-c classify_enable_adaptive_matcher=0 '
'-c classify_enable_learning=0 '
'-c tessedit_preserve_blk_rej_perfect_wds=1 '
'-c preserve_interword_spaces=1 '
'-c segment_penalty_dict_case=0 '
'-c segment_penalty_garbage=0 '
'-c textord_split_num_pattern=0'
)
Basically what I'm after is for tesseract to do ONLY these things:
a.) Detect a character based only on its outline, not the surrounding context - and use the best match.
b.) Group nearby characters into groups based only on whitespace (no splitting on commas, punctuation, etc) however I do want to capture the punctuation (eg: $9,999.00)
c.) Give me the bounding box of each group (because I need the position for further processing)
How can I do this? Is it even possible?
----
tesseract -v
tesseract 5.5.0
leptonica-1.85.0
libgif 5.2.2 : libjpeg 8d (libjpeg-turbo 3.0.4) : libpng 1.6.47 : libtiff 4.7.0 : zlib 1.2.12 : libwebp 1.5.0 : libopenjp2 2.5.3
Found NEON
Found libarchive 3.7.7 zlib/1.2.12 liblzma/5.6.3 bz2lib/1.0.8 liblz4/1.10.0 libzstd/1.5.6
Found libcurl/8.7.1 SecureTransport (LibreSSL/3.3.6) zlib/1.2.12 nghttp2/1.61.0