Tesseract LSTM competitive word recognition (at least for certain use cases)

71 views
Skip to first unread message

Jozef M.

unread,
Nov 10, 2025, 4:54:06 PM (12 days ago) Nov 10
to tesser...@googlegroups.com

Dear Tesseract Community,

We run a high-volume, multi-engine OCR pipeline that includes Tesseract 4 (LSTM, Latin), a custom Tesseract 3 (outline-based) model for specific cases, and newer OCR models on low-resource serverless environment.
We wanted to share some brief internal results that may be useful to the community.

Key points

  • Data: Internal printed-word sets. We cannot publish the datasets. The goal here is practical relevance, not reproducibility claims.
  • Scope: These results focus strictly on word recognition.
  • Context: We did not evaluate other valuable Tesseract features (e.g., segmentation, CPU performance) or address known limitations (e.g., GPU support or the practicality of generic LSTM retraining); however, they might be important for your use case.

Findings

Confidence Calibration For Tesseract LSTM based models, there is a strong link between confidence and correctness: most errors sit at lower confidence levels.
This makes thresholding and model voting reasonably straightforward. In our tests, the confidence distributions of Tesseract LSTM models are usable for such decisions. 
Note that the Tesseract 3 outline-based matching model is more noise-sensitive on our data, reinforcing that the tested dataset is not "easy".
image.png
image.png
image.png
Confidence scores limited to the [0, 100] range. For a single confidence level, there are two corresponding values, red and green, where ideally, high confidence has a low red value and a high green value (and vice-versa at the lower confidence levels).


Head-to-Head Comparisons Direct word-level comparisons show a meaningful share of cases where Tesseract LSTM model is correct while others are not.
This complementary behavior means Tesseract LSTM model still adds significant value in an ensemble, despite being an older engine.

80fc93ca-ddb4-4f19-865f-6d57dc69bd52.png


Conclusion
Mature engines like Tesseract are not obsolete (at least for certain use cases). In our pipeline, Tesseract LSTM word recognition remains competitive and, importantly, provides well-calibrated confidence scores that are useful for filtering and ensemble voting.
Best regards,
Jozef Misutka



Ger Hobbelt

unread,
Nov 11, 2025, 4:48:16 AM (12 days ago) Nov 11
to tesseract-ocr
Thank you for publishing this.

Question: the -1 confidence numbers for T3 and T4 in the charts: could you tell us what happened there? (Smells like mapping software failures to a score number; the word counts for these are pretty high so I'm very curious what went on there!)


Met vriendelijke groeten / Best regards,

Ger Hobbelt

--------------------------------------------------
web:    http://www.hobbelt.com/
        http://www.hebbut.net/
mail:   g...@hobbelt.com
mobile: +31-6-11 120 978
--------------------------------------------------

--
You received this message because you are subscribed to the Google Groups "tesseract-ocr" group.
To unsubscribe from this group and stop receiving emails from it, send an email to tesseract-oc...@googlegroups.com.
To view this discussion visit https://groups.google.com/d/msgid/tesseract-ocr/CABCmPq0AtjM-nY0vb%2B2PWwLRkqkf2Kkznp%3DoTNL1T678VQjAhA%40mail.gmail.com.
Reply all
Reply to author
Forward
0 new messages