--user-patterns not working with tesseract 5.2.0.20220712

488 views
Skip to first unread message

Louis D

unread,
Aug 14, 2022, 1:55:49 PM8/14/22
to tesseract-ocr
I'm using pytesseract with tesseract 5.2.0.20220712 to try to read a float number from an image.
Here is the image I'm trying to read :
help.png
When using tesseract with the config "'--psm 7 --user-patterns "D:\PyCharmProjects\SpiralBattle\patterns.txt"'", it returns "4.43003" instead of the expected "4.43e03"
My patterns.txt file is the following : 
\d.\d\de\d\d

The pattern worked two or three times with similar images but now it doesn't work anymore for some reason.

Does anyone know why it broke ?

Thanks in advance.

Yunlong Liu

unread,
Aug 14, 2022, 8:14:52 PM8/14/22
to tesseract-ocr
I  also encountered the same issue with release 5.2

Benjamin Hall

unread,
Aug 15, 2022, 12:17:56 AM8/15/22
to tesser...@googlegroups.com
I  also encountered the same issue with release 5.2
Did you ever find the reason why?

The pattern worked two or three times with similar images but now it doesn't work anymore for some reason.
Does anyone know why it broke ?
Just learning Tesseract now and have not experimented with regex parameters yet....But if I find a solution I will get it over to you.

This email and any attachment(s) it may contain is confidential and is intended solely for the use of the individual(s) to whom it is addressed. If you are not the intended recipient of this email, you must not take action based on the contents, nor distribute, nor expose any part of the content(s) to entities or person(s) beyond the original distribution list. Please contact the sender and delete the email if you have received it in error. Thank you.

--
You received this message because you are subscribed to the Google Groups "tesseract-ocr" group.
To unsubscribe from this group and stop receiving emails from it, send an email to tesseract-oc...@googlegroups.com.
To view this discussion on the web visit https://groups.google.com/d/msgid/tesseract-ocr/ec009a55-810d-4de0-886d-a6d50fa6c22en%40googlegroups.com.

Zdenko Podobny

unread,
Aug 26, 2022, 7:30:59 AM8/26/22
to tesser...@googlegroups.com
Let me give you some general pointers:
  • If something does not work the way you expect it to, that does not mean it's broken ;-). Maybe you just misunderstood something. Or you expect something that was never promised...
  • you mark something as broken, you should first prove that it worked and now it does not work anymore. Or you should understand the feature in detail and explain why it does not work ;-)
About the user-patterns:
  1. user-patterns "just" extend dawg dictionary [1],  [2]. So using user-patterns does not mean "interpret OCR string as this pattern" or "find this pattern in the image"
  2. Some years ago (in the age of tesseract 3 a.k.a. legacy engine) someone measured the influence of dictionaries on OCR results by 10-15%. It would be great if somebody would make such a test for the LSTM engine ;-). But I would not expect a big change and definitely not a 100% result when adding a word/pattern to the dictionary.
[1] https://github.com/tesseract-ocr/tesseract/blob/5a36943de4a39d236a9762f6971823c5b7c20404/src/dict/dict.cpp#L263-L279

Zdenko


po 15. 8. 2022 o 6:17 Benjamin Hall <codename...@gmail.com> napísal(a):
Reply all
Reply to author
Forward
0 new messages