user patterns with tesserocr python API

282 views
Skip to first unread message

Roman Seidel

unread,
Feb 29, 2024, 3:40:55 PM2/29/24
to tesseract-ocr
Hi all,

I am currently try to use user-patterns on the PyTessBaseAPI from tesserocr [1].

What I've done is to initialize the API with:

with PyTessBaseAPI(path='/usr/share/tesseract-ocr/4.00/tessdata', lang=LANGUAGE, psm=int(psm), oem=int(TOEM)) as api:

setting the user patterns file with:

api.SetVariable('user_patterns_file', '/home/roman/Dev_d/playground/user_patterns/deu.patterns')

Where the user patterns file contains a pattern, e.g.:

\A\A\A

(which means three characters in capital letters.


The result, independently ,whether I use the user_patterns_file argument or not, are the same. This brings me to the question if tesserocr supports user (and word) patterns?

My versions:

tesserocr 2.6.2
tesseract 5.3.3
 leptonica-1.83.1
  libpng 1.6.34 : zlib 1.2.11

Thanks a lot for your help and best wishes,
Roman







René JM Clais

unread,
Mar 1, 2024, 12:59:37 PM3/1/24
to tesser...@googlegroups.com
Can you send an example of an input document and the output of tesseract as well of what should be your expectation using the pattern file. 

--
You received this message because you are subscribed to the Google Groups "tesseract-ocr" group.
To unsubscribe from this group and stop receiving emails from it, send an email to tesseract-oc...@googlegroups.com.
To view this discussion on the web visit https://groups.google.com/d/msgid/tesseract-ocr/767cc60f-5325-43d7-a6ef-9cf879f82950n%40googlegroups.com.
Message has been deleted

Roman Seidel

unread,
Mar 2, 2024, 4:45:41 AM3/2/24
to tesser...@googlegroups.com
Yes, sure, the input file is a snippet with a capital letter followed by 9 digits. The correct user pattern, corresponding to [1] is:

``\A\d\d\d\d\d\d\d\d\d``

The result of Tesseract (psm 8) is fully correct. Nevertheless, user patterns is not working in the way described above.

For instance, I have tried to extract only the capital character with user patterns (not with whitelist), which is:

\A

In this case, the capital letter and all digits are given back by tesseract.

I've attached my input file and the corresponding Python snippet for reading and proessing the image with tesserocr from [2]





You received this message because you are subscribed to a topic in the Google Groups "tesseract-ocr" group.
To unsubscribe from this topic, visit https://groups.google.com/d/topic/tesseract-ocr/MMtdkQu3vSM/unsubscribe.
To unsubscribe from this group and all its topics, send an email to tesseract-oc...@googlegroups.com.
To view this discussion on the web visit https://groups.google.com/d/msgid/tesseract-ocr/CAPJAo_ok%2BQec6cJ1fxfb5NOqLVr8MAovZMNdXT-N3QS3di%2B%3Dng%40mail.gmail.com.
deu.patterns
up_test.py
betriebsstaette.png

Zdenko Podobny

unread,
Mar 2, 2024, 7:08:43 AM3/2/24
to tesser...@googlegroups.com
Can you please elaborate on:
Nevertheless, user patterns is not working in the way described above.


Zdenko


so 2. 3. 2024 o 10:45 Roman Seidel <roman.s...@gmail.com> napísal(a):

Roman Seidel

unread,
Mar 3, 2024, 5:02:17 PM3/3/24
to tesser...@googlegroups.com
To be more precise with my questions:

- Is the user-patterns functiontionality implemented in the tesserocr Python API of tesseract?
- How exact is the syntax of specifying user patterns with the tesserocr Python API. Is SetVariable() correct and how is the path (Linux) and the attribute specified?
- is there a default path, where it is lookes for the *.patterns / *.user-patterns file

With the attached code from my last message, I've tested different constellations with/without the combination of whitelist, different atrributes and path notations, which was not successfull. 

If I use the following notation for user patterns, it has no effect on the results independently from the entries of the *.patterns file:
 
api.SetVariable('user_patterns_file', '/home/roman/Dev_d/playground/user_patterns/deu.patterns')

Does anyone has (successfully) used user patterns with the tesserocr Python API of tesseract?

best wishes and thanks, Roman

Zdenko Podobny

unread,
Mar 10, 2024, 12:32:50 PM3/10/24
to tesser...@googlegroups.com
Maybe I am wrong, but it looks to me like you are expecting from user-patterns something it never promises to provide.
What we know/experienced: 
  • user-patterns extends the Tesseract legacy engine dictionary.
  • putting a word/pattern to the Tesseract Legacy Engine dictionary never guarantees word is recognized correctly (see remark https://tesseract-ocr.github.io/tessdoc/APIExample-user_patterns.html)
  • somebody (I can not find details as it was a long time ago) made tests and he found that the Tesseract legacy engine dictionary has limited effect. For "nonword" text (like "codes" with mixed letter&digits" people usually turn off the dictionary)
  • some users prefer to use the Legacy engine for "codes" instead of LSTM
As far as I know, nobody made tests regarding LSTM and dictionaries e.g. an investigation if user-patterns also affect LSTM engine (as for LSTM there are new dictionary components lstm-punc-dawg, lstm-punc-dawg, lstm-number-dawg) ...


Zdenko


ne 3. 3. 2024 o 23:02 Roman Seidel <roman.s...@gmail.com> napísal(a):
--
You received this message because you are subscribed to the Google Groups "tesseract-ocr" group.
To unsubscribe from this group and stop receiving emails from it, send an email to tesseract-oc...@googlegroups.com.

Zdenko Podobny

unread,
Mar 12, 2024, 3:39:06 PM3/12/24
to tesser...@googlegroups.com
One correction:

I checked the example in the below mentioned url with the Tesseract executable and tessdata repository. The result is that user_pattern is effecting also LSTM. This could be easily tested by generating output without user_patters (Arial.txt):

tesseract Arial.png Arial

And with patterns:
tesseract Arial.png Arial.pat --user-patterns my.patterns
tesseract Arial.png Arial.pat.oem0 --user-patterns my.patterns --oem 0
tesseract Arial.png Arial.pat.oem1 --user-patterns my.patterns --oem 1
tesseract Arial.png Arial.pat.oem2 --user-patterns my.patterns --oem 2

Zdenko


ne 10. 3. 2024 o 17:32 Zdenko Podobny <zde...@gmail.com> napísal(a):
Reply all
Reply to author
Forward
0 new messages