How to use user-patterns or Bazaar correctly

1,711 views
Skip to first unread message

Rishika Rupam

unread,
Mar 18, 2021, 7:01:26 AM3/18/21
to tesseract-ocr
Hello,

I understand that questions like these have been asked before, but I still haven't found a satisfactory reply. 

Our goal is to have tesseract output results according to a fixed user pattern. We have Tesseract 4.1.1 with Leptonica

Let me organize this message into 3 distinct but related questions.

1. Using --user-patterns like so :
tesseract --dpi 300 --psm 6 --user-patterns pattern1.user-patterns Fully_Preprocessed/IMG_0632_cropped.png outputpattern1
in the command seems to not have any effect at all.

2. Is my pattern like so correct? I have tried at least 2 pattern files as follows:
\d\d.\d\d.\d\d
\A\A\d\d\d\d\d\d:\d\d
\d\d \d\d\d
AND 
\d{2}\.\d{2}\.\d{2}
\A{2}\d{6}:\d{2}
\d{2} \d{3}

But neither works

3. On the other hand, I can't seem to get Bazaar to work (I have put the bazaar file in configs with reference to user-patterns file and put the eng.user-patterns file in tessdata folder), perhaps because I'm not using bazaar properly 
sudo tesseract --dpi 300 --psm 6 Fully_Preprocessed/IMG_0632_cropped.png outputbazaar.txt bazaar

gives the error:  read_params_file: Can't open /usr/share/tesseract-ocr/4.00/tessdata/configs/bazaar

I would greatly appreciate any help or advice with any of these questions. 
Thank you very much.

Regards,
Rishika

Rishika Rupam

unread,
Mar 19, 2021, 4:20:34 AM3/19/21
to tesseract-ocr
Hello again,

I have read here that the user-patterns file should contain one pattern per line and it's best not to have more than one line.
In our case however, the image itself contains 3 lines, like so. 
So, my user-pattern file is :
\d\d.\d\d.\d\d
\A\A\d\d\d\d\d\d:\d\d
\d\d \d\d\d
As discussed above, the user-patterns parameter doesn't seem to have any effect. I'm wondering now if there's a problem because there are multiple lines?
 
IMG_0684_cropped.jpg

Shree Devi Kumar

unread,
Mar 19, 2021, 4:25:17 AM3/19/21
to tesseract-ocr

--
You received this message because you are subscribed to the Google Groups "tesseract-ocr" group.
To unsubscribe from this group and stop receiving emails from it, send an email to tesseract-oc...@googlegroups.com.
To view this discussion on the web visit https://groups.google.com/d/msgid/tesseract-ocr/cde221fb-1c4a-4d0e-8529-d8ae17da3813n%40googlegroups.com.

Rishika Rupam

unread,
Mar 19, 2021, 4:51:07 AM3/19/21
to tesser...@googlegroups.com
Hi Shree,

Thank you for your response and the links. 
I've already gone through and implemented the suggestions as mentioned in the thread. Namely,
1. Resized images (with mixed results)
The result I'm getting currently is : 
43.09.21
AASI1802:98
2 000
I'm hoping to convert at least some characters (numbers misread as letters and vice versa) using user-patterns.

2. Tried to get Bazaar to work (I have put the bazaar file in configs with reference to user-patterns file and put the eng.user-patterns file in tessdata folder), perhaps because I'm not using bazaar properly 
sudo tesseract --dpi 300 --psm 6 IMG_0632_cropped.png outputbazaar.txt bazaar

gives the error:  read_params_file: Can't open /usr/share/tesseract-ocr/4.00/tessdata/configs/bazaar

I'm not sure why tesseract is unable to open the bazaar file when I can see that it clearly exists. I have even tried it using sudo su, but same error. 


Lastly, thanks for the link, but I'm not sure what you are referring to please: https://tesseract-ocr.github.io/tessdoc/Planning.html#regression_of_features_from_3.0x

Thank you very much for your help 

Regards,
R




Rishika RUPAM
Data and AI Research Engineer
Tilkal | LinkedIn | Github | Medium      

BETTER SUPPLY CHAINS. BETTER PRODUCTS. BETTER FUTURE.



You received this message because you are subscribed to a topic in the Google Groups "tesseract-ocr" group.
To unsubscribe from this topic, visit https://groups.google.com/d/topic/tesseract-ocr/g1ygrQeNKKQ/unsubscribe.
To unsubscribe from this group and all its topics, send an email to tesseract-oc...@googlegroups.com.
To view this discussion on the web visit https://groups.google.com/d/msgid/tesseract-ocr/CAG2NduXjLz3DQUVvfVwr0prGyZdSj4ECZBSF5tyCMHKvECjoSQ%40mail.gmail.com.
Reply all
Reply to author
Forward
0 new messages