Bazaar & eng.user-patterns not doing anything?

91 views
Skip to first unread message

phillip...@gmail.com

unread,
Apr 26, 2019, 5:53:44 AM4/26/19
to tesseract-ocr
Hi,

I created a bazaar file as attached with load_system_dawg, load_freq_dawg set to F. I also want to use user-patterns so I set it as well.
load_system_dawg F
load_freq_dawg F
user_patterns_suffix user
-patterns

In the same directory, I also have the user-pattern file:
\d\d\d\d\c

So the structure looks like:
./bazaar
./eng.user-patterns
./ocr_inv.png
 
But these settings still fail to recognise the image correctly as "1880A". If I just run tesseract without any bells and whistles, the outputs are still the same.

Commands used:
tesseract ocr_inv.png stdout
tesseract ocr_inv
.png stdout bazaar
tesseract ocr_inv.png stdout --user-patterns eng.user-patterns bazaar

Output:
Warning. Invalid resolution 0 dpi. Using 70 instead.
Estimating resolution as 1128
ISON

Can anyone tell me if this is expected behaviour?
Tesseract version:
tesseract 4.0.0-beta.1
 leptonica-1.75.3
  libgif 5.1.4 : libjpeg 8d (libjpeg-turbo 1.5.2) : libpng 1.6.34 : libtiff 4.0.9 : zlib 1.2.11 : libwebp 0.6.1 : libopenjp2 2.3.0

 Found AVX2
 Found AVX
 Found SSE


bazaar
eng.user-patterns
ocr_inv.png

Zdenko Podobny

unread,
Apr 26, 2019, 6:03:55 AM4/26/19
to tesser...@googlegroups.com
First of all: this is very old tesseract version. Try recent code, where were a lot of bug fixes (not sure if for bazar/user patterns  too. 

Zdenko


pi 26. 4. 2019 o 11:53 <phillip...@gmail.com> napísal(a):
--
You received this message because you are subscribed to the Google Groups "tesseract-ocr" group.
To unsubscribe from this group and stop receiving emails from it, send an email to tesseract-oc...@googlegroups.com.
To post to this group, send email to tesser...@googlegroups.com.
Visit this group at https://groups.google.com/group/tesseract-ocr.
To view this discussion on the web visit https://groups.google.com/d/msgid/tesseract-ocr/10d1ed21-d30e-4eaf-8f2a-6fdf74a6a7d1%40googlegroups.com.
For more options, visit https://groups.google.com/d/optout.

JB Data31

unread,
Apr 26, 2019, 10:19:12 AM4/26/19
to tesser...@googlegroups.com
Image processing can improve the result, but this typo is very particular, i.e. "unconnected" digit.
I try morphological transformations to re-connect digit, better (tesseract ocr_inv-8.png ocr-8 --psm 6) , but yet far away a proper result.
According to me, train tesseract with this typo is a way.

@JBΔ

ocr-8.txt
ocr_inv-9-0.png
ocr_inv-8-1.jpg
ocr_inv-8.png

phillip...@gmail.com

unread,
Apr 28, 2019, 10:45:03 PM4/28/19
to tesseract-ocr
I tried this version:
tesseract 4.0.0
 leptonica-1.75.3
  libgif 5.1.4 : libjpeg 8d (libjpeg-turbo 1.5.2) : libpng 1.6.34 : libtiff 4.0.9 : zlib 1.2.11 : libwebp 0.6.1 : libopenjp2 2.3.0

 Found AVX2
 Found AVX
 Found SSE

It still gives me the same outputs with or without the user-pattern and config files.
To unsubscribe from this group and stop receiving emails from it, send an email to tesser...@googlegroups.com.

phillip...@gmail.com

unread,
Apr 28, 2019, 10:46:35 PM4/28/19
to tesseract-ocr
Did you have any luck without training, but including the bazaar config file and user-patterns txt file?
To unsubscribe from this group and stop receiving emails from it, send an email to tesser...@googlegroups.com.

phillip...@gmail.com

unread,
Apr 29, 2019, 12:37:46 AM4/29/19
to tesseract-ocr
UPDATES:

I trained a file and put it in tessdata. The output is better now, but it seems that the user-pattern file is still not working?

Command:
tesseract ocr_inv.png ocr_inv -l Karton --user-patterns Karton.user-patterns Karfig

Karfig
ocr_inv.png
ocr_inv.txt
Karton.traineddata

phillip...@gmail.com

unread,
Apr 29, 2019, 6:10:17 AM4/29/19
to tesseract-ocr
UPDATES:

It works fine now. All I did is git clone the latest master branch and build from source. Then the user-pattern and configs can work.
Reply all
Reply to author
Forward
0 new messages