Help processing this tiny image with a date

112 views
Skip to first unread message

Juanjo Gómez Navarro

unread,
May 11, 2021, 5:19:46 AM5/11/21
to tesseract-ocr
Good morning, I'm trying to use Tesseract to read dates in image files. The problem I have is that the image is rather small. This is the cropped image with the date I have to process:


test-raw.jpg

After some processing with Scikit-Image (rescaling, adding a white border, erosion and binarising) I get this image:

processed.png

To me it reads pretty well. Still, tesseract  reads "» MAY 2021". The "5" is missing.

How can I process the image to get the desired output, i.e. "5 MAY 2021".

I'm using tesseract 4.1.1 with pytesseract.

Zdenko Podobny

unread,
May 15, 2021, 3:29:02 AM5/15/21
to tesser...@googlegroups.com
> tesseract -v
tesseract 5.0.0-alpha-20210401-66-g91b2b4
 leptonica-1.81.0 (Apr 16 2021, 16:18:45) [MSC v.1928 LIB Release x64]
  libgif 5.2.1 : libjpeg 6b (libjpeg-turbo 2.0.91) : libpng 1.6.37 : libtiff 4.2.0 : zlib 1.2.11 : libwebp 1.2.0 : libopenjp2 2.4.0
 Found AVX2
 Found AVX
 Found FMA
 Found SSE4.1
 Found libarchive 3.5.1 zlib/1.2.11 liblzma/5.2.4 bz2lib/1.0.6 libzstd/1.4.9

> echo %TESSDATA_PREFIX%
t:\Project-Personal\tessdata_best\tessdata

> tesseract 5_may_2021.jpg - --psm 7 -l eng
5 MAY 2021


5_may_2021.jpg is your first image (white text on black)

Zdenko


ut 11. 5. 2021 o 11:19 Juanjo Gómez Navarro <juanjo.go...@gmail.com> napísal(a):
--
You received this message because you are subscribed to the Google Groups "tesseract-ocr" group.
To unsubscribe from this group and stop receiving emails from it, send an email to tesseract-oc...@googlegroups.com.
To view this discussion on the web visit https://groups.google.com/d/msgid/tesseract-ocr/75f4dfc8-7ec0-4334-8b11-72fc268f1b83n%40googlegroups.com.
Reply all
Reply to author
Forward
0 new messages