Get more accurate results from specific image pattern

121 views
Skip to first unread message

kacper

unread,
Feb 13, 2023, 9:56:21 AM2/13/23
to tesseract-ocr
Hello. I'm trying to fetch text data from images, they have all same pattern layout. I have tried every thing from "Improving the quality..." docs - It helps a little but still missing mandatory piece of data. Example png files:

image.pngimage2.png

I need to fetch name of item, for example: "SKITTERING" and amount - very first value on the left of each row. I removed transparency, resized image (4x), tried different page segmentation method. I also know in advance all possible "names" so I created my own dictionary.

The best result I can get is this one:
Screenshot from 2023-02-13 15-15-23.png

Looks like words from dictionary are easily recognizable, there are still problems with amount and formatting - perhaps because of Table format. Before I try PyTesseract / Open CSV I would ask here - can I do something more to get what I need using just  Tesseract?

Zdenko Podobny

unread,
Feb 20, 2023, 1:27:35 PM2/20/23
to tesser...@googlegroups.com
So you tried all the easy parts and leave difficult parts to the forum :-)

First of all - yes - this is a table problem => you need to do page segmentation by yourself before OCR. Tesseract is OCR eng. It is able to make simple page segmentation like scanned book pages, but for complex layouts, you need to make layout segmentation with something else

Next, there are plenty of graphics - you will need to get rid of them (e.g. not to OCR it with tesseract). 

If the text positions are stable you create/use uzn file (search forum) to OCR just text areas.
If the text positions are changing, then the solution could be to detect the position of the expected image part like "x" and calculate the text positions from it.

Or try to use some text detection tools like OpenCV’s EAST text detector[1] or Yolo...

Zdenko


po 13. 2. 2023 o 15:56 kacper <kacper.c...@gmail.com> napísal(a):
--
You received this message because you are subscribed to the Google Groups "tesseract-ocr" group.
To unsubscribe from this group and stop receiving emails from it, send an email to tesseract-oc...@googlegroups.com.
To view this discussion on the web visit https://groups.google.com/d/msgid/tesseract-ocr/6f9b529d-146c-4fd8-87dc-212d8e0dc9efn%40googlegroups.com.
Reply all
Reply to author
Forward
0 new messages