bad quality!?

336 views
Skip to first unread message

Cyrus Yip

unread,
Dec 29, 2021, 12:21:38 PM12/29/21
to tesseract-ocr
here is an example of an image i would like to use ocr on:
drop8.png
I would like the results to be like:
["Naruto Uzumaki Naruto", "Mai Sakurajima Rascal Does Not Dream of Bunny Girl Senpai", "Keqing Genshin Impact"]

Right now I'm using
region1 = im.crop((0, 55, im.width, 110))
region2 = im.crop((0, 312, im.width, 360))
image = Image.new("RGB", (im.width, region1.height + region2.height + 20))
image.paste(region1)
image.paste(region2, (0, region1.height + 20))
results = pytesseract.image_to_data(image, output_type=pytesseract.Output.DICT)



the processed image looks like
hi.png
but getting results like:
[' ', '»MaiSakurajima¥RascalDoesNotDreamofBunnyGirlSenpai', 'iGenshinImpact']

How do I optimize the image/configs so the ocr is more accurate?

Thank you.

Zdenko Podobny

unread,
Dec 29, 2021, 12:46:13 PM12/29/21
to tesser...@googlegroups.com
If you properly crop text areas you get good output. E.g.

r_cropped.png

> tesseract r_cropped.png - --dpi 300

Rascal Does Not Dream
of Bunny Girl Senpai

Zdenko


st 29. 12. 2021 o 18:21 Cyrus Yip <cyrus...@gmail.com> napísal(a):
--
You received this message because you are subscribed to the Google Groups "tesseract-ocr" group.
To unsubscribe from this group and stop receiving emails from it, send an email to tesseract-oc...@googlegroups.com.
To view this discussion on the web visit https://groups.google.com/d/msgid/tesseract-ocr/1a2fa0e4-b998-4931-ad7d-ae069a46568bn%40googlegroups.com.

Cyrus Yip

unread,
Dec 29, 2021, 12:58:15 PM12/29/21
to tesseract-ocr
I played around a bit and replacing all colours except for text colour and it works pretty well!

The only thing is replacing colours with:
im = im.convert("RGB")
pixdata = im.load()
for y in range(im.height):
    for x in range(im.width):
        if pixdata[x, y] != (51, 51, 51):
            pixdata[x, y] = (255, 255, 255)

is a bit slow. Do you know a better way to replace pixels in python? I don't know if this is off topic.

Zdenko Podobny

unread,
Dec 29, 2021, 1:15:26 PM12/29/21
to tesser...@googlegroups.com
IMO if the text is always in the same area, cropping and OCR just that area will be faster.

Zdenko


st 29. 12. 2021 o 18:58 Cyrus Yip <cyrus...@gmail.com> napísal(a):

Cyrus Yip

unread,
Dec 29, 2021, 3:18:37 PM12/29/21
to tesseract-ocr
but won't multiple ocr's and crops use a lot of time?

Zdenko Podobny

unread,
Dec 30, 2021, 5:19:52 AM12/30/21
to tesser...@googlegroups.com
Just made your tests ;-)

You can use tesserocr (maybe quite difficult installation if you are on windows) instead of pytesseract (e.g. initialize tesseract API once and use is multiple times). But it does not provide DICT output.


Zdenko


st 29. 12. 2021 o 21:18 Cyrus Yip <cyrus...@gmail.com> napísal(a):

Zdenko Podobny

unread,
Dec 30, 2021, 11:46:51 AM12/30/21
to tesser...@googlegroups.com
OK. I played a little bit ;-):

I tested the speed of your code with your image:

import timeit

pil_color_replace = """
from PIL import Image

im = Image.open('mai.png').convert("RGB")

pixdata = im.load()
for y in range(im.height):
    for x in range(im.width):
        if pixdata[x, y] != (51, 51, 51):
            pixdata[x, y] = (255, 255, 255)
"""

elapsed_time = timeit.timeit(pil_color_replace, number=100)/100
print(f"duration: {elapsed_time:.4} seconds")


I got an average speed 0.08547 seconds on my computer.
On internet I found the suggestion to use numpy for this and I finished with the following code:

np_color_replace_rgb = """
import numpy as np
from PIL import Image

data = np.array(Image.open('mai.png').convert("RGB"))
mask = (data == [51, 51, 51]).all(-1)
img = Image.fromarray(np.invert(mask))
"""

elapsed_time = timeit.timeit(np_color_replace_rgb, number=100)/100
print(f"duration: {elapsed_time:.4} seconds")


I got an average speed 0.01774 seconds e.g. 4.8 faster than the PIL code.
It is a little bit cheating as it does not replace colors - just take a mask of target color and return it as a binarized image, what is exactly what you need for OCR ;-)

Also, I would like to point out that the result OCR output is not so perfect (compared to OCR of unmodified text areas), as this kind of binarization is very simple.


Zdenko


št 30. 12. 2021 o 11:19 Zdenko Podobny <zde...@gmail.com> napísal(a):

Cyrus Yip

unread,
Dec 30, 2021, 12:14:40 PM12/30/21
to tesseract-ocr
I also tried many things like cropping, colour changing, colour replacing, and mixing them together.

I landed on checking if a pixel is not one of these:

[(51, 51, 51), (69, 69, 65), (65, 64, 60), (59, 58, 56), (67, 66, 62), (67, 67, 63), (67, 67, 62), (53, 53, 53), (54, 54, 53), (61, 61, 58), (62, 62, 60), (55, 55, 54), (59, 59, 57), (56, 56, 55)]

colours, replace it with white. It is pretty accurate but is there a way to do this with numpy arrays?

(code)
for x in range(im.width):
    if pixels[x, y] not in [(51, 51, 51), (69, 69, 65), (65, 64, 60), (59, 58, 56), (67, 66, 62), (67, 67, 63), (67, 67, 62), (53, 53, 53), (54, 54, 53), (61, 61, 58), (62, 62, 60), (55, 55, 54), (59, 59, 57), (56, 56, 55)]:
        pixels[x, y] = (255, 255, 255)

Zdenko Podobny

unread,
Dec 30, 2021, 2:43:01 PM12/30/21
to tesser...@googlegroups.com
try this:

import numpy as np
from PIL import Image

filter_colors = [(51, 51, 51), (69, 69, 65), (65, 64, 60), (59, 58, 56), (67, 66, 62),

          (67, 67, 63), (67, 67, 62), (53, 53, 53), (54, 54, 53), (61, 61, 58),
          (62, 62, 60), (55, 55, 54), (59, 59, 57), (56, 56, 55)]
image = np.array(Image.open('mai.png').convert("RGB"))
mask = np.isin(image, filter_colors, invert=True)
img = Image.fromarray(mask.any(axis=2))



Zdenko


št 30. 12. 2021 o 18:14 Cyrus Yip <cyrus...@gmail.com> napísal(a):

Cyrus Yip

unread,
Dec 30, 2021, 7:45:37 PM12/30/21
to tesseract-ocr
For some reason, using the numpy array has a different result than mine.

Numpy array:

hi.png
Loop through pixels:
hi.png
The second was is more accurate but way slower.

Cyrus Yip

unread,
Dec 30, 2021, 7:46:40 PM12/30/21
to tesseract-ocr
(original image)
drop11.png

Zdenko Podobny

unread,
Dec 31, 2021, 6:18:18 AM12/31/21
to tesser...@googlegroups.com
You are right -  np.isin is working another way than I expected (it does not match tuples, but individual values at tuples) and by coincidence, it produces similar results as your code.

Here is updated code that produces the same result as PIL. It is faster but with an increasing number of colors in  filter_colors, it will be slower.

filter_colors = [(51, 51, 51), (69, 69, 65), (65, 64, 60), (59, 58, 56), (67, 66, 62),
          (67, 67, 63), (67, 67, 62), (53, 53, 53), (54, 54, 53), (61, 61, 58),
          (62, 62, 60), (55, 55, 54), (59, 59, 57), (56, 56, 55)]

image = np.array(Image.open('mai.png').convert("RGB"))
mask = np.array([], dtype=bool)
for color in filter_colors:
    if mask.size == 0:
        mask = (image == color).all(-1)
    else:
        mask = mask | (image == color).all(-1)
img = Image.fromarray(~mask)



Zdenko


pi 31. 12. 2021 o 1:45 Cyrus Yip <cyrus...@gmail.com> napísal(a):

Cyrus Yip

unread,
Dec 31, 2021, 1:27:41 PM12/31/21
to tesseract-ocr
Right now I'm installing tesseract 4 in docker with 
RUN apt-get install -y tesseract-ocr
That might be a reason why it's way slower than on my computer, how can I install tesseract 5?

Dockerfile # syntax=docker/dockerfile:1

ARG TOKEN

FROM python:3.8-slim-buster

RUN apt-get update
RUN apt-get install -y software-properties-common
RUN apt-get update
RUN add-apt-repository ppa:alex-p/tesseract-ocr-devel

RUN apt-get update
RUN apt-get install -y build-essential

COPY requirements.txt requirements.txt
RUN pip3 install -r requirements.txt

COPY . .

RUN apt-get install -y tesseract

CMD ["python3", "bot.py"]


Cyrus Yip

unread,
Dec 31, 2021, 1:29:59 PM12/31/21
to tesseract-ocr

Cyrus Yip

unread,
Jan 1, 2022, 1:49:44 PM1/1/22
to tesseract-ocr
i managed to install tesseract 5, but the numpy mask doesn't work now.
it makes pictures like:
image.png
not:
image.png


Dockerfile:
# syntax=docker/dockerfile:1 ARG TOKEN FROM ubuntu:18.04 RUN apt-get update RUN apt-get install -y software-properties-common RUN apt-get install -y python3.8 RUN apt-get install -y python3-pip RUN apt-get update RUN apt-get install -y build-essential RUN apt-get install -y python3-pil COPY requirements.txt requirements.txt RUN pip3 install -r requirements.txt RUN apt-get update RUN add-apt-repository ppa:alex-p/tesseract-ocr5 RUN apt-get update RUN apt-get install -y tesseract-ocr COPY . . CMD ["python3", "bot.py"]

Zdenko Podobny

unread,
Jan 1, 2022, 2:41:27 PM1/1/22
to tesser...@googlegroups.com
What is your code? Does it work on your local computer?

BTW: here is proven numpy code:

filter_colors = [(51, 51, 51), (69, 69, 65), (65, 64, 60), (59, 58, 56), (67, 66, 62),
          (67, 67, 63), (67, 67, 62), (53, 53, 53), (54, 54, 53), (61, 61, 58),
          (62, 62, 60), (55, 55, 54), (59, 59, 57), (56, 56, 55)]

image = np.array(Image.open('mina.png').convert("RGB"))

*A, B = image.shape
mask = (image.reshape((-1,B)) == np.array(filter_colors)[:,None]).all(-1).any(0).reshape(A)
img = Image.fromarray(~mask)



Zdenko


so 1. 1. 2022 o 19:49 Cyrus Yip <cyrus...@gmail.com> napísal(a):

Zdenko Podobny

unread,
Jan 1, 2022, 3:29:34 PM1/1/22
to tesser...@googlegroups.com
And here is opencv2 version with IMO better quality:


import cv2
data = cv2.imread("mina.png")
mask_text = cv2.inRange(data, (51, 51, 51), (51, 51, 51))

# Morph open to remove noise
kernel = cv2.getStructuringElement(cv2.MORPH_RECT, (2, 2))
morph = cv2.morphologyEx(mask_text, cv2.MORPH_OPEN, kernel, iterations=1)

kernel = cv2.getStructuringElement(cv2.MORPH_RECT, (9, 4))
dilate = cv2.dilate(morph, kernel, iterations=4)

tresh = cv2.threshold(cv2.cvtColor(data, cv2.COLOR_BGR2GRAY),
                      0, 255, cv2.THRESH_BINARY + cv2.THRESH_OTSU)[1]
image_final = cv2.bitwise_and(tresh, tresh, mask=dilate)
# replace background with white
mask1 = np.zeros((
image_final.shape[0] + 2,  image_final.shape[1] + 2), np.uint8)
cv2.floodFill(image_final, mask1, (0, 0), 255)

display(Image.fromarray(
image_final))


image.png


Zdenko


so 1. 1. 2022 o 20:40 Zdenko Podobny <zde...@gmail.com> napísal(a):

Cyrus Yip

unread,
Jan 2, 2022, 3:00:48 PM1/2/22
to tesseract-ocr
I tried the opencv version, but it fails with images like this:
drop12.pnghi.png

Cyrus Yip

unread,
Jan 2, 2022, 3:01:46 PM1/2/22
to tesseract-ocr
I tried the opencv version, but it fails with images like this:
drop12.pnghi.png

On Saturday, January 1, 2022 at 12:29:34 PM UTC-8 zdenop wrote:

Zdenko Podobny

unread,
Jan 2, 2022, 4:10:45 PM1/2/22
to tesser...@googlegroups.com
All images you presented have the same size and the text is always in the same regions.
So you can create a mask for these regions and apply it to the thresholded input images. This could give you extra speed as you do not need to create a mask for each image individually...

Zdenko


ne 2. 1. 2022 o 21:01 Cyrus Yip <cyrus...@gmail.com> napísal(a):

Cyrus Yip

unread,
Jan 2, 2022, 6:08:42 PM1/2/22
to tesseract-ocr
Ok, I will look into how to do that. But do you have an idea why some of the letters go missing?

Zdenko Podobny

unread,
Jan 3, 2022, 1:50:42 PM1/3/22
to tesser...@googlegroups.com
increase parameter in getStructuringElement from 4 to 5 when creating mask:

kernel = cv2.getStructuringElement(cv2.MORPH_RECT, (9, 5))


Zdenko


po 3. 1. 2022 o 0:08 Cyrus Yip <cyrus...@gmail.com> napísal(a):

Cyrus Yip

unread,
Jan 3, 2022, 8:07:00 PM1/3/22
to tesseract-ocr
for this image
drop12.png
it still fails to get the text from the bottom right
cards:
['MasumiMushishiZokuShou', 'TamaoHino*Eyeshield21', "DiegoBrando~'sBizarreAdi:tocolBalRan", '']

Cyrus Yip

unread,
Jan 3, 2022, 9:06:29 PM1/3/22
to tesseract-ocr
i tried learning some opencv and doing the mask thing:
boxes = [
    (45, 0, 245, im.height),
    (320, 0, 515, im.height),
    (600, 0, 785, im.height),
]
if im.width > 1000:
    boxes.append(
       (865, 0, 1065, im.height)
    )
mask = np.zeros(data.shape[:2], np.uint8)

for box in boxes:
    cv2.rectangle(mask, (box[0], box[1]), (box[2], box[3]), 255, -1)

mask2 = np.zeros(data.shape[:2], np.uint8)
boxes = [
    (0, 58, im.width, 110),
    (0, 312, im.width, 360)
]
for box in boxes:
    cv2.rectangle(mask2, (box[0], box[1]), (box[2], box[3]), 255, -1)

mask = cv2.bitwise_and(mask, mask2)

image_final = cv2.bitwise_and(data, data, mask=mask)
image_final = cv2.threshold(cv2.cvtColor(image_final, cv2.COLOR_BGR2GRAY),

0, 255, cv2.THRESH_BINARY + cv2.THRESH_OTSU)[1]

mask1 = np.zeros((image_final.shape[0] + 2, image_final.shape[1] + 2), np.uint8)

cv2.floodFill(image_final, mask1, (0, 0), 255)

the results aren't that good and i don't know if this is a good way to make a mask.

Art Rhyno

unread,
Jan 5, 2022, 1:00:50 PM1/5/22
to tesser...@googlegroups.com

It might be worth trying to go with a b&w rendering and using a PSM of 11 since your input images are of such good quality. This is less likely to miss words or letters though other artifacts may slip through. Something like this seems to get decent results:

 

TESSERACT_CONFIG=r'--psm 11'

 

def showResults(region):

    results = pytesseract.image_to_data(region,

        config=TESSERACT_CONFIG,

        output_type=pytesseract.Output.DICT)

 

    tlen = len(results['text'])

 

    for i in range(tlen):

        #use conf to weed out some of the cruft

        if float(results['conf'][i]) > 0:

            print("WORD:",results['text'][i])

            print("left:",results['left'][i])

            print("top:",results['top'][i])

            print("width:",results['width'][i])

            print("height:",results['height'][i])

            print("conf:",results['conf'][i])

 

#read as grayscale to mute colors

gray = cv2.imread("mina.png",cv2.IMREAD_GRAYSCALE)

 

#convert to 2 color black & white

im= cv2.threshold(gray, 128, 255, cv2.THRESH_BINARY)[1]

 

_,w = im.shape

 

#crop and ocr top region (as per coords in email)

region1 = im[55:110,0:w]

cv2.imwrite('region1.png', region1)

showResults(region1)

 

#crop and ocr bottom region

region2 = im[312:360,0:w]

cv2.imwrite('region2.png', region2)

showResults(region2)

 

I think maybe you are cropping at a more granular level than in this example but the basic approach would be the same.

 

art

Cyrus Yip

unread,
Jan 5, 2022, 4:09:12 PM1/5/22
to tesseract-ocr
Art, I'm using your method + my cropping but there are some images which it fails on:
drop13.pngcard.pngcard.png
code:
im = cv2.imread(file_name, cv2.IMREAD_GRAYSCALE)
im = cv2.threshold(im, 128, 255, cv2.THRESH_BINARY)[1]

h, w = im.shape

boxes = [
(45, 0, 245, h),
(320, 0, 515, h),
(600, 0, 785, h),
]
if w > 1000:
boxes.append(
(865, 0, 1065, h)
)
mask = np.zeros(im.shape[:2], np.uint8)


for box in boxes:
    cv2.rectangle(mask, (box[0], box[1]), (box[2], box[3]), 255, -1)

mask2 = np.zeros(im.shape[:2], np.uint8)
boxes = [
(0, 58, w, 110),
(0, 312, w, 360)

]
for box in boxes:
    cv2.rectangle(mask2, (box[0], box[1]), (box[2], box[3]), 255, -1)

mask = cv2.bitwise_and(mask, mask2)
im = cv2.bitwise_and(im, im, mask=mask)

mask1 = np.zeros((im.shape[0] + 2, im.shape[1] + 2), np.uint8)
cv2.floodFill(im, mask1, (0, 0), 255)

thank you

Art Rhyno

unread,
Jan 5, 2022, 4:13:58 PM1/5/22
to tesser...@googlegroups.com

Thanks, I will follow up directly so that we don’t overload the thread.

 

art

 

From: tesser...@googlegroups.com <tesser...@googlegroups.com> On Behalf Of Cyrus Yip
Sent: Wednesday, January 5, 2022 4:09 PM
To: tesseract-ocr <tesser...@googlegroups.com>
Subject: Re: [tesseract-ocr] bad quality!?

 

Art, I'm using your method + my cropping but there are some images which it fails on:

--

You received this message because you are subscribed to the Google Groups "tesseract-ocr" group.
To unsubscribe from this group and stop receiving emails from it, send an email to tesseract-oc...@googlegroups.com.

Reply all
Reply to author
Forward
0 new messages