bad quality!?

Cyrus Yip

unread,

Dec 29, 2021, 12:21:38 PM12/29/21

to tesseract-ocr

here is an example of an image i would like to use ocr on:

I would like the results to be like:
["Naruto Uzumaki Naruto", "Mai Sakurajima Rascal Does Not Dream of Bunny Girl Senpai", "Keqing Genshin Impact"]

Right now I'm using
region1 = im.crop((0, 55, im.width, 110))
region2 = im.crop((0, 312, im.width, 360))
image = Image.new("RGB", (im.width, region1.height + region2.height + 20))
image.paste(region1)
image.paste(region2, (0, region1.height + 20))
results = pytesseract.image_to_data(image, output_type=pytesseract.Output.DICT)

the processed image looks like

but getting results like:

[' ', '»MaiSakurajima¥RascalDoesNotDreamofBunnyGirlSenpai', 'iGenshinImpact']

How do I optimize the image/configs so the ocr is more accurate?

Thank you.

Zdenko Podobny

unread,

Dec 29, 2021, 12:46:13 PM12/29/21

to tesser...@googlegroups.com

If you properly crop text areas you get good output. E.g.

> tesseract r_cropped.png - --dpi 300

Rascal Does Not Dream
of Bunny Girl Senpai

Zdenko

st 29. 12. 2021 o 18:21 Cyrus Yip <cyrus...@gmail.com> napísal(a):

--
You received this message because you are subscribed to the Google Groups "tesseract-ocr" group.
To unsubscribe from this group and stop receiving emails from it, send an email to tesseract-oc...@googlegroups.com.
To view this discussion on the web visit https://groups.google.com/d/msgid/tesseract-ocr/1a2fa0e4-b998-4931-ad7d-ae069a46568bn%40googlegroups.com.

Cyrus Yip

unread,

Dec 29, 2021, 12:58:15 PM12/29/21

to tesseract-ocr

I played around a bit and replacing all colours except for text colour and it works pretty well!

The only thing is replacing colours with:
im = im.convert("RGB")
pixdata = im.load()
for y in range(im.height):
for x in range(im.width):
if pixdata[x, y] != (51, 51, 51):
pixdata[x, y] = (255, 255, 255)
is a bit slow. Do you know a better way to replace pixels in python? I don't know if this is off topic.

Zdenko Podobny

unread,

Dec 29, 2021, 1:15:26 PM12/29/21

to tesser...@googlegroups.com

IMO if the text is always in the same area, cropping and OCR just that area will be faster.

Zdenko

st 29. 12. 2021 o 18:58 Cyrus Yip <cyrus...@gmail.com> napísal(a):

To view this discussion on the web visit https://groups.google.com/d/msgid/tesseract-ocr/3c60a0fd-a213-4caa-8a0d-6888a116b08an%40googlegroups.com.

Cyrus Yip

unread,

Dec 29, 2021, 3:18:37 PM12/29/21

to tesseract-ocr

but won't multiple ocr's and crops use a lot of time?

Zdenko Podobny

unread,

Dec 30, 2021, 5:19:52 AM12/30/21

to tesser...@googlegroups.com

Just made your tests ;-)

You can use tesserocr (maybe quite difficult installation if you are on windows) instead of pytesseract (e.g. initialize tesseract API once and use is multiple times). But it does not provide DICT output.

Zdenko

st 29. 12. 2021 o 21:18 Cyrus Yip <cyrus...@gmail.com> napísal(a):

To view this discussion on the web visit https://groups.google.com/d/msgid/tesseract-ocr/8d80ed59-6163-48c9-adb8-975d8274a9adn%40googlegroups.com.

Zdenko Podobny

unread,

Dec 30, 2021, 11:46:51 AM12/30/21

to tesser...@googlegroups.com

OK. I played a little bit ;-):

I tested the speed of your code with your image:

import timeit

pil_color_replace = """
from PIL import Image

im = Image.open('mai.png').convert("RGB")

pixdata = im.load()
for y in range(im.height):
for x in range(im.width):
if pixdata[x, y] != (51, 51, 51):
pixdata[x, y] = (255, 255, 255)
"""

elapsed_time = timeit.timeit(pil_color_replace, number=100)/100
print(f"duration: {elapsed_time:.4} seconds")

I got an average speed 0.08547 seconds on my computer.

On internet I found the suggestion to use numpy for this and I finished with the following code:

np_color_replace_rgb = """
import numpy as np
from PIL import Image

data = np.array(Image.open('mai.png').convert("RGB"))
mask = (data == [51, 51, 51]).all(-1)
img = Image.fromarray(np.invert(mask))
"""

elapsed_time = timeit.timeit(np_color_replace_rgb, number=100)/100
print(f"duration: {elapsed_time:.4} seconds")

I got an average speed 0.01774 seconds e.g. 4.8 faster than the PIL code.
It is a little bit cheating as it does not replace colors - just take a mask of target color and return it as a binarized image, what is exactly what you need for OCR ;-)

Also, I would like to point out that the result OCR output is not so perfect (compared to OCR of unmodified text areas), as this kind of binarization is very simple.

Zdenko

št 30. 12. 2021 o 11:19 Zdenko Podobny <zde...@gmail.com> napísal(a):

Cyrus Yip

unread,

Dec 30, 2021, 12:14:40 PM12/30/21

to tesseract-ocr

I also tried many things like cropping, colour changing, colour replacing, and mixing them together.

I landed on checking if a pixel is not one of these:

[(51, 51, 51), (69, 69, 65), (65, 64, 60), (59, 58, 56), (67, 66, 62), (67, 67, 63), (67, 67, 62), (53, 53, 53), (54, 54, 53), (61, 61, 58), (62, 62, 60), (55, 55, 54), (59, 59, 57), (56, 56, 55)]

colours, replace it with white. It is pretty accurate but is there a way to do this with numpy arrays?

(code)

for x in range(im.width):
if pixels[x, y] not in [(51, 51, 51), (69, 69, 65), (65, 64, 60), (59, 58, 56), (67, 66, 62), (67, 67, 63), (67, 67, 62), (53, 53, 53), (54, 54, 53), (61, 61, 58), (62, 62, 60), (55, 55, 54), (59, 59, 57), (56, 56, 55)]:

pixels[x, y] = (255, 255, 255)

Zdenko Podobny

unread,

Dec 30, 2021, 2:43:01 PM12/30/21

to tesser...@googlegroups.com

try this:

import numpy as np
from PIL import Image

filter_colors = [(51, 51, 51), (69, 69, 65), (65, 64, 60), (59, 58, 56), (67, 66, 62),

(67, 67, 63), (67, 67, 62), (53, 53, 53), (54, 54, 53), (61, 61, 58),
(62, 62, 60), (55, 55, 54), (59, 59, 57), (56, 56, 55)]

image = np.array(Image.open('mai.png').convert("RGB"))
mask = np.isin(image, filter_colors, invert=True)
img = Image.fromarray(mask.any(axis=2))

Zdenko

št 30. 12. 2021 o 18:14 Cyrus Yip <cyrus...@gmail.com> napísal(a):

To view this discussion on the web visit https://groups.google.com/d/msgid/tesseract-ocr/8749a458-6938-4894-aa67-804631b5139dn%40googlegroups.com.

Cyrus Yip

unread,

Dec 30, 2021, 7:45:37 PM12/30/21

to tesseract-ocr

For some reason, using the numpy array has a different result than mine.

Numpy array:

Loop through pixels:

The second was is more accurate but way slower.

Cyrus Yip

unread,

Dec 30, 2021, 7:46:40 PM12/30/21

to tesseract-ocr

(original image)

Zdenko Podobny

unread,

Dec 31, 2021, 6:18:18 AM12/31/21

to tesser...@googlegroups.com

You are right - np.isin is working another way than I expected (it does not match tuples, but individual values at tuples) and by coincidence, it produces similar results as your code.

Here is updated code that produces the same result as PIL. It is faster but with an increasing number of colors in filter_colors, it will be slower.

filter_colors = [(51, 51, 51), (69, 69, 65), (65, 64, 60), (59, 58, 56), (67, 66, 62),
(67, 67, 63), (67, 67, 62), (53, 53, 53), (54, 54, 53), (61, 61, 58),
(62, 62, 60), (55, 55, 54), (59, 59, 57), (56, 56, 55)]

image = np.array(Image.open('mai.png').convert("RGB"))

mask = np.array([], dtype=bool)
for color in filter_colors:
if mask.size == 0:
mask = (image == color).all(-1)
else:
mask = mask | (image == color).all(-1)
img = Image.fromarray(~mask)

Zdenko

pi 31. 12. 2021 o 1:45 Cyrus Yip <cyrus...@gmail.com> napísal(a):

To view this discussion on the web visit https://groups.google.com/d/msgid/tesseract-ocr/83f7473f-a2c5-4d5c-8a45-450cb9a630c1n%40googlegroups.com.

Cyrus Yip

unread,

Dec 31, 2021, 1:27:41 PM12/31/21

to tesseract-ocr

Right now I'm installing tesseract 4 in docker with
RUN apt-get install -y tesseract-ocr

That might be a reason why it's way slower than on my computer, how can I install tesseract 5?

Dockerfile # syntax=docker/dockerfile:1

ARG TOKEN

FROM python:3.8-slim-buster

RUN apt-get update
RUN apt-get install -y software-properties-common
RUN apt-get update
RUN add-apt-repository ppa:alex-p/tesseract-ocr-devel

RUN apt-get update
RUN apt-get install -y build-essential

COPY requirements.txt requirements.txt
RUN pip3 install -r requirements.txt

COPY . .

RUN apt-get install -y tesseract

CMD ["python3", "bot.py"]

Build logs

Cyrus Yip

unread,

Dec 31, 2021, 1:29:59 PM12/31/21

to tesseract-ocr

better link?

Cyrus Yip

unread,

Jan 1, 2022, 1:49:44 PM1/1/22

to tesseract-ocr

i managed to install tesseract 5, but the numpy mask doesn't work now.
it makes pictures like:

not:

Dockerfile:

# syntax=docker/dockerfile:1 ARG TOKEN FROM ubuntu:18.04 RUN apt-get update RUN apt-get install -y software-properties-common RUN apt-get install -y python3.8 RUN apt-get install -y python3-pip RUN apt-get update RUN apt-get install -y build-essential RUN apt-get install -y python3-pil COPY requirements.txt requirements.txt RUN pip3 install -r requirements.txt RUN apt-get update RUN add-apt-repository ppa:alex-p/tesseract-ocr5 RUN apt-get update RUN apt-get install -y tesseract-ocr COPY . . CMD ["python3", "bot.py"]

Zdenko Podobny

unread,

Jan 1, 2022, 2:41:27 PM1/1/22

to tesser...@googlegroups.com

What is your code? Does it work on your local computer?

BTW: here is proven numpy code:

filter_colors = [(51, 51, 51), (69, 69, 65), (65, 64, 60), (59, 58, 56), (67, 66, 62),
(67, 67, 63), (67, 67, 62), (53, 53, 53), (54, 54, 53), (61, 61, 58),
(62, 62, 60), (55, 55, 54), (59, 59, 57), (56, 56, 55)]

image = np.array(Image.open('mina.png').convert("RGB"))

*A, B = image.shape
mask = (image.reshape((-1,B)) == np.array(filter_colors)[:,None]).all(-1).any(0).reshape(A)
img = Image.fromarray(~mask)

Zdenko

so 1. 1. 2022 o 19:49 Cyrus Yip <cyrus...@gmail.com> napísal(a):

To view this discussion on the web visit https://groups.google.com/d/msgid/tesseract-ocr/c7626180-9bd7-4759-9f0e-df0b0697ab15n%40googlegroups.com.

Zdenko Podobny

unread,

Jan 1, 2022, 3:29:34 PM1/1/22

to tesser...@googlegroups.com

And here is opencv2 version with IMO better quality:

import cv2

data = cv2.imread("mina.png")
mask_text = cv2.inRange(data, (51, 51, 51), (51, 51, 51))

# Morph open to remove noise
kernel = cv2.getStructuringElement(cv2.MORPH_RECT, (2, 2))
morph = cv2.morphologyEx(mask_text, cv2.MORPH_OPEN, kernel, iterations=1)

kernel = cv2.getStructuringElement(cv2.MORPH_RECT, (9, 4))
dilate = cv2.dilate(morph, kernel, iterations=4)

tresh = cv2.threshold(cv2.cvtColor(data, cv2.COLOR_BGR2GRAY),
0, 255, cv2.THRESH_BINARY + cv2.THRESH_OTSU)[1]
image_final = cv2.bitwise_and(tresh, tresh, mask=dilate)

# replace background with white
mask1 = np.zeros(( image_final.shape[0] + 2, image_final.shape[1] + 2), np.uint8)
cv2.floodFill(image_final, mask1, (0, 0), 255)

display(Image.fromarray(image_final))

Zdenko

so 1. 1. 2022 o 20:40 Zdenko Podobny <zde...@gmail.com> napísal(a):

Cyrus Yip

unread,

Jan 2, 2022, 3:00:48 PM1/2/22

to tesseract-ocr

I tried the opencv version, but it fails with images like this:

Cyrus Yip

unread,

Jan 2, 2022, 3:01:46 PM1/2/22

to tesseract-ocr

I tried the opencv version, but it fails with images like this:

On Saturday, January 1, 2022 at 12:29:34 PM UTC-8 zdenop wrote:

Zdenko Podobny

unread,

Jan 2, 2022, 4:10:45 PM1/2/22

to tesser...@googlegroups.com

All images you presented have the same size and the text is always in the same regions.

So you can create a mask for these regions and apply it to the thresholded input images. This could give you extra speed as you do not need to create a mask for each image individually...

Zdenko

ne 2. 1. 2022 o 21:01 Cyrus Yip <cyrus...@gmail.com> napísal(a):

To view this discussion on the web visit https://groups.google.com/d/msgid/tesseract-ocr/5891f832-b45d-4e24-bcc2-e45a0ed4bb38n%40googlegroups.com.

Cyrus Yip

unread,

Jan 2, 2022, 6:08:42 PM1/2/22

to tesseract-ocr

Ok, I will look into how to do that. But do you have an idea why some of the letters go missing?

Zdenko Podobny

unread,

Jan 3, 2022, 1:50:42 PM1/3/22

to tesser...@googlegroups.com

increase parameter in getStructuringElement from 4 to 5 when creating mask:

kernel = cv2.getStructuringElement(cv2.MORPH_RECT, (9, 5))

Zdenko

po 3. 1. 2022 o 0:08 Cyrus Yip <cyrus...@gmail.com> napísal(a):

To view this discussion on the web visit https://groups.google.com/d/msgid/tesseract-ocr/2109d002-62d8-4c93-a2de-e9585b277fabn%40googlegroups.com.

Cyrus Yip

unread,

Jan 3, 2022, 8:07:00 PM1/3/22

to tesseract-ocr

for this image

it still fails to get the text from the bottom right
cards:
['MasumiMushishiZokuShou', 'TamaoHino*Eyeshield21', "DiegoBrando~'sBizarreAdi:tocolBalRan", '']

Cyrus Yip

unread,

Jan 3, 2022, 9:06:29 PM1/3/22

to tesseract-ocr

i tried learning some opencv and doing the mask thing:

boxes = [
(45, 0, 245, im.height),
(320, 0, 515, im.height),
(600, 0, 785, im.height),
]
if im.width > 1000:
boxes.append(
(865, 0, 1065, im.height)
)
mask = np.zeros(data.shape[:2], np.uint8)

for box in boxes:
cv2.rectangle(mask, (box[0], box[1]), (box[2], box[3]), 255, -1)

mask2 = np.zeros(data.shape[:2], np.uint8)
boxes = [
(0, 58, im.width, 110),
(0, 312, im.width, 360)
]
for box in boxes:
cv2.rectangle(mask2, (box[0], box[1]), (box[2], box[3]), 255, -1)

mask = cv2.bitwise_and(mask, mask2)

image_final = cv2.bitwise_and(data, data, mask=mask)
image_final = cv2.threshold(cv2.cvtColor(image_final, cv2.COLOR_BGR2GRAY),

0, 255, cv2.THRESH_BINARY + cv2.THRESH_OTSU)[1]

mask1 = np.zeros((image_final.shape[0] + 2, image_final.shape[1] + 2), np.uint8)

cv2.floodFill(image_final, mask1, (0, 0), 255)

the results aren't that good and i don't know if this is a good way to make a mask.

Art Rhyno

unread,

Jan 5, 2022, 1:00:50 PM1/5/22

to tesser...@googlegroups.com

It might be worth trying to go with a b&w rendering and using a PSM of 11 since your input images are of such good quality. This is less likely to miss words or letters though other artifacts may slip through. Something like this seems to get decent results:

TESSERACT_CONFIG=r'--psm 11'

def showResults(region):

results = pytesseract.image_to_data(region,

config=TESSERACT_CONFIG,

output_type=pytesseract.Output.DICT)

tlen = len(results['text'])

for i in range(tlen):

#use conf to weed out some of the cruft

if float(results['conf'][i]) > 0:

print("WORD:",results['text'][i])

print("left:",results['left'][i])

print("top:",results['top'][i])

print("width:",results['width'][i])

print("height:",results['height'][i])

print("conf:",results['conf'][i])

#read as grayscale to mute colors

gray = cv2.imread("mina.png",cv2.IMREAD_GRAYSCALE)

#convert to 2 color black & white

im= cv2.threshold(gray, 128, 255, cv2.THRESH_BINARY)[1]

_,w = im.shape

#crop and ocr top region (as per coords in email)

region1 = im[55:110,0:w]

cv2.imwrite('region1.png', region1)

showResults(region1)

#crop and ocr bottom region

region2 = im[312:360,0:w]

cv2.imwrite('region2.png', region2)

showResults(region2)

I think maybe you are cropping at a more granular level than in this example but the basic approach would be the same.

art

Cyrus Yip

unread,

Jan 5, 2022, 4:09:12 PM1/5/22

to tesseract-ocr

Art, I'm using your method + my cropping but there are some images which it fails on:

code:
im = cv2.imread(file_name, cv2.IMREAD_GRAYSCALE)
im = cv2.threshold(im, 128, 255, cv2.THRESH_BINARY)[1]

h, w = im.shape

boxes = [
(45, 0, 245, h),
(320, 0, 515, h),
(600, 0, 785, h),
]
if w > 1000:
boxes.append(
(865, 0, 1065, h)
)
mask = np.zeros(im.shape[:2], np.uint8)

for box in boxes:
cv2.rectangle(mask, (box[0], box[1]), (box[2], box[3]), 255, -1)

mask2 = np.zeros(im.shape[:2], np.uint8)
boxes = [
(0, 58, w, 110),
(0, 312, w, 360)

]
for box in boxes:
cv2.rectangle(mask2, (box[0], box[1]), (box[2], box[3]), 255, -1)

mask = cv2.bitwise_and(mask, mask2)

im = cv2.bitwise_and(im, im, mask=mask)

mask1 = np.zeros((im.shape[0] + 2, im.shape[1] + 2), np.uint8)
cv2.floodFill(im, mask1, (0, 0), 255)

thank you

Art Rhyno

unread,

Jan 5, 2022, 4:13:58 PM1/5/22

to tesser...@googlegroups.com

Thanks, I will follow up directly so that we don’t overload the thread.

art

From: tesser...@googlegroups.com <tesser...@googlegroups.com> On Behalf Of Cyrus Yip
Sent: Wednesday, January 5, 2022 4:09 PM
To: tesseract-ocr <tesser...@googlegroups.com>
Subject: Re: [tesseract-ocr] bad quality!?

Art, I'm using your method + my cropping but there are some images which it fails on:

--

You received this message because you are subscribed to the Google Groups "tesseract-ocr" group.
To unsubscribe from this group and stop receiving emails from it, send an email to tesseract-oc...@googlegroups.com.

To view this discussion on the web visit https://groups.google.com/d/msgid/tesseract-ocr/fbb3a2eb-fa7a-481b-860d-3675a157db2en%40googlegroups.com.

Reply all

Reply to author

Forward