problems with upper-case character

Sandra M.

unread,

Sep 18, 2019, 11:19:22 AM9/18/19

to tesseract-ocr

I'm using Tesseract with Python. I have an image with 1-6 words in it and need to read the text. Sometimes the character "C", which look the same in upper and lower case, is detected as lower case c instead of upper case C. I see the problem, but in context to the following letters it should be possible to detect the right notation. Is there any configuration or something to improve this?

I had a look at the configuration options of config='-psm x' with different values for x, but nothing fits to my problem

Timothy Snyder

unread,

Sep 18, 2019, 11:33:48 AM9/18/19

to tesser...@googlegroups.com

No configs I know of but I have similar functionality implemented in a text post-processing step in my OCR pipeline.

On Wed, Sep 18, 2019 at 11:19 AM 'Sandra M.' via tesseract-ocr <tesser...@googlegroups.com> wrote:

I'm using Tesseract with Python. I have an image with 1-6 words in it and need to read the text. Sometimes the character "C", which look the same in upper and lower case, is detected as lower case c instead of upper case C. I see the problem, but in context to the following letters it should be possible to detect the right notation. Is there any configuration or something to improve this?
I had a look at the configuration options of config='-psm x' with different values for x, but nothing fits to my problem

--
You received this message because you are subscribed to the Google Groups "tesseract-ocr" group.
To unsubscribe from this group and stop receiving emails from it, send an email to tesseract-oc...@googlegroups.com.
To view this discussion on the web visit https://groups.google.com/d/msgid/tesseract-ocr/ef0e07cc-5f7c-4ff3-bb07-ffdda4c68321%40googlegroups.com.

Zdenko Podobny

unread,

Sep 18, 2019, 11:55:35 AM9/18/19

to tesser...@googlegroups.com

IMO only solution is to send longer text for ocr. (e.g. paragraph)

Zdenko

st 18. 9. 2019 o 17:19 'Sandra M.' via tesseract-ocr <tesser...@googlegroups.com> napísal(a):

I'm using Tesseract with Python. I have an image with 1-6 words in it and need to read the text. Sometimes the character "C", which look the same in upper and lower case, is detected as lower case c instead of upper case C. I see the problem, but in context to the following letters it should be possible to detect the right notation. Is there any configuration or something to improve this?
I had a look at the configuration options of config='-psm x' with different values for x, but nothing fits to my problem

--

Sandra M.

unread,

Sep 19, 2019, 3:50:58 AM9/19/19

to tesseract-ocr

thanks for your responses

@Timothy Snyder: I think I cannot do this in postprocesssing, as it is possible that both spellings occur, but I have to differentiate them. Or what did you do exactly?

@zdenop: Unfortunately it is not possible for me to send a longer text.

anyone else any ideas?

Lorenzo Bolzani

unread,

Sep 19, 2019, 4:03:17 AM9/19/19

to tesser...@googlegroups.com

You say that both letters looks the same (same height too?) and that it is not possible to do it in processing as both spellings are possible. How is tesseract, or a human, supposed to tell them apart?

Can you please share a sample? Maybe using a smaller/bigger image is enough. Or maybe the image is very noisy or colored or there is something else making the things more difficult.

About the longer text you could try to repeat the same image twice, vertical or horizontal and see IF it helps.

Lorenzo

--

You received this message because you are subscribed to the Google Groups "tesseract-ocr" group.
To unsubscribe from this group and stop receiving emails from it, send an email to tesseract-oc...@googlegroups.com.

To view this discussion on the web visit https://groups.google.com/d/msgid/tesseract-ocr/c8271c49-77a3-4081-9418-0a822be1f8c7%40googlegroups.com.

Sandra M.

unread,

Sep 19, 2019, 4:43:30 AM9/19/19

to tesseract-ocr

@Lorenzo Blz: This is an example image. The output of my code is "calibrations". The height of the letters is not the same. Of course it cannot be recognized if there is only a "c", but in the context to the other letters tesseract should be able to detect if it is a small or capital letter, I think. This image has no noise or anything else, I don't unterstand the problem. But nevertheless, your comment to change the size helped! If I resize it with 150% or 75% for example, it works. I just don't know how to solve it if I don't have a reference value later on. How to decide which is the right spelling, 100% image size or 150%. Or is it possible to say that it's always a more reliable result if I resize the image in preprocessing?

Am Mittwoch, 18. September 2019 17:19:22 UTC+2 schrieb Sandra M.:

Zdenko Podobny

unread,

Sep 19, 2019, 5:23:50 AM9/19/19

to tesser...@googlegroups.com

Please provide more information (versions info, how you do OCR - seem like you use some coding).

I just tried tesseract (tesseract 5.0.0-alpha-416-g408d6) command line with tessdata_best and if work for me:

tesseract unnamed.png -
Warning: Invalid resolution 0 dpi. Using 70 instead.
Estimating resolution as 497
Calibrations

Zdenko

št 19. 9. 2019 o 10:43 'Sandra M.' via tesseract-ocr <tesser...@googlegroups.com> napísal(a):

--

You received this message because you are subscribed to the Google Groups "tesseract-ocr" group.
To unsubscribe from this group and stop receiving emails from it, send an email to tesseract-oc...@googlegroups.com.

To view this discussion on the web visit https://groups.google.com/d/msgid/tesseract-ocr/e4ed704a-cee0-4bb2-80ae-9fc9b82ab55d%40googlegroups.com.

Sandra M.

unread,

Sep 19, 2019, 5:55:30 AM9/19/19

to tesseract-ocr

I use Tesseract 3.02 leptonica-1.68. What do you mean with tessdata_best? I'm new in this field and just know how to call tesseract with the given code line.... How can the resolution be 0 dpi?

I'm using this Python code:

import pytesseract
import argparse
import cv2
import os

# construct the argument parse and parse the arguments
ap = argparse.ArgumentParser()
ap.add_argument("-i", "--image", required=True,
    help="path to input image to be OCR'd")
args = vars(ap.parse_args())

# load the example image and convert it to grayscale
image = cv2.imread(args["image"])
gray = cv2.cvtColor(image, cv2.COLOR_BGR2GRAY)

# write the grayscale image to disk as a temporary file so we can
# apply OCR to it
filename = "{}.png".format(os.getpid())
cv2.imwrite(filename, gray)

# load the image as a PIL/Pillow image, apply OCR, and then delete
# the temporary file
text = pytesseract.image_to_string(gray)
print("Output: " + text)

To unsubscribe from this group and stop receiving emails from it, send an email to tesser...@googlegroups.com.

Lorenzo Bolzani

unread,

Sep 19, 2019, 6:37:14 AM9/19/19

to tesser...@googlegroups.com

I tried to upscale, downscale, with and without the white border and I always get Calibrations. I even tried a few psm modes.

I'm using:

tesseract 4.0.0
leptonica-1.76.0
libjpeg 8d (libjpeg-turbo 1.5.2) : libpng 1.6.34 : libtiff 4.0.9 : zlib 1.2.11

What I would do is this:

- prepare a test set with some data so that you can check what gives you an improvement and what not on average

- remove the white border (see here)

- now rescale the text so that it is about 35/55px, try a few values and see what works best. I would also try a few completely different values (75, 100) while I'm there (just make sure you always start from the original images when you rescale not to mess the images too much, I would use find+imagemagick).

If this doesn't work, you could look at the character boxes size. If the text height is fixed you should be able to tell immediately what is what.

If this doesn't work and if you have some data, you could consider doing some fine tuning (for example with ocrd-train) but if your text is so clear and standard you should not need it.

I just saw that you are using version 3.x, this is the old version and does not use neural networks. Current stable version is 4.1.

Lorenzo

--

You received this message because you are subscribed to the Google Groups "tesseract-ocr" group.

To unsubscribe from this group and stop receiving emails from it, send an email to tesseract-oc...@googlegroups.com.

Zdenko Podobny

unread,

Sep 19, 2019, 6:49:43 AM9/19/19

to tesser...@googlegroups.com

your tesseract version is old. Current version is 4.1 (or dev version is 5.0).

For 4.x and above you can you different tessdata: best, fast or with 3.x module.

Zdenko

št 19. 9. 2019 o 11:55 'Sandra M.' via tesseract-ocr <tesser...@googlegroups.com> napísal(a):

To unsubscribe from this group and stop receiving emails from it, send an email to tesseract-oc...@googlegroups.com.
To view this discussion on the web visit https://groups.google.com/d/msgid/tesseract-ocr/9faf77f7-c862-47f6-b01d-629773025a7f%40googlegroups.com.

Message has been deleted

Sandra M.

unread,

Sep 19, 2019, 11:30:34 AM9/19/19

to tesseract-ocr

You were both right - updating to version 5 fixed the problem more or less! Only in one case there is still a problem with lower and upper case letters, but for the other cases it's working now!

Sandra M.

unread,

Sep 19, 2019, 12:06:46 PM9/19/19

to tesseract-ocr

But therefore I get empty strings now, because it occurs a symbol that tesseract does not know. I had this problem before as well, but could fix it for whatever reason with config='--psm 7'. This doesn't work now anymore... Do you have an idea for this as well? I don't need to detect the symbol, I just want that the rest of the string is not "thrown away"...

Zdenko Podobny

unread,

Sep 19, 2019, 1:36:32 PM9/19/19

to tesser...@googlegroups.com

please provide image for testing.

Zdenko

št 19. 9. 2019 o 18:06 'Sandra M.' via tesseract-ocr <tesser...@googlegroups.com> napísal(a):

But therefore I get empty strings now, because it occurs a symbol that tesseract does not know. I had this problem before as well, but could fix it for whatever reason with config='--psm 7'. This doesn't work now anymore... Do you have an idea for this as well? I don't need to detect the symbol, I just want that the rest of the string is not "thrown away"...

--

You received this message because you are subscribed to the Google Groups "tesseract-ocr" group.
To unsubscribe from this group and stop receiving emails from it, send an email to tesseract-oc...@googlegroups.com.

To view this discussion on the web visit https://groups.google.com/d/msgid/tesseract-ocr/29f63b14-e2f4-481b-89f6-bd8149e71138%40googlegroups.com.

Sandra M.

unread,

Sep 20, 2019, 4:54:13 AM9/20/19

to tesseract-ocr

I realized that it also occurs for strings without the symbol. The image given below for example returns an empty string as well. But in this case, it is recognized correctly with config='--psm 7' But unfortunately I cannot presume generally for this case that it is only one line text. Maybe the problem is because it is no word given in the dictionary? I found out that it is possible to enable the dictionary and to get back the single letters with the highest accuracy, but I did not get how to do this. I tried it with this config:

text = pytesseract.image_to_string(gray, config='load_system_dawg=0')

but it didn't imporove anything and I'm even not sure if I applied it correctly...

Lorenzo Bolzani

unread,

Sep 21, 2019, 7:24:45 AM9/21/19

to tesser...@googlegroups.com

If you are not sure if you have a single line or a single block use psm 6.

See tesseract --help-extra

Psm 6 generally works fine for single lines too.

If you have full pages and single lines mixed you need a pre processing step (threshold, morphology, etc.) to understand what psm is the correct one.

--

You received this message because you are subscribed to the Google Groups "tesseract-ocr" group.
To unsubscribe from this group and stop receiving emails from it, send an email to tesseract-oc...@googlegroups.com.

To view this discussion on the web visit https://groups.google.com/d/msgid/tesseract-ocr/ebbdd84b-0928-43b1-a0d8-d7c9308f7616%40googlegroups.com.

Reply all

Reply to author

Forward