Anyway to disable internal image preprocessing? (internal operations make really BAD result)

xian

unread,

Jul 2, 2020, 5:54:42 AM7/2/20

to tesseract-ocr

For the Chinese words, I found that binarization in tesseract makes really bad results.

I use -c tessedit_write_image=1 to get the result image from tesseract's binarization.

As attachments,

original

tess_bin -> tesseract binarize the original.png

my_bin -> my preprocessing to the original.png

tess_my_bin -> tesseract binarize the my_bin.png

You can find that some characters disappear.

Before I pass all the images to the tesseract, I want to use my own function (pre-processing) first.

But tesseract's binarization make result worse.

I want to handle the image preprocessing part by mysl

How can I disable tesseract's image preprocessing? ....Or the only chance to do this is to modify the source code?

Thanks!!

my_bin.png

original.png

tess_bin.tif

tess_my_bin.tif

xian

unread,

Jul 2, 2020, 5:58:09 AM7/2/20

to tesseract-ocr

After several testing, I think "line removal" is the reason instead of the binarization.

xian於 2020年7月2日星期四 UTC+8下午5時54分42秒寫道：

xian

unread,

Jul 2, 2020, 6:11:53 AM7/2/20

to tesseract-ocr

Now I doubt that the image I got from tessedit_write_image=1 is really what OCR will run with?

Some "totally disappear" characters in the tess_bin.tif can still get result...

xian於 2020年7月2日星期四 UTC+8下午5時54分42秒寫道：

Zdenko Podobny

unread,

Jul 3, 2020, 2:50:55 PM7/3/20

to tesser...@googlegroups.com

First of all: you do not mention any important information like which tesseract version you use, which language model etc.

Next: " -c tessedit_write_image=1" produces Could not set option: tessedit_write_image=1 ;-)

Next: If you want to avoid tesseract binarization (Otsu), you must provide realy binarized image [1] as input. Yours my_bin.png image is using format 256 color/ 8 BitsPerPixel image

And last: I am not able to reproduce your problem with the latest tesseract code:

tesseract real_bin.png real_bin2 -c tessedit_write_images=1 -l chi_tra

see attached tessinput.tif - it is different from yours tess_my_bin.tif....

[1] https://github.com/tesseract-ocr/tesseract/blob/e910b3c20b831017b3152378bdaa4c567e62c65a/src/ccmain/thresholder.cpp#L185-L199

Zdenko

št 2. 7. 2020 o 11:54 xian <chen...@gmail.com> napísal(a):

--
You received this message because you are subscribed to the Google Groups "tesseract-ocr" group.
To unsubscribe from this group and stop receiving emails from it, send an email to tesseract-oc...@googlegroups.com.
To view this discussion on the web visit https://groups.google.com/d/msgid/tesseract-ocr/fe0850ae-6138-4736-a855-fb691b16056co%40googlegroups.com.

tessinput.tif

xian

unread,

Jul 5, 2020, 10:30:48 PM7/5/20

to tesseract-ocr

Hi zdenop:

Thank you for the reply, I will check my program to fix the image depth's problem.

But the "missing characters" problem is still there...

Here is my tesseract version:

tesseract 4.1.1

leptonica-1.79.0

libgif 5.2.1 : libjpeg 6b (libjpeg-turbo 1.2.1) : libpng 1.2.49 : libtiff 3.9.4 : zlib 1.2.3 : libwebp 0.4.3

Found AVX

Found SSE

The model I use is tessdata_best

And the full command is tesseract original.png stdout -l chi_tra+eng --oem 1 --psm 1 -c tessedit_write_images=1

As the attachments, you can see that some characters' lines are disappear!

Is this the bug of tesseract 4.1.1?

Thank you!

zdenop於 2020年7月4日星期六 UTC+8上午2時50分55秒寫道：

To unsubscribe from this group and stop receiving emails from it, send an email to tesser...@googlegroups.com.