Anyway to disable internal image preprocessing? (internal operations make really BAD result)

346 views
Skip to first unread message

xian

unread,
Jul 2, 2020, 5:54:42 AM7/2/20
to tesseract-ocr
For the Chinese words, I found that binarization in tesseract makes really bad results.
I use -c tessedit_write_image=1 to get the result image from tesseract's binarization.

As attachments,
original
tess_bin -> tesseract binarize the original.png
my_bin -> my preprocessing to the original.png
tess_my_bin ->  tesseract binarize the my_bin.png

You can find that some characters disappear.
Before I pass all the images to the tesseract, I want to use my own function (pre-processing) first.
But tesseract's binarization make result worse.


I want to handle the image preprocessing part by mysl
How can I disable tesseract's image preprocessing? ....Or the only chance to do this is to modify the source code?
Thanks!!
my_bin.png
original.png
tess_bin.tif
tess_my_bin.tif

xian

unread,
Jul 2, 2020, 5:58:09 AM7/2/20
to tesseract-ocr
After several testing, I think "line removal" is the reason instead of the binarization.

xian於 2020年7月2日星期四 UTC+8下午5時54分42秒寫道:

xian

unread,
Jul 2, 2020, 6:11:53 AM7/2/20
to tesseract-ocr
Now I doubt that the image I got from tessedit_write_image=1 is really what OCR will run with?
Some "totally disappear" characters in the tess_bin.tif can still get result...

xian於 2020年7月2日星期四 UTC+8下午5時54分42秒寫道:

Zdenko Podobny

unread,
Jul 3, 2020, 2:50:55 PM7/3/20
to tesser...@googlegroups.com
First of all: you do not mention any important information like which tesseract version you use, which language model etc.

Next: " -c tessedit_write_image=1" produces Could not set option: tessedit_write_image=1 ;-)

Next: If you want to avoid tesseract binarization (Otsu), you must provide realy binarized image [1] as input. Yours my_bin.png image is using format 256 color/ 8 BitsPerPixel image 

And last: I am not able to reproduce your problem with the latest tesseract code:
tesseract real_bin.png real_bin2 -c tessedit_write_images=1 -l chi_tra
see attached tessinput.tif - it is different from yours tess_my_bin.tif....


Zdenko


št 2. 7. 2020 o 11:54 xian <chen...@gmail.com> napísal(a):
--
You received this message because you are subscribed to the Google Groups "tesseract-ocr" group.
To unsubscribe from this group and stop receiving emails from it, send an email to tesseract-oc...@googlegroups.com.
To view this discussion on the web visit https://groups.google.com/d/msgid/tesseract-ocr/fe0850ae-6138-4736-a855-fb691b16056co%40googlegroups.com.
tessinput.tif

xian

unread,
Jul 5, 2020, 10:30:48 PM7/5/20
to tesseract-ocr
Hi zdenop:

Thank you for the reply, I will check my program to fix the image depth's problem.
But the "missing characters" problem is still there...

Here is my tesseract version:
tesseract 4.1.1
 leptonica-1.79.0
  libgif 5.2.1 : libjpeg 6b (libjpeg-turbo 1.2.1) : libpng 1.2.49 : libtiff 3.9.4 : zlib 1.2.3 : libwebp 0.4.3
 Found AVX
 Found SSE

The model I use is tessdata_best
And the full command is tesseract original.png stdout -l chi_tra+eng --oem 1 --psm 1 -c tessedit_write_images=1

As the attachments, you can see that some characters' lines are disappear!
Is this the bug of tesseract 4.1.1?
Thank you!

zdenop於 2020年7月4日星期六 UTC+8上午2時50分55秒寫道:
To unsubscribe from this group and stop receiving emails from it, send an email to tesser...@googlegroups.com.
original.png
tessinput.tif
Reply all
Reply to author
Forward
0 new messages