Does Tesseract Actually Deskew the Image?

4,453 views
Skip to first unread message

Pedro Correia

unread,
Jan 22, 2017, 6:40:20 PM1/22/17
to tesseract-ocr
Hi there,
I've been reading Tesseract's code and I've realized that are several skew estimation functions around. However I am really confused: does Tesseract actually deskew the image? 
Here [1] we find that "The line finding algorithm is designed so that a skewed page can be recognized without having to de-skew, thus saving loss of image quality." Does it mean that no deskewing is done?
On the other hand, I've seen some posts around here referring to Leptonica's deskewing algorithm, even though I couldn't find any call to it's functions anywhere in Tesseract's code.
So, can anyone explain the real deal to me?
Thanks in advance!


Pedro Correia

unread,
Feb 3, 2017, 7:55:41 AM2/3/17
to tesseract-ocr
up! please help meeeeeeeee

James R Barlow

unread,
Feb 7, 2017, 2:43:30 AM2/7/17
to tesseract-ocr
Tesseract doesn't deskew the output image. It makes no changes to the output image.

What it does try to do is find a local baseline to account for text that is skewed. I believe it may be capable of finding low order polynomial baselines as well to account for certain distortions. This is strictly a tool to improve OCR results. You can see the results in the output of the "hocr" renderer. 

Example hocr output for a straight line with no skew:
  
<span class='ocr_line' id='line_1_1' title="bbox 882 131 1656 217; baseline 0 -17; x_size 87; x_descenders 17; x_ascenders 21">

Interpretation in hocr spec of "baseline":


You will get better results if you globally deskew and cleanup the image using other methods such as Leptonica's deskew function.

OCRmyPDF is a tool I develop that wraps tesseract. It can perform deskew using Leptonica before it delegates OCR to tesseract.

Pedro Correia

unread,
Feb 7, 2017, 1:26:57 PM2/7/17
to tesseract-ocr
Thanks a lot, James, that's what I needed.
I'll check your tool!

Akira Hayakawa

unread,
Jun 5, 2017, 2:06:36 AM6/5/17
to tesseract-ocr
> Tesseract doesn't deskew the output image

I think what we are arguing is about preprocessing, which is done before segmentation in tesseract.
So what is the output image are you talking about?

Btw, I found a similar Github issue in tesseract repo and in my experiment no deskewing is done in preprocessing.

Do you agree with this experimental result?
And we need to deskew the input image before passing the image to any of the tesseract's pipelines?

zdenop

unread,
Oct 15, 2019, 2:13:11 AM10/15/19
to tesseract-ocr
https://github.com/tesseract-ocr/tesseract/wiki/ImproveQuality#rotation--deskewing

Dňa pondelok, 5. júna 2017 8:06:36 UTC+2 Akira Hayakawa napísal(-a):
Reply all
Reply to author
Forward
0 new messages