Issue 1496 in tesseract-ocr: Access Violation - reading outside image buffer during line detection

104 views
Skip to first unread message

tesser...@googlecode.com

unread,
Jul 8, 2015, 6:53:43 PM7/8/15
to tesserac...@googlegroups.com
Status: New
Owner: ----

New issue 1496 by rtaylor...@gmail.com: Access Violation - reading outside
image buffer during line detection
https://code.google.com/p/tesseract-ocr/issues/detail?id=1496

What steps will reproduce the problem?
1. Tesseract 3.02+ command line
2. "tesseract -l eng Image_crop.png Image pdf"

What is the expected output? What do you see instead?
> I expect tesseract to run and produce output

> Instead, Tesseract crashes with "ACCESS VIOLATION (0xC0000005)"-type
> error.

What version of the product are you using? On what operating system?
Seen in Tesseract 3.02.02 and code from SVN around March 2015.
Windows 7
Win32-bit Tesseract builds.

Please provide any additional information below.
- Doesn't happen in 64-bit Windows build (lucky?)

- Attached image has non-white pixels at image edges - this seems to
trigger this crash bug.

- Access violation occurs in TextlineProjection::MeanPixelsInLineSegment()
when it calls GET_DATA_BYTE() (~line 550). This can break when
start_pt/end_pt Y values = 0 and offset is a negative value. This can also
break when start_pt/end_pt Y value = bottom of image and offset is a
positive value. These conditions lead to an attempted reads of data either
before or after the image buffer.

- Other problems would occur horizontally (i.e. X value = 0 or right edge
of image). In these cases there is less chance of stepping outside the
image buffer (unless at a corner), but good chance that the algorithm will
not read the intended data due to wrapping to other image side.


Attachments:
Image_crop.png 1.5 MB

--
You received this message because this project is configured to send all
issue notifications to this address.
You may adjust your notification preferences at:
https://code.google.com/hosting/settings

tesser...@googlecode.com

unread,
Jul 22, 2015, 4:26:49 AM7/22/15
to tesserac...@googlegroups.com
Updates:
Status: Invalid

Comment #1 on issue 1496 by zde...@gmail.com: Access Violation - reading
outside image buffer during line detection
https://code.google.com/p/tesseract-ocr/issues/detail?id=1496

Your report is strange:
1. you wrote you use tesseract 3.02 - but that version does not provide pdf
output
2. you wrote you use code from SVN around March 2015 - but this project
switched from svn to git in 2014/08 (see [1]) and few week ago we moved to
github.com (see announcement on main page[2]

So it looks like you need to use correct code from correct place...

[1]
https://code.google.com/p/tesseract-ocr/source/list?r=736d32747333a5ff68162975c04054bc30792572&r=298e31465a445e54defedd076217ff24b1af3fc2
[2] https://code.google.com/p/tesseract-ocr/

tesser...@googlecode.com

unread,
Jul 22, 2015, 2:39:04 PM7/22/15
to tesserac...@googlegroups.com

Comment #2 on issue 1496 by rtaylor...@gmail.com: Access Violation -
reading outside image buffer during line detection
https://code.google.com/p/tesseract-ocr/issues/detail?id=1496

It looks like I copied a command line from calling a recent (3.03-ish)
build that could generated PDF. That, however, is just a distraction.

The attached image causes an access violation during the internal
segmentation code - i.e. before recognition, and long before any actual
output (hocr, pdf, text, etc.)


As far as I could tell the Google Code source for Tesseract started getting
mirrored to GitHub sometime around July 2014. However, I see no
announcements that it was clearly moved to GitHub until your "Tesseract
moved to github" posting on June 14, 2015. Based on comparisons, the
source from Google Code (SVN) and from GitHub (GIT) was exactly the same
until recently.


I will also test with the posted 3.02.02 binaries - that may remove any
issue of which source was used.


Finally - this bug is apparent through code inspection. The logic in
TextlineProjection works on a line of pixels. The calling code selects a
line segment to analyze and calls MeanPixelsInLineSegment() with the
current line and then with other adjacent lines chosen by offsetting x or y
by +2/-2 +1/-3. When analyzing a horizontal line at y=0, the adjacent line
where y=-2 will be trying to read pixel data outside the image buffer -
which causes an access violation unless the memory happens (lucky?) to be
readable.

tesser...@googlecode.com

unread,
Aug 6, 2015, 11:44:13 AM8/6/15
to tesserac...@googlegroups.com

Comment #3 on issue 1496 by rtaylor...@gmail.com: Access Violation -
reading outside image buffer during line detection
https://code.google.com/p/tesseract-ocr/issues/detail?id=1496

Ok, after some work I traced this to a Microsoft Visual Studio 2013
Community compiler optimization bug that was skipping operation of calls to
TruncateToImageBounds() inside TextlineProjection.cpp. This allowed data
access to memory outside of the image's buffer and, usually, an access
violation.

This was hard to find because the problem didn't happen in debug mode (no
optimization) and would cease happening if I added any kind of program
logic before or after the calls to the Truncate...() method. It also only
occurred in 32-bit code - no problem in 64-bit build. Also, code would
work for some images alone but not when those images were being processed
among multiple threads running Tesseract. An annoying problem to isolate
(as are many compiler-optimization problems).

For our build (still using VS2013) we used pragmas to disable optimization
for the TruncateToImageBounds() method - that seems to work based on our
testing.

I think that the VS2015 Community Edition compiler fixes this - they claim
to have fixed "500 compiler bugs" (but no specifics that I can find yet).
Tests, so far, aren't showing this problem.

This issue can be closed.

tesser...@googlecode.com

unread,
Aug 6, 2015, 4:00:09 PM8/6/15
to tesserac...@googlegroups.com
Updates:
Status: Look-here-for-help

Comment #4 on issue 1496 by zde...@gmail.com: Access Violation - reading
outside image buffer during line detection
https://code.google.com/p/tesseract-ocr/issues/detail?id=1496

Thanks for info. Do you plan to test VS2015? Will you post also pragmas
that you used to solve this problem?

tesser...@googlecode.com

unread,
Aug 11, 2015, 2:10:33 PM8/11/15
to tesserac...@googlegroups.com

Comment #5 on issue 1496 by rtaylor...@gmail.com: Access Violation -
reading outside image buffer during line detection
https://code.google.com/p/tesseract-ocr/issues/detail?id=1496

YW

I did some minimal testing with VS2015, but it was using current code
instead of 3.02.02 code where I isolated the original problem. For the
moment we are still building with VS2013, so thorough VS2015 testing may
not happen soon.

One of our customers reported access violation crashes in an earlier
version we built with VS2008, but we haven't gotten enough feedback from
them to certify that they were encountering the same problems.

In .../textord/textlineprojection.cpp I added VS pragma statements like
this (starting @ line 752)

#pragma optimize("g", off)
// Helper truncates the TPOINT to be within the pix_.
void TextlineProjection::TruncateToImageBounds(TPOINT* pt) const {
pt->x = ClipToRange<int>(pt->x, 0, pixGetWidth(pix_) - 1);
pt->y = ClipToRange<int>(pt->y, 0, pixGetHeight(pix_) - 1);
}
#pragma optimize( "", on )

This turns global (g) optimization off for the TruncateToImageBounds()
method only. I tried disabling optimization at a lower level (i.e. for
ClipToRange() function), but that didn't eliminate the problem.

To make these changes cross-platform you'd want to add some #ifdef brackets
around each pragma so that it is used only when building in the Visual
Studio tool chain.

-= Rich

tesser...@googlecode.com

unread,
Aug 16, 2015, 3:43:32 PM8/16/15
to tesserac...@googlegroups.com

Comment #6 on issue 1496 by zde...@gmail.com: Access Violation - reading
outside image buffer during line detection
https://code.google.com/p/tesseract-ocr/issues/detail?id=1496

Thanks I committed to github.com:
https://github.com/tesseract-ocr/tesseract/commit/9d359cf58a920ad068a3a4b159e6c3e3b0511f8b
Reply all
Reply to author
Forward
0 new messages