I wrote potential algorithms(outlined hypothetical ways) to deskew text on a picture taken on a camera phone

274 views

Skip to first unread message

R Block

unread,

May 21, 2016, 2:23:35 PM5/21/16

to tesseract-dev

Hi, OCR Developers,

I tried to understand the Tesseract code but was faced with the hurdle of not working in C/C++ for years. My primary language is Java. My test code is in Java. It is sloppy but I can clean it up if someone wants to see it.

I read that rotating the images might help so I wrote out algorithms to rotate text from an image taken by a camera phone of a page. I thought this might be of interest to someone.

The short summary is I assumed the picture was largely in the correct orientation and I found blocks of letters, and rotated the blocks individually as that was my problem. Many blocks of letters were rotated at different angles based on the angle my picture took and how the page was underneath.

Unfortunately, doing this has led to my Tesseract output results being worse than the original.

Attached: Original on left, demonstrative result of process on right. In my final image, no red box or white baseline appears. The extreme white was converted to mauve.

Top here means up where the title bar is. Bottom means opposite of where the title bar is. Same as when you talk about the letter.

The algorithm works like this.

My code assumes the skew is slightly off and the page text is in rows moving from left to right.

Go through the page and find the bounding boxes for individual letters. In the picture these are in red.

Create a histogram of the Y coordinates of bounding box bottoms. Each Y pixel itself cannot be a bucket. Group Y-pixels together. You can use the average height of the bounding boxes to make an educated guess. For now I'll say bucket size should be 1/4 the size of the average bounding box size. If it's too big, you won't be able to see anything. Too small and it won't be useful.

This results in a rolling hills kind of data setup with peaks and valleys.

Find the average count for the buckets of the histogram. This is to divide the peaks between the valleys. You can go through the Y-pixel histogram from top to bottom. When the current bucket is less than the average count, you are in a valley. When the current bucket is greater than or equal to the average count you have entered a hilltop and you need to record the peak of each hilltop. Continue until you reach the bottom.

Now you have a list of hilltops. Determine the mean distance between hilltops. You may have to create another histogram. This mean distance tells you the expected distance between baselines. This should be greater than the average bounding box height.

Create buckets to group bounding boxes together into rows. You will need one bucket for the number of rows you could fit onto the page.

Go through each of the bounding boxes and put them in the bucket that corresponds to the shortest Y-distance to a row. Any bounding boxes where the X coordinates overlap for a row may be joined together. Like i, j, or letters that got turned into multiple chunks.

Postcondition - bounding boxes are now grouped according to row.

Potential for flaws here. The title row may put the expected rows halfway between the main rows.

Hint: You could write images to disk along the way to see if you are getting it right. Bounding Boxes is a heavy process and I wrote the box coordinates to disk as a starting point to process the rest.

Calculate the kern distance, the average distance between letters.

You could do this many ways, such as by row or by page. I'm doing it through a hard-code currently.

Initially I calculated it by using a histogram of X- distances between bounding boxes. For this histogram, I did put every pixel in its own bucket.

Then, I said word breaks occured where the distance is 2 times or greater than the kern space.

You could do a loop to join bounding boxes who are a few pixels less than the average kern space.

Optional

For each row, you could determine the average bottom Y of the bounding boxes. This was drawn in the image during processing as a thin white line. This is for tracing progress.

For each row,

You need to create an image to hold the row, rowImage. It should be at least as tall as the average row height, maybe more. You need to determine how far the baseline is from the bottom of the row image. You need to know which Y coordinate is the baseline. I hard-coded this.

Go through bounding boxes left to right.

Started a group by getting 10 bounding boxes together. I added to the group until I reached the end of line or reached a word break. Word breaks give me space to play with.

I calculated the relative slope of the group by creating a list of slopes along the bottom Y(change in Y divided by change in X) from the first three characters to the last five characters. Each slope is also tied to its left coordinate(x,y), to represent the baseline. As letters that fall below the baseline are uncommon, those slopes should be extremes and not the norm.

Here, I have some bugs evident in the image, but the idea is good. For example, the badly rotated line took the bottom of a set that started with quotes and ended with quotes, throwing things off. This could be fixed By measuring average distance to the slope line.

I have my group, a slope, and a baseline coordinate. I calculated the edges of my collected bounding boxes into a joined rectangle. I pushed thed edges of the rectangle out some to account for weakness in my bounding box identification

start new group and repeat until bounding boxes for row are processed.

Note: A smarter person could write an algorithm to track significant deviations in average slope, group better, and deal with the letter g along the way.

Hard part - Image processing

Extract the group from the current image into its own isolated image. The center of joined rectangle should be the center of the image. These were difficult calulations to think through because everything is relative to center.. The new image should be big enough to handle rotation. Here it is width + fluff. At higher angles, you would want to use distance from bottom left to upper right of the joined rectangle.

Use arctan on the slope to determine the angle of rotation.

Rotate the isolated image.

Calculate where the baseline's coordinates relative to center and where it should have rotated to. baselineRotatedX = centerrelativeX * cos(angle) - centerrelativeY * sin(angle). baselineRotatedY = centerrelativeX * sin(angle) + centerrelativeY * cos(angle).

Extremely hard part

You need to transfer the rotated image out of its place, into the rowImage and positioned so that it is relative to the baseline. I placed these rotated images in sequence X-wise simply one after the other separated by the distance of a word break. This is why my tabs have fallen to the left.