how to get the character in an image file which is in table format.

Daphne

unread,

Mar 10, 2011, 2:44:42 PM3/10/11

to tesseract-ocr

Hello,

I have a scanned image file which contains table. When I OCR it using
tessnet it doesn't give the desired output.
It is not reading the characters in the table. Instead it give some
numbers.

How to read the character in table format image

David Hoffer

unread,

Mar 10, 2011, 11:21:22 PM3/10/11

to tesser...@googlegroups.com

I have the same problem, I posted a message a few day's ago titled
"Working with FAX images with lines/borders". If you find a solution
please let me know.

Thanks,
-Dave

> --
> You received this message because you are subscribed to the Google Groups "tesseract-ocr" group.
> To post to this group, send email to tesser...@googlegroups.com.
> To unsubscribe from this group, send email to tesseract-oc...@googlegroups.com.
> For more options, visit this group at http://groups.google.com/group/tesseract-ocr?hl=en.
>
>

Dmitry Silaev

unread,

Mar 11, 2011, 11:24:26 PM3/11/11

to tesser...@googlegroups.com

Actually I think there's no fully user-friendly solution. Maybe you
can try to use the first of the two possible methods currently seen to
me.

So the first method is to devise a special config file and include it
in the command line for Tesseract. The following values need to be
within this config file:

tessedit_pageseg_mode 1 or 3 (I recommend 3)
textord_tabfind_find_tables T
textord_tablefind_recognize_tables T

You can play with the last param trying the T or F values. Actually I
give no guarantee for the whole method to work, only I found out some
clues by studying the code. I suspect corresponding pieces of code may
not work perfectly, or there are some more parameters that can
influence table recognition. Please try this yourself. It would be
nice if you share your results with the community. Sample images are
also appreciated.

The second method is to pre-process your images. You need to remove
lines and borders and pass the cleaned image to Tesseract. There can
arise many issues related to this process, but I think there's no need
to tell anything else now, except if you express some interest in it.

Warm regards,
Dmitry Silaev

David Hoffer

unread,

Mar 11, 2011, 11:39:56 PM3/11/11

to tesser...@googlegroups.com

Dmitry,

Yeah, I was thinking too of preprocessing to remove all straight
lines/borders but haven't found a good approach to this yet. I can
clean up the margins, headers, footers but I haven't found a good way
to remove table row lines. if you/others have any suggestions I would
love to hear them.

I will also experiment with the config file.

Thanks much!
-Dave

Dmitry Silaev

unread,

Mar 12, 2011, 3:57:42 PM3/12/11

to tesser...@googlegroups.com, David Hoffer

Dave,

There's a number of methods you can use to remove straight lines or
borders, either individually or in combination. The most simple are:
Hough line detector (http://en.wikipedia.org/wiki/Hough_transform),
vertical/horizontal profile method (X and Y histograms of foreground
pixel counts - detect lines by most bin count or table cell margins by
least bin count), connected component analysis (detect nested CCs -
outer ones serve as borders), methods based on alignment analysis. If
your documents can have a skew, for some methods they need to be
deskewed.

After you detect table borders, you can get bounding boxes of
individual cells and then pass them to Tesseract. I think for
Tesseract, small single-row portions of text, yet allowing to
determine the baseline and x-height, are often much easier to
recognize than full-sized pages, even with no tables in them. This is
because Tesseract's native layout analysis. To disable it (or to avoid
it as much as possible) you would need to set "pageseg_mode" to
PSM_SINGLE_BLOCK, PSM_SINGLE_LINE, PSM_SINGLE_WORD, or even to
PSM_SINGLE_CHAR. According to my experience, PSM_SINGLE_WORD or
PSM_SINGLE_CHAR work best as they almost evade any Tesseract's layout
analysis. Then go PSM_SINGLE_LINE and PSM_SINGLE_BLOCK. However for
PSM_SINGLE_WORD or PSM_SINGLE_CHAR you'd need to do your own
segmentation. I don't know if you are ready to dive into such serious
development.

HTH

Warm regards,
Dmitry Silaev

TP

unread,

Mar 12, 2011, 4:52:14 PM3/12/11

to tesser...@googlegroups.com, Dmitry Silaev, David Hoffer

How about this technique mentioned in the Leptonica documentation (its
even easier if you can use binary morphology): "Removing dark lines
from a light pencil drawing" at
http://tpgit.github.com/UnOfficialLeptDocs/leptonica/line-removal.html
.

-- TP

Dmitry Silaev

unread,

Mar 13, 2011, 5:22:30 AM3/13/11

to TP, tesser...@googlegroups.com, David Hoffer

The first step in this technique is to threshold the image using a
manually selected threshold value. Within the text of the article this
step only deserved a line of code (pix1 = pixThresholdToBinary(pixs,
150)), but not a single word. However the fact that such a convenient
threshold luckily exists is crucial for the whole subsequent method
steps to work. I think the your source images do not enjoy such good
separability conditions.

I think this article is more an example of what can be done with
Leptonica from user's, not developer's point of view. It's like you
take one concrete image in Photoshop and try to achieve what you have
in your head. You try various filters, apply transformations, effects,
etc. However none of these can be applied automatically: every time
you need to choose parameters manually and make decisions specifically
for this very image.

Imho this is the reason why the author chose morphology - "oh, great!
that's worked!". It's easier to use in one function call, but in the
overwhelming majority of cases, using "algorithmic" approach gives
much more precise results. In real situations morphology requires from
you to do a great deal of cleaning after it has done its work, which
can be a lot more complex and not so mathematically elegant than
morphology algos themselves. Another reason why I try to stay away
from morphology is that it is really slow by its nature compared to
other methods, despite recent emerging of some fast methods. By the
way, the article advertises the processing speed of 1 Mpix/sec, which
I think is relatively slow for the intended goal even for yesterday's
P4s.

The moral is: you can use this article as a guideline or maybe just
for several specific images. However it's not well suited for
automatic processing.

P.S.: This my own opinion, and it does not necessarily coincide with
the views of other document image processing people.

Warm regards,
Dmitry Silaev

Vicky Budhiraja

unread,

Mar 13, 2011, 3:18:39 AM3/13/11

to tesser...@googlegroups.com, vicky...@gmail.com

Hello,

I used this paper (for pre-processing):
Parameter-Free Geometric Document Layout Analysis, by Lee, Ryu 2001. IEEE
Tran. Patt. Analysis and Machine Int. Nov 2001 Volume 23 Issue 11 Pages 1240
- 1256

Best Regards,
Vicky

Hello,

--

David Hoffer

unread,

Mar 14, 2011, 1:23:07 AM3/14/11

to tesser...@googlegroups.com

Hi Vicky,

Can you tell me more about this paper? It looks like this is not a
free document so I can't just read it to see if it would solve the
problem I have.

My problem is that I have grey-scale image data (tif/jpg/etc) that
contains text within a table format, i.e. cells on the page. The
documents where originally faxed then converted to PDF so the image
quality varies from poor to good. I don't want the table formatting,
I'm looking for a way to remove the formatting and get to just the
image text, I want to convert that to text using OCR, Tesseract or
otherwise.

My programming environment is Java but can shell out to other programs
if I need to.

Would the approach in the paper solve this problem space? How
practical is the software solution for a one man effort?

Thanks,
-Dave

Dmitry Silaev

unread,

Mar 14, 2011, 4:07:17 AM3/14/11

to tesser...@googlegroups.com, David Hoffer

I suspect, this paper is a sledgehammer for a nut. It's quite
universal and elaborated. Usually it may take a great deal of time to
implement and debug it. Your images might require much simplier
methods.

I always say the same thing: send your sample images and the community
will try to help.

Warm regards,
Dmitry Silaev

Dmitry Silaev

unread,

Mar 14, 2011, 9:34:28 AM3/14/11

to David Hoffer, tesser...@googlegroups.com

Dave,

Yep, quality is relatively poor so don't expect high accuracy from Tess.

Do you need every table cell's contents? Or getting numbers is just
enough and in a next step you can restore [predefined] item names?

Warm regards,
Dmitry Silaev

On Mon, Mar 14, 2011 at 4:19 PM, David Hoffer <dhof...@gmail.com> wrote:
> Dmity,
>
> That would be great thanks for the offer, I'll attach two samples.
>
> These two are good examples of the range of quality. What I need to
> do is extract cell data for processing. I can generate these in any
> image format, tiff, jpeg if one should be preferred.
>
> Best regards,
> -Dave

David Hoffer

unread,

Mar 14, 2011, 9:19:32 AM3/14/11

to Dmitry Silaev, tesser...@googlegroups.com

Dmity,

That would be great thanks for the offer, I'll attach two samples.

These two are good examples of the range of quality. What I need to
do is extract cell data for processing. I can generate these in any
image format, tiff, jpeg if one should be preferred.

Best regards,
-Dave

On Mon, Mar 14, 2011 at 11:07 AM, Dmitry Silaev <daemo...@gmail.com> wrote:

hud1.jpeg

hud2.jpeg

Dmitry Silaev

unread,

Mar 14, 2011, 10:12:40 AM3/14/11

to David Hoffer, tesser...@googlegroups.com

Well, if I was faced with such a problem, I'd do the following:

1. Deskew
2. Cut out excess whitespace using hor/ver projection profile
3. Determine aspect ratio (AR)
4. Based on AR determine location of significant areas (columns with
numbers, much the same method for other areas in the header)
5. Do the connected component (CC) labeling
(http://en.wikipedia.org/wiki/Connected_Component_Labeling)
6. Remove speckle noise
7. Apply approximate predefined cell bounding boxes to locate cell contents
8. In each cell locate potential table borders using hor projection profile
9. Remove table borders. There might be pixels that are shared between
a table border segment and significant CCs (digits or letters). For
every such suspicious case do repetitive recognition and based on
highest confidence from Tesseract choose the most probable separation.
10. Recognize unsuspicious CCs in a usual way, selectively applying
whitelists based on cell's semantics to increase accuracy.

Something like that. Again, there can be other ways to do what you
want, but I'd do it this way.

Warm regards,
Dmitry Silaev

On Mon, Mar 14, 2011 at 4:42 PM, David Hoffer <dhof...@gmail.com> wrote:
> Dmitry,
>
> I just need to get the numbers and know what 'item' the numbers go
> with...so I don't even have to rebuild the actual item name. However
> in the header there is some text...names, addresses, etc that I had to
> remove for privacy reasons...but it's similar I need to get the
> data...not the item text...as long as I can figure out what item the
> data goes with I am good to go.
>
> Best regards,
> -Dave

Dmitry Silaev

unread,

Mar 14, 2011, 10:23:55 AM3/14/11

to tesseract-ocr

Ehmm, actually I thought a bit more and now I say no to deskewing. It
can be detrimental to such poor quality images - they are almost
binary ("almost" probably because of the JPEG compression algo) and
low-res. As far as I see, you only can have binary images.

Therefore we need to assume a skew of an input image to be always
within some narrow range and modify all our following steps to work in
a skewed coordinate system.

Dmitry

On Mar 14, 4:19 pm, David Hoffer <dhoff...@gmail.com> wrote:
> Dmity,
>
> That would be great thanks for the offer, I'll attach two samples.
>
> These two are good examples of the range of quality. What I need to
> do is extract cell data for processing. I can generate these in any
> image format, tiff, jpeg if one should be preferred.
>
> Best regards,
> -Dave
>

> On Mon, Mar 14, 2011 at 11:07 AM, Dmitry Silaev <daemons2...@gmail.com> wrote:
> > I suspect, this paper is a sledgehammer for a nut. It's quite
> > universal and elaborated. Usually it may take a great deal of time to
> > implement and debug it. Your images might require much simplier
> > methods.
>
> > I always say the same thing: send your sample images and the community
> > will try to help.
>
> > Warm regards,
> > Dmitry Silaev
>

> > On Mon, Mar 14, 2011 at 8:23 AM, David Hoffer <dhoff...@gmail.com> wrote:
> >> Hi Vicky,
>
> >> Can you tell me more about this paper? It looks like this is not a
> >> free document so I can't just read it to see if it would solve the
> >> problem I have.
>
> >> My problem is that I have grey-scale image data (tif/jpg/etc) that
> >> contains text within a table format, i.e. cells on the page. The
> >> documents where originally faxed then converted to PDF so the image
> >> quality varies from poor to good. I don't want the table formatting,
> >> I'm looking for a way to remove the formatting and get to just the
> >> image text, I want to convert that to text using OCR, Tesseract or
> >> otherwise.
>
> >> My programming environment is Java but can shell out to other programs
> >> if I need to.
>
> >> Would the approach in the paper solve this problem space? How
> >> practical is the software solution for a one man effort?
>
> >> Thanks,
> >> -Dave
>

> >>> For more options, visit this group athttp://groups.google.com/group/tesseract-ocr?hl=en.

>
> >> --
> >> You received this message because you are subscribed to the Google Groups "tesseract-ocr" group.
> >> To post to this group, send email to tesser...@googlegroups.com.
> >> To unsubscribe from this group, send email to tesseract-oc...@googlegroups.com.
> >> For more options, visit this group athttp://groups.google.com/group/tesseract-ocr?hl=en.
>
>
>

> hud1.jpeg
> 748KViewDownload
>
> hud2.jpeg
> 2046KViewDownload

David Hoffer

unread,

Mar 14, 2011, 10:29:26 AM3/14/11

to tesser...@googlegroups.com

Dmitry,

Would using a loss-less format like TIFF be preferred?

(I'm going to give this a try but some of these steps might be a bit
more than I can handle...I'm not an image processing guru.)

-Dave

Dmitry Silaev

unread,

Mar 15, 2011, 12:52:57 AM3/15/11

to tesser...@googlegroups.com, David Hoffer

Dave,

What is the format and resolution in which you initially get your
images? For such poor quality every conversion makes an image even
worse...

Warm regards,
Dmitry Silaev

Dmitry Silaev

unread,

Mar 15, 2011, 1:41:26 AM3/15/11

to tesser...@googlegroups.com, David Hoffer

As I can see, your source data can be deemed as 1-bit (binary)
losslessly compressed image. So a lossless conversion to any image
format (makes no difference which) will do no harm.

Warm regards,
Dmitry Silaev

On Tue, Mar 15, 2011 at 8:31 AM, David Hoffer <dhof...@gmail.com> wrote:
> Dmitry,
>

> Originally the documents are PDF with these images CCITTFax encoded I
> decoded them using iText. At this point I have a BufferedImage which
> I can save in any format supported by Java. I assume Tiff would be
> one of the best.
>
> Best regards,
> -Dave

Reply all

Reply to author

Forward