Thanks,
-Dave
> --
> You received this message because you are subscribed to the Google Groups "tesseract-ocr" group.
> To post to this group, send email to tesser...@googlegroups.com.
> To unsubscribe from this group, send email to tesseract-oc...@googlegroups.com.
> For more options, visit this group at http://groups.google.com/group/tesseract-ocr?hl=en.
>
>
So the first method is to devise a special config file and include it
in the command line for Tesseract. The following values need to be
within this config file:
tessedit_pageseg_mode 1 or 3 (I recommend 3)
textord_tabfind_find_tables T
textord_tablefind_recognize_tables T
You can play with the last param trying the T or F values. Actually I
give no guarantee for the whole method to work, only I found out some
clues by studying the code. I suspect corresponding pieces of code may
not work perfectly, or there are some more parameters that can
influence table recognition. Please try this yourself. It would be
nice if you share your results with the community. Sample images are
also appreciated.
The second method is to pre-process your images. You need to remove
lines and borders and pass the cleaned image to Tesseract. There can
arise many issues related to this process, but I think there's no need
to tell anything else now, except if you express some interest in it.
Warm regards,
Dmitry Silaev
Yeah, I was thinking too of preprocessing to remove all straight
lines/borders but haven't found a good approach to this yet. I can
clean up the margins, headers, footers but I haven't found a good way
to remove table row lines. if you/others have any suggestions I would
love to hear them.
I will also experiment with the config file.
Thanks much!
-Dave
There's a number of methods you can use to remove straight lines or
borders, either individually or in combination. The most simple are:
Hough line detector (http://en.wikipedia.org/wiki/Hough_transform),
vertical/horizontal profile method (X and Y histograms of foreground
pixel counts - detect lines by most bin count or table cell margins by
least bin count), connected component analysis (detect nested CCs -
outer ones serve as borders), methods based on alignment analysis. If
your documents can have a skew, for some methods they need to be
deskewed.
After you detect table borders, you can get bounding boxes of
individual cells and then pass them to Tesseract. I think for
Tesseract, small single-row portions of text, yet allowing to
determine the baseline and x-height, are often much easier to
recognize than full-sized pages, even with no tables in them. This is
because Tesseract's native layout analysis. To disable it (or to avoid
it as much as possible) you would need to set "pageseg_mode" to
PSM_SINGLE_BLOCK, PSM_SINGLE_LINE, PSM_SINGLE_WORD, or even to
PSM_SINGLE_CHAR. According to my experience, PSM_SINGLE_WORD or
PSM_SINGLE_CHAR work best as they almost evade any Tesseract's layout
analysis. Then go PSM_SINGLE_LINE and PSM_SINGLE_BLOCK. However for
PSM_SINGLE_WORD or PSM_SINGLE_CHAR you'd need to do your own
segmentation. I don't know if you are ready to dive into such serious
development.
HTH
Warm regards,
Dmitry Silaev
How about this technique mentioned in the Leptonica documentation (its
even easier if you can use binary morphology): "Removing dark lines
from a light pencil drawing" at
http://tpgit.github.com/UnOfficialLeptDocs/leptonica/line-removal.html
.
-- TP
I think this article is more an example of what can be done with
Leptonica from user's, not developer's point of view. It's like you
take one concrete image in Photoshop and try to achieve what you have
in your head. You try various filters, apply transformations, effects,
etc. However none of these can be applied automatically: every time
you need to choose parameters manually and make decisions specifically
for this very image.
Imho this is the reason why the author chose morphology - "oh, great!
that's worked!". It's easier to use in one function call, but in the
overwhelming majority of cases, using "algorithmic" approach gives
much more precise results. In real situations morphology requires from
you to do a great deal of cleaning after it has done its work, which
can be a lot more complex and not so mathematically elegant than
morphology algos themselves. Another reason why I try to stay away
from morphology is that it is really slow by its nature compared to
other methods, despite recent emerging of some fast methods. By the
way, the article advertises the processing speed of 1 Mpix/sec, which
I think is relatively slow for the intended goal even for yesterday's
P4s.
The moral is: you can use this article as a guideline or maybe just
for several specific images. However it's not well suited for
automatic processing.
P.S.: This my own opinion, and it does not necessarily coincide with
the views of other document image processing people.
Warm regards,
Dmitry Silaev
I used this paper (for pre-processing):
Parameter-Free Geometric Document Layout Analysis, by Lee, Ryu 2001. IEEE
Tran. Patt. Analysis and Machine Int. Nov 2001 Volume 23 Issue 11 Pages 1240
- 1256
Best Regards,
Vicky
Hello,
--
Can you tell me more about this paper? It looks like this is not a
free document so I can't just read it to see if it would solve the
problem I have.
My problem is that I have grey-scale image data (tif/jpg/etc) that
contains text within a table format, i.e. cells on the page. The
documents where originally faxed then converted to PDF so the image
quality varies from poor to good. I don't want the table formatting,
I'm looking for a way to remove the formatting and get to just the
image text, I want to convert that to text using OCR, Tesseract or
otherwise.
My programming environment is Java but can shell out to other programs
if I need to.
Would the approach in the paper solve this problem space? How
practical is the software solution for a one man effort?
Thanks,
-Dave
I always say the same thing: send your sample images and the community
will try to help.
Warm regards,
Dmitry Silaev
Yep, quality is relatively poor so don't expect high accuracy from Tess.
Do you need every table cell's contents? Or getting numbers is just
enough and in a next step you can restore [predefined] item names?
Warm regards,
Dmitry Silaev
On Mon, Mar 14, 2011 at 4:19 PM, David Hoffer <dhof...@gmail.com> wrote:
> Dmity,
>
> That would be great thanks for the offer, I'll attach two samples.
>
> These two are good examples of the range of quality. What I need to
> do is extract cell data for processing. I can generate these in any
> image format, tiff, jpeg if one should be preferred.
>
> Best regards,
> -Dave
That would be great thanks for the offer, I'll attach two samples.
These two are good examples of the range of quality. What I need to
do is extract cell data for processing. I can generate these in any
image format, tiff, jpeg if one should be preferred.
Best regards,
-Dave
On Mon, Mar 14, 2011 at 11:07 AM, Dmitry Silaev <daemo...@gmail.com> wrote:
1. Deskew
2. Cut out excess whitespace using hor/ver projection profile
3. Determine aspect ratio (AR)
4. Based on AR determine location of significant areas (columns with
numbers, much the same method for other areas in the header)
5. Do the connected component (CC) labeling
(http://en.wikipedia.org/wiki/Connected_Component_Labeling)
6. Remove speckle noise
7. Apply approximate predefined cell bounding boxes to locate cell contents
8. In each cell locate potential table borders using hor projection profile
9. Remove table borders. There might be pixels that are shared between
a table border segment and significant CCs (digits or letters). For
every such suspicious case do repetitive recognition and based on
highest confidence from Tesseract choose the most probable separation.
10. Recognize unsuspicious CCs in a usual way, selectively applying
whitelists based on cell's semantics to increase accuracy.
Something like that. Again, there can be other ways to do what you
want, but I'd do it this way.
Warm regards,
Dmitry Silaev
On Mon, Mar 14, 2011 at 4:42 PM, David Hoffer <dhof...@gmail.com> wrote:
> Dmitry,
>
> I just need to get the numbers and know what 'item' the numbers go
> with...so I don't even have to rebuild the actual item name. However
> in the header there is some text...names, addresses, etc that I had to
> remove for privacy reasons...but it's similar I need to get the
> data...not the item text...as long as I can figure out what item the
> data goes with I am good to go.
>
> Best regards,
> -Dave
Would using a loss-less format like TIFF be preferred?
(I'm going to give this a try but some of these steps might be a bit
more than I can handle...I'm not an image processing guru.)
-Dave
What is the format and resolution in which you initially get your
images? For such poor quality every conversion makes an image even
worse...
Warm regards,
Dmitry Silaev
Warm regards,
Dmitry Silaev
On Tue, Mar 15, 2011 at 8:31 AM, David Hoffer <dhof...@gmail.com> wrote:
> Dmitry,
>
> Originally the documents are PDF with these images CCITTFax encoded I
> decoded them using iText. At this point I have a BufferedImage which
> I can save in any format supported by Java. I assume Tiff would be
> one of the best.
>
> Best regards,
> -Dave