How to recognize only parts (zones) of a document ?

Reimmann

unread,

Oct 12, 2007, 3:35:59 AM10/12/07

to tesseract-ocr

Hi,

I'm trying out Tesseract 2.01. I have a document that two columns of
text, the quality of Tesseract's recognition is very good, but the
columns are mixed, because tesseract recognizes the characters line by
line. So, I like to have two different zones, that are recognized one
after the other. I have tried out a tiff-image and a "zone-file" that
I found on the UNLV site, but this does not work. My command-line
looks like that:

tesseract in.tif out.txt -l deu in.zone

in.tif is not compressed.

When I debug this, the program exits at line 234 in variables.cpp when
trying to read_variables.

Can anyone help ?

Has anyone a useful pair of tiff-file and configuration-file for
recognizing parts of a document ?

thx in advance,

Chris from Aachen, Germany

Scan...@gmail.com

unread,

Oct 12, 2007, 9:28:13 AM10/12/07

to tesseract-ocr

Tess does not at this point support multiple columns. You can write a
zoning software yourself and then use the dll interface to recognize
those parts of it.

Ray Smith

unread,

Oct 12, 2007, 12:47:49 PM10/12/07

to tesser...@googlegroups.com

If you have made a correctly formatted UNLV zone file, then you should name it in.uzn and use this command line:

tesseract in.tif out.txt -l deu

The in.uzn file will be found based on the name of the input tif file.
Ray.

Christoph Reimmann

unread,

Oct 14, 2007, 1:38:29 PM10/14/07

to tesseract-ocr

Hi Ray,

thx for your answer.

I've tried out ocr with in.uzn and .... it worked very well. Thanks.

But when is a zone file correctly formatted ? I can't find a
documentation. Do you know whether there is one ?

Thx again in advance, Chris

On 12 Okt., 18:47, "Ray Smith" <theraysm...@gmail.com> wrote:
> If you have made a correctly formatted UNLV zone file, then you should name
> it in.uzn and use this command line:
> tesseract in.tif out.txt -l deu
> The in.uzn file will be found based on the name of the input tif file.
> Ray.
>

> On 10/12/07, g...@jetsoftdev.com <ScanH...@gmail.com> wrote:
>
>
>
>
>
> > Tess does not at this point support multiple columns. You can write a
> > zoning software yourself and then use the dll interface to recognize
> > those parts of it.
>
> > On Oct 12, 3:35 am, Reimmann <christ...@reimmann.de> wrote:
> > > Hi,
>
> > > I'm trying out Tesseract 2.01. I have a document that two columns of
> > > text, the quality of Tesseract's recognition is very good, but the
> > > columns are mixed, because tesseract recognizes the characters line by
> > > line. So, I like to have two different zones, that are recognized one
> > > after the other. I have tried out a tiff-image and a "zone-file" that
> > > I found on the UNLV site, but this does not work. My command-line
> > > looks like that:
>
> > > tesseract in.tif out.txt -l deu in.zone
>
> > > in.tif is not compressed.
>
> > > When I debug this, the program exits at line 234 in variables.cpp when
> > > trying to read_variables.
>
> > > Can anyone help ?
>
> > > Has anyone a useful pair of tiff-file and configuration-file for
> > > recognizing parts of a document ?
>
> > > thx in advance,
>

> > > Chris from Aachen, Germany- Zitierten Text ausblenden -
>
> - Zitierten Text anzeigen -

Glen Rubin

unread,

May 24, 2014, 12:54:43 AM5/24/14

to tesser...@googlegroups.com

I would also like more information on how to make a UZN file appropriate to my image. thanks!

zdenko podobny

unread,

May 25, 2014, 11:58:59 AM5/25/14

to tesser...@googlegroups.com

uzn file is simple text file with area per line. Area need to have this structure[1]:

x y width height description

x, y, width, height are number (integer) separated by space
description is text, not used by tesseract, but can help you describe area (e.g. header, footer, body...)

For examples see some file from isri-ocr-evaluation-tools[2].

[1] https://code.google.com/p/tesseract-ocr/source/browse/trunk/ccstruct/blread.cpp?r=1064#54

[2] https://code.google.com/p/isri-ocr-evaluation-tools/downloads/detail?name=zset.4B.tar.gz&can=2&q=

Zdenko

--
You received this message because you are subscribed to the Google Groups "tesseract-ocr" group.
To unsubscribe from this group and stop receiving emails from it, send an email to tesseract-oc...@googlegroups.com.
To post to this group, send email to tesser...@googlegroups.com.
Visit this group at http://groups.google.com/group/tesseract-ocr.
To view this discussion on the web visit https://groups.google.com/d/msgid/tesseract-ocr/8ec643c4-2e0b-4f62-8d52-183da1789cda%40googlegroups.com.
For more options, visit https://groups.google.com/d/optout.

Reply all

Reply to author

Forward