File Uzn using tesseract 3

9,486 views
Skip to first unread message

Di Perna Francesco

unread,
Jul 4, 2012, 7:16:27 AM7/4/12
to tesseract-ocr, daniele....@gmail.com
Hi, we use tesseract in a web application to recognize some numer in
document aquired with scanner.
With tesseract2 we have used the "uzn" file to indicate in wich area
of the tiff file are the numers to be recognize (the uzn file shoud
have the same name of the tiff file witch "uzn" extension).
We have now intalled tesseract 3, my error was to suppose that the uzn
file work as the previous version, but doesn't.
Can anyone explain me how recognize some area of the file in tesseract
3?
Regards

Di Perna Francesco

unread,
Jul 5, 2012, 11:00:10 AM7/5/12
to tesseract-ocr
Ok. No one can help me.
I have found the solution anyway....:-)
Calling tesseract with parameter "-psm 4" and renaming the uzn file
with the same name of the image seem works.
Bye

On 4 Lug, 13:16, Di Perna Francesco <francesco.dipe...@gmail.com>
wrote:

Sven Pedersen

unread,
Jul 6, 2012, 11:45:49 AM7/6/12
to tesser...@googlegroups.com
Thanks for sharing your solution!
--Sven
> --
> You received this message because you are subscribed to the Google
> Groups "tesseract-ocr" group.
> To post to this group, send email to tesser...@googlegroups.com
> To unsubscribe from this group, send email to
> tesseract-oc...@googlegroups.com
> For more options, visit this group at
> http://groups.google.com/group/tesseract-ocr?hl=en



--
``All that is gold does not glitter,
not all those who wander are lost;
the old that is strong does not wither,
deep roots are not reached by the frost.
From the ashes a fire shall be woken,
a light from the shadows shall spring;
renewed shall be blade that was broken,
the crownless again shall be king.”

llozano

unread,
Jun 19, 2013, 9:20:04 AM6/19/13
to tesser...@googlegroups.com
Francesco,

Do you mind to post how this uzn file may look like and how should be the entire command?
I'm starting to research this area for one project and I a bit puzzled. All I know is I need to specify areas to extract text from a document. Document is layout in tables. Do I need to remove the lines if I specify areas?

Thanks

zdenko podobny

unread,
Jun 19, 2013, 2:18:40 PM6/19/13
to tesser...@googlegroups.com
On Wed, Jun 19, 2013 at 3:20 PM, llozano <fix...@gmail.com> wrote:
Francesco,

Do you mind to post how this uzn file may look like


and how should be the entire command?

As far as I remember if you use psm > 3 tesseract will look for uzn file (based on image name). If you are on linux you can check it with strace easily.

So you can try something like this:
tesseract 8309_016.2B.tif 8309_016.2B_psm4 -psm 4
 
I'm starting to research this area for one project and I a bit puzzled. All I know is I need to specify areas to extract text from a document. Document is layout in tables. Do I need to remove the lines if I specify areas?

The best way is to make your test and share your findings.

Thanks


On Thursday, July 5, 2012 11:00:10 AM UTC-4, Di Perna Francesco wrote:
Ok. No one can help me.
I have found the solution anyway....:-)
Calling tesseract with parameter "-psm 4" and renaming the uzn file
with the same name of the image seem works.
Bye

On 4 Lug, 13:16, Di Perna Francesco <francesco.dipe...@gmail.com>
wrote:
> Hi, we use tesseract in a web application to recognize some numer in
> document aquired with scanner.
> With tesseract2 we have used the "uzn" file to indicate in wich area
> of the tiff file are the numers to be recognize (the uzn file shoud
> have the same name of the tiff file witch "uzn" extension).
> We have now intalled tesseract 3, my error was to suppose that the uzn
> file work as the previous version, but doesn't.
> Can anyone explain me how recognize some area of the file in tesseract
> 3?
> Regards

--
--
You received this message because you are subscribed to the Google
Groups "tesseract-ocr" group.
To post to this group, send email to tesser...@googlegroups.com
To unsubscribe from this group, send email to
tesseract-oc...@googlegroups.com
For more options, visit this group at
http://groups.google.com/group/tesseract-ocr?hl=en
 
---
You received this message because you are subscribed to the Google Groups "tesseract-ocr" group.
To unsubscribe from this group and stop receiving emails from it, send an email to tesseract-oc...@googlegroups.com.
For more options, visit https://groups.google.com/groups/opt_out.
 
 

llozano

unread,
Jun 19, 2013, 2:45:02 PM6/19/13
to tesser...@googlegroups.com
This is awesome. Thanks for your reply. So, one more for you.. just to clarify..
In your command example: 8309_016.2B_psm4 should be with the prefix _psm4? Is it true or just mistype? 
Do I need to pass the tiff file through some filter to remove colors or something like that? The examples you shared in your tar.gz file, which are awesome, there are in gray scales and not sure about the resolution. Is there some preparation of the image in order to improve output?

Thanks again!

zdenko podobny

unread,
Jun 20, 2013, 2:33:26 AM6/20/13
to tesser...@googlegroups.com
On Wed, Jun 19, 2013 at 8:45 PM, llozano <fix...@gmail.com> wrote:
This is awesome. Thanks for your reply. So, one more for you.. just to clarify..
In your command example: 8309_016.2B_psm4 should be with the prefix _psm4? Is it true or just mistype? 

As a second argument (output basename - in this case 8309_016.2B_psm4) you can use any free text.
I prefer to use image name (to easily identify image source) + something for identification how I run tesseract (_psm4). This could be useful if you are planning to test different page segmentation modes on the same image. Than you can use some tools like kdiff3 (or Winmerge, or Compare by content in Total Commander if you are on Windows) to see differences coming from different psm...
 
Do I need to pass the tiff file through some filter to remove colors or something like that? The examples you shared in your tar.gz file, which are awesome, there are in gray scales and not sure about the resolution. Is there some preparation of the image in order to improve output?

The images you saw are part of UNLV tests (see [1]). There are much more files with different DPI.
Tesseract binarize input image by itself (see e.g. [2] for parameter how to get binarized image from tesseract). If you are not satisfied with it, you can binarize images by yourself in advance (e.g. to use different algorithm). Search tesseract forum if you need, for more details about used binarization algorithm.

Reply all
Reply to author
Forward
0 new messages