Is there minimum of letters?

851 views
Skip to first unread message

Adlerfalke

unread,
Oct 24, 2011, 9:29:49 AM10/24/11
to tesseract-ocr
Hello,

i can't found anything about, how many letters/numbers an tif must
have, so that tesseract can find these.

For example I have an picture only with the number 3, but tesseract
don't detect this number. But if i put 3 3 3 3 in my picture tesseract
detect the numbers. So my Question is: What is the minimum of letters,
so that tesseract can worked without training?

Thx for answer :-)

Ps, sorry for my english ;-)

Giuseppe Menga

unread,
Oct 24, 2011, 10:09:27 AM10/24/11
to tesser...@googlegroups.com
That is interesting. I'm recognizing espiration dates from medicines, and I
found convenient to repeat the date 3 or 4 times, it improves recognition.
Someone can explain the reason.
Giuseppe

-----Messaggio originale-----
From: Adlerfalke
Sent: Monday, October 24, 2011 3:29 PM
To: tesseract-ocr
Subject: Is there minimum of letters?

Hello,

Thx for answer :-)

--
You received this message because you are subscribed to the Google
Groups "tesseract-ocr" group.
To post to this group, send email to tesser...@googlegroups.com
To unsubscribe from this group, send email to
tesseract-oc...@googlegroups.com
For more options, visit this group at
http://groups.google.com/group/tesseract-ocr?hl=en

patrickq

unread,
Oct 24, 2011, 12:36:12 PM10/24/11
to tesseract-ocr
The basic reason it helps Tesseract to repeat text is because
Tesseract makes an initial assumption what kind of letters it is
looking at: tall (digits, uppercase letters, tall lowercase) or
lowercase letters. Only after it makes that assumption / guess will it
try to match the letters against the proper subset of letters in the
training set.

Consider this texts submitted on their own:
aroma
usa

In the first example Tesseract is fairly likely to get it wrong and
interpret the word as a all-uppercase word. The reason: long words
where letters are all same heights are likely to be uppercase words,
because lowercase words tend to have taller letters in the mix, like
"lunch", "party" or "obscure". In the case of "usa" is may get it
right because it's shorter so could be either lowercase letters or
uppercase.

In the case of digits submitting "32 32 32" may yield better results
than just "32" because in the first case Tesseract gets 6 letters of
same height which increases the likelihood that they be tall.

One would hope that Tesseract had a feeback loop whereby a height
estimation is revisited and reversed if it produced suspicious results
but I have not seen strong evidence that Tesseract has any such check.

Patrick

Quan Nguyen

unread,
Oct 24, 2011, 1:35:31 PM10/24/11
to tesseract-ocr
Try with PSM 8 or 10.

patrickq

unread,
Oct 24, 2011, 2:41:50 PM10/24/11
to tesseract-ocr
What's PSM? Alternative spelling for PMS :-)?

merve t

unread,
Oct 25, 2011, 1:57:14 AM10/25/11
to tesser...@googlegroups.com
Yes, can you explain PSM 8?
Is it something like PSM_AUTO, should i change pagesegmode to PSM_8?
In this mail group, i was advised to give tesseract characters one by one.
Thus i must learn how to make tesseract recognize alone chars in images.
Thanks in advance.

2011/10/24 patrickq <patrick.q...@gmail.com>

zdenko podobny

unread,
Oct 25, 2011, 2:36:18 AM10/25/11
to tesser...@googlegroups.com
On Mon, Oct 24, 2011 at 8:41 PM, patrickq <patrick.q...@gmail.com> wrote:
What's PSM? Alternative spelling for PMS :-)?

 See:
$ tesseract
Usage:tesseract imagename outputbase [-l lang] [-psm pagesegmode] [configfile...]
pagesegmode values are:
0 = Orientation and script detection (OSD) only.
1 = Automatic page segmentation with OSD.
2 = Automatic page segmentation, but no OSD, or OCR
3 = Fully automatic page segmentation, but no OSD. (Default)
4 = Assume a single column of text of variable sizes.
5 = Assume a single uniform block of vertically aligned text.
6 = Assume a single uniform block of text.
7 = Treat the image as a single text line.
8 = Treat the image as a single word.
9 = Treat the image as a single word in a circle.
10 = Treat the image as a single character.
-l lang and/or -psm pagesegmode must occur before anyconfigfile.

Adlerfalke

unread,
Oct 25, 2011, 2:42:40 AM10/25/11
to tesseract-ocr
That is what i mean. Is there any chance that tesseract recognize
alone chars.

Thus i understand Patrick right, so it is not easy that tesseract
recognize alone chars, because tesseract need more chars to give a
result.

Thanks for the answers.

On Oct 25, 7:57 am, merve t <mervet2...@gmail.com> wrote:
> Yes, can you explain PSM 8?
> Is it something like PSM_AUTO, should i change pagesegmode to PSM_8?
> In this mail group, i was advised to give tesseract characters one by one.
> Thus i must learn how to make tesseract recognize alone chars in images.
> Thanks in advance.
>
> 2011/10/24 patrickq <patrick.questemb...@gmail.com>

merve t

unread,
Oct 25, 2011, 7:33:14 AM10/25/11
to tesser...@googlegroups.com
zdenko, thanks very much for answer

2011/10/25 Adlerfalke <adler...@googlemail.com>

Adlerfalke

unread,
Nov 3, 2011, 5:05:52 AM11/3/11
to tesseract-ocr
Sorry my late answer, after installing 3.01 i can use PSM so i can
give tesseract one number. thanks for your answer :-)

On Oct 25, 12:33 pm, merve t <mervet2...@gmail.com> wrote:
> zdenko, thanks very much for answer
>
> 2011/10/25 Adlerfalke <adlerfa...@googlemail.com>
Reply all
Reply to author
Forward
0 new messages