multiple languages?

1,961 views
Skip to first unread message

Guido Milanese

unread,
Sep 14, 2007, 12:48:14 PM9/14/07
to tesseract-ocr
I am trying tesseract and I find it very interesting. A quick
question: is it possible to use
tesseract for documents containing more than one language? Or, if not,
is this point planned for the future? Can we (normal users) be of any
help?

Thanks,
guido, italy

--
http://docenti.unicatt.it/milanese_guido
http://www.arsantiqua.org

Scan...@gmail.com

unread,
Sep 14, 2007, 4:42:44 PM9/14/07
to tesseract-ocr
You could train to languages in one. It may even be possible through
the tools to merge them. Read the Wiki.

One thing. The more characters you can recognized the slower and less
you recognition percentage will be.

On Sep 14, 12:48 pm, "Guido Milanese" <guido.milan...@gmail.com>
wrote:

Guido Milanese

unread,
Sep 16, 2007, 3:53:45 PM9/16/07
to tesseract-ocr
On Sep 14, 10:42 pm, "g...@jetsoftdev.com" <ScanH...@gmail.com> wrote:
> You could train to languages in one. It may even be possible through
> the tools to merge them. Read the Wiki.

Yes, thank you, I had understood this approach of two languages in
one, but my question was a bit different (probably the phrasing of my
question was not clear). In some OCR programs, as Omnipage, you can
select a certain number of languages *at the moment* you need them,
not building "couples" or "groups" in advance. Being a linguist, I
need sometimes e.g. English + German + French, sometimes German +
Italian + Latin + French, and so on.
I gather this is not possible with Tesseract?

Thanks again for this excellent program!
guido, italy


Scan...@gmail.com

unread,
Sep 17, 2007, 3:06:56 PM9/17/07
to tesseract-ocr
Not currently. You would need to merge languages for every
combination.

Ray Smith

unread,
Sep 19, 2007, 3:26:24 PM9/19/07
to tesser...@googlegroups.com
Although it is a desirable feature, it is not currently supported to have multiple languages enabled. It might happen in a future release...
Ray.

Speedy

unread,
Aug 16, 2012, 9:15:12 AM8/16/12
to tesser...@googlegroups.com
Is it possible to get the language that matched from the result? In other words, is it possible to use tesseract to recognize the font? Is this per character, per word or per page? How much slower is recognition when multiple languages are combined?

On Thursday, August 9, 2012 9:35:00 AM UTC+2, Simion Zafiu wrote:
The multilanguage is working in tesseract-ocr 3.2.
The command line is:

tesseract
imagename outbase [-l lang1+lang2+lag3]

(e.q. tesseract image1.tif imageOCR [-l eng+fra+deu])

All the best !

Sven Pedersen

unread,
Aug 16, 2012, 10:56:49 AM8/16/12
to tesser...@googlegroups.com
Tesseract cannot recognize what font is in use, but it does work for
various fonts. You would have to try languages in succession (not
together) to tell which language was matched, I think, although with
some coding you might be able to get at that. I believe the
combination of languages is quite a bit better than the time to try
them individually, but you'd have to test it, I think. Not all
languages are supported in combination.
--Sven
> --
> You received this message because you are subscribed to the Google
> Groups "tesseract-ocr" group.
> To post to this group, send email to tesser...@googlegroups.com
> To unsubscribe from this group, send email to
> tesseract-oc...@googlegroups.com
> For more options, visit this group at
> http://groups.google.com/group/tesseract-ocr?hl=en



--
``All that is gold does not glitter,
not all those who wander are lost;
the old that is strong does not wither,
deep roots are not reached by the frost.
From the ashes a fire shall be woken,
a light from the shadows shall spring;
renewed shall be blade that was broken,
the crownless again shall be king.”
Reply all
Reply to author
Forward
0 new messages