Is there a way to combine languages?

507 views
Skip to first unread message

Falke

unread,
Mar 7, 2012, 5:51:25 PM3/7/12
to tesseract-ocr
I did search this group but found only old posts regarding multiple
languages (regarding 2.0), but, looking forward to the new features in
3.01...

I am assuming it's still impossible, even in 3.01, to recognize a
mixture of languages (distinct alphabets), per scan. If my assumption
is correct, then, the next best thing would/could be to combine
multiple traineddata files into one superset...

But is that even feasible??

Any other solutions for multilingual (multi-alphabetic) documents?

(ABBYY does it -- why can't we?? :-))

TIA

zdenko podobny

unread,
Mar 8, 2012, 2:53:44 AM3/8/12
to tesser...@googlegroups.com
On Wed, Mar 7, 2012 at 11:51 PM, Falke <haw...@flight.us> wrote:
I did search this group but found only old posts regarding multiple
languages (regarding 2.0), but, looking forward to the new features in
3.01...

I am assuming it's still impossible, even in 3.01, to recognize a
mixture of languages (distinct alphabets), per scan.  If my assumption
is correct, then, the next best thing would/could be to combine
multiple traineddata files into one superset...

this feature will be/is available in 3.02 version[1] (already in svn).

 
But is that even feasible??

Any other solutions for multilingual (multi-alphabetic) documents?

(ABBYY does it -- why can't we?? :-))

TIA

--
You received this message because you are subscribed to the Google
Groups "tesseract-ocr" group.
To post to this group, send email to tesser...@googlegroups.com
To unsubscribe from this group, send email to
tesseract-oc...@googlegroups.com
For more options, visit this group at
http://groups.google.com/group/tesseract-ocr?hl=en

Speedy

unread,
Mar 12, 2012, 7:55:54 AM3/12/12
to tesseract-ocr
Can you provide any information on how this works?
At what level can languages mingle? For example, could each wod be of
a different language? Or is it on a sentence level or on a paragraph
level? Is there a way to influence this? For example, if I know that a
document is of only a single language, I just don't know which one, is
there a way to specify that?
Does the result contain information on which language matched?

Best regards,
Marcus

On 8 Mrz., 08:53, zdenko podobny <zde...@gmail.com> wrote:
> >http://groups.google.com/group/tesseract-ocr?hl=en- Zitierten Text ausblenden -
>
> - Zitierten Text anzeigen -

Sriranga(78yrsold)

unread,
Mar 12, 2012, 9:16:02 AM3/12/12
to tesser...@googlegroups.com
please upload any of sample tif and two  traineddata files you choose - I shall  test and feedback to you.
cheers.

Falke

unread,
Mar 13, 2012, 9:12:09 AM3/13/12
to tesseract-ocr


On Mar 8, 3:53 am, zdenko podobny <zde...@gmail.com> wrote:
Wonderful; a great start.

Just a tiny-issue feedback for now, for an algorithmic tweak:

if an apostrophe has no spaces on either side of it, it's probably a
contraction, rather than a quote. So, more likely, the two letters on
either side of the apostrophe MUST belong to the same alphabet set.
As it is now, the svn version allows for something like:

п's ( cyrillic "п" and latin "s")

or

l'я (latin "l" and cyrillic "я")


Perhaps there might exist exceptions, but, i think, safe to assume, in
practice, less than 5% of the time...

thanks for your hard work, amazing product.

To Speedy: Looks like it's word-level (the above "bug"
notwithstanding :-))

Speedy

unread,
Mar 16, 2012, 8:25:02 AM3/16/12
to tesseract-ocr
This is a very generous offer, thank you very much! I have prepared
some data and two trained engines and packed them in a zip file. Where
do you want me to upload them?

Best regards,
Marcus

On 12 Mrz., 14:16, "Sriranga(78yrsold)" <withblessi...@gmail.com>
wrote:
> > > >http://groups.google.com/group/tesseract-ocr?hl=en-Zitierten Text
Reply all
Reply to author
Forward
0 new messages