Training for Swedish, Danish, Norwegian, old spelling, fraktur

809 views
Skip to first unread message

Lars Aronsson

unread,
Apr 23, 2010, 10:28:08 AM4/23/10
to tesser...@googlegroups.com
I'm the founder of Project Runeberg, the Scandinavian
volunteer book scanning project, http://runeberg.org/
where we have mainly been using Abbyy Finereader,
with subsequent manual, online proofreading.
I'm also involved in Wikisource, the book scanning
and proofreading project of the Wikimedia Foundation.

Is anybody training Tesseract to read Swedish and
other Scandinavian languages? Is there a tutorial
for how to train new languages in Tesseract?

I'm running Ubuntu Linux 9.10. The included package
for Tesseract 2.03 contains man pages that are next
to useless. There seem to be some programs: mftraining,
cntraining, unicharset_extractor, but they talk about
"box files" and I have no clue what these are.

In Project Runeberg, we already have 186,000 pages
that are fully proofread, mostly in Swedish and
Danish, in various fonts and from different years,
meaning different spelling standards. Could these
be used for training Tesseract? How do I start?


--
Lars Aronsson (la...@aronsson.se)
Aronsson Datateknik - http://aronsson.se

Project Runeberg - free Nordic literature - http://runeberg.org/


--
You received this message because you are subscribed to the Google Groups "tesseract-ocr" group.
To post to this group, send email to tesser...@googlegroups.com.
To unsubscribe from this group, send email to tesseract-oc...@googlegroups.com.
For more options, visit this group at http://groups.google.com/group/tesseract-ocr?hl=en.

Sven Pedersen

unread,
Apr 23, 2010, 1:44:19 PM4/23/10
to tesser...@googlegroups.com
Hi Lars,
The current development version of Tesseract 3.0 does have some
support for Swedish and Norwegian:
http://tesseract-ocr.googlecode.com/svn/trunk/tessdata/

You may want to update the link on your website, since the current
project site is:
http://code.google.com/p/tesseract-ocr/

and not the one on SourceForge. You may need to do further training
for older texts, and I would guess you could produce a Danish version
if needed from the Norwegian. The community here is currently planning
a fork of the code to continue development, since Google has not shown
any activity in several months and nobody else has write access to the
source code.
Good luck with your project!
--Sven
--
``All that is gold does not glitter,
not all those who wander are lost;
the old that is strong does not wither,
deep roots are not reached by the frost.
From the ashes a fire shall be woken,
a light from the shadows shall spring;
renewed shall be blade that was broken,
the crownless again shall be king.”

Zdenko Podobný

unread,
Apr 23, 2010, 1:47:51 PM4/23/10
to tesser...@googlegroups.com
Hello,,

please read wiki pages http://code.google.com/p/tesseract-ocr/wiki especially http://code.google.com/p/tesseract-ocr/wiki/TrainingTesseract where is described training process for tesseract 2.04

In svn (http://code.google.com/p/tesseract-ocr/source/checkout) there is already (pre?) release of version 3.00 with language data also for your language (See http://code.google.com/p/tesseract-ocr/source/browse/#svn/trunk/tessdata%3Fstate%3Dclosed).

Based on some remarks on wikipages training process should be different + see posting in this forum. There is no information when 3.00 will be released.

Zd.

Lars Aronsson

unread,
Apr 24, 2010, 7:55:55 AM4/24/10
to tesser...@googlegroups.com
Sven Pedersen wrote:
> Hi Lars,
> The current development version of Tesseract 3.0 does have some
> support for Swedish and Norwegian:
> http://tesseract-ocr.googlecode.com/svn/trunk/tessdata/

I downloaded (from SVN) Tesseract 3.0, compiled it, and
ran it with "-l swe" (Swedish language) on some pages.

On this page, http://runeberg.org/strindbg/diktvers/0046.html
it interpreted some » right-angle-quotation-marks as ">>"
and failed to recognize the "H" in this font. Other than this
it was quite successful. But that's a very easy page.

I also found some instructions for how to train new
languages in Tesseract 2.0x. I don't know if these
instructions are still valid for 3.0, but it seems very
strange that I should start to generate a TIFF with
"a b c d e f" in order to train Swedish language, since
these letters are already used in English and German.
The uppercase "H" in this particular font does need
to be trained, but that is not specific to Swedish,
but common to all languages that use the Latin script.
Swedish uses all German letters plus a-ring (å), only
with somewhat different probability weights. (Some
would say u-umlaut is not used in Swedish, but it
does appear in some personal names and OCR is much
worse without it.)

Maybe the instructions I found are useful for training
entirely new scripts (Cyrillic, Hebrew, Hindi, ...) and
not for new languages that use an already supported
script. This should be clarified in the text.

It would seem rational to make an OCR program for
Latin script recognize all letters and accents
in the most common 8-bit codes (ISO 8859-1, -2, -3, ...)
and only vary the probability weights and dictionaries
to add new languages.

In Tesseract 3.0, the language data is in a single
file, e.g. tessdata/swe.traineddata
Is this file format documented? Could I edit it
manually and get something useful?

> The community here is currently planning
> a fork of the code to continue development, since Google has not shown
> any activity in several months and nobody else has write access to the
> source code.

Oh, really. Is anybody taking the lead,
and do you have any funding for this?

Sven Pedersen

unread,
Apr 24, 2010, 4:32:09 PM4/24/10
to tesser...@googlegroups.com
Hi Lars,
Yes, I think putting the language and script in a single unit is not
the best idea. But Tesseract does support other scripts like Chinese
and Indic -- some success is reported with Indic. Actually, I think
the way the algorithm works makes the combination of dictionary and
orthography more effective.

For training details, it would be good to look through the archives of
this mailing list, since people have recently asked these questions:
http://groups.google.com/group/tesseract-ocr?pli=1

In particular, you'll find that you can just train starting from the
existing Swedish with some of the texts and fonts you'll be working
with. People have made programs to help generate the right kind of
training data.

The fork of the project is not funded per se, but the developer who is
taking the lead has funding for his part of the work, and some of us
hope to get organizations we're involved with to participate.
--Sven
--
``All that is gold does not glitter,
not all those who wander are lost;
the old that is strong does not wither,
deep roots are not reached by the frost.
From the ashes a fire shall be woken,
a light from the shadows shall spring;
renewed shall be blade that was broken,
the crownless again shall be king.”

Lars Aronsson

unread,
Apr 25, 2010, 2:29:59 PM4/25/10
to tesser...@googlegroups.com
Sven, the more you describe the situation,
the more I realize that my needs are not the
same as yours or others who are here.
Has anybody, now that you are discussing
a fork, tried to draw a map of what kinds of
needs the user community has?

I'm scanning old books and newspapers, and
want to make really good OCR that can be
manually proofread with as little effort as
possible. This means lots of old typefaces,
lots of old spelling, lots of strange names,
often different languages on the same page,
often bad print quality, often complex page
layout. When I discover an error, I want to
fix the OCR engine, continuously training it
to become more and more perfect. If I find
a new kind of upper-case "H", it would be
insane to apply this new experience only to
the interpretation of Swedish, since it will
soon appear in texts in other languages.
It would also be insane if I was the only one
to benefit from such an improvement. It
should go back into the engine, so all users
can benefit.

The way language training is described in
Tesseract, it clearly can't meet these needs.
The software never was designed with
these goals in mind, or it would look very
different. Just one example: If I want to
train "fraktur" (black letter), there's no
easy way I can generate a pattern page because
I don't have fraktur fonts installed on my
computer. I never write fraktur, I only read
it in old books.

The internal needs of Google Book Search
should be very similar to my needs, and if
that's where the previous lead developer
works, I can understand if he has abandoned
Tesseract for some other design. I can also
understand if Google wants to keep that new
design to themselves. It would most probably
be based on statistics from the many million
books that Google has already scanned.

Does anybody know of an open source OCR
project that is based on statistics from
scanned books? Could parts of the Tesseract
software library be used to cut out letters
from scanned pages, so some other software
could group them statistically?


--
Lars Aronsson (la...@aronsson.se)
Aronsson Datateknik - http://aronsson.se


Reply all
Reply to author
Forward
0 new messages