Outreach from the Wikisource community

49 views
Skip to first unread message

David Cuenca

unread,
Aug 11, 2014, 8:59:34 AM8/11/14
to tesser...@googlegroups.com, discussion list for Wikisource, the free library
Dear Tesseracters,

At Wikisource, the free digital library and sister project of Wikipedia, we have founded a user group [1] to promote international coordination and partnerships with fellow organizations. We have thousands of high quality volunteer proofread pages [2] matched by scans in ca. 50 different languages [3]. Our editing interface of one single page looks like this [4], which has another view as "index" [5] or as text with all pages together [6]. There are several verification levels, the most important are "yellow" which means that one contributor proofread the page, and "green" which means that a second person verified the proofread text.

This past weekend at Wikimania '14 in London we had a meeting were we discussed technical and social issues from several Wikisource language communities. One of the most serious issues was raised by the Belarusian community which uses 2 different scripts with no commercial OCR support. This means that the volunteers have to type each word manually. We wondered if it would be possible to train Tesseract to recognize these old texts using the text that has been already typed.

We would like to know if you would be interested in exploring collaboration possibilities. I imagine that with your guidance we could prepare training data not only in different languages, but also from different time periods, scripts, etc. At the moment it is not very clear how to achieve this.

Please let us know if you would like to have a hangout/skype conversation any day next week.

Cheers,
Micru

Nick White

unread,
Aug 12, 2014, 12:25:43 PM8/12/14
to tesser...@googlegroups.com, discussion list for Wikisource, the free library, David Cuenca
Dear Wikisourcerers,

It's good to hear from you. Wikisource is awesome, as far as I am
concerned.

> One
> of the most serious issues was raised by the Belarusian community which uses 2
> different scripts with no commercial OCR support. This means that the
> volunteers have to type each word manually. We wondered if it would be possible
> to train Tesseract to recognize these old texts using the text that has been
> already typed.

Actually, Tesseract should already have support for Russian and
Belarussian "out of the box"; see the 'rus' and 'bel' training data.

> We would like to know if you would be interested in exploring collaboration
> possibilities. I imagine that with your guidance we could prepare training data

The first thing to do would be to take a look at the results you get
from Tesseract with the rus and bel training sets already available,
and let us know if they aren't appropriate.

> not only in different languages, but also from different time
> periods, scripts, etc.

As to training for specific scripts, time periods, etc., in theory
that is super cool, in practise probably one training set should be
able to cover more-or-less everything (except very different
scripts, like fraktur). That has been my experience with training
Ancient Greek (for which I have been interested in recognising
printing from a variety of time periods).

So give Tesseract a whirl, and if it isn't appropriate, or doesn't
work for specific scripts, let us know and we can try to figure out
a plan.

> At the moment it is not very clear how to achieve this.

My plan is to rewrite the training documentation very soon, so
things should hopefully become clearer on that front.

One thing that wikisource could potentially do for us would be
provide loads of proofread, freely reusable "ground truth" data to
test Tesseract with. Are there programatic ways of getting at the
data, for example downloading all page images and corresponding text
that is marked as green, for a specific language / script?

Thanks for getting in touch!

Nick

Jim O'Regan

unread,
Aug 12, 2014, 5:12:31 PM8/12/14
to tesser...@googlegroups.com, discussion list for Wikisource, the free library, David Cuenca
On 12 August 2014 17:25, Nick White <nick....@durham.ac.uk> wrote:
> Dear Wikisourcerers,
>
> It's good to hear from you. Wikisource is awesome, as far as I am
> concerned.
>
>> One
>> of the most serious issues was raised by the Belarusian community which uses 2
>> different scripts with no commercial OCR support. This means that the
>> volunteers have to type each word manually. We wondered if it would be possible
>> to train Tesseract to recognize these old texts using the text that has been
>> already typed.
>
> Actually, Tesseract should already have support for Russian and
> Belarussian "out of the box"; see the 'rus' and 'bel' training data.
>

'bel' contains Cyrillic; there is also a Latin script ('Łacinka') for
Belarusian. (Russian is widely spoken in Belarus, but Russian texts
would be added to the Russian Wikisource).

The question I'd have for the Belarusian Wikisourcers is: can they be
treated as having an exact mapping? (It doesn't need to be 1:1, I'm
aware that, e.g., 'нь' maps to 'ń'). I ask because, as I remember it,
there's very little text in Łacinka, and adapting Cyrillic material
could be useful.

> One thing that wikisource could potentially do for us would be
> provide loads of proofread, freely reusable "ground truth" data to
> test Tesseract with. Are there programatic ways of getting at the
> data, for example downloading all page images and corresponding text
> that is marked as green, for a specific language / script?

They're all added to a category, so that part should be pretty easy.

--
<Sefam> Are any of the mentors around?
<jimregan> yes, they're the ones trolling you

David Cuenca

unread,
Aug 14, 2014, 2:31:12 AM8/14/14
to Jim O'Regan, tesser...@googlegroups.com, discussion list for Wikisource, the free library
PS: I forwarded Jim's message to one of the Belarusian Wikisourcers 
--
Etiamsi omnes, ego non
Reply all
Reply to author
Forward
0 new messages