Fwd: Msg from Ray - Calling for community contribution for some languages

ShreeDevi Kumar

unread,

Jan 24, 2017, 12:48:31 AM1/24/17

to tesser...@googlegroups.com, tesser...@googlegroups.com

Language Codes:

iku,

khm,

mya

bih
chr
dzo
iku
snd
syr
tgk
tir

See post below for further details.

---------- Forwarded message ----------
From: theraysmith <notifi...@github.com>
Date: Tue, Jan 24, 2017 at 12:00 AM
Subject: Re: [tesseract-ocr/tesseract] LSTM: Indic - length of the compressed codes (#654)
To: tesseract-ocr/tesseract <tess...@noreply.github.com>
Cc: Shreeshrii <shree...@gmail.com>, Author <aut...@noreply.github.com>

The text corpus is from *all* the www, taken several years ago, plus more
recent data from wiki-something.
The text is divided by language automatically, so there is a separate
stream for each of the Devanagari-based languages (as there is for the
Latin-based languages) and clipped to 1GB for each language.
For each language, the text is frequency counted and cleaned by multiple
methods, and sometimes this cleaning is too stringent automatically, or not
stringent enough, so forbidden_characters and desired_characters are used
as a guide in the cleanup process. There are other lang-specific numbers
like a 1-in-n discard ratio for the frequency.
For some languages, the amount of data produced at the end is very thin.

The unicharset is extracted from what remains, and the wordlist that is
published in langdata.
For the LSTM training, I resorted to using Google's parallel infrastructure
to render enough text in all the languages.
However much or little corpus text there is, the rendering process makes
50000 chunks of 50 words to render in a different combination of font and
random degradation, which results in 400000-800000 rendered textlines.
The words are chosen to approximately echo the real frequency of conjunct
clusters (characters in most languages) in the source text, while also
using the most frequent words.

This process is all done without significant manual intervention, but
counts of the number of generated textlines indicates when it has gone
badly, usually due to a lack of fonts, or a lack of corpus text.
I recently stopped trainin

g chr, iku, khm, mya after discovering that I
have no rendered textlines that contain anything other than digits and
punctuation.

Community input is therefore extremely useful, and usually results in edits
to forbidden_characters and desired_characters, which in turn guides the
filtration process.
Community-provided corpus text would be useful for languages that have very
little or no training data, given appropriate copyright/licensing clearance.

The languages with very little corpus text are:
bih
chr
dzo
iku
snd
syr
tgk
tir
so these are likely to have poor recognition accuracy.

On Sat, Jan 21, 2017 at 7:46 AM, Shreeshrii <notifi...@github.com>
wrote:

> Ray,
>
> Thank you for explaining regrading unicharset compression and your new
> strategy for Indic graphemes.
>
> Since the unicharset is being used as a filter, it will be important to
> include the most common conjunct clusters in it, which may differ from
> language to language.
>
> Some more questions
>
> Are the desired_characters and forbidden_characters used in the process of
> creating the text corpus for different languages?
>
> How many text lines are you using for training of Devanagari, e.g.
> Sanskrit, Hindi, Marathi etc. Is it all/only from Wikipedia?
>
>
>
> - excuse the brevity, sent from mobile
>
> On 21-Jan-2017 3:34 AM, "theraysmith" <notifi...@github.com> wrote:
>
> > The LSTM recognizer is currently trained to recognize the sequence of
> > *unicodes* for Indic languages. This reduces the size of the output
> > softmax of the network from the 5000+ elements in the unicharset to ~140.
> > (There is an analogous process for Chinese, Japanese, and Korean, that
> > doesn't use the unicode encoding, but it is a similar idea, and the codes
> > are strictly limited in length.)
> > The unicharset is used as a *filter* in the beam search to allow only
> > sensible grapheme/syllable combinations of unicodes, so it doesn't output
> > complete garbage text.
> >
> > The consequence of this recoding is that it runs a lot faster, but it has
> > to learn to output a long sequence for each grapheme/syllable.
> > The recoding system that maps from unicharset elements to the sequence of
> > unicodes currently only allows a maximum of 9 unicodes per
> > grapheme/syllable, including any viramas.
> >
> > I'm running a new training experiment this weekend to try a new coding
> > scheme, in which pairs are mapped to a single code, allowing a long
> CVCVCVC
> > string to be encoded using just CCCC, cutting down from 7 codes to 4.
> This
> > will probably increase the size of the output softmax to ~170, but reduce
> > the length of the average code sequence by about 1/3, which might be
> easier
> > for it to learn, without slowing it down much.
> >
> > It will take a couple of weeks to tell if it works, but if it does I will
> > check in the code, and upload new traineddatas, and close this issue. If
> it
> > doesn't work, I will have to think again...
> >
> > —
> > You are receiving this because you authored the thread.
> > Reply to this email directly, view it on GitHub
> > <https://github.com/tesseract-ocr/tesseract/issues/654#
> issuecomment-274192153>,
> > or mute the thread
> > <https://github.com/notifications/unsubscribe-auth/AE2_o-xusyCIFbh-
> wE4T4cp4mVb4oBWWks5rUS9vgaJpZM4LhbNY>
> > .
>
> >
>
> —
> You are receiving this because you were mentioned.
> Reply to this email directly, view it on GitHub
> <https://github.com/tesseract-ocr/tesseract/issues/654#issuecomment-274269267>,
> or mute the thread
> <https://github.com/notifications/unsubscribe-auth/AL056XOUmyQKlAM4aHUJc-jTRmhEwWOxks5rUihVgaJpZM4LhbNY>
> .
>

--
Ray.

—
You are receiving this because you authored the thread.
Reply to this email directly, view it on GitHub, or mute the thread.

ShreeDevi Kumar

unread,

Jan 24, 2017, 2:22:44 AM1/24/17

to tesser...@googlegroups.com, tesser...@googlegroups.com, Ray Smith

Language Names and some links

chr - Cherokee - http://crubadan.org/languages/chr

syr - Syriac - http://crubadan.org/languages/syr

tir - ti - Tigrigna - http://crubadan.org/languages/ti

tgk - tg - Tajiki - http://crubadan.org/languages/tg

snd - sd - Sindhi - http://crubadan.org/languages/sd

iku - iu - Inuktitut - http://crubadan.org/languages/iu

dzo - dz - Dzongkha - http://crubadan.org/languages/dz

mya - my - Myanmar - http://crubadan.org/languages/my

khm - km - Khmer - http://crubadan.org/languages/km

Ray,

An Crúbadán edited by Scannell, Kevin
is licensed under a Creative Commons Attribution 4.0 International License .

The zip files linked from the above pages have word lists as well as the list of URLs scrubbed from vast quantities of text freely available on the web used for building corpora for languages with small numbers of speakers and/or limited computational resources.

ShreeDevi
____________________________________________________________
भजन - कीर्तन - आरती @ http://bhajans.ramparivar.com

Timilehin Fasubaa

unread,

Aug 23, 2017, 3:17:30 AM8/23/17

to tesseract-ocr, tesser...@googlegroups.com

I am working on a side project in yoruba that might be helpful. It predicts the right diacritics on unmarked yoruba words. I imagine you could also run the OCR allowing only unmarked characters as output (maybe reduce the height of the scan window so it doesn't see the diacritics) and then pipe the marked characters through the tool I'm building and use the output as a fallback for when the image recognition is not sure.

My project right now needs more training data to make the model more robust. It is very tough to find properly marked yoruba text on the internet. I have physical books and some scanned pdfs in archive.org that I would want to transform to text but the yor.traineddata doesn't seem robust enough. It makes many mistakes such as ọdọ instead of ẹdẹ.

Other times, it just spits out gibberish.

What can I provide to help make yor.traineddata much better and what quantity? (e.g. 200 (pages) images of yourba text and the yoruba text it contains).I think both projects an reinforce each other. I look forward to hearing back.

link to proj -> https://github.com/Timilehin/Yoruba-Intonator

ShreeDevi Kumar

unread,

Aug 23, 2017, 4:03:34 AM8/23/17

to tesser...@googlegroups.com

> yor.traineddata doesn't seem robust enough

I have added as an issue - see https://github.com/tesseract-ocr/langdata/issues/89

> My project right now needs more training data to make the model more robust. It is very tough to find properly marked yoruba text on the internet.