inconsistent languages

Jeff Breidenbach

unread,

Jul 12, 2015, 1:36:26 AM7/12/15

to tesser...@googlegroups.com

Here are the inconsistent languages.

only found in tessdata

deu_frak

dan_frak

slk_frak

equ

osd

only found in VALID_LANGUAGE_CODES in language-specific.sh

fil

hye

lat_lid

snd

only found in langdata

gle_uncial

zlm

missing from langdata

grc

missing from tessdata

bih

zdenko podobny

unread,

Jul 12, 2015, 2:19:28 AM7/12/15

to tesser...@googlegroups.com

*_frak is Fraktur variant of language

equ is Math / equation detection module

osd is Orientation and script detection module

grc has is from http://ancientgreekocr.org/

Zdenko

--
You received this message because you are subscribed to the Google Groups "tesseract-dev" group.
To unsubscribe from this group and stop receiving emails from it, send an email to tesseract-de...@googlegroups.com.
To post to this group, send email to tesser...@googlegroups.com.
Visit this group at http://groups.google.com/group/tesseract-dev.
To view this discussion on the web visit https://groups.google.com/d/msgid/tesseract-dev/02e6d16d-8b71-44ba-a2f9-bb150b807e41%40googlegroups.com.
For more options, visit https://groups.google.com/d/optout.

ShreeDevi Kumar

unread,

Jul 12, 2015, 4:59:48 AM7/12/15

to tesser...@googlegroups.com

language-specific.sh

lines 21-26

# Codes for which we have webtext but no fonts:

# armenian, dhivehi, mongolian (we support mongolian cyrillic as in the webtext,

# but not mongolian script with vertical writing direction), sindhi (for which

# we have persian script webtext, but real sindhi text can be in persian OR

# devanagari script)

UNUSABLE_LANGUAGE_CODES="hye div mon snd"

--------------

bih

Bihari tessdata may have been pulled because there is a pending issue with the training text

Please see

https://github.com/tesseract-ocr/langdata/pull/11

------------------------

grc

Anicient Greek - work done by Nick White

http://ancientgreekocr.org/

The code

The code to generate and test the Ancient Greek OCR training data is in several small git repositories. It is all free software under the Apache License 2.0.

git clone http://ancientgreekocr.org/grc.git: The final training process, hopefully soon to be part of the main Tesseract codebase.
git clone http://ancientgreekocr.org/grctraining.git: Rules and tools to deterministically generate all prerequisites for the final training process.
git clone http://ancientgreekocr.org/grctestfodder.git: Ancient Greek page scans and ground truth text for testing OCR accuracy.
git clone http://ancientgreekocr.org/ocr-evaluation-tools.git: Tools to test OCR accuracy.

ShreeDevi
____________________________________________________________
भजन - कीर्तन - आरती @ http://bhajans.ramparivar.com

--

Jim O'Regan

unread,

Jul 12, 2015, 6:01:51 AM7/12/15

to tesser...@googlegroups.com

On 12 July 2015 at 06:36, Jeff Breidenbach <breid...@gmail.com> wrote:
> only found in langdata
> gle_uncial

gle_uncial is Irish (Gaeilge) in Uncial script, traineddata is here:
https://github.com/jimregan/tesseract-gle-uncial/releases/download/v0.1beta1/gle_uncial.traineddata

--
<Sefam> Are any of the mentors around?
<jimregan> yes, they're the ones trolling you

Ray Smith

unread,

Jul 12, 2015, 10:16:13 PM7/12/15

to tesser...@googlegroups.com

All these responses look correct to me.

The actual errors are:

grc and fil shouldn't be in the valid language codes list, as they are the wrong variant of ISO 632.

That would also reduce the risk of accidentally overwriting Nick White's grc.traineddata in the future.

Shree is correct, that I pulled bih from the traineddata because the training text is garbage.

Another caveat worth noting is that I only tested a small fraction of these languages - maybe 25?

I suspect, for instance, that all the Arabic-based langages except ara don't work very well.

I would be interested in an more feedback on how bad it is in any of them, and will take suggestions into account for the next version after 3.04.

--
You received this message because you are subscribed to the Google Groups "tesseract-dev" group.
To unsubscribe from this group and stop receiving emails from it, send an email to tesseract-de...@googlegroups.com.
To post to this group, send email to tesser...@googlegroups.com.
Visit this group at http://groups.google.com/group/tesseract-dev.

To view this discussion on the web visit https://groups.google.com/d/msgid/tesseract-dev/CAHh9-xv6qd6X5jvF1VgOggGdv43_j6ZJN9Fagv1%2BFf-%3D-B%3D-rA%40mail.gmail.com.

ShreeDevi Kumar

unread,

Jul 13, 2015, 12:09:28 AM7/13/15

to tesser...@googlegroups.com

Ray,

1. I will be happy to test the devanagari based languages as well as other Indic ones - if there is some objective way of measuring the accuracy for the same. Is there any test suite or recommended method for the same?

2. Also, I noticed that there is a directory for Persian Langdata but no traineddata for it.

https://github.com/tesseract-ocr/tessdata/issues/3

3. It would be helpful, if we can have a page which symlinks to external (non-google) traineddata files eg. grc, per, gle_uncial etc.

4. Is there a recommended method for listing language-script combinations eg. Sindhi can be written in devanagari and persian scripts - so should the traineddata be snd_deva and snd_per??

ShreeDevi
____________________________________________________________
भजन - कीर्तन - आरती @ http://bhajans.ramparivar.com

To view this discussion on the web visit https://groups.google.com/d/msgid/tesseract-dev/CAGuE8nXYz%3DzxYgMn6kYzaUAeFb%2BrVCGTaNY3to6UOOAkC4XR6Q%40mail.gmail.com.

Ray Smith

unread,

Jul 13, 2015, 1:02:14 AM7/13/15

to tesser...@googlegroups.com

On Sun, Jul 12, 2015 at 9:08 PM, ShreeDevi Kumar <shree...@gmail.com> wrote:

Ray,

1. I will be happy to test the devanagari based languages as well as other Indic ones - if there is some objective way of measuring the accuracy for the same. Is there any test suite or recommended method for the same?

We have an internal tool, but I don't know what if any open source tools there are for complex scripts.

I have test data for Hindi, Kannada, Telugu, Tamil, but they are the only Indic languages that I have data for.

2. Also, I noticed that there is a directory for Persian Langdata but no traineddata for it.

https://github.com/tesseract-ocr/tessdata/issues/3

The file exists. I will have to add it tomorrow.

3. It would be helpful, if we can have a page which symlinks to external (non-google) traineddata files eg. grc, per, gle_uncial etc.

Good idea.

4. Is there a recommended method for listing language-script combinations eg. Sindhi can be written in devanagari and persian scripts - so should the traineddata be snd_deva and snd_per??

So far I have used the convention that if the language has a "usual" script then that goes in lang.traineddata, and the "other" script goes in lang_other.traineddata (eg srp, srp_latn and uzb, uzb_cyrl). If there are 2 "usual" scripts, eg in different countries, then I suppose the best thing to do is put a script identifier on both, as you suggest.

To view this discussion on the web visit https://groups.google.com/d/msgid/tesseract-dev/CAG2NduWG-WtjTtg%3Dj0Zq1v_pKmvUYhHfw03Kru1kWqeLDCd51g%40mail.gmail.com.

ShreeDevi Kumar

unread,

Jul 13, 2015, 2:48:40 AM7/13/15

to tesser...@googlegroups.com

If you can share the format for test data, I can try and provide you with files for other Indian languages, specially devanagari based.

Alternately, you can suggest if there is a way to get access to Google's internal tool for this.

- sent from my phone. excuse the brevity.

To view this discussion on the web visit https://groups.google.com/d/msgid/tesseract-dev/CAGuE8nXyAR0LraRM_8UoNd0bdWekaPsEW-Ym04WAcZC1xGxHCw%40mail.gmail.com.

Helmut Wollmersdorfer

unread,

Jul 13, 2015, 10:24:13 AM7/13/15

to tesser...@googlegroups.com

Am Montag, 13. Juli 2015 04:16:13 UTC+2 schrieb Ray:

All these responses look correct to me.

The actual errors are:
grc and fil shouldn't be in the valid language codes list, as they are the wrong variant of ISO 632.

grc is the valid ISO code (see http://www.loc.gov/standards/iso639-2/php/code_list.php) for ancient greek.

That would also reduce the risk of accidentally overwriting Nick White's grc.traineddata in the future.

If Nick White's grc.traineddata is not for ancient greek, then it should be renamed.

Helmut Wollmersdorfer

unread,

Jul 13, 2015, 10:24:50 AM7/13/15

to tesser...@googlegroups.com

Am Sonntag, 12. Juli 2015 08:19:28 UTC+2 schrieb Zdenko Podobný:

*_frak is Fraktur variant of language
equ is Math / equation detection module
osd is Orientation and script detection module

To avoid naming conflicts with (future) ISO language codes, equ and osd should maybe renamed to be variants of the language code

zxx No linguistic content; Not applicable

i.e.

zxx_equ

zxx_osd

Jeff Breidenbach

unread,

Jul 16, 2015, 4:39:39 PM7/16/15

to tesser...@googlegroups.com

Should I be shipping any languages besides the ones found in

tessdata on github? The only candidate I currently know of is gle_unical

mentioned above and the ancient greek at http://ancientgreekocr.org/.

If so, I need to know the copyright owner and license for each. (And

I really, really hope the license is Apache 2.0 to match everything else).

Jim O'Regan

unread,

Jul 16, 2015, 6:45:21 PM7/16/15

to tesser...@googlegroups.com

Uncial :) Unical makes me think of INTERCAL.

The language pack is:
Copyright 2009-2015 Jim O'Regan <jor...@gmail.com>
Copyright 2009-2015 Kevin Scannell <ksc...@gmail.com>

(The training images and scripts are a little more complicated, and I
should really get around to doing those credits).

The licence is Apache 2.0
(https://github.com/jimregan/tesseract-gle-uncial/blob/master/LICENSE)

Jim O'Regan

unread,

Jul 16, 2015, 6:50:27 PM7/16/15

to tesser...@googlegroups.com

On 16 July 2015 at 23:45, Jim O'Regan <jor...@gmail.com> wrote:
> On 16 July 2015 at 21:39, Jeff Breidenbach <breid...@gmail.com> wrote:
>> Should I be shipping any languages besides the ones found in
>> tessdata on github? The only candidate I currently know of is gle_unical
>> mentioned above and the ancient greek at http://ancientgreekocr.org/.
>>
>> If so, I need to know the copyright owner and license for each. (And
>> I really, really hope the license is Apache 2.0 to match everything else).
>
> Uncial :) Unical makes me think of INTERCAL.
>
> The language pack is:
> Copyright 2009-2015 Jim O'Regan <jor...@gmail.com>
> Copyright 2009-2015 Kevin Scannell <ksc...@gmail.com>
>
> (The training images and scripts are a little more complicated, and I
> should really get around to doing those credits).

I could have left that out; the training data that was used, to the
extent that it can have copyright, was my own work.

Nick White

unread,

Jul 20, 2015, 8:07:50 AM7/20/15

to tesser...@googlegroups.com

Hi all, I'm catching up on the discussion on the list.

On Mon, Jul 13, 2015 at 04:46:43AM -0700, Helmut Wollmersdorfer wrote:
> Am Montag, 13. Juli 2015 04:16:13 UTC+2 schrieb Ray:
> grc and fil shouldn't be in the valid language codes list, as they are the
> wrong variant of ISO 632.
>
>
> grc is the valid ISO code (see http://www.loc.gov/standards/iso639-2/php/
> code_list.php) for ancient greek.
>

> If Nick White's grc.traineddata is not for ancient greek, then it should be
> renamed.

It is for Ancient Greek (different from modern Greek in both
dictionary and diacritics), and it is the correct ISO 632 code, yes.

I will look into producing langdata files using the main Tesseract
training tools soon (they weren't publically available when I
created the training originally). I'm interested to see the how
accuracy compares with the traineddata produced from my own tools,
and obviously it would be easier for packaging if things were done
that way.

Nick

Reply all

Reply to author

Forward