inconsistent languages

225 views
Skip to first unread message

Jeff Breidenbach

unread,
Jul 12, 2015, 1:36:26 AM7/12/15
to tesser...@googlegroups.com
Here are the inconsistent languages.

only found in tessdata
deu_frak
dan_frak
slk_frak
equ
osd

only found in VALID_LANGUAGE_CODES in language-specific.sh
fil
hye
lat_lid
snd

only found in langdata
gle_uncial
zlm

missing from langdata
grc 

missing from tessdata
bih

zdenko podobny

unread,
Jul 12, 2015, 2:19:28 AM7/12/15
to tesser...@googlegroups.com
*_frak is Fraktur variant of language
equ is Math / equation detection module
osd is Orientation and script detection module
grc has is from http://ancientgreekocr.org/ 

Zdenko

--
You received this message because you are subscribed to the Google Groups "tesseract-dev" group.
To unsubscribe from this group and stop receiving emails from it, send an email to tesseract-de...@googlegroups.com.
To post to this group, send email to tesser...@googlegroups.com.
Visit this group at http://groups.google.com/group/tesseract-dev.
To view this discussion on the web visit https://groups.google.com/d/msgid/tesseract-dev/02e6d16d-8b71-44ba-a2f9-bb150b807e41%40googlegroups.com.
For more options, visit https://groups.google.com/d/optout.

ShreeDevi Kumar

unread,
Jul 12, 2015, 4:59:48 AM7/12/15
to tesser...@googlegroups.com
language-specific.sh 
lines 21-26

# Codes for which we have webtext but no fonts:
# armenian, dhivehi, mongolian (we support mongolian cyrillic as in the webtext,
# but not mongolian script with vertical writing direction), sindhi (for which
# we have persian script webtext, but real sindhi text can be in persian OR
# devanagari script)
UNUSABLE_LANGUAGE_CODES="hye div mon snd"

--------------
bih
Bihari tessdata may have been pulled because there is a pending issue with the training text 
Please see 

------------------------
grc
Anicient Greek - work done by Nick White

The code

The code to generate and test the Ancient Greek OCR training data is in several small git repositories. It is all free software under the Apache License 2.0.

git clone http://ancientgreekocr.org/grc.git
The final training process, hopefully soon to be part of the main Tesseract codebase.
git clone http://ancientgreekocr.org/grctraining.git
Rules and tools to deterministically generate all prerequisites for the final training process.
git clone http://ancientgreekocr.org/grctestfodder.git
Ancient Greek page scans and ground truth text for testing OCR accuracy.
git clone http://ancientgreekocr.org/ocr-evaluation-tools.git
Tools to test OCR accuracy.


ShreeDevi
____________________________________________________________
भजन - कीर्तन - आरती @ http://bhajans.ramparivar.com

--

Jim O'Regan

unread,
Jul 12, 2015, 6:01:51 AM7/12/15
to tesser...@googlegroups.com
On 12 July 2015 at 06:36, Jeff Breidenbach <breid...@gmail.com> wrote:
> only found in langdata
> gle_uncial

gle_uncial is Irish (Gaeilge) in Uncial script, traineddata is here:
https://github.com/jimregan/tesseract-gle-uncial/releases/download/v0.1beta1/gle_uncial.traineddata

--
<Sefam> Are any of the mentors around?
<jimregan> yes, they're the ones trolling you

Ray Smith

unread,
Jul 12, 2015, 10:16:13 PM7/12/15
to tesser...@googlegroups.com
All these responses look correct to me.

The actual errors are:
grc and fil shouldn't be in the valid language codes list, as they are the wrong variant of ISO 632.
That would also reduce the risk of accidentally overwriting Nick White's grc.traineddata in the future.

Shree is correct, that I pulled bih from the traineddata because the training text is garbage.

Another caveat worth noting is that I only tested a small fraction of these languages - maybe 25?
I suspect, for instance, that all the Arabic-based langages except ara don't work very well.
I would be interested in an more feedback on how bad it is in any of them, and will take suggestions into account for the next version after 3.04.



--
You received this message because you are subscribed to the Google Groups "tesseract-dev" group.
To unsubscribe from this group and stop receiving emails from it, send an email to tesseract-de...@googlegroups.com.
To post to this group, send email to tesser...@googlegroups.com.
Visit this group at http://groups.google.com/group/tesseract-dev.

ShreeDevi Kumar

unread,
Jul 13, 2015, 12:09:28 AM7/13/15
to tesser...@googlegroups.com
Ray,

1. I will be happy to test the devanagari based languages as well as other Indic ones - if there is some objective way of measuring the accuracy for the same. Is there any test suite or recommended method for the same?

2. Also, I noticed that there is a directory for Persian Langdata but no traineddata for it.


3. It would be helpful, if we can have a page which symlinks to external (non-google) traineddata files eg. grc, per, gle_uncial etc.

4. Is there a recommended method for listing language-script combinations eg. Sindhi can be written in devanagari and persian scripts - so should the traineddata be snd_deva and snd_per??

ShreeDevi
____________________________________________________________
भजन - कीर्तन - आरती @ http://bhajans.ramparivar.com

Ray Smith

unread,
Jul 13, 2015, 1:02:14 AM7/13/15
to tesser...@googlegroups.com
On Sun, Jul 12, 2015 at 9:08 PM, ShreeDevi Kumar <shree...@gmail.com> wrote:
Ray,

1. I will be happy to test the devanagari based languages as well as other Indic ones - if there is some objective way of measuring the accuracy for the same. Is there any test suite or recommended method for the same?
We have an internal tool, but I don't know what if any open source tools there are for complex scripts.
I have test data for Hindi, Kannada, Telugu, Tamil, but they are the only Indic languages that I have data for. 

2. Also, I noticed that there is a directory for Persian Langdata but no traineddata for it.

The file exists. I will have to add it tomorrow. 


3. It would be helpful, if we can have a page which symlinks to external (non-google) traineddata files eg. grc, per, gle_uncial etc.
Good idea. 

4. Is there a recommended method for listing language-script combinations eg. Sindhi can be written in devanagari and persian scripts - so should the traineddata be snd_deva and snd_per??
So far I have used the convention that if the language has a "usual" script then that goes in lang.traineddata, and the "other" script goes in lang_other.traineddata (eg srp, srp_latn and uzb, uzb_cyrl). If there are 2 "usual" scripts, eg in different countries, then I suppose the best thing to do is put a script identifier on both, as you suggest.

ShreeDevi Kumar

unread,
Jul 13, 2015, 2:48:40 AM7/13/15
to tesser...@googlegroups.com

If you can share the format for test data, I can try and provide you with files for other Indian languages, specially devanagari based.

Alternately, you can suggest if there is a way to get access to Google's internal tool for this.

- sent from my phone. excuse the brevity.

Helmut Wollmersdorfer

unread,
Jul 13, 2015, 10:24:13 AM7/13/15
to tesser...@googlegroups.com


Am Montag, 13. Juli 2015 04:16:13 UTC+2 schrieb Ray:
All these responses look correct to me.

The actual errors are:
grc and fil shouldn't be in the valid language codes list, as they are the wrong variant of ISO 632.

grc is the valid ISO code (see http://www.loc.gov/standards/iso639-2/php/code_list.php) for ancient greek.

That would also reduce the risk of accidentally overwriting Nick White's grc.traineddata in the future.

If Nick  White's grc.traineddata is not for ancient greek, then it should be renamed.

Helmut Wollmersdorfer

unread,
Jul 13, 2015, 10:24:50 AM7/13/15
to tesser...@googlegroups.com


Am Sonntag, 12. Juli 2015 08:19:28 UTC+2 schrieb Zdenko Podobný:
*_frak is Fraktur variant of language
equ is Math / equation detection module
osd is Orientation and script detection module

To avoid naming conflicts with (future) ISO language codes, equ and osd should maybe renamed to be variants of the language code 
 
   zxx    No linguistic content; Not applicable



 
i.e.

  zxx_equ
  zxx_osd
 

Jeff Breidenbach

unread,
Jul 16, 2015, 4:39:39 PM7/16/15
to tesser...@googlegroups.com
Should I be shipping any languages besides the ones found in 
tessdata on github? The only candidate I currently know of is gle_unical 
mentioned above and the ancient greek at http://ancientgreekocr.org/.

If so, I need to know the copyright owner and license for each. (And 
I really, really hope the license is Apache 2.0 to match everything else).

Jim O'Regan

unread,
Jul 16, 2015, 6:45:21 PM7/16/15
to tesser...@googlegroups.com
Uncial :) Unical makes me think of INTERCAL.

The language pack is:
Copyright 2009-2015 Jim O'Regan <jor...@gmail.com>
Copyright 2009-2015 Kevin Scannell <ksc...@gmail.com>

(The training images and scripts are a little more complicated, and I
should really get around to doing those credits).

The licence is Apache 2.0
(https://github.com/jimregan/tesseract-gle-uncial/blob/master/LICENSE)

Jim O'Regan

unread,
Jul 16, 2015, 6:50:27 PM7/16/15
to tesser...@googlegroups.com
On 16 July 2015 at 23:45, Jim O'Regan <jor...@gmail.com> wrote:
> On 16 July 2015 at 21:39, Jeff Breidenbach <breid...@gmail.com> wrote:
>> Should I be shipping any languages besides the ones found in
>> tessdata on github? The only candidate I currently know of is gle_unical
>> mentioned above and the ancient greek at http://ancientgreekocr.org/.
>>
>> If so, I need to know the copyright owner and license for each. (And
>> I really, really hope the license is Apache 2.0 to match everything else).
>
> Uncial :) Unical makes me think of INTERCAL.
>
> The language pack is:
> Copyright 2009-2015 Jim O'Regan <jor...@gmail.com>
> Copyright 2009-2015 Kevin Scannell <ksc...@gmail.com>
>
> (The training images and scripts are a little more complicated, and I
> should really get around to doing those credits).

I could have left that out; the training data that was used, to the
extent that it can have copyright, was my own work.

Nick White

unread,
Jul 20, 2015, 8:07:50 AM7/20/15
to tesser...@googlegroups.com
Hi all, I'm catching up on the discussion on the list.

On Mon, Jul 13, 2015 at 04:46:43AM -0700, Helmut Wollmersdorfer wrote:
> Am Montag, 13. Juli 2015 04:16:13 UTC+2 schrieb Ray:
> grc and fil shouldn't be in the valid language codes list, as they are the
> wrong variant of ISO 632.
>
>
> grc is the valid ISO code (see http://www.loc.gov/standards/iso639-2/php/
> code_list.php) for ancient greek.
>
> If Nick White's grc.traineddata is not for ancient greek, then it should be
> renamed.

It is for Ancient Greek (different from modern Greek in both
dictionary and diacritics), and it is the correct ISO 632 code, yes.

I will look into producing langdata files using the main Tesseract
training tools soon (they weren't publically available when I
created the training originally). I'm interested to see the how
accuracy compares with the traineddata produced from my own tools,
and obviously it would be easier for packaging if things were done
that way.

Nick
Reply all
Reply to author
Forward
0 new messages