Tesseract not recognizing ancient language's code

aby tesh

unread,

Mar 5, 2020, 11:37:51 PM3/5/20

to tesseract-ocr

Hey,

I have been trying to train tesseract4 for an ancient language but it seems it can not recognize its code 'xsa' which is Sabaean Language

[user@laptop tesstraining]$ tesstrain.sh --fonts_dir ./sabaean_fonts --lang xsa --linedata_only --noextract_font_properties --langdata_dir ./tesslang --tessdata_dir ./tessdata --output_dir .xsatrain

Creating new directory .xsatrain

=== Starting training for language 'xsa'

ERROR: Error: xsa is not a valid language code

Is it a common problem? Or does it need some update to recognize the language?

Shree Devi Kumar

unread,

Mar 6, 2020, 4:24:12 AM3/6/20

to tesseract-ocr

Language codes recognized for tesseract training are listed in https://github.com/tesseract-ocr/tesseract/blob/master/src/training/language-specific.sh#L21

I will suggest that you use a language similar to your ancient language and do training. You can rename file with your proper language code at end.

--
You received this message because you are subscribed to the Google Groups "tesseract-ocr" group.
To unsubscribe from this group and stop receiving emails from it, send an email to tesseract-oc...@googlegroups.com.
To view this discussion on the web visit https://groups.google.com/d/msgid/tesseract-ocr/96a4bac7-ee2c-46ab-95f9-a0313099d778%40googlegroups.com.

--

____________________________________________________________
भजन - कीर्तन - आरती @ http://bhajans.ramparivar.com

aby tesh

unread,

Mar 6, 2020, 12:35:04 PM3/6/20

to tesseract-ocr

The character set of the language is new and is not in any way similar to the already supported languages.

Should i just pick for example the arabic language to proceed? What is the language's code use in the training process, does it affect the training process?

Thanks

Shree Devi Kumar

unread,

Mar 6, 2020, 12:46:49 PM3/6/20

to tesseract-ocr

Is the language RTL like Arabic?

The language code is used for picking up related files from langdata or langdata_lstm repo. RTL languages have slightly different processing.

--

You received this message because you are subscribed to the Google Groups "tesseract-ocr" group.
To unsubscribe from this group and stop receiving emails from it, send an email to tesseract-oc...@googlegroups.com.

To view this discussion on the web visit https://groups.google.com/d/msgid/tesseract-ocr/212cfc02-137e-4aa6-bfd8-e27ab7d3b8fb%40googlegroups.com.

saman ukh

unread,

Mar 6, 2020, 12:52:44 PM3/6/20

to tesser...@googlegroups.com

Thank you for being in contact with me.Yes the language is RTL. Is there a proper solution?

What is the first step should i do?

To view this discussion on the web visit https://groups.google.com/d/msgid/tesseract-ocr/CAG2NduVGywQMEZtPo03uF0QJu1dpfvenAtcjPKUFpGKqvJEULQ%40mail.gmail.com.

Shree Devi Kumar

unread,

Mar 6, 2020, 12:57:53 PM3/6/20

to tesseract-ocr

If you plan to use ara as the language code, you should change the files in --langdata_dir ./tesslang/ara to the files for your language. Eg. The training text, wordlist, etc.

To view this discussion on the web visit https://groups.google.com/d/msgid/tesseract-ocr/CAH4VOMKZ_pkcZw4e4KzbzLN3nbU-Nom69L0s70Dv6Ypze2tLDQ%40mail.gmail.com.

saman ukh

unread,

Mar 6, 2020, 1:01:12 PM3/6/20

to tesser...@googlegroups.com

Yes this is my plan, can you help me with the steps to change this and train the system?

To view this discussion on the web visit https://groups.google.com/d/msgid/tesseract-ocr/CAG2NduUQBj1ihh4AOxYYuGcoVRdZJd%3Dry38zY246Wm8L%2ByXtnQ%40mail.gmail.com.

Shree Devi Kumar

unread,

Mar 6, 2020, 1:03:47 PM3/6/20

to tesseract-ocr

See https://github.com/Shreeshrii/tessdata_brahmi/blob/master/brah.sh

I used eng as the starting language and trained for brahmi script.

You can look at the repo as an example for training.

To view this discussion on the web visit https://groups.google.com/d/msgid/tesseract-ocr/CAH4VOMLnNTp2sTaRLK4qU_bkEyZOOx7Xdj2_LjkNz73FWtoDNg%40mail.gmail.com.

saman ukh

unread,

Mar 6, 2020, 1:07:55 PM3/6/20

to tesser...@googlegroups.com

Thank you very much. I will have a look to the repo . I will get back to you if i found difficulties

To view this discussion on the web visit https://groups.google.com/d/msgid/tesseract-ocr/CAG2NduVQm50vekkiWArr9yfOBYUPjfyq%3D98WizPNu6y5%3Dm6s1Q%40mail.gmail.com.

aby tesh

unread,

Mar 6, 2020, 4:28:12 PM3/6/20

to tesseract-ocr

I think it is, most likely , Right To Left, it has passed that error now using eng since i only have the traindata for it, the other issue i am encountering is

=== Starting training for language 'eng'

[Sat 07 Mar 2020 12:26:06 AM EAT] /usr/bin/text2image --fonts_dir=./sabaean_fonts/ --ptsize 12 --font=Sabaean --outputbase=/tmp/fc-cache/sample_text.txt --text=/tmp/fc-cache/sample_text.txt --fontconfig_tmpdir=/tmp/fc-cache

Fontconfig warning: "/tmp/fc-cache/fonts.conf", line 4: Use of ambiguous path in <dir> element. please add prefix="cwd" if current behavior is desired.

Stripped 1 unrenderable words

Rendered page 0 to file /tmp/fc-cache/sample_text.txt.tif

=== Phase I: Generating training images ===

Rendering using Sabaean

[Sat 07 Mar 2020 12:26:08 AM EAT] /usr/bin/text2image --fontconfig_tmpdir=/tmp/fc-cache --fonts_dir=./sabaean_fonts/ --strip_unrenderable_words --leading=32 --xsize=3600 --char_spacing=0.0 --exposure=0 --outputbase=/tmp/eng-2020-03-07.lif/eng.Sabaean.exp0 --max_pages=0 --font=Sabaean --ptsize 12 --text=./tesslang/eng/eng.training_text

Fontconfig warning: "/tmp/fc-cache/fonts.conf", line 4: Use of ambiguous path in <dir> element. please add prefix="cwd" if current behavior is desired.

Stripped 2 unrenderable words

Rendered page 0 to file /tmp/eng-2020-03-07.lif/eng.Sabaean.exp0.tif

=== Phase UP: Generating unicharset and unichar properties files ===

[Sat 07 Mar 2020 12:26:08 AM EAT] /usr/bin/unicharset_extractor --output_unicharset /tmp/eng-2020-03-07.lif/eng.unicharset --norm_mode 1 /tmp/eng-2020-03-07.lif/eng.Sabaean.exp0.box

Failed to read data from: /tmp/eng-2020-03-07.lif/eng.Sabaean.exp0.box

Wrote unicharset file /tmp/eng-2020-03-07.lif/eng.unicharset

[Sat 07 Mar 2020 12:26:08 AM EAT] /usr/bin/set_unicharset_properties -U /tmp/eng-2020-03-07.lif/eng.unicharset -O /tmp/eng-2020-03-07.lif/eng.unicharset -X /tmp/eng-2020-03-07.lif/eng.xheights --script_dir=./langdata

Loaded unicharset of size 3 from file /tmp/eng-2020-03-07.lif/eng.unicharset

Setting unichar properties

Setting script properties

Failed to load script unicharset from:./langdata/Latin.unicharset

Writing unicharset to file /tmp/eng-2020-03-07.lif/eng.unicharset

=== Phase E: Generating lstmf files ===

Using TESSDATA_PREFIX=./tessdata/

[Sat 07 Mar 2020 12:26:08 AM EAT] /usr/bin/tesseract /tmp/eng-2020-03-07.lif/eng.Sabaean.exp0.tif /tmp/eng-2020-03-07.lif/eng.Sabaean.exp0 --psm 6 lstm.train

read_params_file: Can't open lstm.train

Tesseract Open Source OCR Engine v4.1.1 with Leptonica

Page 1

ERROR: /tmp/eng-2020-03-07.lif/eng.Sabaean.exp0.lstmf does not exist or is not readable

Shree Devi Kumar

unread,

Mar 7, 2020, 8:24:50 AM3/7/20

to tesseract-ocr

I have created an example traineddata for xsa. I will upload later today. You can then modify with a larger training text and run training.

--

You received this message because you are subscribed to the Google Groups "tesseract-ocr" group.
To unsubscribe from this group and stop receiving emails from it, send an email to tesseract-oc...@googlegroups.com.

To view this discussion on the web visit https://groups.google.com/d/msgid/tesseract-ocr/ee9d5e16-328e-480d-ab2c-4ca4de708381%40googlegroups.com.

Shree Devi Kumar

unread,

Mar 7, 2020, 12:59:18 PM3/7/20

to tesseract-ocr

Please see https://github.com/Shreeshrii/tesstrain-xsa

aby tesh

unread,

Mar 9, 2020, 11:55:28 AM3/9/20

to tesseract-ocr

Ohh a very nice repo, i will check it out and get back to you.

Thanks!

aby tesh

unread,

Mar 9, 2020, 6:11:52 PM3/9/20

to tesseract-ocr

Hey,

I followed the steps in the readme file, and i started the lstmtraining, but it seems my current computer's processor can't handle the training for a longer period of time.

What can i do about it? When should i abort the training to get a good trainedata file? or is there one which is accurate that you can share ?

Thanks

Shree Devi Kumar

unread,

Mar 9, 2020, 10:32:17 PM3/9/20

to tesseract-ocr

https://github.com/tesseract-ocr/tessdoc/blob/master/TrainingTesseract-4.00.md#hardware-software-requirements

--

You received this message because you are subscribed to the Google Groups "tesseract-ocr" group.
To unsubscribe from this group and stop receiving emails from it, send an email to tesseract-oc...@googlegroups.com.

To view this discussion on the web visit https://groups.google.com/d/msgid/tesseract-ocr/e727f106-d668-44b5-9bba-8fad29fc1587%40googlegroups.com.

Shree Devi Kumar

unread,

Mar 9, 2020, 10:35:26 PM3/9/20

to tesseract-ocr

If you can share a large enough training text and fonts, I can rerun the training.

On Tue, Mar 10, 2020, 03:41 aby tesh <abyt...@gmail.com> wrote:

--

You received this message because you are subscribed to the Google Groups "tesseract-ocr" group.
To unsubscribe from this group and stop receiving emails from it, send an email to tesseract-oc...@googlegroups.com.

To view this discussion on the web visit https://groups.google.com/d/msgid/tesseract-ocr/e727f106-d668-44b5-9bba-8fad29fc1587%40googlegroups.com.

Message has been deleted

aby tesh

unread,

Mar 12, 2020, 2:31:20 PM3/12/20

to tesseract-ocr

I can't get any special fonts, and what i already have are downloaded from the web.

On Tuesday, March 10, 2020 at 5:35:26 AM UTC+3, shree wrote:

If you can share a large enough training text and fonts, I can rerun the training.

On Tue, Mar 10, 2020, 03:41 aby tesh <abyt...@gmail.com> wrote:

Hey,

I followed the steps in the readme file, and i started the lstmtraining, but it seems my current computer's processor can't handle the training for a longer period of time.

What can i do about it? When should i abort the training to get a good trainedata file? or is there one which is accurate that you can share ?

Thanks

--
You received this message because you are subscribed to the Google Groups "tesseract-ocr" group.

To unsubscribe from this group and stop receiving emails from it, send an email to tesser...@googlegroups.com.

aby tesh

unread,

Mar 14, 2020, 1:07:35 PM3/14/20

to tesseract-ocr

Hey shree, I have compiled all relevant fonts and attached them below. I am not sure know how i can generate text data with it.

On Tuesday, March 10, 2020 at 5:35:26 AM UTC+3, shree wrote:

If you can share a large enough training text and fonts, I can rerun the training.

On Tue, Mar 10, 2020, 03:41 aby tesh <abyt...@gmail.com> wrote:

Hey,

I followed the steps in the readme file, and i started the lstmtraining, but it seems my current computer's processor can't handle the training for a longer period of time.

What can i do about it? When should i abort the training to get a good trainedata file? or is there one which is accurate that you can share ?

Thanks

--
You received this message because you are subscribed to the Google Groups "tesseract-ocr" group.

To unsubscribe from this group and stop receiving emails from it, send an email to tesser...@googlegroups.com.

xsa_fonts.zip

Shree Devi Kumar

unread,

Mar 14, 2020, 1:45:46 PM3/14/20

to tesseract-ocr

Are all these Unicode fonts?

What about training text in utf-8 Unicode encoding?

To unsubscribe from this group and stop receiving emails from it, send an email to tesseract-oc...@googlegroups.com.
To view this discussion on the web visit https://groups.google.com/d/msgid/tesseract-ocr/efa79761-20a5-4d20-b0c1-40eb2523c289%40googlegroups.com.

aby tesh

unread,

Mar 14, 2020, 6:31:01 PM3/14/20

to tesseract-ocr

That is what i am not getting, i don't think they all are unicode fonts, i couldn't get one. Some render on my machine (Linux) some don't.

On Saturday, March 14, 2020 at 8:45:46 PM UTC+3, shree wrote:

Are all these Unicode fonts?

What about training text in utf-8 Unicode encoding?

On Sat, Mar 14, 2020, 22:37 aby tesh <abyt...@gmail.com> wrote:

Hey shree, I have compiled all relevant fonts and attached them below. I am not sure know how i can generate text data with it.

On Tuesday, March 10, 2020 at 5:35:26 AM UTC+3, shree wrote:
If you can share a large enough training text and fonts, I can rerun the training.

On Tue, Mar 10, 2020, 03:41 aby tesh <abyt...@gmail.com> wrote:
Hey,

I followed the steps in the readme file, and i started the lstmtraining, but it seems my current computer's processor can't handle the training for a longer period of time.

What can i do about it? When should i abort the training to get a good trainedata file? or is there one which is accurate that you can share ?

Thanks

--
You received this message because you are subscribed to the Google Groups "tesseract-ocr" group.
To unsubscribe from this group and stop receiving emails from it, send an email to tesser...@googlegroups.com.
To view this discussion on the web visit https://groups.google.com/d/msgid/tesseract-ocr/e727f106-d668-44b5-9bba-8fad29fc1587%40googlegroups.com.

--
You received this message because you are subscribed to the Google Groups "tesseract-ocr" group.
To unsubscribe from this group and stop receiving emails from it, send an email to tesser...@googlegroups.com.

To view this discussion on the web visit https://groups.google.com/d/msgid/tesseract-ocr/efa79761-20a5-4d20-b0c1-40eb2523c289%40googlegroups.com.

aby tesh

unread,

Mar 14, 2020, 6:34:00 PM3/14/20

to tesseract-ocr

Even google's Noto font doesn't show glyphs while opening it with Gnome Fonts, does that mean it is not a unicode font?

On Saturday, March 14, 2020 at 8:45:46 PM UTC+3, shree wrote:

Are all these Unicode fonts?

What about training text in utf-8 Unicode encoding?

On Sat, Mar 14, 2020, 22:37 aby tesh <abyt...@gmail.com> wrote:

Hey shree, I have compiled all relevant fonts and attached them below. I am not sure know how i can generate text data with it.

On Tuesday, March 10, 2020 at 5:35:26 AM UTC+3, shree wrote:
If you can share a large enough training text and fonts, I can rerun the training.

On Tue, Mar 10, 2020, 03:41 aby tesh <abyt...@gmail.com> wrote:
Hey,

I followed the steps in the readme file, and i started the lstmtraining, but it seems my current computer's processor can't handle the training for a longer period of time.

What can i do about it? When should i abort the training to get a good trainedata file? or is there one which is accurate that you can share ?

Thanks

--
You received this message because you are subscribed to the Google Groups "tesseract-ocr" group.
To unsubscribe from this group and stop receiving emails from it, send an email to tesser...@googlegroups.com.
To view this discussion on the web visit https://groups.google.com/d/msgid/tesseract-ocr/e727f106-d668-44b5-9bba-8fad29fc1587%40googlegroups.com.

--
You received this message because you are subscribed to the Google Groups "tesseract-ocr" group.
To unsubscribe from this group and stop receiving emails from it, send an email to tesser...@googlegroups.com.

To view this discussion on the web visit https://groups.google.com/d/msgid/tesseract-ocr/efa79761-20a5-4d20-b0c1-40eb2523c289%40googlegroups.com.

Shree Devi Kumar

unread,

Mar 14, 2020, 9:32:08 PM3/14/20

to tesseract-ocr

I had used the findfonts feature of text2image and found only two fonts that rendered the xsa text. I will check the fonts that you sent. What about training text? Unless you have some more text, it will be difficult to do training.

Quivira
	Segoe UI Historic

To unsubscribe from this group and stop receiving emails from it, send an email to tesseract-oc...@googlegroups.com.
To view this discussion on the web visit https://groups.google.com/d/msgid/tesseract-ocr/1d3e54cc-3f53-4ad3-b870-171bb26fc6eb%40googlegroups.com.

aby tesh

unread,

Mar 15, 2020, 8:27:25 AM3/15/20

to tesseract-ocr

Where can i get the training text, or can i create a new one. I have a problem writing with fonts which some of included in the attachment i sent you.

To view this discussion on the web visit https://groups.google.com/d/msgid/tesseract-ocr/1d3e54cc-3f53-4ad3-b870-171bb26fc6eb%40googlegroups.com.

Lorenzo Bolzani

unread,

Mar 15, 2020, 8:51:17 AM3/15/20

to tesser...@googlegroups.com

Common fonts do not cover every unicode symbol (about 100000).

If one font works and another does not the text is correct and you just need to find fonts covering that language.

Lorenzo

To unsubscribe from this group and stop receiving emails from it, send an email to tesseract-oc...@googlegroups.com.
To view this discussion on the web visit https://groups.google.com/d/msgid/tesseract-ocr/373dfeed-d09f-49cc-9f3e-8b0d55661f1c%40googlegroups.com.

aby tesh

unread,

Mar 15, 2020, 8:58:35 AM3/15/20

to tesseract-ocr

Most of the fonts i have found covers the language's Unicode range. Although they aren't recognized by the system. (Linux)

To view this discussion on the web visit https://groups.google.com/d/msgid/tesseract-ocr/373dfeed-d09f-49cc-9f3e-8b0d55661f1c%40googlegroups.com.

Shree Devi Kumar

unread,

Mar 15, 2020, 11:56:00 AM3/15/20

to tesseract-ocr

There is no online corpus for xsa that I could find.

Two of the fonts you sent are legacy fonts, that is they map English letters to ancient Arabic characters.

Are there any converters that convert from the legacy mapping to Unicode?

If there is existing text in legacy fonts, it can be converted to Unicode and that can be used for training.

To unsubscribe from this group and stop receiving emails from it, send an email to tesseract-oc...@googlegroups.com.
To view this discussion on the web visit https://groups.google.com/d/msgid/tesseract-ocr/88bfa189-4a1e-4528-857c-013248b5ee4b%40googlegroups.com.

Wincent Balin

unread,

Mar 15, 2020, 5:02:15 PM3/15/20

to tesser...@googlegroups.com

Maybe http://dasi.cnr.it does have something usable?

To view this discussion on the web visit https://groups.google.com/d/msgid/tesseract-ocr/CAG2NduVrD9Vo8HUFWe_dr6c6Gs2EPOB2bh9DfkmAtA85cKp8fQ%40mail.gmail.com.

Shree Devi Kumar

unread,

Mar 15, 2020, 10:12:50 PM3/15/20

to tesseract-ocr

Hi Wincent,

Thanks for the link.

I had checked that site earlier. It has text transcription in Latin transliteration,eg. http://dasi.cnr.it/index.php?id=79&prjId=1&corId=5&colId=0&navId=522207406&recId=2149 I haven't found any conversion tool to Unicode for the same.

1  Yʿly w-ʾḏmr bny Whbʾl[ ... ...] ʾḏmr[ ... ... by]—
   2  t-(s¹m) Yġl b-rdʾ mrʾ-s¹[m ... ...]
   3  [... ... ]w-(b)-(rd)ʾ mrʾ-s¹m [... ...]
   4  [... ...] ʾḏ(mr) w-b-rd(ʾ)[ ... ...]

Maybe, you can add a tool in https://github.com/wincentbalin/pytesstrain to create randomly generated training text from a range of characters/word list, similar to

The tool language_metrics runs Tesseract OCR over images of random word sequences, which are created out of the supplied wordlist,

To view this discussion on the web visit https://groups.google.com/d/msgid/tesseract-ocr/CANuFvMcdEir5VQr0RJCkBKaS-0C%3DE2EaPUpezxtqyKwaRcTAUw%40mail.gmail.com.

Wincent Balin

unread,

Mar 22, 2020, 4:28:39 PM3/22/20

to tesser...@googlegroups.com

Hi Shree,

I will add a tool to create random text within Unicode range soon.

@aby tesh: Do you know anything about a converter from transliterated text to [xsa] Unicode text?

To view this discussion on the web visit https://groups.google.com/d/msgid/tesseract-ocr/CAG2NduWPnzsoe7BgF2k6bg8QQg4XcLp1Cu%2B6Fq3kVbkw28XEwg%40mail.gmail.com.

Shree Devi Kumar

unread,

Mar 24, 2020, 4:12:51 AM3/24/20

to tesseract-ocr

Please see https://github.com/Shreeshrii/tesstrain-xsa/blob/master/langdata/latin2unicode.sh

It has sed substitution commands for going from transliteration to Unicode for xsa, based on mapping shown in Wikipedia and other web pages.

To view this discussion on the web visit https://groups.google.com/d/msgid/tesseract-ocr/CANuFvMcxdxNSr5M4ruQqRmLW3n233DQmBHReYAmJ%2BHcNyCGtLg%40mail.gmail.com.

Reply all

Reply to author

Forward