Tesseract not recognizing ancient language's code

238 views
Skip to first unread message

aby tesh

unread,
Mar 5, 2020, 11:37:51 PM3/5/20
to tesseract-ocr
Hey, 

I have been trying to train tesseract4  for an ancient language but it seems it can not recognize its code 'xsa' which is Sabaean Language 


[user@laptop tesstraining]$ tesstrain.sh --fonts_dir ./sabaean_fonts --lang xsa --linedata_only   --noextract_font_properties --langdata_dir ./tesslang   --tessdata_dir ./tessdata --output_dir .xsatrain
Creating new directory .xsatrain

=== Starting training for language 'xsa'
ERROR: Error: xsa is not a valid language code



Is it a common problem? Or does it need some update to recognize the language?

Shree Devi Kumar

unread,
Mar 6, 2020, 4:24:12 AM3/6/20
to tesseract-ocr
Language codes recognized for tesseract training are listed in https://github.com/tesseract-ocr/tesseract/blob/master/src/training/language-specific.sh#L21

I will suggest that you use a language similar to your ancient language and do training. You can rename file with your proper language code at end.

--
You received this message because you are subscribed to the Google Groups "tesseract-ocr" group.
To unsubscribe from this group and stop receiving emails from it, send an email to tesseract-oc...@googlegroups.com.
To view this discussion on the web visit https://groups.google.com/d/msgid/tesseract-ocr/96a4bac7-ee2c-46ab-95f9-a0313099d778%40googlegroups.com.


--

____________________________________________________________
भजन - कीर्तन - आरती @ http://bhajans.ramparivar.com

aby tesh

unread,
Mar 6, 2020, 12:35:04 PM3/6/20
to tesseract-ocr
The character set of the language is new and is not in any way similar to the already supported languages.

Should i just pick for example the arabic language to proceed?  What is the language's code use in the training process, does it affect the training process?

Thanks 

Shree Devi Kumar

unread,
Mar 6, 2020, 12:46:49 PM3/6/20
to tesseract-ocr
Is the language RTL like Arabic?

The language code is used for picking up related files from langdata or langdata_lstm repo. RTL languages have slightly different processing.



--
You received this message because you are subscribed to the Google Groups "tesseract-ocr" group.
To unsubscribe from this group and stop receiving emails from it, send an email to tesseract-oc...@googlegroups.com.

saman ukh

unread,
Mar 6, 2020, 12:52:44 PM3/6/20
to tesser...@googlegroups.com
Thank you for being in contact with me.Yes the language is RTL. Is there a proper solution?

What is the first step should i do? 

Shree Devi Kumar

unread,
Mar 6, 2020, 12:57:53 PM3/6/20
to tesseract-ocr
If you plan to use ara as the language code, you should change the files in  --langdata_dir ./tesslang/ara to the files for your language. Eg. The training text, wordlist, etc.

saman ukh

unread,
Mar 6, 2020, 1:01:12 PM3/6/20
to tesser...@googlegroups.com
Yes this is my plan, can you help me with the steps to change this and train the system?

Shree Devi Kumar

unread,
Mar 6, 2020, 1:03:47 PM3/6/20
to tesseract-ocr
See https://github.com/Shreeshrii/tessdata_brahmi/blob/master/brah.sh

I used eng as the starting language and trained for brahmi script.

You can look at the repo as an example for training.

saman ukh

unread,
Mar 6, 2020, 1:07:55 PM3/6/20
to tesser...@googlegroups.com
Thank you very much. I will have a look to the repo . I will get back to you if i found difficulties 

aby tesh

unread,
Mar 6, 2020, 4:28:12 PM3/6/20
to tesseract-ocr
I think it is, most likely , Right To Left, it has passed that error now using eng since i only have the traindata for it,  the other issue i am encountering is 

=== Starting training for language 'eng'
[Sat 07 Mar 2020 12:26:06 AM EAT] /usr/bin/text2image --fonts_dir=./sabaean_fonts/ --ptsize 12 --font=Sabaean --outputbase=/tmp/fc-cache/sample_text.txt --text=/tmp/fc-cache/sample_text.txt --fontconfig_tmpdir=/tmp/fc-cache
Fontconfig warning: "/tmp/fc-cache/fonts.conf", line 4: Use of ambiguous path in <dir> element. please add prefix="cwd" if current behavior is desired.
Stripped 1 unrenderable words
Rendered page 0 to file /tmp/fc-cache/sample_text.txt.tif

=== Phase I: Generating training images ===
Rendering using Sabaean
[Sat 07 Mar 2020 12:26:08 AM EAT] /usr/bin/text2image --fontconfig_tmpdir=/tmp/fc-cache --fonts_dir=./sabaean_fonts/ --strip_unrenderable_words --leading=32 --xsize=3600 --char_spacing=0.0 --exposure=0 --outputbase=/tmp/eng-2020-03-07.lif/eng.Sabaean.exp0 --max_pages=0 --font=Sabaean --ptsize 12 --text=./tesslang/eng/eng.training_text
Fontconfig warning: "/tmp/fc-cache/fonts.conf", line 4: Use of ambiguous path in <dir> element. please add prefix="cwd" if current behavior is desired.
Stripped 2 unrenderable words
Rendered page 0 to file /tmp/eng-2020-03-07.lif/eng.Sabaean.exp0.tif

=== Phase UP: Generating unicharset and unichar properties files ===
[Sat 07 Mar 2020 12:26:08 AM EAT] /usr/bin/unicharset_extractor --output_unicharset /tmp/eng-2020-03-07.lif/eng.unicharset --norm_mode 1 /tmp/eng-2020-03-07.lif/eng.Sabaean.exp0.box
Failed to read data from: /tmp/eng-2020-03-07.lif/eng.Sabaean.exp0.box
Wrote unicharset file /tmp/eng-2020-03-07.lif/eng.unicharset
[Sat 07 Mar 2020 12:26:08 AM EAT] /usr/bin/set_unicharset_properties -U /tmp/eng-2020-03-07.lif/eng.unicharset -O /tmp/eng-2020-03-07.lif/eng.unicharset -X /tmp/eng-2020-03-07.lif/eng.xheights --script_dir=./langdata
Loaded unicharset of size 3 from file /tmp/eng-2020-03-07.lif/eng.unicharset
Setting unichar properties
Setting script properties
Failed to load script unicharset from:./langdata/Latin.unicharset
Writing unicharset to file /tmp/eng-2020-03-07.lif/eng.unicharset

=== Phase E: Generating lstmf files ===
Using TESSDATA_PREFIX=./tessdata/
[Sat 07 Mar 2020 12:26:08 AM EAT] /usr/bin/tesseract /tmp/eng-2020-03-07.lif/eng.Sabaean.exp0.tif /tmp/eng-2020-03-07.lif/eng.Sabaean.exp0 --psm 6 lstm.train
read_params_file: Can't open lstm.train
Tesseract Open Source OCR Engine v4.1.1 with Leptonica
Page 1
ERROR: /tmp/eng-2020-03-07.lif/eng.Sabaean.exp0.lstmf does not exist or is not readable
 

Shree Devi Kumar

unread,
Mar 7, 2020, 8:24:50 AM3/7/20
to tesseract-ocr
I have created an example traineddata for xsa. I will upload later today. You can then modify with a larger training text and run training.

--
You received this message because you are subscribed to the Google Groups "tesseract-ocr" group.
To unsubscribe from this group and stop receiving emails from it, send an email to tesseract-oc...@googlegroups.com.

Shree Devi Kumar

unread,
Mar 7, 2020, 12:59:18 PM3/7/20
to tesseract-ocr

aby tesh

unread,
Mar 9, 2020, 11:55:28 AM3/9/20
to tesseract-ocr
Ohh a very nice repo, i will check it out and get back to you.


Thanks!

aby tesh

unread,
Mar 9, 2020, 6:11:52 PM3/9/20
to tesseract-ocr
Hey,

I followed the steps in the readme file, and i started the lstmtraining, but it seems my current computer's processor can't handle the training for a longer period of time.

What can i do about it? When should i abort the training to get a good trainedata file? or is there one which is accurate that you can share ?

Thanks

Shree Devi Kumar

unread,
Mar 9, 2020, 10:32:17 PM3/9/20
to tesseract-ocr

--
You received this message because you are subscribed to the Google Groups "tesseract-ocr" group.
To unsubscribe from this group and stop receiving emails from it, send an email to tesseract-oc...@googlegroups.com.

Shree Devi Kumar

unread,
Mar 9, 2020, 10:35:26 PM3/9/20
to tesseract-ocr
If you can share a large enough training text and fonts, I can rerun the training.

On Tue, Mar 10, 2020, 03:41 aby tesh <abyt...@gmail.com> wrote:
--
You received this message because you are subscribed to the Google Groups "tesseract-ocr" group.
To unsubscribe from this group and stop receiving emails from it, send an email to tesseract-oc...@googlegroups.com.
Message has been deleted

aby tesh

unread,
Mar 12, 2020, 2:31:20 PM3/12/20
to tesseract-ocr
I can't get any special fonts, and what i already have are downloaded from the web. 


On Tuesday, March 10, 2020 at 5:35:26 AM UTC+3, shree wrote:
If you can share a large enough training text and fonts, I can rerun the training.

On Tue, Mar 10, 2020, 03:41 aby tesh <abyt...@gmail.com> wrote:
Hey,

I followed the steps in the readme file, and i started the lstmtraining, but it seems my current computer's processor can't handle the training for a longer period of time.

What can i do about it? When should i abort the training to get a good trainedata file? or is there one which is accurate that you can share ?

Thanks

--
You received this message because you are subscribed to the Google Groups "tesseract-ocr" group.
To unsubscribe from this group and stop receiving emails from it, send an email to tesser...@googlegroups.com.

aby tesh

unread,
Mar 14, 2020, 1:07:35 PM3/14/20
to tesseract-ocr
Hey shree, I have compiled all relevant fonts and attached them below. I am not sure know how i can generate text data with it.


On Tuesday, March 10, 2020 at 5:35:26 AM UTC+3, shree wrote:
If you can share a large enough training text and fonts, I can rerun the training.

On Tue, Mar 10, 2020, 03:41 aby tesh <abyt...@gmail.com> wrote:
Hey,

I followed the steps in the readme file, and i started the lstmtraining, but it seems my current computer's processor can't handle the training for a longer period of time.

What can i do about it? When should i abort the training to get a good trainedata file? or is there one which is accurate that you can share ?

Thanks

--
You received this message because you are subscribed to the Google Groups "tesseract-ocr" group.
To unsubscribe from this group and stop receiving emails from it, send an email to tesser...@googlegroups.com.
xsa_fonts.zip

Shree Devi Kumar

unread,
Mar 14, 2020, 1:45:46 PM3/14/20
to tesseract-ocr
Are all these Unicode fonts?

What about training text in utf-8 Unicode encoding?

To unsubscribe from this group and stop receiving emails from it, send an email to tesseract-oc...@googlegroups.com.
To view this discussion on the web visit https://groups.google.com/d/msgid/tesseract-ocr/efa79761-20a5-4d20-b0c1-40eb2523c289%40googlegroups.com.

aby tesh

unread,
Mar 14, 2020, 6:31:01 PM3/14/20
to tesseract-ocr
That is what i am not getting, i don't think they all are unicode fonts, i couldn't get one. Some render on my machine (Linux) some don't. 


On Saturday, March 14, 2020 at 8:45:46 PM UTC+3, shree wrote:
Are all these Unicode fonts?

What about training text in utf-8 Unicode encoding?

On Sat, Mar 14, 2020, 22:37 aby tesh <abyt...@gmail.com> wrote:
Hey shree, I have compiled all relevant fonts and attached them below. I am not sure know how i can generate text data with it.

On Tuesday, March 10, 2020 at 5:35:26 AM UTC+3, shree wrote:
If you can share a large enough training text and fonts, I can rerun the training.

On Tue, Mar 10, 2020, 03:41 aby tesh <abyt...@gmail.com> wrote:
Hey,

I followed the steps in the readme file, and i started the lstmtraining, but it seems my current computer's processor can't handle the training for a longer period of time.

What can i do about it? When should i abort the training to get a good trainedata file? or is there one which is accurate that you can share ?

Thanks

--
You received this message because you are subscribed to the Google Groups "tesseract-ocr" group.
To unsubscribe from this group and stop receiving emails from it, send an email to tesser...@googlegroups.com.
To view this discussion on the web visit https://groups.google.com/d/msgid/tesseract-ocr/e727f106-d668-44b5-9bba-8fad29fc1587%40googlegroups.com.

--
You received this message because you are subscribed to the Google Groups "tesseract-ocr" group.
To unsubscribe from this group and stop receiving emails from it, send an email to tesser...@googlegroups.com.

aby tesh

unread,
Mar 14, 2020, 6:34:00 PM3/14/20
to tesseract-ocr
Even google's Noto font doesn't show glyphs while opening it with Gnome Fonts, does that mean it is not a unicode font? 


On Saturday, March 14, 2020 at 8:45:46 PM UTC+3, shree wrote:
Are all these Unicode fonts?

What about training text in utf-8 Unicode encoding?

On Sat, Mar 14, 2020, 22:37 aby tesh <abyt...@gmail.com> wrote:
Hey shree, I have compiled all relevant fonts and attached them below. I am not sure know how i can generate text data with it.

On Tuesday, March 10, 2020 at 5:35:26 AM UTC+3, shree wrote:
If you can share a large enough training text and fonts, I can rerun the training.

On Tue, Mar 10, 2020, 03:41 aby tesh <abyt...@gmail.com> wrote:
Hey,

I followed the steps in the readme file, and i started the lstmtraining, but it seems my current computer's processor can't handle the training for a longer period of time.

What can i do about it? When should i abort the training to get a good trainedata file? or is there one which is accurate that you can share ?

Thanks

--
You received this message because you are subscribed to the Google Groups "tesseract-ocr" group.
To unsubscribe from this group and stop receiving emails from it, send an email to tesser...@googlegroups.com.
To view this discussion on the web visit https://groups.google.com/d/msgid/tesseract-ocr/e727f106-d668-44b5-9bba-8fad29fc1587%40googlegroups.com.

--
You received this message because you are subscribed to the Google Groups "tesseract-ocr" group.
To unsubscribe from this group and stop receiving emails from it, send an email to tesser...@googlegroups.com.

Shree Devi Kumar

unread,
Mar 14, 2020, 9:32:08 PM3/14/20
to tesseract-ocr
I had used the findfonts feature of text2image and found only two fonts that rendered the xsa text. I will check the fonts that you sent. What about training text? Unless you have some more text, it will be difficult to do training.

Quivira
Segoe UI Historic

To unsubscribe from this group and stop receiving emails from it, send an email to tesseract-oc...@googlegroups.com.
To view this discussion on the web visit https://groups.google.com/d/msgid/tesseract-ocr/1d3e54cc-3f53-4ad3-b870-171bb26fc6eb%40googlegroups.com.

aby tesh

unread,
Mar 15, 2020, 8:27:25 AM3/15/20
to tesseract-ocr
Where can i get the training text, or can i create a new one. I have a problem writing with fonts which some of included in the attachment i sent you. 

Lorenzo Bolzani

unread,
Mar 15, 2020, 8:51:17 AM3/15/20
to tesser...@googlegroups.com
Common fonts do not cover every unicode symbol (about 100000).

If one font works and another does not the text is correct and you just need to find fonts covering that language. 

Lorenzo 


To unsubscribe from this group and stop receiving emails from it, send an email to tesseract-oc...@googlegroups.com.
To view this discussion on the web visit https://groups.google.com/d/msgid/tesseract-ocr/373dfeed-d09f-49cc-9f3e-8b0d55661f1c%40googlegroups.com.

aby tesh

unread,
Mar 15, 2020, 8:58:35 AM3/15/20
to tesseract-ocr
Most of the fonts i have found covers the language's Unicode range. Although they aren't recognized by the system. (Linux) 

Shree Devi Kumar

unread,
Mar 15, 2020, 11:56:00 AM3/15/20
to tesseract-ocr
There is no online corpus for xsa that I could find. 

Two of the fonts you sent are legacy fonts, that is they map English letters to ancient Arabic characters.

Are there any converters that convert from the legacy mapping to Unicode?

If there is existing text in legacy fonts, it can be converted to Unicode and that can be used for training.

To unsubscribe from this group and stop receiving emails from it, send an email to tesseract-oc...@googlegroups.com.
To view this discussion on the web visit https://groups.google.com/d/msgid/tesseract-ocr/88bfa189-4a1e-4528-857c-013248b5ee4b%40googlegroups.com.

Wincent Balin

unread,
Mar 15, 2020, 5:02:15 PM3/15/20
to tesser...@googlegroups.com

Shree Devi Kumar

unread,
Mar 15, 2020, 10:12:50 PM3/15/20
to tesseract-ocr
Hi Wincent,
Thanks for the link. 

I had checked that site earlier. It has text transcription in Latin transliteration,eg. http://dasi.cnr.it/index.php?id=79&prjId=1&corId=5&colId=0&navId=522207406&recId=2149 I haven't found any conversion tool to Unicode for the same.

   1  Yʿly w-ʾḏmr bny Whbʾl[ ... ...] ʾḏmr[ ... ... by]—
   2  t-(s¹m) Yġl b-rdʾ mrʾ-s¹[m ... ...]
   3  [... ... ]w-(b)-(rd)ʾ mrʾ-s¹m [... ...]
   4  [... ...] ʾḏ(mr) w-b-rd(ʾ)[ ... ...] 

 Maybe, you can add a tool in https://github.com/wincentbalin/pytesstrain to create randomly generated training text from a range of characters/word list, similar to 

The tool language_metrics runs Tesseract OCR over images of random word sequences, which are created out of the supplied wordlist,   

Wincent Balin

unread,
Mar 22, 2020, 4:28:39 PM3/22/20
to tesser...@googlegroups.com
Hi Shree,

I will add a tool to create random text within Unicode range soon.

@aby tesh: Do you know anything about a converter from transliterated text to [xsa] Unicode text?

Shree Devi Kumar

unread,
Mar 24, 2020, 4:12:51 AM3/24/20
to tesseract-ocr
Please see https://github.com/Shreeshrii/tesstrain-xsa/blob/master/langdata/latin2unicode.sh

It has sed substitution commands for going from transliteration to Unicode for xsa, based on mapping shown in Wikipedia and other web pages.


Reply all
Reply to author
Forward
0 new messages