Advice on training for Old Amharic texts

163 views
Skip to first unread message

Menelik Berhan

unread,
Jan 13, 2024, 3:21:29 AM1/13/24
to tesseract-ocr
Background
I'm trying to use tesseract 5.3.3 on scanned old books written in Amharic (which uses Ethiopic script).

Major Shortcomings of amh.traineddata from tesseract

Difference in type of Ethiopic script: there are Ethiopic script characters in old Amharic texts that are not used in the unicharset of amh.traineddata.

Difference in punctuation styles: the old texts use some punctuations not used in modern Amharic, and also for some that are used in modern Amharic, the old texts have d/t pattern (mostly space b/n word and punctuation character --- while the old texts always put space b/n punctuation chars and both preceding and following words, in modern times these punctuation chars doesn't have space b/n them and the preceding word).

Very narrow training_text & wordlist (based on tesseract/langdata_lstm)
The amh.training_text & amh.wordlist text files used by tesseract (the one from langdata_lstm) is very small. (to give you an Idea: for tir.traineddata (another language which uses Ethiopic script) the tir.training_text from langdata_lstm has more than 400,000 lines while the amh.training_text has only around 400 lines)

Other challenges
  • The old Amharic books use a font that's not in use (or available).
  • The old Amharic books contain many Ge'ez words (a liturgical language like latin which uses Ethiopic script).
  • The old Amharic books mostly use Ge'ez numbers, while modern Amharic texts use Arabic numbers.
WHAT I'VE DONE SO FAR
As an experiment I've tried to fine tune amh.traineddata_best (using `make training`) with close to 300 line images & texts (from sample pages of some old Amharic books) and using files from langdata_lstm (for 10,000 iterations).

The resulting traineddata has a very satisfactory improvement in addressing some of the challenges mentioned above, especially those regarding punctuation chars.

But it still fails to solve the problems I've with some characters (the ones not present in the unicharset of amh.traineddata) and fails for almost all Ge'ez numbers (eventhough the training sample pages have many Ge'ez nums).

WHAT I'M PLANNING TO DO
First I want to train tesseract with a large training_text & wordlist files, and also a complete unicharset file ,
Then fine tune the resulting traineddata based on sample line images from the old books.

QUESTIONS (for now. I'll definitely add more questions later)
Is there another path I should take that would get me to where I want?

Regarding training tesseract with large training_text & wordlist files, and also a complete unicharset file:
  • How to prepare the training_text & wordlist file? (What the text files should contain)
  • How to prepare the unicharset file, and also how to pass it to the `make training` command ?

Regarding generating a text, image(tif) and box file from training_text:

I've looked up python scripts to do this job, but have question about the proper values for these params in text2image:
--font (what criteria should I use to select the list of fonts),
--leading, --xsize, --ysize, --char_spacing, --exposure, --unicharset_file and --margin. 

I've noticed from tesstrain repo for tesseract 5 that the line images are tightly cropped (with minimal margin around text line). Is the same property (minimal margins) required/desired of the line images generated using text2image from the training_text?

THANKS FOR YOUR TIME !!!

Dellu Bw

unread,
Jan 13, 2024, 4:49:36 AM1/13/24
to tesser...@googlegroups.com
I spend some time trying to improve the default model of Amharic. I default model has a couple of characters missing. As i have noted in many posts in this forum, training by removing the top layer is the best method to introduce new characters.

But i really struggled because the training is deteriotating the base (default) model. I also have the shortage of processing power.
Tesseract 5.3 also has some flaws which made it hard to use in the third countries ( electric blackouts)

Dear Menilik, we might need to put out hands together on this.

--
You received this message because you are subscribed to the Google Groups "tesseract-ocr" group.
To unsubscribe from this group and stop receiving emails from it, send an email to tesseract-oc...@googlegroups.com.
To view this discussion on the web visit https://groups.google.com/d/msgid/tesseract-ocr/9bda9bc4-b07a-491b-b8fc-fbb25b54c368n%40googlegroups.com.

Menelik Berhan

unread,
Jan 13, 2024, 8:07:57 AM1/13/24
to tesseract-ocr
Thanks for your swift reply. It would be my pleasure to collaborate with you.

I've noticed that there is are extensive guides and tutorials regarding training tesseract 4.x, and I wanted to switch to 4.x version.
I wanted to ask what would be the trade off if I used tesseract 4.x instead of 5.x ?

Thanks for your time!!!

Dellu Bw

unread,
Jan 14, 2024, 7:14:28 AM1/14/24
to tesser...@googlegroups.com
Most of the guide written for version 4 actually work for version 5. The changes are minimal. It is better to keep version 5 because it seems perform better. Are u using linux?

Dellu Bw

unread,
Jan 14, 2024, 7:22:31 AM1/14/24
to tesser...@googlegroups.com
Hi Menilik, are you in Addis?
I have figured out most of the workings of Tesseract. I really fall into a trap because of the electric blackouts and the underpowered pc. I feel that we can train everything of Ethiopic (Geez, Amharic, Tigringa and every other ) in one sweep. I have about 8gb of data to  train Amharic. But my pc just cannot handle it. We can meet in person and generate(collect ) more data to include the other Ethiopics and train it.
(Sorry i am writing on my phone.)

Menelik Berhan

unread,
Jan 14, 2024, 8:06:41 AM1/14/24
to tesser...@googlegroups.com
Yes I'm In addis.
My pc is not that powerful either. But I could find a couple of good desktop PCs for the training.

It would be my pleasure to meet in person, I've some questions about the training process that I'll ask when we meet.

I'm free almost all day after 10 a.m (EAT) (ketewatu arat seat local time). Let me know the time and place of your convenience.

Thanks

You received this message because you are subscribed to a topic in the Google Groups "tesseract-ocr" group.
To unsubscribe from this topic, visit https://groups.google.com/d/topic/tesseract-ocr/qhrcsS37Kn4/unsubscribe.
To unsubscribe from this group and all its topics, send an email to tesseract-oc...@googlegroups.com.
To view this discussion on the web visit https://groups.google.com/d/msgid/tesseract-ocr/CA%2BLi4kAcJGE9Qbp9RQYz%3Dnp-Na35E-1ZukwbWdYOdVo79Fjewg%40mail.gmail.com.

Menelik Berhan

unread,
Jan 14, 2024, 8:07:16 AM1/14/24
to tesser...@googlegroups.com
And yes I'm using Ubuntu 20.04 on windows with WSL.
Reply all
Reply to author
Forward
0 new messages