Help for training tesseract to recognize a new (dead) language

340 views
Skip to first unread message

ramas...@gmail.com

unread,
May 29, 2018, 2:39:16 PM5/29/18
to tesseract-ocr
Hi,
I belong to a group who study an old Egyptian writing system called "Coptic".
It's based mostly on Greek (with some variation).

Big majority of books written in Coptic where during the last century and were mostly the same [typewriter] font.
Here is a sample picture:
And sample book:

We need to add Coptic to languages supported by Tesseract but not sure how.
I tried following this document https://github.com/tesseract-ocr/tesseract/wiki/TrainingTesseract-4.00 but it's very difficult to understand.

We need someone help us with the initial setup so that we can dedicate our man power to training the system.
We are none profit group so we are hoping for free help but we would also consider paid help since the alternative is hundreds of hours of man labor to digitalize just few books.

Thanks everyone for contributing to this awesome project

ShreeDevi Kumar

unread,
May 29, 2018, 2:52:44 PM5/29/18
to tesser...@googlegroups.com

you can use it with image files and matching ground truth text - in utf-8.

ShreeDevi
____________________________________________________________
भजन - कीर्तन - आरती @ http://bhajans.ramparivar.com

--
You received this message because you are subscribed to the Google Groups "tesseract-ocr" group.
To unsubscribe from this group and stop receiving emails from it, send an email to tesseract-ocr+unsubscribe@googlegroups.com.
To post to this group, send email to tesser...@googlegroups.com.
Visit this group at https://groups.google.com/group/tesseract-ocr.
To view this discussion on the web visit https://groups.google.com/d/msgid/tesseract-ocr/08869d08-8b3a-4390-be79-fa811c78c0ca%40googlegroups.com.
For more options, visit https://groups.google.com/d/optout.

ShreeDevi Kumar

unread,
May 30, 2018, 12:32:44 AM5/30/18
to tesser...@googlegroups.com

It provides a traineddata file for Coptic for use with tesseract version 3.

ShreeDevi
____________________________________________________________
भजन - कीर्तन - आरती @ http://bhajans.ramparivar.com

On Tue, May 29, 2018 at 9:57 PM, <ramas...@gmail.com> wrote:

--

Ramast Magdy

unread,
May 30, 2018, 6:39:44 AM5/30/18
to tesser...@googlegroups.com, ShreeDevi Kumar
Thank you ShreeDevi for both moheb's link and the one below.
The current one uses Tesseract 3 and according to the author:
"Recognition quality of Coptic texts containing old fonts will be very poor, depending on the trained data."

I will get in contact with him to see if we can use the other link you provided
https://github.com/OCR-D/ocrd-train
To train Tesseract 4.00

Thank you very much
To unsubscribe from this group and stop receiving emails from it, send an email to tesseract-oc...@googlegroups.com.

To post to this group, send email to tesser...@googlegroups.com.
Visit this group at https://groups.google.com/group/tesseract-ocr.

ShreeDevi Kumar

unread,
May 30, 2018, 6:45:44 AM5/30/18
to Ramast Magdy, tesser...@googlegroups.com
I am trying a test training for coptic for tess4, will let you know where to access traineddata.

You can train using utf-8 textand unicode optic fonts.

1. collect utf-8 text in Coptic
2. Find Coptic unicode fonts, if you can find one similar to the typewriter font used in books it will make training easier
3. train a model with these and then finetune it with line images and matching ground truth


ShreeDevi
____________________________________________________________
भजन - कीर्तन - आरती @ http://bhajans.ramparivar.com

ShreeDevi Kumar

unread,
May 30, 2018, 6:47:05 AM5/30/18
to Ramast Magdy, tesser...@googlegroups.com
> The current one uses Tesseract 3

Tesseract 3.ox has different formats for traineddata depending on the version used 3.02 vs 3.04 etc.

ShreeDevi
____________________________________________________________
भजन - कीर्तन - आरती @ http://bhajans.ramparivar.com

Ramast Magdy

unread,
May 30, 2018, 6:57:24 AM5/30/18
to ShreeDevi Kumar, tesser...@googlegroups.com
1. collect utf-8 text in Coptic (DONE)
2. Find Coptic unicode fonts, if you can find one similar to the typewriter font used in books it will make training easier
I tried but couldn't find such font. There are not that many Coptic fonts to being with.
Can't I just extract few samples of each letter from the old books?

3. train a model with these and then finetune it with line images and matching ground truth
I think I got this one.
After extracting sample letters. arrange them randomly into separate lines (image for each line) and provide the text in a file with similar name.

That's a good idea but since I am trying to train for reading old books, how can I account for things like slight page tilt during scanning for example?
Also while at it, is there a tool I could use to split book pages into separate lines so that I can give it as part of training (along with it's text of course)

ShreeDevi Kumar

unread,
May 30, 2018, 7:01:05 AM5/30/18
to Ramast Magdy, tesser...@googlegroups.com

You can use the utilities listed there for creating linelevel images from page images. Make matching ground truth text files. and train.

ShreeDevi
____________________________________________________________
भजन - कीर्तन - आरती @ http://bhajans.ramparivar.com

Ramast Magdy

unread,
May 30, 2018, 7:02:03 AM5/30/18
to ShreeDevi Kumar, tesser...@googlegroups.com
Perfect, That is really helpful.
Hope you are having awesome day :)

ShreeDevi Kumar

unread,
May 31, 2018, 12:43:41 AM5/31/18
to Ramast Magdy, tesser...@googlegroups.com
I am attaching the recognition result of the one page image you gave from the test model for Coptic I have built. If you can send me the correct unicode transcription for that page, I can further fine tune it. You can then further modify as per your needs.
coptic-cop-plus.txt

Ramast Magdy

unread,
May 31, 2018, 8:13:41 AM5/31/18
to ShreeDevi Kumar, tesser...@googlegroups.com
Impressive! I thought we would need to do a lot of work in order to reach that stage??.

The "??" in the text correspond to an unknown character to me, I also can't find it among the available unicode characters.
It's certainly 100% not part of the text. Probably indicator of new chapter.
Maybe we could use  paragraph sign § symbol for it?

This is very exciting, how do we access the work you have done so far and add to it?
Thanks a lot


ⲁⲩⲱ ⲟⲛ ⲁⲓ̈ⲧⲣⲉⲩ ⲣ̄ ⲥⲟⲟⲩ ⲛ̄ ⲉⲃⲟⲧ ⲉⲩⲕⲏⲧ ⲉ ϩⲃⲟⲩⲣ
ⲉⲩⲉⲓⲣⲉ ⲛ̄ ⲛⲉ ϩⲃⲏⲩⲉ ⲛ̄ ⲛⲉⲩⲁⲡⲟⲧⲉⲗⲉⲥⲙⲁ ⲙⲛ̄ ⲛⲉⲩ–
ⲥⲭⲏⲙⲁ ⲧⲏⲣⲟⲩ· ϫⲉ ⲕⲁⲥ ϩⲛ̄ ⲟⲩ ϩⲃⲁ ⲉⲩⲉⲣ̄ ϩⲃⲁ·
ⲁⲩⲱ ϩⲛ̄ ⲟⲩ ⲡⲗⲁⲛⲏ ⲉⲩⲉⲡⲗⲁⲛⲁ ⲛ̄ϭⲓ ⲛ ⲁⲣⲭⲱ̄ ⲉⲧ
ϣⲟⲟⲡ ϩⲛ̄ ⲛ ⲁⲓⲱ̄ ⲁⲩⲱ ϩⲛ̄ ⲛⲉⲩⲥⲫⲁⲓⲣⲁ ⲁⲩⲱ ϩⲛ̄  5
ⲛⲉⲩⲙ̄ⲡⲏⲩⲉ· ⲁⲩⲱ ϩⲛ̄ ⲛⲉⲩⲧⲟⲡⲟⲥ ⲧⲏⲣⲟⲩ· ϫⲉ ⲕⲁⲥ ⲛ̄
ⲛⲉⲩⲛⲟⲓ̈ ⲛ̄ ⲧⲉⲩϭⲓⲛⲙⲟⲟϣⲉ ⲙ̄ⲙⲓⲛ ⲙ̄ⲙⲟ–
?? ⲟⲩ: ⲁⲥϣⲱⲡⲉ ϭⲉ ⲛ̄ⲧⲉⲣⲉ ⲓ̄ⲥ̄ ⲟⲩⲱ ⲉϥϫⲱ ⲛ̄
ⲡⲉⲓ̈ ϣⲁϫⲉ ⲉⲣⲉ ⲫⲓⲗⲓⲡⲡⲟⲥ ϩⲙⲟⲟⲥ ⲉϥⲥϩⲁⲓ̈ ⲛ̄ ϣⲁϫⲉ
ⲗ̄ⲁ̄ ⲁ. ⲛⲓⲙ ⲉⲧ ⲉⲣⲉ ⲓ̄ⲥ̄ ϫⲱ ⲙ̄ⲙⲟⲟⲩ; ⲁⲥϣⲱⲡⲉ ϭⲉ ⲙⲛ̄ⲛ̄ⲥⲁ 10



On 05/31/2018 06:42 AM, ShreeDevi Kumar wrote:

Ramast Magdy

unread,
Jun 1, 2018, 1:15:11 PM6/1/18
to ShreeDevi Kumar, tesser...@googlegroups.com
Impressive! I thought we would need to do a lot of work in order to reach that stage.



ⲁⲩⲱ ⲟⲛ ⲁⲓ̈ⲧⲣⲉⲩ ⲣ̄ ⲥⲟⲟⲩ ⲛ̄ ⲉⲃⲟⲧ ⲉⲩⲕⲏⲧ ⲉ ϩⲃⲟⲩⲣ
ⲉⲩⲉⲓⲣⲉ ⲛ̄ ⲛⲉ ϩⲃⲏⲩⲉ ⲛ̄ ⲛⲉⲩⲁⲡⲟⲧⲉⲗⲉⲥⲙⲁ ⲙⲛ̄ ⲛⲉⲩ–
ⲥⲭⲏⲙⲁ ⲧⲏⲣⲟⲩ· ϫⲉ ⲕⲁⲥ ϩⲛ̄ ⲟⲩ ϩⲃⲁ ⲉⲩⲉⲣ̄ ϩⲃⲁ·
ⲁⲩⲱ ϩⲛ̄ ⲟⲩ ⲡⲗⲁⲛⲏ ⲉⲩⲉⲡⲗⲁⲛⲁ ⲛ̄ϭⲓ ⲛ ⲁⲣⲭⲱ̄ ⲉⲧ
ϣⲟⲟⲡ ϩⲛ̄ ⲛ ⲁⲓⲱ̄ ⲁⲩⲱ ϩⲛ̄ ⲛⲉⲩⲥⲫⲁⲓⲣⲁ ⲁⲩⲱ ϩⲛ̄  5
ⲛⲉⲩⲙ̄ⲡⲏⲩⲉ· ⲁⲩⲱ ϩⲛ̄ ⲛⲉⲩⲧⲟⲡⲟⲥ ⲧⲏⲣⲟⲩ· ϫⲉ ⲕⲁⲥ ⲛ̄
ⲛⲉⲩⲛⲟⲓ̈ ⲛ̄ ⲧⲉⲩϭⲓⲛⲙⲟⲟϣⲉ ⲙ̄ⲙⲓⲛ ⲙ̄ⲙⲟ–
?? ⲟⲩ: ⲁⲥϣⲱⲡⲉ ϭⲉ ⲛ̄ⲧⲉⲣⲉ ⲓ̄ⲥ̄ ⲟⲩⲱ ⲉϥϫⲱ ⲛ̄
ⲡⲉⲓ̈ ϣⲁϫⲉ ⲉⲣⲉ ⲫⲓⲗⲓⲡⲡⲟⲥ ϩⲙⲟⲟⲥ ⲉϥⲥϩⲁⲓ̈ ⲛ̄ ϣⲁϫⲉ
ⲗ̄ⲁ̄ ⲁ. ⲛⲓⲙ ⲉⲧ ⲉⲣⲉ ⲓ̄ⲥ̄ ϫⲱ ⲙ̄ⲙⲟⲟⲩ; ⲁⲥϣⲱⲡⲉ ϭⲉ ⲙⲛ̄ⲛ̄ⲥⲁ 10



On 05/31/2018 06:42 AM, ShreeDevi Kumar wrote:

Ramast

unread,
Jun 1, 2018, 1:15:51 PM6/1/18
to ShreeDevi Kumar, tesser...@googlegroups.com
I am so sorry for late reply, I send it yesterday but for some reasons it's still in my draft folder.
Here is the original email.


Impressive! I thought we would need to do a lot of work in order to reach that stage.


ⲁⲩⲱ ⲟⲛ ⲁⲓ̈ⲧⲣⲉⲩ ⲣ̄ ⲥⲟⲟⲩ ⲛ̄ ⲉⲃⲟⲧ ⲉⲩⲕⲏⲧ ⲉ ϩⲃⲟⲩⲣ
ⲉⲩⲉⲓⲣⲉ ⲛ̄ ⲛⲉ ϩⲃⲏⲩⲉ ⲛ̄ ⲛⲉⲩⲁⲡⲟⲧⲉⲗⲉⲥⲙⲁ ⲙⲛ̄ ⲛⲉⲩ–
ⲥⲭⲏⲙⲁ ⲧⲏⲣⲟⲩ· ϫⲉ ⲕⲁⲥ ϩⲛ̄ ⲟⲩ ϩⲃⲁ ⲉⲩⲉⲣ̄ ϩⲃⲁ·
ⲁⲩⲱ ϩⲛ̄ ⲟⲩ ⲡⲗⲁⲛⲏ ⲉⲩⲉⲡⲗⲁⲛⲁ ⲛ̄ϭⲓ ⲛ ⲁⲣⲭⲱ̄ ⲉⲧ
ϣⲟⲟⲡ ϩⲛ̄ ⲛ ⲁⲓⲱ̄ ⲁⲩⲱ ϩⲛ̄ ⲛⲉⲩⲥⲫⲁⲓⲣⲁ ⲁⲩⲱ ϩⲛ̄  5
ⲛⲉⲩⲙ̄ⲡⲏⲩⲉ· ⲁⲩⲱ ϩⲛ̄ ⲛⲉⲩⲧⲟⲡⲟⲥ ⲧⲏⲣⲟⲩ· ϫⲉ ⲕⲁⲥ ⲛ̄
ⲛⲉⲩⲛⲟⲓ̈ ⲛ̄ ⲧⲉⲩϭⲓⲛⲙⲟⲟϣⲉ ⲙ̄ⲙⲓⲛ ⲙ̄ⲙⲟ–
?? ⲟⲩ: ⲁⲥϣⲱⲡⲉ ϭⲉ ⲛ̄ⲧⲉⲣⲉ ⲓ̄ⲥ̄ ⲟⲩⲱ ⲉϥϫⲱ ⲛ̄
ⲡⲉⲓ̈ ϣⲁϫⲉ ⲉⲣⲉ ⲫⲓⲗⲓⲡⲡⲟⲥ ϩⲙⲟⲟⲥ ⲉϥⲥϩⲁⲓ̈ ⲛ̄ ϣⲁϫⲉ
ⲗ̄ⲁ̄ ⲁ. ⲛⲓⲙ ⲉⲧ ⲉⲣⲉ ⲓ̄ⲥ̄ ϫⲱ ⲙ̄ⲙⲟⲟⲩ; ⲁⲥϣⲱⲡⲉ ϭⲉ ⲙⲛ̄ⲛ̄ⲥⲁ 10



On 05/31/2018 06:42 AM, ShreeDevi Kumar wrote:

shree

unread,
Jun 1, 2018, 5:19:41 PM6/1/18
to tesseract-ocr
Please see https://github.com/Shreeshrii/tessdata_coptic

for the traineddata files.
Reply all
Reply to author
Forward
0 new messages