Tesseract training for New font/language

Ali Abedian

unread,

Mar 31, 2023, 1:03:05 AM3/31/23

to tesseract-ocr

Hey everyone! I'm currently working on a personal project where I'm training a new font for the English language using Tesseract. The font is called Aurebesh and it's from the Star Wars universe. Basically, each letter in Aurebesh corresponds to a letter in English. I've collected close to 100,000 images and their corresponding translations, but I'm not sure how many iterations I should run for a file of this size. I've tried training with only 100 images, but it didn't work out. Can anyone advise me on how many iterations I should run and whether it's even possible to train a new font like this?

Zdenko Podobny

unread,

Apr 1, 2023, 3:05:36 AM4/1/23

to tesser...@googlegroups.com

Please have a look at https://github.com/tesseract-ocr/tesstrain (especially https://github.com/tesseract-ocr/tesstrain/blob/main/ocrd-testset.zip)

Zdenko

pi 31. 3. 2023 o 7:03 Ali Abedian <ali8a...@gmail.com> napísal(a):

Hey everyone! I'm currently working on a personal project where I'm training a new font for the English language using Tesseract. The font is called Aurebesh and it's from the Star Wars universe. Basically, each letter in Aurebesh corresponds to a letter in English. I've collected close to 100,000 images and their corresponding translations, but I'm not sure how many iterations I should run for a file of this size. I've tried training with only 100 images, but it didn't work out. Can anyone advise me on how many iterations I should run and whether it's even possible to train a new font like this?

--
You received this message because you are subscribed to the Google Groups "tesseract-ocr" group.
To unsubscribe from this group and stop receiving emails from it, send an email to tesseract-oc...@googlegroups.com.
To view this discussion on the web visit https://groups.google.com/d/msgid/tesseract-ocr/1b20c2e0-76b2-41a0-bc9f-e1a16b9c67a2n%40googlegroups.com.

Ali Abedian

unread,

Apr 1, 2023, 10:47:46 AM4/1/23

to tesseract-ocr

Hello,

Thank you for providing the references, but I'm still a bit confused. I have trained tesseract using the same method as described in https://github.com/tesseract-ocr/tesstrain/blob/main/ocrd-testset.zip, with 100,000 sentences and a maximum iteration of 10,000. However, it still cannot recognize a 6-letter word that I input from a TIF file using the same font and settings. I have tried using fewer iterations, such as 1,000, as well as more iterations, such as 20,000 and 100,000, but still no results. Additionally, the BCER (Character Error Rate) doesn't seem to change significantly with largere iterations, remaining at 3.56%. I'm unsure of what I'm doing wrong or what I should do next, but any help would be appreciated.

Thank you.

Shree Devi Kumar

unread,

Apr 1, 2023, 10:54:30 AM4/1/23

to tesseract-ocr

Aurebesh seems to be different symbols mapped to the English alphabet rather than a new font for English, hence training would need to be for a new language rather than just fine-tuning.

To view this discussion on the web visit https://groups.google.com/d/msgid/tesseract-ocr/2cab8f1d-b81e-4926-a21b-8065a4178d04n%40googlegroups.com.

Ali Abedian

unread,

Apr 1, 2023, 10:56:58 AM4/1/23

to tesseract-ocr

Is it best to train a new language?

Fish Money

unread,

Oct 2, 2023, 6:43:00 AM10/2/23

to tesseract-ocr

please share sample of image you're trying to recognize

суббота, 1 апреля 2023 г. в 10:56:58 UTC-4, ali8a...@gmail.com:

Reply all

Reply to author

Forward