Groups keyboard shortcuts have been updated
Dismiss
See shortcuts

Tesseract training with Custom Dataset

36 views
Skip to first unread message

Ishak DÖLEK

unread,
Apr 18, 2025, 4:12:21 AMApr 18
to tesser...@googlegroups.com, tesser...@googlegroups.com
Hello,

I am writing to inquire about the possibility of training a Tesseract model using my custom dataset. This dataset consists of Arabic image lines paired with corresponding Latin-based text lines.

Specifically, I have the following questions:

Is it possible to train Tesseract with a dataset where the images contain right-to-left (RTL) Arabic script and the corresponding text lines are left-to-right (LTR) Latin-based text? I am sharing the attached example.

If training with such a dataset is possible, are there any specific documents or tutorials available that outline the process? Any guidance on how to structure the training data and the training commands would be greatly appreciated.

Thank you for your time and assistance. I look forward to your guidance on this matter.



make LANG_TYPE=RTL MODEL_NAME=ara GROUND_TRUTH_DIR=data/ara-ground-truth PSM=13 TESSDATA=/tessdata EPOCHS=20 training


Sincerely,
Ishak Dölek

--
example-01.gt.txt
example-01.png

TheComplete BookOfMormon

unread,
Apr 20, 2025, 12:28:37 PMApr 20
to tesser...@googlegroups.com
Yes you can. This video is very good.

You should use the most recent ara.traineddata file from the tesseract "best" repository as your basis

I found that training it further actually made the results worse. The existing ARA file will probably already do what you need.


--
You received this message because you are subscribed to the Google Groups "tesseract-ocr" group.
To unsubscribe from this group and stop receiving emails from it, send an email to tesseract-oc...@googlegroups.com.
To view this discussion visit https://groups.google.com/d/msgid/tesseract-ocr/CAA%3DdkubGBEpdCOHP0RBKXjgc3zSz%3DExhS-2PmhOWv2LFiXeH_w%40mail.gmail.com.
Reply all
Reply to author
Forward
0 new messages