Search
Clear search
Close search
Main menu
Google apps
Groups
Sign in
Groups
tesseract-ocr
Conversations
About
Send feedback
Help
Can't encode transcription
68 views
Skip to first unread message
Samee Arif
unread,
Aug 16, 2021, 5:18:22 PM
8/16/21
Reply to author
Sign in to reply to author
Forward
Sign in to forward
Delete
You do not have permission to delete messages in this group
Copy link
Report message
Show original message
Either email addresses are anonymous for this group or you need the view member email addresses permission to view the original message
to tesseract-ocr
Hello, I am trying to train form scratch/fine tune tesseract for "Jameel Noori Nastaleeq" font for Urdu. The steps i did for training from scratch:
1. Create unicharset from all groundtruth files:
```
unicharset_extractor --output_unicharset file.unicharset --norm_mode 3 file
```
2. Create starter traineddata using above unicharset
```
combine_lang_model --input_unicharset file.unicharset --script_dir "langdata/" --output_dir "output/" --lang JNUrd
```
3. Create wordstrbox for each image
```
tesseract file1.png file1 --psm 6 wordstrbox
```
4. Manually correct wordstrbox files using the ground truth
5. Create lstmf file from each png and its corresponding box file
```
tesseract file.png file --psm 6 lstm.train
```
6. Create list of lstmf files to use for training
```
ls *.lstmf -1 > mylang.trainingfiles_text
```
the unicharset the .lstmf file on the training step I am getting this error:
```
Encoding of string failed! Failure bytes: ffffffd9 ffffff8a ffffffd9 ffffff94 ffffffdb ffffff92 20 ffffffd9 ffffff88 ffffffd8 ffffffb2 ffffffdb ffffff8c ffffffd8 ffffffb1 20 ffffffd8 ffffffae ffffffd8 ffffffa7 ffffffd8 ffffffb1 ffffffd8 ffffffac ffffffdb ffffff81 20 ffffffd8 ffffffb4 ffffffd8 ffffffa7 ffffffdb ffffff81 20 ffffffd9 ffffff85 ffffffd8 ffffffad ffffffd9 ffffff85 ffffffd9 ffffff88 ffffffd8 ffffffaf 20 ffffffd9 ffffff82 ffffffd8 ffffffb1 ffffffdb ffffff8c ffffffd8 ffffffb4 ffffffdb ffffff8c 20 ffffffd9 ffffff86 ffffffdb ffffff92 20 ffffffd8 ffffffa8 ffffffd8 ffffffaa ffffffd8 ffffffa7 ffffffdb ffffff8c ffffffd8 ffffffa7 20 ffffffda ffffffa9 ffffffdb ffffff81 20 ffffffd9 ffffff85 ffffffd9 ffffff84 ffffffd8 ffffffa7 ffffffd9 ffffff82 ffffffd8 ffffffa7 ffffffd8 ffffffaa
Can't encode transcription: 'بعد نجی ٹی وی سے گفتگو کرتے ہوئے وزیر خارجہ شاہ محمود قریشی نے بتایا کہ ملاقات' in language ''
```
I have tried normalizing the text using the normalize.py file.
红狮子
unread,
Oct 21, 2021, 12:40:38 PM
10/21/21
Reply to author
Sign in to reply to author
Forward
Sign in to forward
Delete
You do not have permission to delete messages in this group
Copy link
Report message
Show original message
Either email addresses are anonymous for this group or you need the view member email addresses permission to view the original message
to tesseract-ocr
which version do you use?
Reply all
Reply to author
Forward
0 new messages