Hi,
My use case is on Arabic document, the pre retrained ara.traineddata are good but not perfect. so i wish to fine tune ara.traineddata, if the results are not satisfying then have train my own custom data.
please suggest me for the following:
Below are my trails:
For Arabic Numbers:
-> i tried to custom train only Arabic numbers.
-> i wrote a script to write 100,000 numbers in multiple gt.txt files. 100s of character in each gt.txt file.
-> then one script to convert text to image (text2image) which should be more like scanned image.
-> parameters used in the below order.
text2image --text test.gt.txt --outputbase /home/user/output --fonts_dir /usr/share/fonts/truetype/msttcorefonts/ --font 'Arial' --degrade_image false --rotate_image --exposure 2 --resolution 300
If possible please guide me the procedure for datasets preparation.
For testing I tried 50,000 eng number, with each number in one gt.txt file (for eg wrote "2500" data in 2500.gt.txt file) with 20,000 iteration but it fails.
For Arabic Text:
-> prepared around 23k gt.txt files each having one sentence
-> generated .box and small .tifs files for all gt.txt files using 1 font (traditional Arabic font)
-> used the tesstrain git and trained for 20,000 iteration
-> after training generated foo.traineddata with 0.03 error rate
-> did prediction an the real data, it is working perfect for the perticular character which on pre trained (ara.traineddata) failes. but when comes to overall accuracy the pre trained (ara.traineddata) performs better except that one character.
Summery:
GitHub link used for custom training Arabic text and numbers: https://github.com/tesseract-ocr/tesstrain
--
You received this message because you are subscribed to the Google Groups "tesseract-ocr" group.
To unsubscribe from this group and stop receiving emails from it, send an email to tesseract-oc...@googlegroups.com.
To view this discussion on the web visit https://groups.google.com/d/msgid/tesseract-ocr/09cff705-838f-4ccb-b6e9-06326fea1cdbo%40googlegroups.com.
for i in $(seq -f "%06g" 006601 006798)
do
echo $i
text2image --xsize 3600 --ysize 300 --text $i.gt.txt --outputbase /home/user/Desktop/$i --font 'Traditional Arabic' --fonts_dir /home/user/.local/share/fonts/
done
To unsubscribe from this group and stop receiving emails from it, send an email to tesser...@googlegroups.com.
Always the letter "لا" is predicted as "ال" .
Not sure how much relevancy that bears in the context of training models, but لا is no letter! It's a ligature ("Arabic Ligature Lam with Alef") formed by combining ل ("Arabic Letter Lam") with ا ("Arabic Letter Alef") whereas ال is ا followed by ل (so, the exact opposite way around; no ligature). Both are incredibly common in Arabic texts and although I have no clue about machine learning, I'm surprised how the training could miss the difference between them.
Always the letter "لا" is predicted as "ال" .Not sure how much relevancy that bears in the context of training models, but لا is no letter! It's a ligature ("Arabic Ligature Lam with Alef") formed by combining ل ("Arabic Letter Lam") with ا ("Arabic Letter Alef") whereas ال is ا followed by ل (so, the exact opposite way around; no ligature). Both are incredibly common in Arabic texts and although I have no clue about machine learning, I'm surprised how the training could miss the difference between them.
--
You received this message because you are subscribed to the Google Groups "tesseract-ocr" group.
To unsubscribe from this group and stop receiving emails from it, send an email to tesseract-oc...@googlegroups.com.
To view this discussion on the web visit https://groups.google.com/d/msgid/tesseract-ocr/de95d94b-9dcd-432c-a06c-3180d6c741afo%40googlegroups.com.
See https://github.com/tesseract-ocr/tesseract/issues/758 and other similar issues
On Sun, Jul 12, 2020 at 6:52 PM Shree Devi Kumar <shree...@gmail.com> wrote:
@Eliyaz What version of tesseract are you using? Which traineddata?>Always the letter "لا" is predicted as "ال" .I think this was fixed by Ray Smiith in 2017 and should be ok in the traineddata files in tessdata_fast and tessdata_best repos.
On Sun, Jul 12, 2020 at 6:45 PM Rainer Verteidiger <materialde...@gmail.com> wrote:
--
Always the letter "لا" is predicted as "ال" .Not sure how much relevancy that bears in the context of training models, but لا is no letter! It's a ligature ("Arabic Ligature Lam with Alef") formed by combining ل ("Arabic Letter Lam") with ا ("Arabic Letter Alef") whereas ال is ا followed by ل (so, the exact opposite way around; no ligature). Both are incredibly common in Arabic texts and although I have no clue about machine learning, I'm surprised how the training could miss the difference between them.
You received this message because you are subscribed to the Google Groups "tesseract-ocr" group.
To unsubscribe from this group and stop receiving emails from it, send an email to tesser...@googlegroups.com.
To view this discussion on the web visit https://groups.google.com/d/msgid/tesseract-ocr/de95d94b-9dcd-432c-a06c-3180d6c741afo%40googlegroups.com.
--
____________________________________________________________
भजन - कीर्तन - आरती @ http://bhajans.ramparivar.com
To unsubscribe from this group and stop receiving emails from it, send an email to tesseract-oc...@googlegroups.com.
To view this discussion on the web visit https://groups.google.com/d/msgid/tesseract-ocr/3a200939-7c85-48da-bb7b-6c55724bc116o%40googlegroups.com.
To view this discussion on the web visit https://groups.google.com/d/msgid/tesseract-ocr/3a200939-7c85-48da-bb7b-6c55724bc116o%40googlegroups.com.
To unsubscribe from this group and stop receiving emails from it, send an email to tesseract-oc...@googlegroups.com.
To view this discussion on the web visit https://groups.google.com/d/msgid/tesseract-ocr/9e24724e-5af7-4ea2-9a5f-baae731e2e14o%40googlegroups.com.
To view this discussion on the web visit https://groups.google.com/d/msgid/tesseract-ocr/9e24724e-5af7-4ea2-9a5f-baae731e2e14o%40googlegroups.com.
To view this discussion on the web visit https://groups.google.com/d/msgid/tesseract-ocr/c1788de1-48dc-4cf9-99c3-0049b1948747n%40googlegroups.com.