persian in tesseract-ocr

1,532 views
Skip to first unread message

Ali Tabasi

unread,
Jul 5, 2015, 8:44:10 AM7/5/15
to tesser...@googlegroups.com
hi
i working on project detect plate on car
i need read text from image
i use of tesseract
i need persian language
i use per.traineddata
but i have sevral problem
i think this file Is broken
please
Get it right this
& send to my email:
ali....@yahoo.com
download link:
https://drive.google.com/file/d/0B4_07XmFOBleMF8taUFUaEd2S2c/view?usp=sharing

Jeff Breidenbach

unread,
Jul 17, 2015, 11:44:07 PM7/17/15
to tesser...@googlegroups.com
I think 'fas' is the language code for Persian.

Hossein Razizadeh

unread,
Aug 16, 2015, 9:47:15 AM8/16/15
to tesseract-ocr
It seems 'fas' is for Persian, but there are no cube files, resulting in poor results. Arabic language files work much better for Persian images. There is another 'per' folder for Persian, but there isn't even '.traieddata' file for it. Does anyone know if 'Google Doc' has used 'Tesseract' for its OCR engine? Google Docs performs OCR for Persian images with good accuracy!

ShreeDevi Kumar

unread,
Aug 17, 2015, 12:08:38 AM8/17/15
to tesser...@googlegroups.com, Ray Smith
Ray was looking for comparative feedback regarding the new traineddata for RTL languages, so this will be useful.

As far as I know, Google Docs does not use tesseract OCR engine for recognizing the text. Its OCR accuracy is better than Tesseract for some Indian languages also. However, it doesn't seem to handle tifs, and processes only first 10 pages of a pdf.

ShreeDevi
____________________________________________________________
भजन - कीर्तन - आरती @ http://bhajans.ramparivar.com

--
You received this message because you are subscribed to the Google Groups "tesseract-ocr" group.
To unsubscribe from this group and stop receiving emails from it, send an email to tesseract-oc...@googlegroups.com.
To post to this group, send email to tesser...@googlegroups.com.
Visit this group at http://groups.google.com/group/tesseract-ocr.
To view this discussion on the web visit https://groups.google.com/d/msgid/tesseract-ocr/edd64e28-9e52-4b44-80cc-0aaa442caa85%40googlegroups.com.
For more options, visit https://groups.google.com/d/optout.

zdenko podobny

unread,
Aug 17, 2015, 3:08:36 AM8/17/15
to tesser...@googlegroups.com
On Mon, Aug 17, 2015 at 6:07 AM, ShreeDevi Kumar <shree...@gmail.com> wrote:
Ray was looking for comparative feedback regarding the new traineddata for RTL languages, so this will be useful.

As far as I know, Google Docs does not use tesseract OCR engine for recognizing the text.

Interesting. Can you please clarify source of your knowledge?
 
Its OCR accuracy is better than Tesseract for some Indian languages also. However, it doesn't seem to handle tifs, and processes only first 10 pages of a pdf.

ShreeDevi
____________________________________________________________
भजन - कीर्तन - आरती @ http://bhajans.ramparivar.com

On Sun, Aug 16, 2015 at 7:14 PM, Hossein Razizadeh <sm.h...@gmail.com> wrote:
It seems 'fas' is for Persian, but there are no cube files, resulting in poor results. Arabic language files work much better for Persian images. There is another 'per' folder for Persian, but there isn't even '.traieddata' file for it. Does anyone know if 'Google Doc' has used 'Tesseract' for its OCR engine? Google Docs performs OCR for Persian images with good accuracy!

On Saturday, July 18, 2015 at 8:14:07 AM UTC+4:30, Jeff Breidenbach wrote:
I think 'fas' is the language code for Persian.

--
You received this message because you are subscribed to the Google Groups "tesseract-ocr" group.
To unsubscribe from this group and stop receiving emails from it, send an email to tesseract-oc...@googlegroups.com.
To post to this group, send email to tesser...@googlegroups.com.
Visit this group at http://groups.google.com/group/tesseract-ocr.
To view this discussion on the web visit https://groups.google.com/d/msgid/tesseract-ocr/edd64e28-9e52-4b44-80cc-0aaa442caa85%40googlegroups.com.
For more options, visit https://groups.google.com/d/optout.

--
You received this message because you are subscribed to the Google Groups "tesseract-ocr" group.
To unsubscribe from this group and stop receiving emails from it, send an email to tesseract-oc...@googlegroups.com.
To post to this group, send email to tesser...@googlegroups.com.
Visit this group at http://groups.google.com/group/tesseract-ocr.

ShreeDevi Kumar

unread,
Aug 17, 2015, 7:55:23 AM8/17/15
to tesser...@googlegroups.com

On Mon, Aug 17, 2015 at 6:07 AM, ShreeDevi Kumar <shree...@gmail.com> wrote:
Ray was looking for comparative feedback regarding the new traineddata for RTL languages, so this will be useful.


Another caveat worth noting is that I only tested a small fraction of these languages - maybe 25?
I suspect, for instance, that all the Arabic-based langages except ara don't work very well.
I would be interested in an more feedback on how bad it is in any of them, and will take suggestions into account for the next version after 3.04.


As far as I know, Google Docs does not use tesseract OCR engine for recognizing the text.

Interesting. Can you please clarify source of your knowledge? 
 
Its OCR accuracy is better than Tesseract for some Indian languages also. However, it doesn't seem to handle tifs, and processes only first 10 pages of a pdf.



On Sun, Aug 16, 2015 at 7:14 PM, Hossein Razizadeh <sm.h...@gmail.com> wrote:
It seems 'fas' is for Persian, but there are no cube files, resulting in poor results. Arabic language files work much better for Persian images. There is another 'per' folder for Persian, but there isn't even '.traieddata' file for it. Does anyone know if 'Google Doc' has used 'Tesseract' for its OCR engine? Google Docs performs OCR for Persian images with good accuracy!

On Saturday, July 18, 2015 at 8:14:07 AM UTC+4:30, Jeff Breidenbach wrote:
I think 'fas' is the language code for Persian.

--
You received this message because you are subscribed to the Google Groups "tesseract-ocr" group.
To unsubscribe from this group and stop receiving emails from it, send an email to tesseract-oc...@googlegroups.com.
To post to this group, send email to tesser...@googlegroups.com.
Visit this group at http://groups.google.com/group/tesseract-ocr.
To view this discussion on the web visit https://groups.google.com/d/msgid/tesseract-ocr/edd64e28-9e52-4b44-80cc-0aaa442caa85%40googlegroups.com.
For more options, visit https://groups.google.com/d/optout.

--
You received this message because you are subscribed to the Google Groups "tesseract-ocr" group.
To unsubscribe from this group and stop receiving emails from it, send an email to tesseract-oc...@googlegroups.com.
To post to this group, send email to tesser...@googlegroups.com.
Visit this group at http://groups.google.com/group/tesseract-ocr.
To view this discussion on the web visit https://groups.google.com/d/msgid/tesseract-ocr/CAG2NduX%2B9UqeXbWr-E7sADWK3SeyjiyUiJBH6wSJoMy_E2geuQ%40mail.gmail.com.

For more options, visit https://groups.google.com/d/optout.

--
You received this message because you are subscribed to the Google Groups "tesseract-ocr" group.
To unsubscribe from this group and stop receiving emails from it, send an email to tesseract-oc...@googlegroups.com.
To post to this group, send email to tesser...@googlegroups.com.
Visit this group at http://groups.google.com/group/tesseract-ocr.

Hossein Razizadeh

unread,
Aug 18, 2015, 1:41:50 AM8/18/15
to tesseract-ocr
I think the problem is the lack of cube files in persian. Does anyone know how to add cube files to be used by tesseract? There is a 'fas' folder in 'langdata' that contains some cube related data, but I don't know how to use it with tesseract.

buyi wen

unread,
Sep 17, 2015, 11:22:22 PM9/17/15
to tesseract-ocr
if you like tesseract ocr, you may like this free online ocr tool using tesseract ocr 3.02

Reply all
Reply to author
Forward
0 new messages