How to use Tesseract Arabic OCR.

RJ

unread,

Mar 21, 2016, 4:19:37 AM3/21/16

to tesseract-ocr

Hello All,

I am using tesseract 3.02 for Arabic language. I using command line options to read the image.

tesseract.exe "D:\Peace.png" D:\output.txt -l ara -psm 7

But i got output ( النللا ثم ) different to the input image. Is there any configuration required?

Thanks in advance

RJ

Younes REGAIEG

unread,

Apr 21, 2016, 6:58:32 AM4/21/16

to tesseract-ocr

Hello there,

I am considering setting up tesseract-OCR as an OCR server for arabic script, did you get any luck configuring/training your instance or is it just not production-ready yet ?

Regards,

Abdulbaki Aybakan

unread,

Sep 13, 2017, 11:03:07 AM9/13/17

to tesseract-ocr

Did you do any improvements on it? I want to use it for Ottoman Language which uses Arabic Alphabet.

ShreeDevi Kumar

unread,

Sep 13, 2017, 11:47:01 AM9/13/17

to tesser...@googlegroups.com

try tesseract 4.00.00alpha - built from source in github with tessdata/best/Arabic.traineddata

ShreeDevi
____________________________________________________________
भजन - कीर्तन - आरती @ http://bhajans.ramparivar.com

--
You received this message because you are subscribed to the Google Groups "tesseract-ocr" group.
To unsubscribe from this group and stop receiving emails from it, send an email to tesseract-ocr+unsubscribe@googlegroups.com.
To post to this group, send email to tesser...@googlegroups.com.
Visit this group at https://groups.google.com/group/tesseract-ocr.
To view this discussion on the web visit https://groups.google.com/d/msgid/tesseract-ocr/60176871-9c44-42a9-9135-9a7c3f92a8d2%40googlegroups.com.
For more options, visit https://groups.google.com/d/optout.

Ibr

unread,

Sep 26, 2017, 4:09:46 AM9/26/17

to tesseract-ocr

hi, as shree has advised, to detect Arabic writing use tesseract 4alpha, but in your case if you want to use it to detect ottoman text, you have to consider two things, if the font is uncommon, you need to do some enhancing to the Arabic model (ara.traineddata) against that font -it is a several steps, I can walk you through them- , the second thing is the manner of the ottoman text follows the same rules as Arabic text or not? I mean as you already know Arabic writing system has different shapes of letters depending of its location in the word are they the same?, I know Arabic but I unfortunately I don't know ottoman writing system. for any question don't hesitate

Dan9er

unread,

Oct 9, 2017, 11:00:53 AM10/9/17

to tesseract-ocr

To quote the Tesseract Wiki:

Don't try to train Tesseract versions earlier than 4.0 for Arabic (same for Persian, Urdu, etc.). It's hopeless. For 4.0 only train with the LSTM method.

Serkan Taş

unread,

Dec 14, 2019, 5:49:34 AM12/14/19

to tesseract-ocr

Hi lbr,

Are you still on the subject ?

26 Eylül 2017 Salı 11:09:46 UTC+3 tarihinde Ibr yazdı:

Ibr

unread,

Dec 16, 2019, 2:56:03 AM12/16/19

to tesseract-ocr

Hi Serkan

actually I didn't work on the same subject for a long time now, so I'm not very updated in Tesseract, but if you have any question share it, if I know the I will help you

Serkan Taş

unread,

Dec 16, 2019, 3:13:21 PM12/16/19

to tesser...@googlegroups.com

Hi Ibr,

First, I thought to search on ML kind of things to understand if it worth to work on the subject and while I am searching through list about ottoman text transliteration I found your discussions.

I am just trying to understand if I am at the right place for my subject.

The language is Turkish originally, but the alphabet is some kind of mixture of Arabic and Farsi alphabet and contains words from these two and some other eastern languages like English and mostly French.

Here is the cases;

1^st step : Use ocr kind algorithm or some kind of ml algorithms for retrieving text from images mostly written various types of styles including handwriting that has historical roots over 600 hundred years.

2^nd step : Using successfully generated texts transliterate to roman or Latin alphabet.

3^rd step : Due to the age of language, use some algorithm to modernize language to daily language, Turkish

Do you think tesseract is the right tool for these purposes ?

If yes, how ?

What should be the iterations, and where can I find guides ?

Thanks in advance,

Serkan

Pls find some sample document attachments.

Windows 10 için Posta ile gönderildi

Kimden: Ibr
Gönderilme: 16 Aralık 2019 Pazartesi 10:56
Kime: tesseract-ocr
Konu: [tesseract-ocr] Re: How to use Tesseract Arabic OCR.

--

You received this message because you are subscribed to the Google Groups "tesseract-ocr" group.

To unsubscribe from this group and stop receiving emails from it, send an email to tesseract-oc...@googlegroups.com.
To view this discussion on the web visit https://groups.google.com/d/msgid/tesseract-ocr/14751335-d461-464a-87bc-94d9f4f549b5%40googlegroups.com.

ödev.JPG

sample01.jpg

Ibr

unread,

Dec 17, 2019, 4:04:26 AM12/17/19

to tesseract-ocr

Hi Serkan,

How Tesseract works is like the following, each language or writing system, it has a model which depend on to make recognition of the characters in the image, I guess it depends on something called (stroke width transformation) which is actually detecting the shapes, if while scanning an image detected a shape (letter in the image) that already recognize Tesseract will assign it as the corresponding letter that has the same shape and write it in the output text, and then the next shape and so on, in Tesseract every language has its own model (a model in ML is more like the brain which decide the results depending on the input), WHY I'm telling you all of this? to give you an idea how it works and to let you know, you can't be conclusive about the results, even with great accuracy you might still have some errors, that's how machine learning in general, that's why usually people train the model and to enhance its accuracy,

About Ottoman writing system you said "The language is Turkish originally" Tesseract doesn't care about the meaning of the text, just the shapes, "alphabet is some kind of mixture of Arabic and Farsi alphabet" I'm a native Arabic speaker, yet I can read the first image that you have shared "ödev" without knowing what it means (except for few words I already know in Turkish language) I also can read Farsi as well, but the problem with Farsi alphabet it contains extra letters that doesn't exist in Arabic, very close but slightly different for example (چ) is same as (ج) but with three dots, in Farsi both letters exists, but in Arabic only the second one exists, so run the two letter on the Farsi model, will work fine, but on Arabic model, I think both letter will be recognized as the letter with the one dot only. Arabic has 28 letters but Farsi has 32 letters I guess, so that means if Ottoman alphabet contains letter from Farsi, Arabic model wont be enough since Farsi contains Arabic letters and some extra letters, now if Ottoman alphabet and Farsi alphabet are the same, for sure Farsi model (I think its fas.traineddata ) will work fine, but if there are some letters in Ottoman alphabet doesn't exist in Farsi then, these letters wont be recognized or recognized wrong

About the font, I'm not sure what is the font used in both pictures but the first picture definitely it exists in the Arabic model, in Tesseract 4 (at least when I used it last time lime almost a year ago) its contains I think 5000 Arabic fonts, which covers almost all the fonts, so I don't think you would need any training on different fonts

Last thing, when I used Tesseract it was giving a perfect results for Arabic and Japanese as well, for formal documents, but handwritten documents the accuracy is really low, I don't know if this still the case or not, but if it is, handwritten wont have good results, for example the second image that you have shared "sample01" I assure you it wont be recognized even if you have Ottoman model, the first one I'm not sure, I think it would be recognized but any word that has a small space due to being old document, the resulted word will be separated, to be honest you wont know for sure until you try it on the Tesseract, Tesseract since version 4 is easy to use, specially its not necessary to train the model on new fonts, so in my opinion open a question on this Google group or on GitHub asking if there is an Ottoman model, or since you seem you know these stuff you can decide if the Farsi model will do, try on the Farsi model

I wish I was helpful enough, I said to much details but only to give you the full image of what's going on so you would decide if it fits or not, since I don't have enough information about Ottoman writing system, if you still have any question I'm here to help :)

teşekkürler :)

Ibrahim

Serkan Taş

unread,

Dec 17, 2019, 4:14:16 PM12/17/19

to tesser...@googlegroups.com

Hi Ibrahim,

Thank you for this very detailed and descriptive reply. Here are my comments:

On Tue, Dec 17, 2019 at 12:04 PM Ibr <ibr.h...@gmail.com> wrote:

Hi Serkan,

How Tesseract works is like the following, each language or writing system, it has a model which depend on to make recognition of the characters in the image, I guess it depends on something called (stroke width transformation) which is actually detecting the shapes, if while scanning an image detected a shape (letter in the image) that already recognize Tesseract will assign it as the corresponding letter that has the same shape and write it in the output text, and then the next shape and so on, in Tesseract every language has its own model (a model in ML is more like the brain which decide the results depending on the input), WHY I'm telling you all of this? to give you an idea how it works and to let you know, you can't be conclusive about the results, even with great accuracy you might still have some errors, that's how machine learning in general, that's why usually people train the model and to enhance its accuracy,

I completely understand what you mean here. I wonder if the existing language models generated for Arabic and/or Farsi may be suitable for my case, at least may be starting point for the new language model.

About Ottoman writing system you said "The language is Turkish originally" Tesseract doesn't care about the meaning of the text, just the shapes, "alphabet is some kind of mixture of Arabic and Farsi alphabet" I'm a native Arabic speaker, yet I can read the first image that you have shared "ödev" without knowing what it means (except for few words I already know in Turkish language) I also can read Farsi as well, but the problem with Farsi alphabet it contains extra letters that doesn't exist in Arabic, very close but slightly different for example (چ) is same as (ج) but with three dots, in Farsi both letters exists, but in Arabic only the second one exists, so run the two letter on the Farsi model, will work fine, but on Arabic model, I think both letter will be recognized as the letter with the one dot only. Arabic has 28 letters but Farsi has 32 letters I guess, so that means if Ottoman alphabet contains letter from Farsi, Arabic model wont be enough since Farsi contains Arabic letters and some extra letters, now if Ottoman alphabet and Farsi alphabet are the same, for sure Farsi model (I think its fas.traineddata ) will work fine, but if there are some letters in Ottoman alphabet doesn't exist in Farsi then, these letters wont be recognized or recognized wrong

I know that the language is not important for teseract but I tried to give an idea what ottoman languages is like.

Ottoman alphabet has 34 letters mostly comes from Quran (29 http://www.osmanlicadersi.com/turkce-dersler/1-osmanlicada-harfler/#1-harfler )

plus 5 additional letters. http://www.osmanlicadersi.com/turkce-dersler/1-osmanlicada-harfler/#2-farkli-harfler

When I use Farsi model not perfect but works some level. Please find the picture and tesseract result attached the ocr using Farsi model . I wonder if the training set for the model of any one from Arabic or Farsi may be modified and used to create

Ottoman language model or should I work to collect training data for the ottoman language model from scratch ?

About the font, I'm not sure what is the font used in both pictures but the first picture definitely it exists in the Arabic model, in Tesseract 4 (at least when I used it last time lime almost a year ago) its contains I think 5000 Arabic fonts, which covers almost all the fonts, so I don't think you would need any training on different fonts.

The fonts for the Arabic or Farsi model guess does not contains all the letters of 35 and this may be a problem.

Last thing, when I used Tesseract it was giving a perfect results for Arabic and Japanese as well, for formal documents, but handwritten documents the accuracy is really low, I don't know if this still the case or not, but if it is, handwritten wont have good results, for example the second image that you have shared "sample01" I assure you it wont be recognized even if you have Ottoman model, the first one I'm not sure, I think it would be recognized but any word that has a small space due to being old document, the resulted word will be separated, to be honest you wont know for sure until you try it on the Tesseract, Tesseract since version 4 is easy to use, specially its not necessary to train the model on new fonts, so in my opinion open a question on this Google group or on GitHub asking if there is an Ottoman model, or since you seem you know these stuff you can decide if the Farsi model will do, try on the Farsi model

Afak, hand writing is saperate phase of OCR which is also very hard but may be getting easier using ML technics like DL. The hard part of Ottoman writings is very few people can read less can understand. So if I can follow the steps correctly it is going

to be very important work to have mostly handwriting - but styled not all are free handwriting - documents transliterated-and translated to modern Turkish language with less human interaction.

I wish I was helpful enough, I said to much details but only to give you the full image of what's going on so you would decide if it fits or not, since I don't have enough information about Ottoman writing system, if you still have any question I'm here to help :)

teşekkürler :)
Ibrahim

I think this is good for starting.

Teşekkürler :)

--
You received this message because you are subscribed to the Google Groups "tesseract-ocr" group.
To unsubscribe from this group and stop receiving emails from it, send an email to tesseract-oc...@googlegroups.com.

To view this discussion on the web visit https://groups.google.com/d/msgid/tesseract-ocr/1642e20a-1de4-4f83-aa1b-fbfbbae9fd7e%40googlegroups.com.

altun_çiftlik.txt

altun_çiftlik.png

Ibr

unread,

Dec 18, 2019, 5:09:37 AM12/18/19

to tesseract-ocr

Hi Serkan,

** "I wonder if the existing language models generated for Arabic and/or Farsi" yes there is one for Arabic and one for Farsi, they are called lang-name.traineddata such as ara.traineddata and eng.traineddata you can find them and download them from GitHub here I tried Arabic and Japanese models and they were really good, and the good thing that Tesseract guys keep enhancing the engine and models, I'm sure that the Farsi model is good as well, once I used it on a small Farsi script to answer a question on GitHub and it gave good result

** "Ottoman alphabet has 34 letters" from the links that you have shared there is some letters in Ottoman alphabet the letter "Nef" it doesn't exist in Arabic and I think it doesn't exist in Farsi either I can conform that several letters are not in Arabic, I don't know about Farsi but I think Farsi doesn't contain all the Ottoman letters, so your best bet would be Farsi yet the extra letters will be recognized wrong,

** "Please find the picture and tesseract result attached" the image is in a good condition, and the result text, I give it accuracy of 90% accurate, or above, but if you noticed in the image at the most right side of it the image is like bent slightly, like the inner edge of the book, which will affect the accuracy, such as the words "چنتلک" and "په ده" which I believe Tesseract would detect them easily if there were in the middle for example, other words like "اومایان" should is solved in Arabic, refer to this issue I have opened earlier on GitHub and its fixed in Tesseract version 4, I don't know if its fixed or even exists in Farsi model

** "the ocr using Farsi model . I wonder if the training set for the model of any one from Arabic or Farsi may be modified and used to create Ottoman language model or should I work to collect training data for the ottoman language model from scratch ?" here I'm just spitballing , as far as I know that the corresponding text or any alphabet in computers it has a range of a representation in ASCII code, so any letter in Ottoman doesn't exist or have representation in ASCII code it couldn't be written as an output, does Ottoman writing system exists in Office Word for Example? training a model in Tesseract is to enhance the shape detecting in Tesseract engine by introducing other fonts and other potential shapes, which you can find it in Tesseract articles under tuning or training, but adding to the model I don't think that exists as an option from a user side, I recommend to consult regarding this matter with the people who work on Tesseract such as a guy called Shree, or Smith Ray (he is the one responsible of the Tesseract algorithm I believe)

** "The fonts for the Arabic or Farsi model guess does not contains all the letters of 35 and this may be a problem" I meant in this paragraph the font type like "Calibri" or "Arial" not the alphabet, because such as Arabic and English some letters change their shape depending on the font, its all explained in the issue that I have shared above in GitHub

** "Afak, hand writing is separate phase of OCR which is also very hard but may be getting easier using ML technics like DL" probably its doable I'm not that well informed in Machine learning, but the usual case I didn't use Tesseract with handwritten documents, I remember I even found a question about it in this group someone is asking does Tesseract fit for handwritten or not and people answered him "no not really"

Serkan Taş

unread,

Dec 18, 2019, 1:04:58 PM12/18/19

to tesser...@googlegroups.com

Hi Ibrahim,

You helped me so much, and I have new questions :)

1.Do I need a new model for ottoman, what you think ?

2. From your comment I understand that if any letter that does not have ASCII correspondence can not be recognized and converted to text. Right ?
If yes can we say that that letters can never be contained in OCR ?

--

You received this message because you are subscribed to the Google Groups "tesseract-ocr" group.
To unsubscribe from this group and stop receiving emails from it, send an email to tesseract-oc...@googlegroups.com.

To view this discussion on the web visit https://groups.google.com/d/msgid/tesseract-ocr/a620524a-9f05-4fec-8cb8-74818b9b5088%40googlegroups.com.

Shree Devi Kumar

unread,

Dec 19, 2019, 12:30:54 AM12/19/19

to tesseract-ocr

You can try to finetune tessdata_best/script/Arabic.traineddata for Ottoman.

If you have line images and their groundtruth transcription, you can use makefile process from tesstrain.

See https://github.com/tesseract-ocr/tesstrain/issues/128

Tesseract recognizes images to Unicode code points (UTF8 text). If all required Ottoman characters do not have a Unicode codepoint, then you may have to assign some random letter instead.

Ibr

unread,

Dec 19, 2019, 3:10:03 AM12/19/19

to tesseract-ocr

Hi Serkan,

My pleasure brother, any time :)

"Do I need a new model for ottoman, what you think ?" of course I think It would help you a lot but honestly I really have no clue how to create a trained data for Ottoman or any other language, that's why maybe your best shot is Farsi trained date, unless of course you know how to create Ottoman trained data

"I understand that if any letter that does not have ASCII correspondence can not be recognized and converted to text. Right ? if yes can we say that that letters can never be contained in OCR ?" theoretically yes if I understand this matter correct, why I mentioned the Unicode and ASCII at the first place? because I have faced this issue before and I opened an issue about it, refer to this issue and you can see how each character has its own corresponding code. that's why I asked you if the Ottoman writing system is recognized by other editors such as MS Office, according to Shree's comment "If all required Ottoman characters do not have a Unicode codepoint, then you may have to assign some random letter instead" seems like any Ottoman letter doesn't contain its code wont be recognized, again, I think if you look deeper into Farsi alphabet and compare it with the Ottoman alphabet you might conclude that Farsi should do, since Tesseract doesn't work on meaning only characters, unfortunately I can't help you with this since I only know just little of Farsi, you need someone specialized in Farsi or a native like an Iranian or Azerbaijani.

Good thing that Shree is here, this guy is an expert in this matter and helpful as well, specially since were brought the Unicode and ASCII representation and creating trained data to the table he knows these stuff more than me

Again, you should pay attention to the quality of the images, some images might not have good results but due to some imperfections in the images itself like old line or dots, so some image enhancements to the image will give better results

Serkan Taş

unread,

Dec 19, 2019, 5:00:39 PM12/19/19

to tesser...@googlegroups.com

Hi Shree,

I checked git page you referred and need some time to prepare line images and their ground-truth transcription. I guess I can but will take some time.

"Tesseract recognizes images to Unicode code points (UTF8 text). If all required Ottoman characters do not have a Unicode codepoint, then you may have to assign some random letter instead."

Unfortunately I do not think that some Ottoman letters has unicode codepoint.

Just to make it clear, you mean to map these characters to some thing random so they are recognized correctly but can not be converted to exact text. How can this method work ?

--
You received this message because you are subscribed to the Google Groups "tesseract-ocr" group.
To unsubscribe from this group and stop receiving emails from it, send an email to tesseract-oc...@googlegroups.com.

To view this discussion on the web visit https://groups.google.com/d/msgid/tesseract-ocr/CAG2NduV5_HUbfUM4sTJJb_b7jFK5u9%3DN_YpXROTDfW_my-K_bg%40mail.gmail.com.

Serkan Taş

unread,

Dec 19, 2019, 5:06:56 PM12/19/19

to tesser...@googlegroups.com

Hi Ibrahim,

According to Shree's advices I am going to work on training for some time, of course before I am going to work on the alphabet and other symbols in arabic and farsi dataset which are common with ottoman. I am still not sure how to finetune existing data set but going to try to understand.

For ms-word, when I install TTF prepared for Ottoman alphabet, yes I can see all 34 letters of ottoman in a document,

--

You received this message because you are subscribed to the Google Groups "tesseract-ocr" group.
To unsubscribe from this group and stop receiving emails from it, send an email to tesseract-oc...@googlegroups.com.

To view this discussion on the web visit https://groups.google.com/d/msgid/tesseract-ocr/5b32cb1c-65f1-4fc8-a763-fc42e9d58cca%40googlegroups.com.

Shree Devi Kumar

unread,

Dec 20, 2019, 7:24:49 AM12/20/19

to tesseract-ocr

Check https://github.com/OpenITI/OCR_GS_Data/tree/master/AzTurkish/kulliyati

To view this discussion on the web visit https://groups.google.com/d/msgid/tesseract-ocr/CAGCxbmtfMYNH3%3DN-5BWfGmX5TFwX8S2Lvtq%2BSD-Yv6i6x0Yxyg%40mail.gmail.com.

Ibr

unread,

Dec 24, 2019, 10:19:54 AM12/24/19

to tesseract-ocr

Hi Serkan,

if Ottoman Letters have code to represent them then yes its doable

On Friday, December 20, 2019 at 12:06:56 AM UTC+2, Serkan Taş wrote:

Hi Ibrahim,

According to Shree's advices I am going to work on training for some time, of course before I am going to work on the alphabet and other symbols in arabic and farsi dataset which are common with ottoman. I am still not sure how to finetune existing data set but going to try to understand.

For ms-word, when I install TTF prepared for Ottoman alphabet, yes I can see all 34 letters of ottoman in a document,

On Thu, Dec 19, 2019 at 11:10 AM Ibr <ibr....@gmail.com> wrote:

Hi Serkan,

My pleasure brother, any time :)

"Do I need a new model for ottoman, what you think ?" of course I think It would help you a lot but honestly I really have no clue how to create a trained data for Ottoman or any other language, that's why maybe your best shot is Farsi trained date, unless of course you know how to create Ottoman trained data

"I understand that if any letter that does not have ASCII correspondence can not be recognized and converted to text. Right ? if yes can we say that that letters can never be contained in OCR ?" theoretically yes if I understand this matter correct, why I mentioned the Unicode and ASCII at the first place? because I have faced this issue before and I opened an issue about it, refer to this issue and you can see how each character has its own corresponding code. that's why I asked you if the Ottoman writing system is recognized by other editors such as MS Office, according to Shree's comment "If all required Ottoman characters do not have a Unicode codepoint, then you may have to assign some random letter instead" seems like any Ottoman letter doesn't contain its code wont be recognized, again, I think if you look deeper into Farsi alphabet and compare it with the Ottoman alphabet you might conclude that Farsi should do, since Tesseract doesn't work on meaning only characters, unfortunately I can't help you with this since I only know just little of Farsi, you need someone specialized in Farsi or a native like an Iranian or Azerbaijani.

Good thing that Shree is here, this guy is an expert in this matter and helpful as well, specially since were brought the Unicode and ASCII representation and creating trained data to the table he knows these stuff more than me

Again, you should pay attention to the quality of the images, some images might not have good results but due to some imperfections in the images itself like old line or dots, so some image enhancements to the image will give better results

--
You received this message because you are subscribed to the Google Groups "tesseract-ocr" group.

To unsubscribe from this group and stop receiving emails from it, send an email to tesser...@googlegroups.com.

Serkan Taş

unread,

Dec 28, 2019, 3:34:13 PM12/28/19

to tesser...@googlegroups.com

Hi İbrahim,

I worked on the subject and found some workings and documents that confirms that all ottoman letters have Unicode correspondence. I am going to finetune trained models as Shree advises.

Here are the samples of the four letters and the Unicode values :

ﭖ : %uFB56

ﭺ : %uFB7A

ﮊ : %uFB8A

ﮒ : %uFB92

ﯓ : %uFBD3

Selamlar

To unsubscribe from this group and stop receiving emails from it, send an email to tesseract-oc...@googlegroups.com.
To view this discussion on the web visit https://groups.google.com/d/msgid/tesseract-ocr/cafa2e12-24d2-4080-9347-3f5204050de1%40googlegroups.com.

Ibr

unread,

Dec 31, 2019, 3:00:03 AM12/31/19

to tesseract-ocr

Hi Serkan,

Well, that's great, in this case with the corresponding Unicode you can definitely have a work around to solve the issue at hand, which traineddata you are planning to fine tune?

selamlar sana :)

Serkan Taş

unread,

Jan 1, 2020, 2:51:50 PM1/1/20

to tesser...@googlegroups.com

I am not sure Ibrahim, Shree advises arabic but I wonder if farsi also may be a good candidate. What you think ?

--

You received this message because you are subscribed to the Google Groups "tesseract-ocr" group.
To unsubscribe from this group and stop receiving emails from it, send an email to tesseract-oc...@googlegroups.com.

To view this discussion on the web visit https://groups.google.com/d/msgid/tesseract-ocr/317b8be3-a648-407b-bc6a-12b3e64744a8%40googlegroups.com.

Ibr

unread,

Jan 2, 2020, 2:38:20 AM1/2/20

to tesseract-ocr

In my opinion "Theoretically" since Farsi has more letters than Arabic and also exists in Ottoman, Farsi should work better, but Shree is more well informed than me in this matter.

I remember fine tuning for Arabic fonts took too much time, I mean more than a week for the tuning, but that was in Tesseract 3, I don't know if this still the case in Tesseract 4 or not, but overall I think Farsi should do better, IF it gave you not good enough results then try Arabic, I hope it works out well for you :)

Serkan Taş

unread,

Jan 3, 2020, 2:36:54 PM1/3/20

to tesser...@googlegroups.com

Thank you so much İbrahim,

I am going to work on fine-tuning process.

Selam...

On Thu, Jan 2, 2020 at 10:38 AM Ibr <ibr.h...@gmail.com> wrote:

In my opinion "Theoretically" since Farsi has more letters than Arabic and also exists in Ottoman, Farsi should work better, but Shree is more well informed than me in this matter.

I remember fine tuning for Arabic fonts took too much time, I mean more than a week for the tuning, but that was in Tesseract 3, I don't know if this still the case in Tesseract 4 or not, but overall I think Farsi should do better, IF it gave you not good enough results then try Arabic, I hope it works out well for you :)

--

You received this message because you are subscribed to the Google Groups "tesseract-ocr" group.
To unsubscribe from this group and stop receiving emails from it, send an email to tesseract-oc...@googlegroups.com.

To view this discussion on the web visit https://groups.google.com/d/msgid/tesseract-ocr/a085e016-3315-4f08-bdcf-a34fafe007c5%40googlegroups.com.

Ibr

unread,

Jan 5, 2020, 3:00:23 AM1/5/20

to tesseract-ocr

Not at all bro :)

Tell me if you get a good results, I'm interested to know

Selam

Reply all

Reply to author

Forward