--
You received this message because you are subscribed to the Google Groups "tesseract-ocr" group.
To unsubscribe from this group and stop receiving emails from it, send an email to tesseract-ocr+unsubscribe@googlegroups.com.
To post to this group, send email to tesser...@googlegroups.com.
Visit this group at https://groups.google.com/group/tesseract-ocr.
To view this discussion on the web visit https://groups.google.com/d/msgid/tesseract-ocr/60176871-9c44-42a9-9135-9a7c3f92a8d2%40googlegroups.com.
For more options, visit https://groups.google.com/d/optout.
Don't try to train Tesseract versions earlier than 4.0 for Arabic (same for Persian, Urdu, etc.). It's hopeless. For 4.0 only train with the LSTM method.
Hi Ibr,
First, I thought to search on ML kind of things to understand if it worth to work on the subject and while I am searching through list about ottoman text transliteration I found your discussions.
I am just trying to understand if I am at the right place for my subject.
The language is Turkish originally, but the alphabet is some kind of mixture of Arabic and Farsi alphabet and contains words from these two and some other eastern languages like English and mostly French.
Here is the cases;
1st step : Use ocr kind algorithm or some kind of ml algorithms for retrieving text from images mostly written various types of styles including handwriting that has historical roots over 600 hundred years.
2nd step : Using successfully generated texts transliterate to roman or Latin alphabet.
3rd step : Due to the age of language, use some algorithm to modernize language to daily language, Turkish
Do you think tesseract is the right tool for these purposes ?
If yes, how ?
What should be the iterations, and where can I find guides ?
Thanks in advance,
Serkan
Pls find some sample document attachments.
Windows 10 için Posta ile gönderildi
Kimden: Ibr
Gönderilme: 16 Aralık 2019 Pazartesi 10:56
Kime: tesseract-ocr
Konu: [tesseract-ocr] Re: How to use Tesseract Arabic OCR.
--
You received this message because you are subscribed to the Google Groups "tesseract-ocr" group.
To unsubscribe from this group and stop receiving emails from it, send an email to tesseract-oc...@googlegroups.com.
To view this discussion on the web visit https://groups.google.com/d/msgid/tesseract-ocr/14751335-d461-464a-87bc-94d9f4f549b5%40googlegroups.com.
Hi Serkan,
Hi Serkan,
How Tesseract works is like the following, each language or writing system, it has a model which depend on to make recognition of the characters in the image, I guess it depends on something called (stroke width transformation) which is actually detecting the shapes, if while scanning an image detected a shape (letter in the image) that already recognize Tesseract will assign it as the corresponding letter that has the same shape and write it in the output text, and then the next shape and so on, in Tesseract every language has its own model (a model in ML is more like the brain which decide the results depending on the input), WHY I'm telling you all of this? to give you an idea how it works and to let you know, you can't be conclusive about the results, even with great accuracy you might still have some errors, that's how machine learning in general, that's why usually people train the model and to enhance its accuracy,
About Ottoman writing system you said "The language is Turkish originally" Tesseract doesn't care about the meaning of the text, just the shapes, "alphabet is some kind of mixture of Arabic and Farsi alphabet" I'm a native Arabic speaker, yet I can read the first image that you have shared "ödev" without knowing what it means (except for few words I already know in Turkish language) I also can read Farsi as well, but the problem with Farsi alphabet it contains extra letters that doesn't exist in Arabic, very close but slightly different for example (چ) is same as (ج) but with three dots, in Farsi both letters exists, but in Arabic only the second one exists, so run the two letter on the Farsi model, will work fine, but on Arabic model, I think both letter will be recognized as the letter with the one dot only. Arabic has 28 letters but Farsi has 32 letters I guess, so that means if Ottoman alphabet contains letter from Farsi, Arabic model wont be enough since Farsi contains Arabic letters and some extra letters, now if Ottoman alphabet and Farsi alphabet are the same, for sure Farsi model (I think its fas.traineddata ) will work fine, but if there are some letters in Ottoman alphabet doesn't exist in Farsi then, these letters wont be recognized or recognized wrong
About the font, I'm not sure what is the font used in both pictures but the first picture definitely it exists in the Arabic model, in Tesseract 4 (at least when I used it last time lime almost a year ago) its contains I think 5000 Arabic fonts, which covers almost all the fonts, so I don't think you would need any training on different fonts.
Last thing, when I used Tesseract it was giving a perfect results for Arabic and Japanese as well, for formal documents, but handwritten documents the accuracy is really low, I don't know if this still the case or not, but if it is, handwritten wont have good results, for example the second image that you have shared "sample01" I assure you it wont be recognized even if you have Ottoman model, the first one I'm not sure, I think it would be recognized but any word that has a small space due to being old document, the resulted word will be separated, to be honest you wont know for sure until you try it on the Tesseract, Tesseract since version 4 is easy to use, specially its not necessary to train the model on new fonts, so in my opinion open a question on this Google group or on GitHub asking if there is an Ottoman model, or since you seem you know these stuff you can decide if the Farsi model will do, try on the Farsi model
I wish I was helpful enough, I said to much details but only to give you the full image of what's going on so you would decide if it fits or not, since I don't have enough information about Ottoman writing system, if you still have any question I'm here to help :)
teşekkürler :)Ibrahim
--
You received this message because you are subscribed to the Google Groups "tesseract-ocr" group.
To unsubscribe from this group and stop receiving emails from it, send an email to tesseract-oc...@googlegroups.com.
To view this discussion on the web visit https://groups.google.com/d/msgid/tesseract-ocr/1642e20a-1de4-4f83-aa1b-fbfbbae9fd7e%40googlegroups.com.
Hi Serkan,
--
You received this message because you are subscribed to the Google Groups "tesseract-ocr" group.
To unsubscribe from this group and stop receiving emails from it, send an email to tesseract-oc...@googlegroups.com.
To view this discussion on the web visit https://groups.google.com/d/msgid/tesseract-ocr/a620524a-9f05-4fec-8cb8-74818b9b5088%40googlegroups.com.
Hi Serkan,
--
You received this message because you are subscribed to the Google Groups "tesseract-ocr" group.
To unsubscribe from this group and stop receiving emails from it, send an email to tesseract-oc...@googlegroups.com.
To view this discussion on the web visit https://groups.google.com/d/msgid/tesseract-ocr/CAG2NduV5_HUbfUM4sTJJb_b7jFK5u9%3DN_YpXROTDfW_my-K_bg%40mail.gmail.com.
--
You received this message because you are subscribed to the Google Groups "tesseract-ocr" group.
To unsubscribe from this group and stop receiving emails from it, send an email to tesseract-oc...@googlegroups.com.
To view this discussion on the web visit https://groups.google.com/d/msgid/tesseract-ocr/5b32cb1c-65f1-4fc8-a763-fc42e9d58cca%40googlegroups.com.
To view this discussion on the web visit https://groups.google.com/d/msgid/tesseract-ocr/CAGCxbmtfMYNH3%3DN-5BWfGmX5TFwX8S2Lvtq%2BSD-Yv6i6x0Yxyg%40mail.gmail.com.
Hi Ibrahim,According to Shree's advices I am going to work on training for some time, of course before I am going to work on the alphabet and other symbols in arabic and farsi dataset which are common with ottoman. I am still not sure how to finetune existing data set but going to try to understand.For ms-word, when I install TTF prepared for Ottoman alphabet, yes I can see all 34 letters of ottoman in a document,
On Thu, Dec 19, 2019 at 11:10 AM Ibr <ibr....@gmail.com> wrote:
Hi Serkan,My pleasure brother, any time :)"Do I need a new model for ottoman, what you think ?" of course I think It would help you a lot but honestly I really have no clue how to create a trained data for Ottoman or any other language, that's why maybe your best shot is Farsi trained date, unless of course you know how to create Ottoman trained data"I understand that if any letter that does not have ASCII correspondence can not be recognized and converted to text. Right ? if yes can we say that that letters can never be contained in OCR ?" theoretically yes if I understand this matter correct, why I mentioned the Unicode and ASCII at the first place? because I have faced this issue before and I opened an issue about it, refer to this issue and you can see how each character has its own corresponding code. that's why I asked you if the Ottoman writing system is recognized by other editors such as MS Office, according to Shree's comment "If all required Ottoman characters do not have a Unicode codepoint, then you may have to assign some random letter instead" seems like any Ottoman letter doesn't contain its code wont be recognized, again, I think if you look deeper into Farsi alphabet and compare it with the Ottoman alphabet you might conclude that Farsi should do, since Tesseract doesn't work on meaning only characters, unfortunately I can't help you with this since I only know just little of Farsi, you need someone specialized in Farsi or a native like an Iranian or Azerbaijani.Good thing that Shree is here, this guy is an expert in this matter and helpful as well, specially since were brought the Unicode and ASCII representation and creating trained data to the table he knows these stuff more than meAgain, you should pay attention to the quality of the images, some images might not have good results but due to some imperfections in the images itself like old line or dots, so some image enhancements to the image will give better results--
You received this message because you are subscribed to the Google Groups "tesseract-ocr" group.
To unsubscribe from this group and stop receiving emails from it, send an email to tesser...@googlegroups.com.
To unsubscribe from this group and stop receiving emails from it, send an email to tesseract-oc...@googlegroups.com.
To view this discussion on the web visit https://groups.google.com/d/msgid/tesseract-ocr/cafa2e12-24d2-4080-9347-3f5204050de1%40googlegroups.com.
Hi Serkan,
--
You received this message because you are subscribed to the Google Groups "tesseract-ocr" group.
To unsubscribe from this group and stop receiving emails from it, send an email to tesseract-oc...@googlegroups.com.
To view this discussion on the web visit https://groups.google.com/d/msgid/tesseract-ocr/317b8be3-a648-407b-bc6a-12b3e64744a8%40googlegroups.com.
In my opinion "Theoretically" since Farsi has more letters than Arabic and also exists in Ottoman, Farsi should work better, but Shree is more well informed than me in this matter.
In my opinion "Theoretically" since Farsi has more letters than Arabic and also exists in Ottoman, Farsi should work better, but Shree is more well informed than me in this matter.I remember fine tuning for Arabic fonts took too much time, I mean more than a week for the tuning, but that was in Tesseract 3, I don't know if this still the case in Tesseract 4 or not, but overall I think Farsi should do better, IF it gave you not good enough results then try Arabic, I hope it works out well for you :)
--
You received this message because you are subscribed to the Google Groups "tesseract-ocr" group.
To unsubscribe from this group and stop receiving emails from it, send an email to tesseract-oc...@googlegroups.com.
To view this discussion on the web visit https://groups.google.com/d/msgid/tesseract-ocr/a085e016-3315-4f08-bdcf-a34fafe007c5%40googlegroups.com.