--
You received this message because you are subscribed to the Google Groups "tesseract-ocr" group.
To unsubscribe from this group and stop receiving emails from it, send an email to tesseract-oc...@googlegroups.com.
To post to this group, send email to tesser...@googlegroups.com.
Visit this group at https://groups.google.com/group/tesseract-ocr.
To view this discussion on the web visit https://groups.google.com/d/msgid/tesseract-ocr/a6c02172-1fb1-4f00-b096-44332cd81075%40googlegroups.com.
For more options, visit https://groups.google.com/d/optout.
Google has not provided images and box files for San.traineddata released for 3.04
I tried training using text2image with a combination of different fonts and training text. Results are at https://github.com/Shreeshrii/imagessan/tree/master/tessdata
You can give these a try to see if recognition is any better.
You can unpack any trained data file using -u option with combine-tessdata to see the config files etc.
http://manpages.ubuntu.com/manpages/trusty/man1/combine_tessdata.1.html
Use the dawg2wordlist to look at the various dictionary word lists used.
http://manpages.ubuntu.com/manpages/trusty/man1/dawg2wordlist.1.html
- sent from my phone. excuse the brevity.
--
You received this message because you are subscribed to the Google Groups "tesseract-ocr" group.
To unsubscribe from this group and stop receiving emails from it, send an email to tesseract-oc...@googlegroups.com.
To post to this group, send email to tesser...@googlegroups.com.
Visit this group at https://groups.google.com/group/tesseract-ocr.
To view this discussion on the web visit https://groups.google.com/d/msgid/tesseract-ocr/45767a89-cd11-4f39-9622-3fe7b4d20a4a%40googlegroups.com.
--
You received this message because you are subscribed to a topic in the Google Groups "tesseract-ocr" group.
To unsubscribe from this topic, visit https://groups.google.com/d/topic/tesseract-ocr/apmhpJ3K924/unsubscribe.
To unsubscribe from this group and all its topics, send an email to tesseract-oc...@googlegroups.com.
To post to this group, send email to tesser...@googlegroups.com.
Visit this group at https://groups.google.com/group/tesseract-ocr.
To view this discussion on the web visit https://groups.google.com/d/msgid/tesseract-ocr/CAG2NduXfqoY_BSW9BURAbj_AzdtRykK2ea5e9G2Suq9QCeWMOA%40mail.gmail.com.
Yes, hin traineddata with cube gives better results than san.
I did some rudimentary testing with the new traineddata I made. It does not use cube. Look at the config files, it has some options for devanagari processing.
You could try to unpack the hin traineddata and then remake the Dawg files using sanskrit wordlists and combine them as an experiment.
If you have unicode version of the font used for the docs you want to OCR, then train using that.
- sent from my phone. excuse the brevity.
To view this discussion on the web visit https://groups.google.com/d/msgid/tesseract-ocr/CAEga%2BsUNCmGHEmPB0fBZjgPmEAXvWNtzzdkkKK%3DRcd_u25f%2B1Q%40mail.gmail.com.
If you look at the readme files in the diff subdirectories starting with OCR under
https://github.com/Shreeshrii/imagessan/tree/master you will see results of character and word level accuracy. Depending on the font, character level accuracy is around 80% and word level accuracy around 60%
I have not used it for actual OCR of any text because sanskritocr software by dr. Oliver hellwig gives better results.
See https://sites.google.com/site/sanskritcode/ocr/1-ocr-ing
- sent from my phone. excuse the brevity.
To view this discussion on the web visit https://groups.google.com/d/msgid/tesseract-ocr/CAG2NduWu3-cLcTHi2e%3D0Zr15Do5nawfG93k_dXvBeBwze%2BMHfw%40mail.gmail.com.
You can check out the older version of sanskritocr from http://learnsanskrit.org/tools/ocr
The new version is commercial software, available as a demo for free, but requires payment for use.
- sent from my phone. excuse the brevity.
To view this discussion on the web visit https://groups.google.com/d/msgid/tesseract-ocr/CAEga%2BsUH--J%3DMDN-Adwe%3Dt-5OgNNaqjF-_VcXJZW58oDBr0gdA%40mail.gmail.com.
To view this discussion on the web visit https://groups.google.com/d/msgid/tesseract-ocr/CAEga%2BsWmbZV-qC1fwKbb%2BmqO3SSaqseZPHu7o1srObOqt%2BPjxw%40mail.gmail.com.
Oliver had released first version of sanskritocr for free and new version is commercial with demo, sold by indsenz. I assume newer one may be better, it also allows for training for particular fonts.
- sent from my phone. excuse the brevity.
To view this discussion on the web visit https://groups.google.com/d/msgid/tesseract-ocr/c9edbaa5-fb5d-4c01-87d9-93b1a2308f9f%40googlegroups.com.