Macron’s recognition in Tesseract (āĀēĒīĪōŌūŪ)

251 views
Skip to first unread message

alter...@gmail.com

unread,
Jan 20, 2017, 2:21:14 AM1/20/17
to tesseract-ocr
Dear all,
I frequently use Tesseract (3.04) and it’s great.
Still, I can’t find a way to get Tesseract recognize macrons (āĀēĒīĪōŌūŪ).
There was a discussion here about it 5 years ago but at the time, there wasn’t much of a solution.
Things may have changed since then and I’m wondering if somebody would have some hints.
Macrons are used among other things when doing recognition from japanese transcribed in latin alphabet (rōmaji).
Thanks in advance for all possible ideas.
For now, using fra or deu as one of the language, I get ô or ö…
Best,
Nicolas

ShreeDevi Kumar

unread,
Jan 20, 2017, 5:22:49 AM1/20/17
to tesser...@googlegroups.com, tesser...@googlegroups.com, Ray Smith
In addition to Macrons, I will also request addition of other accented letters for Indic text transliterations.

Ray, 

Will using -l eng+<new training> be the best way to handle these?

 I tried to do an add layer training, but the recognition is worse, since I did not use many fonts for the test training. I am attaching the training sample I used. Thanks.

Please see the following links for the transliteration schemes showing the letters to be included.




Here are various sites that have Sanskrit corpus in transliteration.



You can see

eg. pages 87-91 - showing use of both devanagari script as well as trasliterated sanskrit as part of mainly English text



ShreeDevi
____________________________________________________________
भजन - कीर्तन - आरती @ http://bhajans.ramparivar.com

--
You received this message because you are subscribed to the Google Groups "tesseract-ocr" group.
To unsubscribe from this group and stop receiving emails from it, send an email to tesseract-ocr+unsubscribe@googlegroups.com.
To post to this group, send email to tesser...@googlegroups.com.
Visit this group at https://groups.google.com/group/tesseract-ocr.
To view this discussion on the web visit https://groups.google.com/d/msgid/tesseract-ocr/74805b35-b70b-47f8-b287-ddcd34d216e2%40googlegroups.com.
For more options, visit https://groups.google.com/d/optout.

san_latn.training_text

LaSo

unread,
Nov 14, 2018, 11:10:19 AM11/14/18
to tesseract-ocr

Did you find a solution yet? I have the same problem. Well, it's a bit worse. I also need German letters like (ÄäÖöÜü)... : | 
Reply all
Reply to author
Forward
Message has been deleted
0 new messages