You do not have permission to delete messages in this group
Copy link
Report message
Show original message
Either email addresses are anonymous for this group or you need the view member email addresses permission to view the original message
to tesseract-ocr
Hi, everyone.
I'm looking into Filipino support by Tesseract OCR. It appears that at least Ñ/ñ is not supported. They should as you can see here.
I'm being told that other latin characters are also used, like those in Spanish. Is this true?
Thanks in advance,
C.D.
Tom Morris
unread,
Sep 22, 2024, 2:29:13 AM9/22/24
Reply to author
Sign in to reply to author
Forward
Sign in to forward
Delete
You do not have permission to delete messages in this group
Copy link
Report message
Show original message
Either email addresses are anonymous for this group or you need the view member email addresses permission to view the original message
to tesseract-ocr
On Friday, September 20, 2024 at 12:03:47 PM UTC-4 cdok...@gmail.com wrote:
I'm looking into Filipino support by Tesseract OCR. It appears that at least Ñ/ñ is not supported. They should as you can see here.
I'm being told that other latin characters are also used, like those in Spanish. Is this true?
The Filipino support definitely looks incomplete. Neither fil.unicharset [1] nor the training text [2] includes. Since it sounds like they are principally used for Spanish loan words, one solution might be to use both languages (ie fil+esp). You could also try the generic Latin script data.
You do not have permission to delete messages in this group
Copy link
Report message
Show original message
Either email addresses are anonymous for this group or you need the view member email addresses permission to view the original message
to tesseract-ocr
Thanks for the feedback.
I've already tried with "fil+spa" with no success :(
One thing that worries me is that I cannot find one sample filipino text image with
Ñ/ñ on it, just to have an independently produced sample. All I have is a couple of small snippets of text which produce the plain characters only.