FIlipino character set (alphabet) support

Constantine Dokolas

unread,

Sep 20, 2024, 12:03:47 PM9/20/24

to tesseract-ocr

Hi, everyone.

I'm looking into Filipino support by Tesseract OCR. It appears that at least Ñ/ñ is not supported. They should as you can see here.

I'm being told that other latin characters are also used, like those in Spanish. Is this true?

Thanks in advance,

C.D.

Tom Morris

unread,

Sep 22, 2024, 2:29:13 AM9/22/24

to tesseract-ocr

On Friday, September 20, 2024 at 12:03:47 PM UTC-4 cdok...@gmail.com wrote:

I'm looking into Filipino support by Tesseract OCR. It appears that at least Ñ/ñ is not supported. They should as you can see here.

I'm being told that other latin characters are also used, like those in Spanish. Is this true?

The Filipino support definitely looks incomplete. Neither fil.unicharset [1] nor the training text [2] includes. Since it sounds like they are principally used for Spanish loan words, one solution might be to use both languages (ie fil+esp). You could also try the generic Latin script data.

Tom

[1] https://github.com/tesseract-ocr/langdata_lstm/blob/main/fil/fil.unicharset

[2] https://github.com/tesseract-ocr/langdata_lstm/blob/main/fil/fil.training_text

Constantine Dokolas

unread,

Sep 22, 2024, 9:00:26 AM9/22/24

to tesseract-ocr

Thanks for the feedback.

I've already tried with "fil+spa" with no success :(

One thing that worries me is that I cannot find one sample filipino text image with Ñ/ñ on it, just to have an independently produced sample. All I have is a couple of small snippets of text which produce the plain characters only.

C.D.

Reply all

Reply to author

Forward