Google Groups no longer supports new Usenet posts or subscriptions. Historical content remains viewable.
Dismiss

How to create dictionary and spell checker for my locale language (Malay) ?

55 views
Skip to first unread message

Robbi Nespu

unread,
Jun 23, 2021, 12:00:04 AM6/23/21
to
Hello Debian!

TLDR; How I can generate my own dictionary and spell checker file?

I from Malaysia, we use Malay (Bahasa Melayu) as our primary language
but we don't mind using English for user-interface of software and
communication. I think most of us are comfortable with English because
direct Malays translation maybe awkward a bits but of course I encourage
Malays (Ms) translation (I do translation too, btw).

That is OK for user-interface but when comes to document such as
dissertation, paperwork, report which we use word-processing such as
LibreOffice and so..we really want dictionary and spellchecker to
validate what we typed and fix typos right away..

Checking around, I found there is few contribution made by some people
but that distribute dictionary and affix file, but it not been update
quite long time.

I wonder how their build and test the files? It must be using some
tools. I contacted them to ask, but not getting response..could be their
not use the email anymore or already left this world.

On my research, I found most of people using Myspell long time ago, and
mostly now are using Hunspell and there is new tool called Nuspell too,
but I don't understand how to use it. Where you I put my new keyword?
Can anyone guide me? Correct me if this is not the tool for what I am
looking for.

Looking at English hunspell package by debian, it look like it have some
pattern and unfortunately I don't understand it but thanks to Nuspell
manual wiki, I able to understand it

$ head /usr/share/hunspell/en_US.aff
SET UTF-8
TRY esianrtolcdugmphbyfvkwzESIANRTOLCDUGMPHBYFVKWZ'
ICONV 1
ICONV ’ '
NOSUGGEST !

# ordinal numbers
COMPOUNDMIN 1
# only in compounds: 1th, 2th, 3th
ONLYINCOMPOUND c

$ head /usr/share/hunspell/en_US.dic
78975
0/nm
0th/pt
1/n1
1st/p
1th/tc
2/nm
2nd/p
2th/tc
3/nm

78975
0/nm
0th/pt
1/n1
1st/p
1th/tc
2/nm
2nd/p
2th/tc
3/nm

As end user, no body care much about it but now I care because I want to
implement Malay words on it and generate the dictionary and affix file.

I know, it take time to have a good size of file to be useful, but I
might spend few hour when I free and someone maybe can continue my work.

I have plan to put it on debian package too. Since I have experience
with debian packaging. So I take a look on most hunspell package on debian.

$ ls /usr/share/hunspell/ -la
total 860
drwxr-xr-x 2 root root 4096 Feb 26 00:00 .
drwxr-xr-x 381 root root 12288 Jun 14 22:09 ..
-rw-r--r-- 1 root root 3090 Mar 1 2020 en_US.aff
-rw-r--r-- 1 root root 859956 Mar 1 2020 en_US.dic

I don't have hunspell installed but I have hunspell-en-us

$ apt-cache policy hunspell hunspell-en-us
hunspell:
Installed: (none)
Candidate: 1.7.0-3
Version table:
1.7.0-3 500
500 http://ftp.jp.debian.org/debian bullseye/main amd64 Packages
hunspell-en-us:
Installed: 1:2019.10.06-1
Candidate: 1:2019.10.06-1
Version table:
*** 1:2019.10.06-1 500
500 http://ftp.jp.debian.org/debian bullseye/main amd64 Packages
500 http://ftp.jp.debian.org/debian bullseye/main i386 Packages
100 /var/lib/dpkg/status

which mean, I can just check hunspell-en-us package but on
https://tracker.debian.org/pkg/hunspell-en-us and
https://packages.debian.org/search?searchon=sourcenames&keywords=hunspell-en-us
it said version 20070829-* but I have 1:2019.10.06-1 installed. Not sure
why it look like this.

Anyway, I still can see the code dump on
https://sources.debian.org/src/hunspell-en-us/20070829-7/ (it would be
nice, if I can see it on salsa), and I am right. It quite simple to
package and upstream source only need aff and dic file. I see a light
for packaging part.

It only, I don't see how should I generate this file? or It really just
a plaintext and no need a tool to generate it.

To be honest, I might be lost interest if I am clueless to much, but I
posted here hoping to get some information and maybe useful for someone
like me who have same purpose.

--
Robbi Nespu <robbinespu AT SPAMFREE gmail DOT com>
D311 B5FF EEE6 0BE8 9C91 FA9E 0C81 FA30 3B3A 80BA
https://robbinespu.gitlab.io | https://mstdn.social/@robbinespu

Andrei POPESCU

unread,
Jun 23, 2021, 12:20:05 AM6/23/21
to
On Mi, 23 iun 21, 11:53:47, Robbi Nespu wrote:
> Hello Debian!
>
> TLDR; How I can generate my own dictionary and spell checker file?

Try asking on debian-i18n.

> I from Malaysia, we use Malay (Bahasa Melayu) as our primary language but we
> don't mind using English for user-interface of software and communication. I
> think most of us are comfortable with English because direct Malays
> translation maybe awkward a bits but of course I encourage Malays (Ms)
> translation (I do translation too, btw).

Yes, this is a challenge for other languages too (speaking as a
translator myself).

> That is OK for user-interface but when comes to document such as
> dissertation, paperwork, report which we use word-processing such as
> LibreOffice and so..we really want dictionary and spellchecker to validate
> what we typed and fix typos right away..
>
> Checking around, I found there is few contribution made by some people but
> that distribute dictionary and affix file, but it not been update quite long
> time.
>
> I wonder how their build and test the files? It must be using some tools. I
> contacted them to ask, but not getting response..could be their not use the
> email anymore or already left this world.

Once you figure out how this works you might want to take over upstream
development, i.e. put the sources on Gitlab or similar and invite other
Malay speakers / translators to submit words to the dictionary.

> On my research, I found most of people using Myspell long time ago, and
> mostly now are using Hunspell and there is new tool called Nuspell too, but
> I don't understand how to use it. Where you I put my new keyword? Can anyone
> guide me? Correct me if this is not the tool for what I am looking for.

As far as I recall Myspell was used by OpenOffice.org, not sure what
LibreOffice is using now.

And there's also aspell, and vim's own format, and probably others I
forgot about.

https://xkcd.com/927/

As far as I recall there are tools to convert between some formats
(possibly only one way), so it should be possible to have the "source"
dictionary in one format and generate all other formats from it (if
still needed / in use).

Hope this helps,
Andrei
--
http://wiki.debian.org/FAQsFromDebianUser
signature.asc

Robbi Nespu

unread,
Jun 26, 2021, 4:00:04 AM6/26/21
to
On 6/23/21 12:16 PM, Andrei POPESCU wrote:
> Try asking on debian-i18n.
Thanks for your suggestion, I will ask on debian i18n
0 new messages