Update to Belarusian data (bel)

13 views
Skip to first unread message

Aleś Bułojčyk

unread,
Jan 18, 2026, 4:23:03 AMJan 18
to unim...@googlegroups.com

Hello all,

I would like to update the Belarusian data for UniMorph located at https://github.com/unimorph/bel. My improvements are based on the Grammar Database of the Belarusian Language, which is currently the largest collection of Belarusian words and their grammatical tags.

I have prepared a pull request (https://github.com/unimorph/bel/pull/2), but I have a few questions:

  1. I have compressed (zipped) the data file because its size exceeded GitHub's 100MB limit. Is this an acceptable format for UniMorph ?

  2. All lemmas and word forms include stress marks (accents). Is this correct according to UniMorph standards, or should they be removed ?

  3. What else needs to be done to have these changes merged into the master branch?

WBR, Alex.

Kyle Gorman

unread,
Jan 18, 2026, 9:47:55 AMJan 18
to Aleś Bułojčyk, unim...@googlegroups.com
Hi Aleś,

(I am not a maintainer of UniMorph, just an enthusiastic user, but here are some opinions.)

On the former issue (stress marking): I believe—correct me if I’m wrong—that Belarusian is basically only ever written with stress marking in lexicographic or pedagogical sources. It seems obvious to me though that the format of the headwords should match how the words are actually written the vast majority of the time, but you could include a second file with stress if needed. (Many repos in UniMorph now contain this sort of thing: extra data files with additional grammatical information.) I’d say the same thing about Latin (macrons and the occasional diaeresis), Russian (acute accents for stress), or Hebrew (niqqud), for instance.

On the latter issue (compression): Other languages (Polish is one example) have a file so large it exceeds GitHub restrictions. Compression is highly effective and is used here, but the compression scheme is gzip (.xz), not Zip (.zip). Here, gzip makes more sense to me from an engineering perspective as Zip is a “container” in the same way TAR is, but also has compression, and we don’t need have a need for containerization, just compression.

K

Kat Vylomova

unread,
Jan 19, 2026, 9:50:53 AMJan 19
to Kyle Gorman, Aleś Bułojčyk, unim...@googlegroups.com
Dear  Aleś, 

Thank you so much for your contribution to UniMorph. I agree with Kyle on both points. For stress marking, please submit two files (with and without stress).

Warm regards,
Kat

--
You received this message because you are subscribed to the Google Groups "unimorph" group.
To unsubscribe from this group and stop receiving emails from it, send an email to unimorph+u...@googlegroups.com.
To view this discussion visit https://groups.google.com/d/msgid/unimorph/8F0A62CD-52FC-4062-BECB-EB65320F39F9%40gmail.com.

Aleś Bułojčyk

unread,
Jan 20, 2026, 12:58:54 PMJan 20
to unimorph
Thank you, Kyle, Kat.

What naming convention should I use for files with and without stress marks?
Regarding compression, I am currently using the .xz extension, but this prevents Git from showing differences in the commit history.
Is it a good idea to split large files into smaller ones based on part of speech? If so, which naming convention should I use ?

Alex.

панядзелак, 19 студзеня 2026 г. у 17:50:53 UTC+3 карыстальнік Kat Vylomova напісаў:

Kat Vylomova

unread,
Jan 21, 2026, 12:15:15 AMJan 21
to Aleś Bułojčyk, unimorph
Dear Alex, 

You may use the language code and then add something like  _voc for the version with stress marks. Please make sure you provide a short description of each file (here is an example: https://github.com/unimorph/ara). Using .xz for large files should be fine.

In terms of splitting by POS: typically, we don't do this, but in some cases when data for each part of speech comes from different sources, it might be appropriate; here is an example: https://github.com/unimorph/kaz .

Warm regards,
Kat


Aleś Bułojčyk

unread,
Jan 28, 2026, 5:07:47 PMJan 28
to unimorph
I fixed files and description on the https://github.com/unimorph/bel/pull/2.

What will be next steps to merge into main branch ?

Kat Vylomova

unread,
Jan 28, 2026, 5:27:09 PMJan 28
to Aleś Bułojčyk, unimorph
Dear Alex, 

I apologise for the delay, it looks good to me, but would I suggest avoiding deleting the 'bel' file unless there's a strong reason to do that. Is the file broken/misleading (it was extracted from Wiktionary)?

Warm regards,
Kat

Reply all
Reply to author
Forward
0 new messages