Update to Belarusian data (bel)

6 views
Skip to first unread message

Aleś Bułojčyk

unread,
Jan 18, 2026, 4:23:03 AM (yesterday) Jan 18
to unim...@googlegroups.com

Hello all,

I would like to update the Belarusian data for UniMorph located at https://github.com/unimorph/bel. My improvements are based on the Grammar Database of the Belarusian Language, which is currently the largest collection of Belarusian words and their grammatical tags.

I have prepared a pull request (https://github.com/unimorph/bel/pull/2), but I have a few questions:

  1. I have compressed (zipped) the data file because its size exceeded GitHub's 100MB limit. Is this an acceptable format for UniMorph ?

  2. All lemmas and word forms include stress marks (accents). Is this correct according to UniMorph standards, or should they be removed ?

  3. What else needs to be done to have these changes merged into the master branch?

WBR, Alex.

Kyle Gorman

unread,
Jan 18, 2026, 9:47:55 AM (yesterday) Jan 18
to Aleś Bułojčyk, unim...@googlegroups.com
Hi Aleś,

(I am not a maintainer of UniMorph, just an enthusiastic user, but here are some opinions.)

On the former issue (stress marking): I believe—correct me if I’m wrong—that Belarusian is basically only ever written with stress marking in lexicographic or pedagogical sources. It seems obvious to me though that the format of the headwords should match how the words are actually written the vast majority of the time, but you could include a second file with stress if needed. (Many repos in UniMorph now contain this sort of thing: extra data files with additional grammatical information.) I’d say the same thing about Latin (macrons and the occasional diaeresis), Russian (acute accents for stress), or Hebrew (niqqud), for instance.

On the latter issue (compression): Other languages (Polish is one example) have a file so large it exceeds GitHub restrictions. Compression is highly effective and is used here, but the compression scheme is gzip (.xz), not Zip (.zip). Here, gzip makes more sense to me from an engineering perspective as Zip is a “container” in the same way TAR is, but also has compression, and we don’t need have a need for containerization, just compression.

K

Kat Vylomova

unread,
9:50 AM (10 hours ago) 9:50 AM
to Kyle Gorman, Aleś Bułojčyk, unim...@googlegroups.com
Dear  Aleś, 

Thank you so much for your contribution to UniMorph. I agree with Kyle on both points. For stress marking, please submit two files (with and without stress).

Warm regards,
Kat

--
You received this message because you are subscribed to the Google Groups "unimorph" group.
To unsubscribe from this group and stop receiving emails from it, send an email to unimorph+u...@googlegroups.com.
To view this discussion visit https://groups.google.com/d/msgid/unimorph/8F0A62CD-52FC-4062-BECB-EB65320F39F9%40gmail.com.
Reply all
Reply to author
Forward
0 new messages