Lexicography and Bantu morphology

26 views
Skip to first unread message

Otlogetswe

unread,
Aug 25, 2016, 9:47:29 AM8/25/16
to FLEx list
Dear

I am considering enriching my Setswana monolingual dictionary with morphological information. My point of departure is to think electronic dictionary, and not print, though the dictionary may be printed at some point. I am thinking of accounting for all forms of a word in the dictionary. E.g the verb "to go" is "TSAMAYA" but it has a number of inflected forms:

tsamaalana
tsamaalanang
tsamaang
tsamaega
tsamaege
tsamaegeng
tsamaela
tsamaelana
tsamaelanang
tsamaelane
tsamaelaneng
tsamaelang
tsamaelano
tsamaelanong
tsamaele
tsamaelelana
tsamaelelang
tsamaeleng
tsamaelwa
tsamaeng
tsamaetse
tsamaile
tsamaileng
tsamailwe
tsamaisa
tsamaisana
tsamaisanang
tsamaisane
tsamaisaneng
tsamaisang
tsamaisano
tsamaisanya
tsamaise
tsamaisege
tsamaiseng
tsamaisetsa
tsamaisetsang
tsamaisitse
tsamaisitseng
tsamaisitswe
tsamaisitsweng
tsamaisiwa
tsamaisiwang
tsamaisiwe
tsamaiso
tsamaisong
....
etc

I would like all wordform to be found through search. Additionally I would like every word with complex morphology to be broken down into its morphemes. For instance I would like TSAMAELELANA to be analysed this way (or something similar):

tsamaelelana verb. (tsamaya + -elela + -ana).

For plural nouns I would like something like this:

dikgomo n. di- + kgomo. see kgomo

The word kgomo means cow.

I am not sure if this line of thinking would work and I would like to find an elegant way of importing a wordlist with such words and how to best (bulk) analyse it. I would appreciate ideas of how to execute this kind of study

Usually verbs such as those above would not be entered as headwords in a dictionary because they are predictable, however to facilitate search, would these have to be entered as headwords or not?

Regards
Thapelo

Jonathan Dailey

unread,
Aug 25, 2016, 10:01:26 AM8/25/16
to FLEx List
the inflected forms could be added as minor entries, which could either be listed under the word in a paradigm format (root based or hybrid) or just separate with links like you have in you example. 
  1. Import the wordlist in as a text
  2. Analyzing the words by breaking them into their morphemes. 
  3. Add the morphemes to the lexicon.
  4. Delete the analyses in bulk edit word forms.
  5. Reanalyse all of the words but add them as they are into the lexicon.  Make sure at this point to call them complex forms of the type inflected or plural or past or something like that.
  6. then add the component parts into the entries.

Something like this would be the process.

Jonathan


--
You are subscribed to the publicly accessible group "FLEx list".
Only members can post but anyone can view messages on the website.
---
You received this message because you are subscribed to the Google Groups "FLEx list" group.
To unsubscribe from this group and stop receiving emails from it, send an email to flex-list+unsubscribe@googlegroups.com.
To post to this group, send email to flex...@googlegroups.com.
To view this discussion on the web visit https://groups.google.com/d/msgid/flex-list/34befdc8-c6c2-4b11-b1fc-6a6ff061c60f%40googlegroups.com.
For more options, visit https://groups.google.com/d/optout.



--
SIL International
Language Technology Consultant

maxwell

unread,
Aug 25, 2016, 2:44:30 PM8/25/16
to flex...@googlegroups.com, Otlogetswe
On 2016-08-25 09:47, Otlogetswe wrote:
> I am considering enriching my Setswana monolingual dictionary with
> morphological information. My point of departure is to think electronic
> dictionary, and not print, though the dictionary may be printed at some
> point. I am thinking of accounting for all forms of a word in the
> dictionary. E.g the verb "to go" is "TSAMAYA" but it has a number of
> inflected forms:
>
> ...
>
> I would like all wordform to be found through search. Additionally I
> would
> like every word with complex morphology to be broken down into its
> morphemes.
> ...
> I am not sure if this line of thinking would work and I would like to
> find
> an elegant way of importing a wordlist with such words and how to best
> (bulk) analyse it. I would appreciate ideas of how to execute this kind
> of
> study

I know people who have worked on Bantu languages, but I myself have not.
So I may be wrong here.

That said: Bantu languages are agglutinating, meaning that they can
combine with lots of affixes (mostly prefixes) at the same time. Given
the way noun classifiers work in these languages, and the fact that a
transitive verb agrees with both its subject and its object, and that
there are lots of other affixes as well, the average verb may have
hundreds (or more?) of forms. Nouns don't take quite as many affixes,
but they still take lots, and therefore have many forms. I don't think
it would be wise to import a list of all inflected forms of all words
into a dictionary for a couple reasons:

1) It would be hard to create an accurate list of all possible forms of
every word, unless you have a computational way of doing so. (And if
you have a computational way to create all the forms, there's a better
approach than listing them in the dictionary, see below.)

2) If you did import all the possible forms of every word, the
dictionary would be huge--millions of entries, I suspect. A print
dictionary containing every word would either need microscopic print, or
it wouldn't be portable. Even dictionaries of English, with its minimal
morphology, don't usually list all four forms of regular verbs (walk,
walks, walked, walking); and for Bantu languages, you'd have hundreds or
maybe even thousands of forms. An electronic dictionary that listed
every inflected form would also be huge.

3) If you had a print dictionary of all possible forms, looking up an
inflected word would be difficult, because each sequence of prefixes
would occur once for every verb. So if you alphabetized the dictionary,
and if you had a thousand transitive verbs, there would be a thousand
dictionary entries that all started with the same sequence of prefixes,
and differed only in their last few letters (the verb stem).

Rather, if you're thinking of an electronic dictionary, and you think
that users will have trouble looking up inflected forms, what you want
is a morphological analyzer: a tool that can take any possible inflected
form of a word that the dictionary user types in, analyze the
morphology, and display both the affixes (or their meaning) and the
dictionary entry for a citation form of the word.

Bill Poser (who works on Athabaskan languages, which resemble Bantu in
certain ways) and I gave a short paper on this issue, advocating the use
of morphological analyzers as front ends to electronic dictionaries for
languages with complex morphology:
http://dl.acm.org/citation.cfm?id=1610058
Morphological analyzers are often built using finite state transducers.
Such a transducer can also be used to create all the inflected forms of
a single dictionary word, which could in principle be presented to a
user as a paradigm of the word, which might also be useful.

FLEx has two morphological parsers built into it, but as far as I know
neither one is really set up to serve as part of a user interface to the
finished dictionary. (I think it would be possible to use either one in
that way, I just don't think it has been done. Perhaps someone here can
correct me if they've done that.)

Other researchers have worked on Setswana morphology, and might be able
to share their morphological analyzers:
http://www.aclweb.org/anthology/W/W09/W09-0705.pdf
http://www.aclweb.org/anthology/W/W09/W09-0710.pdf
http://link.springer.com/article/10.1007/s10579-014-9292-1

Mike Maxwell
University of Maryland

Jeff Shrum

unread,
Aug 25, 2016, 4:07:43 PM8/25/16
to flex...@googlegroups.com

Thapelo,

 

I would suggest entering all of the affixes and verb extensions in the lexicon then click on the components field for the lexeme and populate it with all the morphemes that make up the word.  Below is a sample from a Bantu language. I you look at the components line it lists what the morphemes are and they are listed within the parenthesis in the dictionary entry at the top.

 

 

Jeff Shrum

SIL International

Language Technology Consultant

--

You are subscribed to the publicly accessible group "FLEx list".
Only members can post but anyone can view messages on the website.
---
You received this message because you are subscribed to the Google Groups "FLEx list" group.

To unsubscribe from this group and stop receiving emails from it, send an email to flex-list+...@googlegroups.com.

Otlogetswe

unread,
Aug 25, 2016, 10:54:22 PM8/25/16
to FLEx list
Dear

I have a Setswana corpus of about 20 million tokens. This makes it possible for me to generate a frequency list which can also be sorted alphabetically. It also makes it possible for me to study language as used by speakers and not what is potentially possible based on morphological rules. I am aware that a common lexicographic approach is to enter only the canonical form of verbs in dictionaries and this is perfectly understable. This is the approach we adopted in producing the largest monolingual Setswana dictionary: "Tlhalosi ya Medi ya Setswana". Our analysis of Setswana verbs is that we dont have more than 5000 in the whole dictionary. Not all of the verbs can take all Setswana suffixes and their various combinations. There are verbs which attract many suffixes and there are those which attract a few. I am yet to see any that generates a hundred wordforms. In my earlier email I gave an example of the highly productive TSAMAYA.

For nouns, we mainly add the plural prefix to mark plurality and the -ng to change the noun into an adverb.
E.g.

Kgosi (chief)
Dikgosi (plural)
Kgosing (adverb)

The principal desire is to provide users/learners with the analysis of wordforms in the dictionary. I am largely not interested in indicating the suffixes that may attach to a canonical verb since they attach in various complex ways. The verbs would be the most challenging ones to deal with and I am wondering if everyone who has had to write a Bantu language dictionary have entered the canonical verbs only.

Thapelo

Otlogetswe

unread,
Aug 25, 2016, 10:56:05 PM8/25/16
to FLEx list
Thanks.... listing inflected forms as minor entries might just work right.

Otlogetswe

unread,
Aug 25, 2016, 11:00:40 PM8/25/16
to FLEx list
Jeff

I take it that your suggested approach doesn't enter all the wordforms but only the basic headword with the possible suffixes.

Jeff Shrum

unread,
Aug 26, 2016, 12:32:09 PM8/26/16
to flex...@googlegroups.com

Thapelo,

 

Not necessarily.  If you want all of the fully inflected forms as head words you can do that.  If you have the list of surface forms there are various ways to import them into you database. If you have the Setswana words with their glosses in a Standard Format file you can import a word list.  If you have just the list of Setswana words then pasting the list in as a new text in the Text & Words area would be the simplest way to enter the words.

 

 

Then click on the gloss tab and begin glossing the words.  To get each surface form added as a head word to you database, go to the Tools menu and check the “Add words to the lexicon” option.

 

I might suggest that you create a new project to do this in case you are not happy with the results. When you have it working as you expect, the project can be merged with you existing project to join the two.

 

Jeff Shrum

SIL International

Language Technology Consultant

Dallas, TX, USA

 

 

-----Original Message-----
From: flex...@googlegroups.com [mailto:flex...@googlegroups.com] On Behalf Of Otlogetswe
Sent: Thursday, August 25, 2016 10:01 PM
To: FLEx list <flex...@googlegroups.com>

--

You are subscribed to the publicly accessible group "FLEx list".

Only members can post but anyone can view messages on the website.

---

You received this message because you are subscribed to the Google Groups "FLEx list" group.

To unsubscribe from this group and stop receiving emails from it, send an email to flex-list+...@googlegroups.com.

To post to this group, send email to flex...@googlegroups.com.

Reply all
Reply to author
Forward
0 new messages