List of Semantic Domains by Ronald Moe - available to download?

891 views
Skip to first unread message

IanS

unread,
Sep 11, 2014, 8:56:08 PM9/11/14
to flex...@googlegroups.com
Is there anywhere I can download a current version of Ronald Moe's "List of Semantic Domains", as used for the Dictionary Development Process?

I'm hoping there is a pdf version, that would be suitable for browsing through and thinking about.

Jonathan Dailey

unread,
Sep 11, 2014, 9:06:04 PM9/11/14
to flex...@googlegroups.com
try http://rapidwords.net/resources and scroll down.

On 9/11/2014 7:56 PM, IanS wrote:
Is there anywhere I can download a current version of Ronald Moe's "List of Semantic Domains", as used for the Dictionary Development Process?

I'm hoping there is a pdf version, that would be suitable for browsing through and thinking about.

--
You are subscribed to the publicly accessible group "FLEx list".
Only members can post but anyone can view messages on the website.
---
You received this message because you are subscribed to the Google Groups "FLEx list" group.
To unsubscribe from this group and stop receiving emails from it, send an email to flex-list+...@googlegroups.com.
To post to this group, send email to flex...@googlegroups.com.
To view this discussion on the web visit https://groups.google.com/d/msgid/flex-list/3e569941-a5f5-40e3-a93b-450bce121712%40googlegroups.com.
For more options, visit https://groups.google.com/d/optout.

IanS

unread,
Sep 11, 2014, 9:51:20 PM9/11/14
to flex...@googlegroups.com
Jonathan, I have found it, as you indicated, at http://rapidwords.net/resources/files/list-domains

That is exactly what I was looking for. Thanks!

Kevin Warfel

unread,
Sep 12, 2014, 9:17:23 AM9/12/14
to flex...@googlegroups.com

Ian,

 

I’m not sure if you’re wanting the list of semantic domain names or the list of semantic domain templates. Either is available on the RapidWords.net website, by clicking on these links, respectively: list of domains or questionnaire. This website is dedicated to the first step in Ron Moe’s Dictionary Development Process, the collection of the words which constitute the corpus of data on which the subsequent steps focus. The files available for download at the locations I’ve provided links to are in English and in Microsoft Word format. If you prefer them in PDF, contact me directly and I’ll convert what you want for you. If you want them in a different language, look through the list of what’s available on the Resources page of the website.

 

Best wishes,

Kevin

 

Kevin Warfel

Associate Dictionary and Lexicography Services Coordinator

a.k.a. Dictionary Development Coordinator

SIL International

 

Current technology makes it possible to provide those translating into just about any language with both a dictionary and a thesaurus in the target language, the standard tools of the trade for professional translators, so why are mother-tongue translators in minority languages still expected to do their work without these tools?  Ask me about Rapid Word Collection after reading about it at rapidwords.net.

--

IanS

unread,
Sep 12, 2014, 10:18:44 AM9/12/14
to flex...@googlegroups.com
Hi Kevin,

Thanks for your response. I had started looking through the list of domains, but since you have alerted me to the questionnaire, I've been looking through that too. I have a lexicon of 4000+ words (mostly collected 10 to 20 years ago), but I am reviewing the list of semantic domains to find things that I may have missed.

I wonder if I classify my lexicon using the list (by typing the domain numbers into each of my entries), whether in FLEx I can then sort the lexicon into the semantic domain list order, so I can get a better view of what might still be missing?

Do you know how many 'branch-end' terms (the terms at the ends of each set of divisions) there are in the hierarchical list of semantic domains? I'm guessing about 1,000?

Ian

Ron Moe

unread,
Sep 12, 2014, 6:05:38 PM9/12/14
to flex...@googlegroups.com
Hi Ian,
There are a total of about 1,800 domains. Each one is a legitimate domain except perhaps for a couple of the nine initial domains. They are all organized in a hierarchy with the more general domains toward the top. But you would need to look at all 1,800 domains to see where your dictionary is lacking words. You will find that some languages have no words in some domains. But I've found that almost all of the domains are valid for most languages. I classified the 20,000 most frequent words of English. In total there are 60,000 example words. But many of these are found in more than one domain. So the total number of unique words is quite a bit less than 60,000.

There are ways of automatically classifying the words in a dictionary. But it would take someone with a bit of skill using CC tables or regular expressions. Basically the procedure matches the English glosses in your dictionary with the example words in the list of domains. It doesn't work perfectly. So you would have to check the results. FLEx also has a tool that works on the same basis. You will find it in the dialog box for the Semantic Domains field. It is called "Suggest". But it only works one word at a time and would be fairly time consuming. It would be better to use the Rapid Word Collection method or simply have a speaker of the language use the Collect Words tool in FLEx.

Once you have classified all the words, you can sort the database on the Semantic Domain field. This would show you where your dictionary has gaps in its coverage. It also brings together all the words in a domain, which enables you to contrast and compare close synonyms.

I hope this answers some of your questions.
Ron Moe

--
You are subscribed to the publicly accessible group "FLEx list".
Only members can post but anyone can view messages on the website.
---
You received this message because you are subscribed to the Google Groups "FLEx list" group.
To unsubscribe from this group and stop receiving emails from it, send an email to flex-list+...@googlegroups.com.
To post to this group, send email to flex...@googlegroups.com.

IanS

unread,
Sep 13, 2014, 7:29:26 PM9/13/14
to flex...@googlegroups.com
Hi Ron

On reading your post carefully, it looks like a fairly daunting task to retrospectively classify an existing extensive lexicon in terms of the List of Semantic Domains. Perhaps I can 'nibble off' the task by focusing on one general domain at a time over the next couple of years.

Perhaps just going at it afresh with the RapidWords approach as though the project is just new is also a possibility. That latter option would be particularly valid if indeed it is likely, as the RapidWords homepage suggests, that I am likely with RapidWords to add approx 10,000 new words to my existing 5000 or so words.

I am still working in Shoebox, with a view to converting to FLEx after a big cleanup of the database. One thing that occurs to me is that the semantic domain field in FLEx might be more stringent than the \sd field is in Shoebox.

I had been using the \sd field in Shoebox as a convenient way to filter the database by various tags, many for housekeeping purposes. For example: 'unconfirmed', 'needexample', 'keyconcept', etc.

What is the best way in FLEx to deal with these 'intruders' to a more purely-structured semantic domain field? Can they just remain there? Or should they be moved to another field (and what would that be)?

One more question: can FLEx filter a database by NOT logic, i.e., list every entry but NOT those tagged with (TagA, TagB, TagC, etc).

Thanks
Ian

Kevin Warfel

unread,
Sep 14, 2014, 7:09:32 PM9/14/14
to flex...@googlegroups.com

Hi Ian,

 

As caretaker of the RapidWords.net website, I’d like to clarify a detail that has been often misunderstood. The figure of 15,000 that is given on the home page is what one can expect to collect in terms of “raw” words. Based on the experiences we’ve had to date, some words will be collected in multiple domains, especially when a word has multiple senses (and will therefore be counted multiple times in the total tally), some phrases will be phrases that don’t belong in a dictionary (often because the individuals doing the word collection did not clearly understand what is desirable and what is not), some words may be the product of an imaginative individual determined to find words for a semantic domain that is largely irrelevant to the language community, and so on. So 15,000 “raw” words, when “refined,” yield how many entries (actually senses)? I would love to know the answer to that question for a number of languages that have used the Rapid Words approach, but the “refining” process takes much longer than the data-gathering, so I don’t have information from the languages where I’ve personally helped guide the word-collection process. I did hear from one person recently, though, that 13,000-14,000 raw items produced an end result of some 7000 words and 3000 phrases. However, some of those were still not items worthy of inclusion in a dictionary.

 

So while I think it is fair to believe that your database will grow as a result of using the Rapid Words approach, I think it’s unlikely that you’ll find as many as 10,000 legitimate words and phrases that you didn’t have before you started the process. One important thing to realize is that most people judge the extent of their lexical data by the number of lexical *entries* they have, but counting the number of items collected when using the semantic domains questionnaire corresponds more closely to taking a census of the *senses*. If you have 5000 entries at present, you might have 6000 or 7000 senses in your current database. This is the number you should really be comparing to the 15,000 on the RapidWords website, in addition to taking into account the fact that the 15,000 are “raw” items, and the ones in your existing database are probably quite “refined.”

 

I hope that this will help set more realistic expectations of the results that you are likely to experience after using the semantic domains questionnaire, so that you won’t be disappointed in the end.

 

Blessings,

Kevin

--

You are subscribed to the publicly accessible group "FLEx list".
Only members can post but anyone can view messages on the website.
---
You received this message because you are subscribed to the Google Groups "FLEx list" group.
To unsubscribe from this group and stop receiving emails from it, send an email to flex-list+...@googlegroups.com.
To post to this group, send email to flex...@googlegroups.com.

IanS

unread,
Sep 14, 2014, 8:13:45 PM9/14/14
to flex...@googlegroups.com
Hi Kevin

Thanks for your clarification, I find that very helpful - and actually rather reassuring. While I have no real way of knowing how many gaps my lexicon has (it will certainly need a systematic 'semantic domain' approach across various subject areas that I have never properly gathered together), I did notice that gradually over time, 'new discoveries' became few and far between, and the lexical work became more a process of refining what was already there - better definitions, more affix verb forms, more metaphorical uses - a process that is still going on today. But I believe that a semantic domain elicitation approach will still throw up more. The  question of course, is how much. I am keen to eventually convert from Shoebox to FLEx, where, I believe, subentries can be sorted independently of headwords, and I guess that means I'll see for the first time a count of 'entries' (including subentries, but still not 'senses') as opposed to 'headwords' (of which I have about 3000+). I imagine that the 7000 'refined' entries (including senses) one of your correspondents counted, is probably about where I will end up too, with further work.

In much elicitation from English, I have found that the result is a phrase (which is no bad thing of course, and many of these can become subentries too - unless as you say, they are so artificial that nobody actually says them), whereas many of the unique words are only derived from mapping a domain (and then sub-domains) within the vernacular - and that throws up words sometimes barely imaginable in English - or at least the domain maps quite differently from English. For example, mapping, sorting, and remapping into finer domains all the ways that food from the garden is broken down for cooking by skinning, stripping, grating, cutting, slicing, snapping, etc. The language is Oceanic, and I think to myself that someday someone might make an Oceanic 'list of semantic domains' - though it would probably take years to do. I myself will have to be satisfied just with getting a bit more done on my one language - as the sun is going down so to speak. In the meantime though the RapidWords list covers a lot of ground that might otherwise be missed. 

Take all that as a comment, which I make only because I couldn't resist responding to the mightily interesting points you raise.

Ian

Beth Bryson

unread,
Sep 15, 2014, 9:38:59 AM9/15/14
to flex...@googlegroups.com
This probably isn't just what Ian is looking for, but may be useful for others.

From within FLEx, it is also possible to export the semantic domains in the form of a worksheet to be used in a Rapid Word Collection workshop.   This format includes the domain name, the description, and the questions used for eliciting words in the domain, including example words.  Typically this would be exported in the language the workshop is being conducted in.  English is an option, as are any of the languages that have the domains translated and shipping with FLEx (such as French, Indonesian, etc.).

To do this, use File/Export, and then scroll down to find "Semantic Domains". 

-Beth

On Sep 11, 2014, at 7:56 PM, IanS wrote:

Is there anywhere I can download a current version of Ronald Moe's "List of Semantic Domains", as used for the Dictionary Development Process?

I'm hoping there is a pdf version, that would be suitable for browsing through and thinking about.


--
You are subscribed to the publicly accessible group "FLEx list".
Only members can post but anyone can view messages on the website.
---
You received this message because you are subscribed to the Google Groups "FLEx list" group.
To unsubscribe from this group and stop receiving emails from it, send an email to flex-list+...@googlegroups.com.
To post to this group, send email to flex...@googlegroups.com.

Jon C

unread,
Sep 15, 2014, 1:39:44 PM9/15/14
to flex...@googlegroups.com

On 9/12/2014 5:05 PM, Ron Moe wrote:
> FLEx also has a tool that works on the same basis. You will find it in
> the dialog box for the Semantic Domains field. It is called "Suggest".
> But it only works one word at a time and would be fairly time consuming.

One thing I just recently noticed is that FLEx now has this in Bulk Edit
as well. Go to the List Choice tab and target Semantic Domains, then
click Suggest to get a preview. It can do the entire lexicon all at once
(nice!), though I imagine it would be a lot safer to manually approve
each row. Personally, I would filter first for, say, all words beginning
with "a", to break that manual task into chunks.

Jon

Jon C

unread,
Sep 15, 2014, 1:48:39 PM9/15/14
to flex...@googlegroups.com
Kevin, thank you so much for clarifying that! I'm a big fan of RWC, but I cringe at the phrase "10,000 to 15,000 words" because it's better for people to have attainable expectations going in. And we're really trying to get *lexemes* (memorized words/phrases whose meanings are not 100% predictable from their component morphemes.) More inline below...


On 9/14/2014 6:09 PM, Kevin Warfel wrote:

Hi Ian,

 

As caretaker of the RapidWords.net website, I’d like to clarify a detail that has been often misunderstood. The figure of 15,000 that is given on the home page is what one can expect to collect in terms of “raw” words. Based on the experiences we’ve had to date, some words will be collected in multiple domains, especially when a word has multiple senses (and will therefore be counted multiple times in the total tally),

And even the same sense may show up multiple times: e.g. "potato" under cooking and daily life domains, and under farming. Note that FLEx or WeSay can auto-merge those, so typing directly into those is better than using paper. (Although they won't know that "potato" and "potatoes" are the same lexeme.)

some phrases will be phrases that don’t belong in a dictionary (often because the individuals doing the word collection did not clearly understand what is desirable and what is not), some words may be the product of an imaginative individual determined to find words for a semantic domain that is largely irrelevant to the language community, and so on.

And many languages will have inflected forms (potatoes) or forms with clitics on them (thepotato), etc.

So 15,000 “raw” words, when “refined,” yield how many entries (actually senses)? I would love to know the answer to that question for a number of languages that have used the Rapid Words approach, but the “refining” process takes much longer than the data-gathering, so I don’t have information from the languages where I’ve personally helped guide the word-collection process. I did hear from one person recently, though, that 13,000-14,000 raw items produced an end result of some 7000 words and 3000 phrases. However, some of those were still not items worthy of inclusion in a dictionary.

 

So while I think it is fair to believe that your database will grow as a result of using the Rapid Words approach, I think it’s unlikely that you’ll find as many as 10,000 legitimate words and phrases that you didn’t have before you started the process. One important thing to realize is that most people judge the extent of their lexical data by the number of lexical *entries* they have, but counting the number of items collected when using the semantic domains questionnaire corresponds more closely to taking a census of the *senses*. If you have 5000 entries at present, you might have 6000 or 7000 senses in your current database. This is the number you should really be comparing to the 15,000 on the RapidWords website, in addition to taking into account the fact that the 15,000 are “raw” items, and the ones in your existing database are probably quite “refined.”

This is a *very* important point. Before merging a bunch of raw RWC data into a polished existing lexicon, you'll probably want to mark the latter in some way, probably in a status field. It may very well still be worth doing RWC, but you'd probably want to invest more time in training the team to do quality control *as* they are collecting the new data.  -Jon

Kevin Warfel

unread,
Sep 15, 2014, 2:05:45 PM9/15/14
to flex...@googlegroups.com

If I’m not mistaken, two of the things that Ron included in the materials he made available to those who sought to implement his Dictionary Development Process were (1) the need to address the question of “citation form,” and (2) the need to train the participants to recognize the difference between multi-word expressions that are of interest in the publication of a dictionary and those that are not (e.g., “hit the ball” vs. “hit the hay” or “The White House” vs. “the white house”). In any case, I highlight these two points in the approach that I use when leading a Rapid Word Collection workshop. To the degree to which I succeed in helping the participants understand these two concepts, the variation between “potato,” “potatoes,” “thepotato,” etc. is eliminated, as is the proliferation of phrases that have no place in a dictionary. Success on these fronts translates into a significant reduction in the need to edit the database after the word collection is completed. (Don’t misunderstand me: There will still be a significant amount of editing to be done, but the quantity will be greatly reduced if all words are entered in their citation form and if uninteresting phrases are not included.)

 

Kevin

Jon C

unread,
Sep 15, 2014, 2:07:37 PM9/15/14
to flex...@googlegroups.com
Hi Ian,

On 9/13/2014 6:29 PM, IanS wrote:
> I had been using the \sd field in Shoebox as a convenient way to
> filter the database by various tags, many for housekeeping purposes.
> For example: 'unconfirmed', 'needexample', 'keyconcept', etc.
>
> What is the best way in FLEx to deal with these 'intruders' to a more
> purely-structured semantic domain field? Can they just remain there?
> Or should they be moved to another field (and what would that be)?
They can, and FLEx import will create new domains for those, but it
would be better to move them into a status field (\st is the MDF field
for that). And unless you intend to do extensive (emic) customizing of
the standard (etic) domains, it's better to just use the standard ones,
unmodified. (That makes it easier to auto-migrate to the name/numbering
system of any future domain set Ron releases.)
>
> One more question: can FLEx filter a database by NOT logic, i.e., list
> every entry but NOT those tagged with (TagA, TagB, TagC, etc).
Yes, but this is easiest with list fields. It sounds to me like you'll
want to create a custom list in FLEx (in the Lists area) for those tags,
then create a custom field that uses that list and allows multiple
values. (Actually, it's usually an import specialist who would set this
up for you. Cleaning up an SFM for interpretation by the computer--e.g.
for FLEx import, or for electronic publication with functioning
links--is almost never a simple task.)

BTW, with non-list fields, it can be done but it's much harder. You can
use a regular expression to do a NOT filter, but it's easiest to do this
with single-letter codes (e.g. a field indicating dialect(s) with
one-letter codes). This filter finds all entries that do not contain any
a's or d's:
^[^ad]*$

And this finds all entries that do not contain the text "ada".
Literally, every character must be a "not a" or an "a not followed by da".
^([^a]|a(?!da))*$

Jon

Jon C

unread,
Sep 15, 2014, 2:28:10 PM9/15/14
to flex...@googlegroups.com
Yes, we definitely tried hard to prevent all of that through up-front training of the group. But it was difficult for those who understood that well (typically younger people) to feel confident enough to correct those who knew the language best (typically older people). And there were a lot of people there, 30-40, and 'only' three expat linguists. We later wished that we had done more up-front training so that every table would have a confident quality-control expert--ideally the one doing the writing/typing too. (If we'd had FLEx back then, and reliable electricity, we could have saved work by providing each table with a computer, to avoid messy handwriting. And auto-merge could have reduced false homographs, assuming we'd been doing very frequent send/receive.)

There are also certain inherent difficulties. For verbs we picked "active voice, realis tense, no clitics" as our citation form, but certain verbs almost never occur in that combination. Nearly all verbs of perception are used in a "passive/middle" voice, and many verbs tend to occur with an aspectual clitic attached, so having to deal with those later was probably unavoidable. (Singular vs. plural is much simpler and wasn't an issue for us anyway. With English, I was imagining a pre-literate person answering the question, "what do you harvest?" and a somewhat-literate person knowing to write "potatoes" down as "potato". But to not throw out words like "scissors".)

Jon
Reply all
Reply to author
Forward
0 new messages