Generating a frequency list of morpheme

61 views

Skip to first unread message

Kevin Parent

unread,

Mar 21, 2013, 12:35:27 AM3/21/13

to corplin...@googlegroups.com

I'm trying to think how to make a frequency list of morphemes based on a regular corpus in which words are written normally in full words ('unimpressively' and not 'un-impress-ive-ly').

I don't want to supply a list of morphemes and tell R to look for each instance of '^un' or 'ly$' but, rather, I want R to sort through a word list and see that several words begin with 'un' and then differ, so it then adds 'un-' to a list of morpheme candidates which I can clean up manually later.

Although my example here is in English, I'm actually doing this with a Korean corpus. My workplace, Korea Maritime University, would be written in Korean more like koreamaritimeuniversity, and I'm trying to make a frequency list that accounts for this (even if imperfectly--it's not for data that I intended to publish on).

So here's how I'm thinking of approaching it. First, make a list of single words that do come up, like 'university'. Next go through the corpus again, one pass for each word, and find words that contain that word, so 'koreamaritimeuniversity,' 'seouluniversity,' etc. would be added under 'university.' (or alternatively, 'seoul' is on the list and so is 'university' so the corpus is reformatted with a space between the two.)

That part's pretty straightforward, but there will almost definitely be a lot of important words that don't come up as single words. For convenience' sake, let's say I limit this to morphemes that are two syllables or longer. So I go through the lists I have and break it into two-sylable chunks, so 'unimpressively' becomes 'unim,' 'impress,' 'pressive,' and 'sively' (more easily done in Korean which is written in syllables) and search the corpus for each string, and then do the same for three-syllable chunks.

Awfully messy and time-consuming, but I accept that it will be no matter what. Can anyone think of a better way to go about this?

--

Kevin Parent, Ph.D
Korea Maritime University
Chair, Korea Toastmasters Territorial Council

Kevin Donnelly

unread,

Mar 21, 2013, 4:45:35 AM3/21/13

to corplin...@googlegroups.com

Hi Kevin

::::On Thursday 21 March 2013 Kevin Parent said::::

> I'm trying to think how to make a frequency list of morphemes based on a
> regular corpus in which words are written normally in full words
> ('unimpressively' and not 'un-impress-ive-ly').

If I were doing this, I'd actually go about it from the other end, and I'm not
sure I'd use R initially.

- Get a Korean wordlist (you should really try to get one under an open
license, even if it's not as comprehensive, to avoid blowback later).
- Arrange it in order of wordlength (to deal with the fact that, I presume,
Korean is a bit like Chinese in that two basic "words" can be stuck together
to make a word referring to a new concept).
- Take each word in order from the list, and see if it is in the text. If so,
bracket it or mark it somehow (and preferably tag it as well). For instance
(Chinese): [qiche]_car_n.
- As you go down the list, component words could be added, eg:
[[qi]_steam_n[che]_vehicle_n]_car_n
or you could also ban looking inside words already bracketed it depends on the
level you're working to. In Chinese, the script helps to distinguish between
which "qi" it is - maybe in Korean that's not the case, and you'd get a lot of
hits for each word.
- To cut down on multiple hits for each word, you could remove uncommon or
rarely-used words from the list.
- In the case that you have words still untagged, remove all tagged words from
the text, and add the remaining items to your wordlist. Repeat.
- Once you have full coverage, split out each tagged item, and subsplit that
if applicable, then (perhaps using R now, but a database will do the same)
look at frequency, so you would maybe generate an output like:
qiche:car (n) - 6
qi:steam (n) - 8
qiche - 6
che: vehicle (n) - 10
qiche - 6
Your morphemes should fall out of that naturally.
- Disambiguating multiple hits for each word will be difficult (getting worse
the shorter the word gets, unless the writing system helps you there), but
it's probably better to put effort in there (manually disambiguating if
necessary) and in improving your wordlist rather than in checking through n-
grams.

Just a suggestion!

--
Pob hwyl / Best wishes

Kevin Donnelly
kevindonnelly.org.uk
bangortalk.org.uk

Reply all

Reply to author

Forward

0 new messages