Hi Kevin
::::On Thursday 21 March 2013 Kevin Parent said::::
> I'm trying to think how to make a frequency list of morphemes based on a
> regular corpus in which words are written normally in full words
> ('unimpressively' and not 'un-impress-ive-ly').
If I were doing this, I'd actually go about it from the other end, and I'm not
sure I'd use R initially.
- Get a Korean wordlist (you should really try to get one under an open
license, even if it's not as comprehensive, to avoid blowback later).
- Arrange it in order of wordlength (to deal with the fact that, I presume,
Korean is a bit like Chinese in that two basic "words" can be stuck together
to make a word referring to a new concept).
- Take each word in order from the list, and see if it is in the text. If so,
bracket it or mark it somehow (and preferably tag it as well). For instance
(Chinese): [qiche]_car_n.
- As you go down the list, component words could be added, eg:
[[qi]_steam_n[che]_vehicle_n]_car_n
or you could also ban looking inside words already bracketed it depends on the
level you're working to. In Chinese, the script helps to distinguish between
which "qi" it is - maybe in Korean that's not the case, and you'd get a lot of
hits for each word.
- To cut down on multiple hits for each word, you could remove uncommon or
rarely-used words from the list.
- In the case that you have words still untagged, remove all tagged words from
the text, and add the remaining items to your wordlist. Repeat.
- Once you have full coverage, split out each tagged item, and subsplit that
if applicable, then (perhaps using R now, but a database will do the same)
look at frequency, so you would maybe generate an output like:
qiche:car (n) - 6
qi:steam (n) - 8
qiche - 6
che: vehicle (n) - 10
qiche - 6
Your morphemes should fall out of that naturally.
- Disambiguating multiple hits for each word will be difficult (getting worse
the shorter the word gets, unless the writing system helps you there), but
it's probably better to put effort in there (manually disambiguating if
necessary) and in improving your wordlist rather than in checking through n-
grams.
Just a suggestion!
--
Pob hwyl / Best wishes
Kevin Donnelly
kevindonnelly.org.uk
bangortalk.org.uk