I've kind of got round it so far by
1) saving the lemmatised wordlists as text or Excel files
2) doctoring the text file produced in Excel so that only the word and
frequency values appear for each item (so LEMMA groups are treated
just as ordinary words, with the totals for all the lemma members
being assigned to the headword)
3) Opening the text file as a wordlist using the 'Advanced' tag in
Wordlist
4) Then saving the file...
Is there an easier/better way than this?
It seems to me that there are several possibilities here, when
contrasting in the KeyWords procedure, assuming we have a wordlist
with data like this extract:
GO (20)
WENT (10)
GOING (3)
GOES (4)
...
1) consider the headword GO as having 37 members, ignore WENT, GOING
etc but use their frequencies in calculation.
2) consider all words including GO (20), WENT (10), etc., basically
ignoring the fact that lemmatisation has happened
3) consider GO as having a frequency of 20 and ignore WENT and GOING
etc because they have been joined to GO.
Which do you think we OUGHT to do? Anyone else have an opinion on
this?
Cheers -- Mike
Thanks for your response!
My feeling is that GO is the lemma-the name of the whole group which
is named by convention after the headword 'go'. So 'go', 'goes',
'went', 'going', 'gone' are now treated as a group called 'GO'. The
words lose their individual identity and get treated like a 'word' by
the Keywords application.
..so I would go with option 1....
At the moment the software 'supports' option 3, is that right?
I can see why 1), my preference, would be problematic in terms of
programming. For one thing, in any Keyword list GO is going to appear
as an item like any other -there will be no indication that it is a
lemma rather than a word, which could be disastrous if the person
compiling the KW list doesn't know that the WL data has been
lemmatised! ...Unless the Keyword application was able to 'read' the
lemma information from the wordlist and present this in the keywords
result somehow. Even perhaps just flag the item as a lemma rather than
a word by using a different colour or whatever...
So I can see it's not straight forward..
..and I'd be interested like you, to see what others think...!
I agree with your assessment of my Excel 'short-cut'... can you
suggest an alternative?!
I have just done some of this. You'll now find in the KeyWords section
of the main Controller's settings "full lemma processing" which by
default is checked.
It's explained under "full lemma processing" in the Help index. There
is no special colour though. I'd be interested in your feedback. I
don't think you need your Excel method now.
WS4 key codes -- just write to me and I'll see what can be done.
Cheers -- Mike
Thanks for pointing this out. Now (hopefully) corrected -- please try
the latest upload.
Cheers -- Mike
D.
This should be the underlying data, the 1247 occurrences of give being
the accumulated lemma figure.
N Word 1 Freq. Word 2 Freq. Texts Gap Joint MI Z MI3 Log L.
1 GIVE 446 YOU 13.808 2 1 90 3,97 4,32 1,03 309,31
2 GIVE 446 A 22.891 2 2 144 3,92 5,05 3,07 470,57
3 GIVE 446 ME 1.681 2 1 36 5,69 10,98 -2,94 209,60
This unfortunately is the result of relationship computing.
And here is another stange thing, if I hit compute lemma matches again
in the alphabetical display window this is what happens:
N Word Freq. % Texts %
4.435 GIVE 2.048 0,04 1 100,00
4.436 GIVEN 376 0,03 1 100,00
4.437 GIVES 108 1 100,00
4.438 GIVING 138 0,01 1 100,00
(Again GIVE is the cumulative data, the others are shown in light
grey; figures for gave belong there,too I just didn't find it
convenient to copy)
Sorry I spoke too soon!
This is quite complex. I think you should find that in KeyWords and in
ordinary WordList word-lists lemmatised word-types are handled OK as
described above. In the case of an index, unfortunately, this is a bit
more complex: what you quite often see in WS4 is really only a kind of
wordlist-window into an index, and until now I have not made a way of
storing lemma info in the index. (Saving it after computing lemma
matches has no practical effect.) I have now fixed a little bug which
meant you couldn't even see the lemma-variant forms in a little window
by double-clicking as you can with a word-list, but I'm now working on
how to save additional information such as lemma info, match-list info
etc with the original index. I think I will need a new file extension
only for that, so you'd have FRED.types FRED.tokens and FRED.extras. I
need to think carefully about what the extra info should/could be so I
don't have to re-define too much later on... Watch this space!
Cheers -- Mike
A colleague n I checked out the spiffy new version (with lemmatisation
options) a few days ago. It certainly worked- the Keywords program now
uses freqeuncy data for the whole lemma when dealing with lemmatised
wordlists. Funnily enough, her project also required the use of
keyword analysis on lemmatised lists, so I think the changes were
useful!
Is that the default setting now? If so, all well and good IMO.
We still thought it would be nice if lemmas came up in a nice colour.
We thought green. Or pink....
Anyway, thanks, WSTools guys, for taking our concerns on board!
Cheers --- Mike
Yes, it is.
Cheers -- Mike
Hi Mike,
Yes, we thought c) would be nice, I guess a real corpus-head person
would see immediately which items were lemmas, but we thought it would
be a nice 'touch'!
Thanks,
Duncan
OK, alternative c) mentioned above is implemented now, green. But only
in WS5.
(I will be making beta versions of WS5 available soon. Basically I do
not intend to do ANY more development of WS4. WS5 will be a free
download for a while.)
Cheers -- Mike
WS5-cool! I'm really looking forward to seeing the beta version.
BTW, I think I've noticed another improvement. In the past there was
sometimes a problem using <> to indicate that the application should
skip text, if the end chevron '>' appeared at the end of the file. I
used to put in a buffer word, like 'bobby', so ensure the chevron
would be read and the preceding text excluded form the anlaysis.
Now it seems the app works just fine without this measure.
Can I just check whether or not that is the case?!!
Keep up the good work WS guys!
Dunc
Cheers -- MIke
The files were small - 1-4 thousand words each, with sentence, paragraph
and some other structural mark-up.
I will perhaps try again as soon as I restart the system, can't do it now.
Thanks.
Przemek
Yes, programming/modifying/maintaining a whole software suite must be
a big job!
I've just been playing around with then new WS5 beta version and ran a
few of my leedle tests to see how it would do in scratching my
particualr corpus itches corpus itches.
I'm assuming this kind of thing is interesting and useful as feedback.
If not pls ignore!
Any hoo here goes!:
Objective: See how lemmas and words are dealt with in the new version,
with particular interest in Kw and KKW lists.
The experiment
1) I took 5 texts as a test corpus. I took 13 texts as a comparison/
reference corpus.
2) I compiled a single WL file from the test corpus files, and checked
frequencies of words/lemmas. I noted some of these; one was:
lemma/word frequency number of files lemmas
COMEDY 7 1
comedy [5] comedies [2]
3) I made a wl for the ref corpus
4) I then made a KWs list by comparing the test corpus wl and ref
corpus wls (sorry-pretty obvious so far!).
5) In the KW list I saw
N Word Freq. % Texts % Lemmas
439 COMEDIES 2 0.02 1 25.00
440 COMEDY 7 0.04 1 25.00 comedy[5] comedies[2]
So that's good- COMEDY is listed as lemma, consisiting of 'comedy'
and 'comedies', and comedies appears as a sepaarte item.
COMMENT: Now we've got a hybrid lemma and word list, which is great
from my POV but potentially confusing for the new user 'sans' colours
(good luck with that BTW). I think to clear it up for the beginner
user, the headword 'comedy' (frequency of 5) could also appear as a
separate 'word' item, just like comedies. Just my opinion!
6) I then ran the KKW test. I got:
N KW Texts % Overall Freq. No. Ass.
7 COMEDY 1 25.00 7 0
So in the KKW list the lemma appears with its agreggate frequency
(total frequency of constituent words/ forms whatever you want to call
them).
COMMENT: It seems a pity that we 've 'lost' the information that
existed in the wordlists about the constiuent words, but that might be
VERY hard to program I guess!!! I do think though that it is even more
important to mark the item here as a lemma rather than a word, as
there is no other data (like the individual word frequencies you have
in the wordlists) indicating its status...???
But my overall response to this is GREAT-I think WS5 will make even
more impact on a lot of people's attempts to solve corpus 'problems'!
4) I made a batch of word lists, one for each of the five test corpus
files, comparing against the ref corpus.
While I'm here, i wonder if I could just check something quickly...
In the test described above, there was another item, Language, which
appeared IN MORE THAN ONE FILE (unlike comedy, the example above which
appeared in only 1 file)..
In the WL for the test corpus i have:
word lemmas*
language [120} language[112] languages [8]
*BTW could be call this column 'words in lemma'? I think that would be
clearer for easily confused types like me...
Now my question is that in the Key-KWs file the listing is
N KW Texts % Overall Freq. No. Ass.
5 LANGUAGE 2 50.00 99
I was briefly confused by the frequnecy value of 99. I guess though,
that it's 99 (less than the 120 listed in the WL) cos it is only
counted when it IS KEY IN A TEXT, is that right? and it was obviously
key in one of the two texts in which it appeared?
Also, LANGUAGE in the Key KW list is a lemma, right (language
+languages)?