How WordSmith deals with lemmatised data

82 views
Skip to first unread message

diff

unread,
May 8, 2007, 10:20:23 AM5/8/07
to WordSmith Tools
I have been experiencing problems with wordlists that have been
lemmatised. I have found that when the wordlists are used by other
programs, such as KeyWords, the frequency values for the headword are
read rather than the values for the whole lemma. This means that when
KeyWords looks at the lemma GO (consisting of go, went, gone, going,
goes) in a wordlist, it only uses the values for the headword 'go',
rather than the aggregate vales for the whole group.


I've kind of got round it so far by
1) saving the lemmatised wordlists as text or Excel files
2) doctoring the text file produced in Excel so that only the word and
frequency values appear for each item (so LEMMA groups are treated
just as ordinary words, with the totals for all the lemma members
being assigned to the headword)
3) Opening the text file as a wordlist using the 'Advanced' tag in
Wordlist
4) Then saving the file...

Is there an easier/better way than this?

mi...@lexically.net

unread,
May 9, 2007, 12:54:33 PM5/9/07
to WordSmith Tools
"Diff", hi

It seems to me that there are several possibilities here, when
contrasting in the KeyWords procedure, assuming we have a wordlist
with data like this extract:
GO (20)
WENT (10)
GOING (3)
GOES (4)
...

1) consider the headword GO as having 37 members, ignore WENT, GOING
etc but use their frequencies in calculation.
2) consider all words including GO (20), WENT (10), etc., basically
ignoring the fact that lemmatisation has happened
3) consider GO as having a frequency of 20 and ignore WENT and GOING
etc because they have been joined to GO.

Which do you think we OUGHT to do? Anyone else have an opinion on
this?

Cheers -- Mike

diff

unread,
May 11, 2007, 1:47:25 PM5/11/07
to WordSmith Tools

Hi Mike,

Thanks for your response!

My feeling is that GO is the lemma-the name of the whole group which
is named by convention after the headword 'go'. So 'go', 'goes',
'went', 'going', 'gone' are now treated as a group called 'GO'. The
words lose their individual identity and get treated like a 'word' by
the Keywords application.

..so I would go with option 1....

At the moment the software 'supports' option 3, is that right?


I can see why 1), my preference, would be problematic in terms of
programming. For one thing, in any Keyword list GO is going to appear
as an item like any other -there will be no indication that it is a
lemma rather than a word, which could be disastrous if the person
compiling the KW list doesn't know that the WL data has been
lemmatised! ...Unless the Keyword application was able to 'read' the
lemma information from the wordlist and present this in the keywords
result somehow. Even perhaps just flag the item as a lemma rather than
a word by using a different colour or whatever...

So I can see it's not straight forward..

..and I'd be interested like you, to see what others think...!


mi...@lexically.net

unread,
May 12, 2007, 4:54:12 AM5/12/07
to WordSmith Tools
I think I'll let users choose by putting an option in the Controller.
And yes it might be useful to colour lemmatised GO with 37 members
differently from ordinary GO with 20 if one chooses option 1.
At the moment WS4 does option 3, I think.
BTW your route via Excel may work but is intrinsically somewhat dodgy
as the stats are greatly simplified if one reads in a wordlist from a
text file.

diff

unread,
May 12, 2007, 1:02:51 PM5/12/07
to WordSmith Tools

That sounds great- a new feature for WS5 ?! BTW I've been demo-ing WS4
here to students & staff at the uni, who've been plodding along with
the old WS version (mainly cos somebody has lost all our WS4 key
codes!) . Gneral feeling is amazemnt at all the wacky functionality
(new tabs, etc.) of WS4. So the changes that appear between versions
represent real progress in meeting users' needs/ expectations...

I agree with your assessment of my Excel 'short-cut'... can you
suggest an alternative?!

mi...@lexically.net

unread,
May 13, 2007, 2:19:27 PM5/13/07
to WordSmith Tools
Thanks for the comments!

I have just done some of this. You'll now find in the KeyWords section
of the main Controller's settings "full lemma processing" which by
default is checked.
It's explained under "full lemma processing" in the Help index. There
is no special colour though. I'd be interested in your feedback. I
don't think you need your Excel method now.

WS4 key codes -- just write to me and I'll see what can be done.

Cheers -- Mike

Marco

unread,
May 14, 2007, 3:45:26 AM5/14/07
to WordSmith Tools
Hi Mike, Hi Board,
I have just tested out the new update and generally it goes into the
direction of solving a problem I was working on, however it does not
seem to work in the last step of the procedure I try to apply.
I am trying to find collocates for a lemmatized verb. If I create a
wordlist I can get what is subsumed under version 3 above (GO=37). If
I make an index from the some file, after choosing "compute lemma
matches" this also works (GO=37). When, however, computing
collocations by using the "compute MI" function the option seems to
cease working and I only get collocates for the respective word forms
of GO. This is insofar undesirable, as I will not get collocates which
might collocate often enough with the whole lemma but not with any of
the verb forms alone.
Is there anything I can do about this?

mi...@lexically.net

unread,
May 14, 2007, 7:42:58 AM5/14/07
to WordSmith Tools
Marco, hi

Thanks for pointing this out. Now (hopefully) corrected -- please try
the latest upload.

Cheers -- Mike

diff

unread,
May 14, 2007, 9:05:32 AM5/14/07
to WordSmith Tools

This is great stuff- I'll have alook at the new upload!

D.

Marco

unread,
May 15, 2007, 5:29:34 AM5/15/07
to WordSmith Tools
Ok, tried it out but it still does not work.
Here are the steps I took and an example to rule out errors on my
side:
1. Load the lemma list.
2. Make an index.
3. Compute lemma matches (so far everything is fine).
4. Compute relationships.
And in step 4 I am back to word forms, again clicking on compute lemma
matches does not seem to do anything.
Example: GIVE lemmatised in ICE GB
N Word Freq. % Texts %
4.435 GIVE 1.247 0,04 1 100,00
4.436 GIVEN 376 0,03 1 100,00
4.437 GIVES 108 1 100,00
4.438 GIVING 138 0,01 1 100,00

This should be the underlying data, the 1247 occurrences of give being
the accumulated lemma figure.

N Word 1 Freq. Word 2 Freq. Texts Gap Joint MI Z MI3 Log L.
1 GIVE 446 YOU 13.808 2 1 90 3,97 4,32 1,03 309,31
2 GIVE 446 A 22.891 2 2 144 3,92 5,05 3,07 470,57
3 GIVE 446 ME 1.681 2 1 36 5,69 10,98 -2,94 209,60

This unfortunately is the result of relationship computing.
And here is another stange thing, if I hit compute lemma matches again
in the alphabetical display window this is what happens:

N Word Freq. % Texts %
4.435 GIVE 2.048 0,04 1 100,00
4.436 GIVEN 376 0,03 1 100,00
4.437 GIVES 108 1 100,00
4.438 GIVING 138 0,01 1 100,00
(Again GIVE is the cumulative data, the others are shown in light
grey; figures for gave belong there,too I just didn't find it
convenient to copy)

mi...@lexically.net

unread,
May 15, 2007, 11:31:37 AM5/15/07
to WordSmith Tools
Marco, hi

Sorry I spoke too soon!
This is quite complex. I think you should find that in KeyWords and in
ordinary WordList word-lists lemmatised word-types are handled OK as
described above. In the case of an index, unfortunately, this is a bit
more complex: what you quite often see in WS4 is really only a kind of
wordlist-window into an index, and until now I have not made a way of
storing lemma info in the index. (Saving it after computing lemma
matches has no practical effect.) I have now fixed a little bug which
meant you couldn't even see the lemma-variant forms in a little window
by double-clicking as you can with a word-list, but I'm now working on
how to save additional information such as lemma info, match-list info
etc with the original index. I think I will need a new file extension
only for that, so you'd have FRED.types FRED.tokens and FRED.extras. I
need to think carefully about what the extra info should/could be so I
don't have to re-define too much later on... Watch this space!
Cheers -- Mike

mi...@lexically.net

unread,
May 19, 2007, 11:58:07 AM5/19/07
to WordSmith Tools
OK I have done something to make an index know about lemmatised forms,
on the lines described above. Any feedback is welcome!
Cheers -- Mike

diff

unread,
May 30, 2007, 12:08:26 PM5/30/07
to WordSmith Tools
Hi all,

A colleague n I checked out the spiffy new version (with lemmatisation
options) a few days ago. It certainly worked- the Keywords program now
uses freqeuncy data for the whole lemma when dealing with lemmatised
wordlists. Funnily enough, her project also required the use of
keyword analysis on lemmatised lists, so I think the changes were
useful!

Is that the default setting now? If so, all well and good IMO.

We still thought it would be nice if lemmas came up in a nice colour.
We thought green. Or pink....

Anyway, thanks, WSTools guys, for taking our concerns on board!

mi...@lexically.net

unread,
May 30, 2007, 12:39:45 PM5/30/07
to WordSmith Tools
Thanks for "spiffy"!
Do you mean you want a special colour for a) the Lemma column? Or b)
for the lemmas when you double-click and see them in detail? Or c) for
the Word column IFF the word was head of a lemma? I assume c) and will
see what can be done...

Cheers --- Mike

mi...@lexically.net

unread,
May 30, 2007, 12:42:40 PM5/30/07
to WordSmith Tools

> Is that the default setting now? If so, all well and good IMO.

Yes, it is.
Cheers -- Mike

diff

unread,
May 31, 2007, 9:48:06 AM5/31/07
to WordSmith Tools

Hi Mike,

Yes, we thought c) would be nice, I guess a real corpus-head person
would see immediately which items were lemmas, but we thought it would
be a nice 'touch'!


Thanks,
Duncan

mi...@lexically.net

unread,
Jun 2, 2007, 4:00:25 AM6/2/07
to WordSmith Tools
Hi Duncan

OK, alternative c) mentioned above is implemented now, green. But only
in WS5.

(I will be making beta versions of WS5 available soon. Basically I do
not intend to do ANY more development of WS4. WS5 will be a free
download for a while.)

Cheers -- Mike

diff

unread,
Jun 2, 2007, 2:20:44 PM6/2/07
to WordSmith Tools

WS5-cool! I'm really looking forward to seeing the beta version.

BTW, I think I've noticed another improvement. In the past there was
sometimes a problem using <> to indicate that the application should
skip text, if the end chevron '>' appeared at the end of the file. I
used to put in a buffer word, like 'bobby', so ensure the chevron
would be read and the preceding text excluded form the anlaysis.

Now it seems the app works just fine without this measure.

Can I just check whether or not that is the case?!!

Keep up the good work WS guys!

Dunc

mi...@lexically.net

unread,
Jun 2, 2007, 2:48:30 PM6/2/07
to WordSmith Tools
Hi Duncan
I believe it is OK in that regard, yes.
"guys" ?? There's only one of us/me actually altering/making
WordSmith! Though those who take part in this forum or otherwise send
in complaints & suggestions do help enormously too with ideas. Time is
the problem...

Cheers -- MIke

Przemyslaw Kaszubski

unread,
Jun 3, 2007, 9:53:47 AM6/3/07
to WordSmi...@googlegroups.com
2.3GHz, 1GB RAM + about 1GB free space on c drive -- at the moment
cannot afford more :(

The files were small - 1-4 thousand words each, with sentence, paragraph
and some other structural mark-up.

I will perhaps try again as soon as I restart the system, can't do it now.

Thanks.

Przemek

diff

unread,
Jun 3, 2007, 12:52:58 PM6/3/07
to WordSmith Tools

Ok-wow! I thought there was an extra programmer or two somewhere in
the background!

Yes, programming/modifying/maintaining a whole software suite must be
a big job!

diff

unread,
Jun 7, 2007, 2:59:09 PM6/7/07
to WordSmith Tools
Hi Mike,

I've just been playing around with then new WS5 beta version and ran a
few of my leedle tests to see how it would do in scratching my
particualr corpus itches corpus itches.

I'm assuming this kind of thing is interesting and useful as feedback.
If not pls ignore!
Any hoo here goes!:

Objective: See how lemmas and words are dealt with in the new version,
with particular interest in Kw and KKW lists.

The experiment
1) I took 5 texts as a test corpus. I took 13 texts as a comparison/
reference corpus.
2) I compiled a single WL file from the test corpus files, and checked
frequencies of words/lemmas. I noted some of these; one was:

lemma/word frequency number of files lemmas
COMEDY 7 1
comedy [5] comedies [2]

3) I made a wl for the ref corpus
4) I then made a KWs list by comparing the test corpus wl and ref
corpus wls (sorry-pretty obvious so far!).
5) In the KW list I saw
N Word Freq. % Texts % Lemmas
439 COMEDIES 2 0.02 1 25.00
440 COMEDY 7 0.04 1 25.00 comedy[5] comedies[2]

So that's good- COMEDY is listed as lemma, consisiting of 'comedy'
and 'comedies', and comedies appears as a sepaarte item.
COMMENT: Now we've got a hybrid lemma and word list, which is great
from my POV but potentially confusing for the new user 'sans' colours
(good luck with that BTW). I think to clear it up for the beginner
user, the headword 'comedy' (frequency of 5) could also appear as a
separate 'word' item, just like comedies. Just my opinion!

6) I then ran the KKW test. I got:
N KW Texts % Overall Freq. No. Ass.
7 COMEDY 1 25.00 7 0

So in the KKW list the lemma appears with its agreggate frequency
(total frequency of constituent words/ forms whatever you want to call
them).
COMMENT: It seems a pity that we 've 'lost' the information that
existed in the wordlists about the constiuent words, but that might be
VERY hard to program I guess!!! I do think though that it is even more
important to mark the item here as a lemma rather than a word, as
there is no other data (like the individual word frequencies you have
in the wordlists) indicating its status...???


But my overall response to this is GREAT-I think WS5 will make even
more impact on a lot of people's attempts to solve corpus 'problems'!


4) I made a batch of word lists, one for each of the five test corpus
files, comparing against the ref corpus.


diff

unread,
Jun 7, 2007, 3:12:53 PM6/7/07
to WordSmith Tools
Oops ignore number 4 at the end of the last post-I should have deleted
it...

While I'm here, i wonder if I could just check something quickly...

In the test described above, there was another item, Language, which
appeared IN MORE THAN ONE FILE (unlike comedy, the example above which
appeared in only 1 file)..

In the WL for the test corpus i have:

word lemmas*
language [120} language[112] languages [8]

*BTW could be call this column 'words in lemma'? I think that would be
clearer for easily confused types like me...

Now my question is that in the Key-KWs file the listing is

N KW Texts % Overall Freq. No. Ass.

5 LANGUAGE 2 50.00 99


I was briefly confused by the frequnecy value of 99. I guess though,
that it's 99 (less than the 120 listed in the WL) cos it is only
counted when it IS KEY IN A TEXT, is that right? and it was obviously
key in one of the two texts in which it appeared?

Also, LANGUAGE in the Key KW list is a lemma, right (language
+languages)?

Reply all
Reply to author
Forward
0 new messages