comparing two sets of writing

eldivenci

unread,

Nov 24, 2009, 9:18:13 AM11/24/09

to WordSmith Tools

I have two sets of approx 250-word reports written by 62 students
describing their spoken language skills. I wanted to compare the two
sets. The first set was written at the start of the term and the
second at the end of the term.
I made the first reports into one text (16368 words) and the second
reports into another text (18662 words). I compared the two texts
using the Key Words program.
Wordsmith identified 9 words that occurred more in the second set and
2 words that occurred less.
These key words are extremely interesting to me as they are words that
students could be expected to start using during the term to describe
their spoken language skills.
Are there any problems with what I have done? Is this an invalid use
of Wordsmith?
The first text does not seem to qualify as a reference corpus as it is
smaller than the study corpus, and I have read that the reference
corpus should be at least twice the size of the study corpus. Is this
a problem?

Mike

unread,

Nov 25, 2009, 5:28:06 AM11/25/09

to WordSmith Tools

Eldivenci, hi

It's not invalid in itself. But you could compare the 2 using the
"Compare Two Wordlists" function in WordList. That takes each word in
EACH corpus and compares the frequencies using the same keywords
mechanism. It gets around the problem of the reference corpus being
smaller, and might give you a few more words to think about. Remember
you can seledct them and then choose Compute | Concordance to examine
your resukts easily. The size/nature of a suitable reference corpus
still seems to me to be somewhat problematic, actually, even though
the procedure is generally very robust.

Cheers -- Mike

Tim Rowe

unread,

Nov 25, 2009, 10:16:35 AM11/25/09

to wordsmi...@googlegroups.com

2009/11/24 eldivenci <philip...@akdeniz.edu.tr>:

>
> I have two sets of approx 250-word reports written by 62 students
> describing their spoken language skills. I wanted to compare the two
> sets. The first set was written at the start of the term and the
> second at the end of the term.
> I made the first reports into one text (16368 words) and the second
> reports into another text (18662 words). I compared the two texts
> using the Key Words program.
> Wordsmith identified 9 words that occurred more in the second set and
> 2 words that occurred less.

I think it's valid as far as it goes -- it seems to show increased use
of technical vocabulary, which is relevant. But an automated study
will only take you so far. For instance, it doesn't show that the
technical vocabulary is being used correctly. Another sign of improved
writing skills that I might expect is better cohesion, and although
WordSmith Tools might show increased use of cohesive phrases it would
be hard to detect improved overall structure (although tracking
changes in the type/token through a text might give you some
interesting information regarding where new ideas are introduced). One
study I have read found that novice and experienced scientific writers
hedged about the same amount, but the experienced writers hedged
things that needed to be hedged whereas the noviced hedged seemingly
at random -- I challenge you to automate that check!

Even amongst the stuff you can automate, are the 9 words that occurred
more and the two that occurred less statistically significantly more
or less? And did you predict those changes in advance? If not, they
form the basis of a hypothesis, not a conclusion, and I think you'd
need to analyse a whole pile more before and after reports to avoid a
well known (but distressingly common) statistical blunder.

--
Tim Rowe

Mike

unread,

Nov 26, 2009, 10:39:43 AM11/26/09

to WordSmith Tools

I agree with the first part of Tim's comment, that an automated study
won't tell you enough and you need to check out what it suggests
manually.
On the second, well the 9 words were identified by the software merely
on the basis of statistical significance, and IMHO that is a massive
strength. It has to do with corpus-driven (as opposed to corpus-based)
research. The automated system suggests items that one has to check
out carefully and therein removes some of the subjectivity of
research. Yes indeed it gives you a hypothesis. You wouldn't want a
machine to give you a conclusion -- and maybe ought to question it
when another human gives you one, too!

Mike

eldivenci

unread,

Nov 28, 2009, 4:08:27 AM11/28/09

to WordSmith Tools

Thank you very much for your responses. This is very interesting. Why
is the size of the reference corpus problematic?

I did the compare two wordlists operation as suggested by Mike. It
produced 13 words, of which 6 were also in the key words list. The new
list has left out 4 words that occurred 23-27 times. Is 30 the cut-off
frequency, then? Can the compare two wordlists be adjusted to include
those lower frequencies?

I am surprised by the appearance of 3 words that were not picked up by
the key words program, with frequencies of 37-92, 6-37 and 4-32 in the
first and second reports.

By the way, after the automated analysis, I went through all the
reports and coded them in categories of aspects of speaking, (range,
accuracy, fluency, interaction, coherence and others) that I am
interested in using Atlas. The key words give a very interesting
perspective on what the students have started to focus on by the end
of the term, and provide me with leads to follow up on in analysis of
the reports.

Very grateful for any further comments or advice.

Mike

unread,

Nov 28, 2009, 5:16:35 AM11/28/09

to WordSmith Tools

Glad you are geting what you need, eldivenci.

I suggest you look up the notion "p value" in the help. That
determines which words are considered "key". It's not absolute
frequency but comparative frequency that makes the difference.

Cheers -- Mike

Tim Rowe

unread,

Nov 28, 2009, 6:01:35 PM11/28/09

to wordsmi...@googlegroups.com

2009/11/26 Mike <mi...@lexically.net>:

> I agree with the first part of Tim's comment, that an automated study
> won't tell you enough and you need to check out what it suggests
> manually.
> On the second, well the 9 words were identified by the software merely
> on the basis of statistical significance, and IMHO that is a massive
> strength. It has to do with corpus-driven (as opposed to corpus-based)
> research.

Yes, but there's a big difference between "statistically significant"
and "statistically meaningful". The corpus research has identified an
objective difference between the texts, but does it say anything about
the learning outcomes in the students? Answer: no, nothing at all.
They are different sets of texts, it would be remarkable if they
*didn't* have significant differences! And humans are pattern-seeking
creatures, so it's all to easy to find an explanation for a difference
that has no significance at all,

So all you can say is that it's interesting that there are those
differences, that they *might* be the result of the learning and they
*might* indicate a particular learning outcome. You have some
hypotheses, but at the moment they're at the same level as noticing
that you had bad luck the last three times you wore your pink socks --
that could be a statistically significant correllation too. To move
from superstition to science, to start the actual *research* you have
to *test* those hypotheses.

The fact that "the 9 words were identified by the software merely on
the basis of statistical significance" can be a massive weakness if
you don't understand statistical hypothesis forming and testing!

--
Tim Rowe

Mike

unread,

Nov 29, 2009, 4:57:37 AM11/29/09

to WordSmith Tools

Humm.

> So all you can say is that it's interesting that there are those
> differences, that they *might* be the result of the learning and they
> *might* indicate a particular learning outcome. You have some
> hypotheses, but at the moment they're at the same level as noticing
> that you had bad luck the last three times you wore your pink socks --
> that could be a statistically significant correllation too. To move
> from superstition to science, to start the actual *research* you have
> to *test* those hypotheses.

The comparison eldivenci made is not really much like simply
"noticing" a match between sock-colour and luck. He used a piece of
software to trawl through the equivalent of his entire wardrobe, their
colours and maybe the type of cloth, date of manufacture, shop where
bought etc, plus his luck or ill-luck and that was what found the 13
coincidences he reports, without him somehow suspecting it had to do
with socks or pink. But I do agree a) the software tells you nothing
about learning processes [it doesn't claim to] and b) he needs now to
look further into the findings [as he indeed seems to be doing!]. The
software gave him pointers, not proof of anything. They might well be
useful pointers...

Cheers -- Mike

Tim Rowe

unread,

Nov 29, 2009, 9:57:56 AM11/29/09

to wordsmi...@googlegroups.com

2009/11/29 Mike <mi...@lexically.net>:

> The comparison eldivenci made is not really much like simply
> "noticing" a match between sock-colour and luck. He used a piece of
> software to trawl through the equivalent of his entire wardrobe, their
> colours and maybe the type of cloth, date of manufacture, shop where
> bought etc, plus his luck or ill-luck and that was what found the 13
> coincidences he reports, without him somehow suspecting it had to do
> with socks or pink.

I'm not sure whether you understand the issue I'm talking about or not.

A few years ago there was a study that found that pilots in the
Israeli Air Force almost exclusively had male babies. People jumped up
and down in excitement (figuratively at least) and started researching
what was causing this effect. Was it something in their diet?
Something about the type of person who became a pilot in the Israeli
air force? An act of God? Actually, it was none of those things. It
was pure fluke. The trouble is that when you "trawl" through a whole
pile of *possible* correlations you *will* produce a lot of fluke
results like this. How many possible professions are there? How many
organisations are there in which those professions might be exercised?
With how many personal attributes might the product of those two be
correlated? If you do a hundred tests then by fluke alone you should
expect 5 results that are statistically significant to 95%. If you do
thousands upon thousands of tests then the flukes will be all over the
place. All of them statistically significant. All of them potential
new superstitions.

That's why good practice in corpus research is to form your hypotheses
on a *sample* from the corpus, and then to test the hypotheses on the
remainder of the corpus. Fluke correlations are still possible, but
they're *way* less likely.

--
Tim Rowe

Mike

unread,

Nov 29, 2009, 10:41:52 AM11/29/09

to WordSmith Tools

Tim, hi.

Correct about flukes. That is why the usual default which is set in
the software is conservatively put at 1 in a million. If you have a
million (or half a million, etc.) types to compare you will certainly
expect some false hits, as you say. In usual practice one might have
only say 1,000 or 2,000 types being compared against a reference set.
You might have 750 types in a text of 1,600 words and 1,500 types in a
text of just under 5,000 words.

Another reason for caution is the assumption that words occur at
random: of course they do not, otherwise why would we do linguistics?
Yet statistical procedures have always been based on assumptions such
as that the phenomena investigated are normally distributed.

Third, the KW procedure does not claim the whole set of KWs (13 in
eldivenci's study) is statistically believable, only that each one
is.

Fourth, think about the fact that words do cluster non-randomly in
text.
In 1997 I wrote this ("PC Analysis of Key Words and Key Key Words",
System, Vov 25 No 2, p. 243):
...The misgivings have to do with the skewed
nature of types in a corpus and the very high incidence of singleton
items. If there were l0 occurrences of
"beetroot" in a 1000 word text on gardening, and also 10 occurrences
of "beetroot" in a 1 000000 word corpus
of general texts, then that item would be 1000 times more frequent
than expected on a chance basis and chisquare
would be a reasonable way of saying that the difference is believable.
If the occurrences were 1/1000 and
I/I 000 000, respectively, the same logic applies but the confidence
in results differs, because 1/1 000 000 suggests
the item is very rare and very rare items will not be spread around
all possible corpora very very thinly (at a
uniform rate of one per million words), but will crop up occasionally
in relation to some sort of topicality or
stylistic factor.

> correlated? If you do a hundred tests then by fluke alone you should
> expect 5 results that are statistically significant to 95%. If you do
> thousands upon thousands of tests then the flukes will be all over the
> place. All of them statistically significant. All of them potential
> new superstitions.

Finally, please can you explain about the *sample*, Tim? Your story
about the Israeli pilots does imply one must be cautious but how are
you to know which words to take for the sample? And how will you avoid
your own biases in choosing? If eldivenci wants to study his two sets
of 250-word reports, are you concluding that he should sample some of
the words or some of the reports? How will he decide which? If he goes
for educationally-loaded words, ones his theories suggested are likely
to relate to learning, he'd be doing some sort of corpus-based study
but he'd find it hard to be sure his choices weren't biased. If he
chose random words, will he find anything out that interests him? My
suggestion is for him to use all the words in all the reports he has,
and let the software suggest some pointers, as I said earlier.

Cheers -- Mike

Reply all

Reply to author

Forward