I hope you don't mind a random visitor posting here, especially so
long after the competition finished. I could have just emailed Teemu
(and I will if it suits him better), but I thought I might as well
post publicly incase my questions were of interest to anyone else.
I came to this discussion through the paper by Teemu and Tuomas
Heikkilä, 'Evaluating Methods for Computer-Assisted Stemmatology Using
Artificial Benchmark Data Sets', Literary and Linguistic Computing, 24
(2009), 417-33. (Though friends had mentioned the project to me
before--I used to work at Helsinki.) I thought it was a really
interesting and helpful article--thanks! I'm interested in a couple of
issues connected with the paper. I have no background in mathematics
though, so my capacity to talk about it is pretty limited.
1. My quite amateurish experiments so far with computer-assisted
stemmatology have used the parsimony analysis program "pars" that
comes with Phylip (http://evolution.genetics.washington.edu/
phylip.html)--basically because it was the first relevant, free, linux-
friendly program I found. I'd be interested to see how this would fare
in the experiment; unfortunately, pars can only handle up to 8 states,
and the Nexus files provided by Teemu have a lot more than 8. I did my
own analysis of two excerpts from the Heirichi data though, and
produced the stemma which I've uploaded in the 'files' area as
"plotfile_bothdatasets.ps". It's only based on 52 characters, and just
looking at it by eye, it seems respectably close to the real stemma
published in Teemu's paper, though it's not as good as his RHM stemma.
I'd like to score my stemma using Teemu's system, but although I think
I understand the system in theory, I haven't worked out how to
implement it in practice.
I don't know if anyone would be able to help me out? I'm not sure how
to run either the C program or the python program, and I haven't quite
been able to understand the description of the adjacency matrix at
http://www.cs.helsinki.fi/u/ttonteri/casc/submission.html. I'm willing
to learn, but I'm starting pretty much from scratch.
2. I was also wondering whether this scoring system could help me with
another problem. I'm interested in getting an idea of the minimum
amount of a given text I would need to analyse to get a fairly
reliable stemma. (I want to get quick and dirty results for large
numbers of Icelandic sagas whose stemmas we know almost nothing about,
rather than trying to get highly detailed knowledge of one tradition
where the basics are already known.) I guess, as a starting point, it
would be possible to analyse progressively larger proportions of, say,
the Heinrichi data, and score them to see at what point the score
maxes out. Does that sound plausible?
Thanks, ja hyvää pääsiäistä!
Alaric
Thanks for your post, we certainly don't mind random visitors posting!
1. About parsimony criteria, the results we give, at the challenge
page and the article you mention, for PAUP include parimony criteria.
I would guess they are probably rather similar to the ones in Phylip.
2. The scripts for evaluation are unfortunately somewhat difficult to
use since they don't accept any of the standard tree formats. I have
written some conversion scripts but at the moment I found it easier to
just write your graph (from the files section of this discussion
group) as a DOT file, compatible with the GraphViz software. I added
the DOT file in the files section as tree.dot. The resulting PDF,
obtained by saying "dot -T pdf -o tree.pdf tree.dot" is also there as
tree.pdf . I hope I didn't make many mistakes in typing in your tree
from the figure.
Now that we have a DOT file (and unfortunately, a rather specific type
of a DOT file is required; for instance, all nodes have to be listed
in the beginning and the observed manuscripts have to be explicitly
labeled), we can run the rankdistance program for which I uploaded
source code as rankdistance.c. Saying "rankdistance
correct_heinrichi.dot tree.dot" prints out a whole lot of details, but
in the end you can see the result: 64.899%.
(Above, correct_heinrichi.dot is the true structure, which you can see
as PDF in correct_heinrichi.pdf, both files to be found in the files
section.)
I hope this helps. If you want to evaluate other graphs, you can try
to modify the tree.dot file and re-run rankdistance. At the point
where this gets too laborous, I can try to see if I have a conversion
script to convert whatever tree format your program outputs into the
required DOT format (or the matrix format used by the Python script).
best,
Teemu
> been able to understand the description of the adjacency matrix athttp://www.cs.helsinki.fi/u/ttonteri/casc/submission.html. I'm willing
It took me a while to work out that I needed to compile the c program
first (sorry, this is the level I'm working at!) but I found out how
to (via http://ubuntuforums.org/showthread.php?t=29698), and now it's
all working great!
Nice to know that my rather rough and ready analysis got a fairly
respectable score too. I'll be experimenting to see how much data I
need to process before the returns on my efforts diminish, so I'll let
you know how that turns out.
Kiitos taas,
Alaric
Your result is indeed not bad at all. I'll be looking forward to
hearing if you can improve by using more data.
ystävällisin terveisin,
Teemu
On Apr 9, 4:58 am, Alaric Hall <alarich...@gmail.com> wrote:
> WOW, thanks Teemu! That's really helpful: it's enabled me to
> understand how to form the dot files and to produce the tree.pdf file.
> Thanks for going to the trouble!
>
> It took me a while to work out that I needed to compile the c program
> first (sorry, this is the level I'm working at!) but I found out how
> to (viahttp://ubuntuforums.org/showthread.php?t=29698), and now it's