Phylip, and scoring

79 views
Skip to first unread message

Alaric Hall

unread,
Apr 1, 2010, 8:08:47 PM4/1/10
to Computer-Assisted Stemmatology Challenge
Moi!

I hope you don't mind a random visitor posting here, especially so
long after the competition finished. I could have just emailed Teemu
(and I will if it suits him better), but I thought I might as well
post publicly incase my questions were of interest to anyone else.

I came to this discussion through the paper by Teemu and Tuomas
Heikkilä, 'Evaluating Methods for Computer-Assisted Stemmatology Using
Artificial Benchmark Data Sets', Literary and Linguistic Computing, 24
(2009), 417-33. (Though friends had mentioned the project to me
before--I used to work at Helsinki.) I thought it was a really
interesting and helpful article--thanks! I'm interested in a couple of
issues connected with the paper. I have no background in mathematics
though, so my capacity to talk about it is pretty limited.

1. My quite amateurish experiments so far with computer-assisted
stemmatology have used the parsimony analysis program "pars" that
comes with Phylip (http://evolution.genetics.washington.edu/
phylip.html)--basically because it was the first relevant, free, linux-
friendly program I found. I'd be interested to see how this would fare
in the experiment; unfortunately, pars can only handle up to 8 states,
and the Nexus files provided by Teemu have a lot more than 8. I did my
own analysis of two excerpts from the Heirichi data though, and
produced the stemma which I've uploaded in the 'files' area as
"plotfile_bothdatasets.ps". It's only based on 52 characters, and just
looking at it by eye, it seems respectably close to the real stemma
published in Teemu's paper, though it's not as good as his RHM stemma.

I'd like to score my stemma using Teemu's system, but although I think
I understand the system in theory, I haven't worked out how to
implement it in practice.

I don't know if anyone would be able to help me out? I'm not sure how
to run either the C program or the python program, and I haven't quite
been able to understand the description of the adjacency matrix at
http://www.cs.helsinki.fi/u/ttonteri/casc/submission.html. I'm willing
to learn, but I'm starting pretty much from scratch.

2. I was also wondering whether this scoring system could help me with
another problem. I'm interested in getting an idea of the minimum
amount of a given text I would need to analyse to get a fairly
reliable stemma. (I want to get quick and dirty results for large
numbers of Icelandic sagas whose stemmas we know almost nothing about,
rather than trying to get highly detailed knowledge of one tradition
where the basics are already known.) I guess, as a starting point, it
would be possible to analyse progressively larger proportions of, say,
the Heinrichi data, and score them to see at what point the score
maxes out. Does that sound plausible?

Thanks, ja hyvää pääsiäistä!

Alaric

troos

unread,
Apr 8, 2010, 12:02:08 PM4/8/10
to Computer-Assisted Stemmatology Challenge
Hi Alaric,

Thanks for your post, we certainly don't mind random visitors posting!

1. About parsimony criteria, the results we give, at the challenge
page and the article you mention, for PAUP include parimony criteria.
I would guess they are probably rather similar to the ones in Phylip.

2. The scripts for evaluation are unfortunately somewhat difficult to
use since they don't accept any of the standard tree formats. I have
written some conversion scripts but at the moment I found it easier to
just write your graph (from the files section of this discussion
group) as a DOT file, compatible with the GraphViz software. I added
the DOT file in the files section as tree.dot. The resulting PDF,
obtained by saying "dot -T pdf -o tree.pdf tree.dot" is also there as
tree.pdf . I hope I didn't make many mistakes in typing in your tree
from the figure.

Now that we have a DOT file (and unfortunately, a rather specific type
of a DOT file is required; for instance, all nodes have to be listed
in the beginning and the observed manuscripts have to be explicitly
labeled), we can run the rankdistance program for which I uploaded
source code as rankdistance.c. Saying "rankdistance
correct_heinrichi.dot tree.dot" prints out a whole lot of details, but
in the end you can see the result: 64.899%.

(Above, correct_heinrichi.dot is the true structure, which you can see
as PDF in correct_heinrichi.pdf, both files to be found in the files
section.)

I hope this helps. If you want to evaluate other graphs, you can try
to modify the tree.dot file and re-run rankdistance. At the point
where this gets too laborous, I can try to see if I have a conversion
script to convert whatever tree format your program outputs into the
required DOT format (or the matrix format used by the Python script).

best,
Teemu

> been able to understand the description of the adjacency matrix athttp://www.cs.helsinki.fi/u/ttonteri/casc/submission.html. I'm willing

Alaric Hall

unread,
Apr 9, 2010, 4:58:47 AM4/9/10
to Computer-Assisted Stemmatology Challenge
WOW, thanks Teemu! That's really helpful: it's enabled me to
understand how to form the dot files and to produce the tree.pdf file.
Thanks for going to the trouble!

It took me a while to work out that I needed to compile the c program
first (sorry, this is the level I'm working at!) but I found out how
to (via http://ubuntuforums.org/showthread.php?t=29698), and now it's
all working great!

Nice to know that my rather rough and ready analysis got a fairly
respectable score too. I'll be experimenting to see how much data I
need to process before the returns on my efforts diminish, so I'll let
you know how that turns out.

Kiitos taas,

Alaric

troos

unread,
Apr 9, 2010, 11:48:27 AM4/9/10
to Computer-Assisted Stemmatology Challenge
Good to hear that you got the hang of it (and even managed to
compile).

Your result is indeed not bad at all. I'll be looking forward to
hearing if you can improve by using more data.

ystävällisin terveisin,

Teemu

On Apr 9, 4:58 am, Alaric Hall <alarich...@gmail.com> wrote:
> WOW, thanks Teemu! That's really helpful: it's enabled me to
> understand how to form the dot files and to produce the tree.pdf file.
> Thanks for going to the trouble!
>
> It took me a while to work out that I needed to compile the c program
> first (sorry, this is the level I'm working at!) but I found out how

> to (viahttp://ubuntuforums.org/showthread.php?t=29698), and now it's

Alaric Hall

unread,
Mar 17, 2014, 8:11:44 PM3/17/14
to computer-assisted-st...@googlegroups.com, teemu...@cs.helsinki.fi, Ludger Zeevaert
Moi Teemu!

I don't know if you remember that about four years ago (!) you helped me to get your rankdistance program working, which I've since made use of in a publication citing your work (http://digitalmedievalist.org/journal/9/hall/).

This time I'm emailing to see if you'd have time and energy to help me with some related work: I've been using Phylip to make the first complete stemma of the famous Icelandic Njáls saga. This has worked pretty well, but it's required rather laborious manual encoding of the data. So me and my partner in crime Ludger (cc'd) would really like to try the method you used in your 'Evaluating methods for computer-assisted stemmatology using artificial benchmark data sets' article, since it would (potentially) save us a lot of time if we can just work with a spreadsheet of aligned readings. But I don't know how you actually went about running your algorithm.

If you can help us I'd be very grateful. We'll have an article ready on the subject soon and I'd be happy to include you as a co-author if we can add a section evaluating your method (though NB that the article has about 20 co-authors, as all the different people who were involved in transcribing the data we used are credited!). I could send you a spreadsheet of aligned readings, or alternatively if you can give me access to whatever program you've used to run your algorithm.

If you're curious, the current draft of the article is at


and our spreadsheet is attached--though there are a few things in the data I'd want to tidy up before actually running is.

Groove on!

Alaric

--
http://www.alarichall.org.uk

School of English, University of Leeds
http://www.leeds.ac.uk/english
final_spreadsheet.ods
Reply all
Reply to author
Forward
0 new messages