Running Bucky parallelized scripts on a cluster (Troubles setting the number of 4-taxon sets)

26 views
Skip to first unread message

Patricia Rivera

unread,
Sep 21, 2018, 9:43:28 PM9/21/18
to BUCKy users
Hi everybody!
I'm running bucky on a cluster through the Perl script bucky-slurm.pl. I set all the options on the submit script bucky-slurm-submit.sh as indicated in the phylonetworks tutorial page. I have data on 6 markers for 203 species. However some species are missing for some markers. So I include empty sequences for the missing taxa as recommended in the manual. 
I made a test with a subset on 104 species (still, with empty sequences for the missing taxa) and when I configure the --array option using the number resulting from Julia's binomial function (104,4) I get an error: "array specification invalid"
So, I try to run the non-parallelized perl script: bucky.pl and it's taking too long, So I took the number of quartets found by bucky.pl and put that number in the array (--array= 1-4249575) but still isn't running. Any thoughts on how to fix this? 
Thanks!
Patricia Rivera
PhD Student 
Departamento de Botanica
Instituto de Biologia
UNAM
Mexico

Cécile Ané

unread,
Sep 21, 2018, 10:15:45 PM9/21/18
to BUCKy users
Hi Patricia,

It looks like the bucky.pl script found 102 taxa, not 104, based on the number of 4-taxon sets:

julia> binomial(104,4)

4598126

julia> binomial(102,4)

4249575


That's a large number of 4-taxon sets. With a job scheduler like slurm, that would be asking for 4,249,575 jobs. Each one is for a single set of 4 taxa, but still.
In any case, I hope that having the correct number will help.

If your plan is to use the quartet concordance factors for downstream analyses, then I would suggest using the original alignments: without empty sequences for missing taxa. Otherwise, when reduced to a subset of 4 taxa, the alignment becomes non-informative if 1 (or more) of the 4 taxa has a missing sequence. The perl script bucky-slurm.pl uses the subset of genes that have data for all 4 taxa, and ignores any gene that is missing 1 or more of the 4 taxa. (This set of genes will vary depending on the chosen 4-taxon set). If an empty sequence is included, the corresponding taxon will be in the sampled gene trees, and the script will think that the taxon did have a sequence.

I am afraid that 6 markers won't be enough to detect small amounts of gene flow (if that was your motivation for running the bucky*.pl script). Even 50-50 hybridization events might be hard to detect with confidence if there is a high level of incomplete lineage sorting on top.
Cheers,
Cecile.
Reply all
Reply to author
Forward
0 new messages