producing consensus, de novo reference assembly from catalogue?

161 views
Skip to first unread message

Plough

unread,
Nov 9, 2016, 2:41:47 PM11/9/16
to Stacks
Hi,
What would be the best or a way to generate a de novo  reference assembly from the stacks catalogue?  The goal would be to produce a fasta file with catalog locus names as the headers. The motivation (uses) would be for calling genotypes in pooled samples using e.g. GATK or for mutation or large insertion deletion detection. 

It is not obvious to me how one would get from the catalog to this reference. I don't see much information in the Stacks manual about the catalog tsv file, so it is hard to know what I am parsing?

Has anyone attempted to do this or have ideas on how to produce this reference? 

Thanks!

Eric Normandeau

unread,
Nov 10, 2016, 9:39:53 AM11/10/16
to Stacks
The following command will do the trick. You may need to change your catalog file name. Also, the resulting fasta file will contain a log of junk loci that are backed by only a few individuals. In the catalog file, column 8 contains a coma separated list of the samples that had the locus. You should probably use this information to reduce the fasta file. For example, keep only loci backed by N samples.

gunzip -c 06-stacks_rx/batch_1.catalog.tags.tsv.gz | awk '{print $3,$9}' | grep -v version | perl -pe 's/^/>locus_/; s/ /\n/' > catalog_loci.fasta


Julian Catchen

unread,
Nov 10, 2016, 2:04:08 PM11/10/16
to stacks...@googlegroups.com, lpl...@umces.edu, eric.norm...@gmail.com
As Eric mentions, you could just pull the ID/consensus sequence out of
the catalog.tags.tsv file. You could then run populations with filters,
and use the sumstats file or another file as a source to make a
whitelist and then to grep out only the catalog loci that are reliable
(and in this whitelist).

Another method is to use the --phylip_all_var output from populations
(collapsing your data down to a single population). This allows you to
apply filters directly. You can then use the batch_1.fullseq.phylip
combined with the batch_1.fullseq.phylip.log file to get the consensus
sequences (with variable sites encoded as IUPAC characters) and this
could be converted to a FASTA with a small script.

So, first method is simple, but includes all data and no variation (just
majority-rules consensus at each site), or the second method allows you
to apply filters and includes the variation, but requires a bit of
script to knit the locus ID back together with its sequence.

We could probably add a 'reference' export from populations that would
do all of this in one step.

julian

Plough

unread,
Nov 17, 2016, 1:56:29 PM11/17/16
to Stacks, lpl...@umces.edu, eric.norm...@gmail.com, jcat...@illinois.edu
Thanks for the replies Eric and Julian. Both solutions are useful and Eric's obv. easy to apply right away. I think a reference export option would be great to build in or include a small extension script to work directly on the phylip output. 

Thanks again.
Reply all
Reply to author
Forward
0 new messages