Lots of missing data in Structure output files

604 views
Skip to first unread message

Quynh Quach

unread,
Jan 5, 2016, 5:00:23 PM1/5/16
to Stacks
Dear Dr. Catchen and fellow Stacks users,

I am having some trouble understanding the results that populations in Stacks is giving me.  I have a lot of missing data (> 90-95%) in my output files (Structure, Phylip is empty, and all Fst values are 0) after running populations with some very liberal parameters (r=0.2, m=3, p=1, write single snp).  Even though it was able to call ~9000 loci, very few of them are actually useable when I tried running it in Arlequin.  It seems that although the loci was called in individuals, all the data was labeled as missing.  I am hoping that it’s not because my data is not good, since process_radtags and denovo_pl, which I ran with default values, both ran smoothly.  I am not sure how to  troubleshoot this problem, so if you have any input to why this might have happened, that would be much appreciated.  Let me know if there’s any results I can provide so that this would be more clear to you.

Thank you,

Quynh

M.S. Student

Revell Lab@UMass Boston


Julian Catchen

unread,
Jan 6, 2016, 1:18:33 PM1/6/16
to stacks...@googlegroups.com, quy...@gmail.com
Hi Quynh,

You will need to develop a deeper understanding of your dataset to
answer the questions you posed. Here are some questions to consider:

1) How many raw reads did you start with prior to any analysis?

2) How many raw reads survived the cleaning/demuliplexing process
(process_radtags)? Ideally, you will have 80% or more of your data retained.

3) How many reads do you have per sample in your analysis?
(process_radtags logfile will report this, or you can check how many
reads are in each file using UNIX commands.)

4) What is the average depth of coverage for each of your samples coming
from ustacks? (This information is printed by ustacks to the screen or
captured to the denovo_map.log file.)

For de novo data, you want a minimum of 20x coverage, and you will get
much better results with 35x coverage per individual. The software can
handle variation in coverage, but it cannot compensate for chronic low
coverage across your data set.

5) How many loci exist in your organism's genome? The catalog will
record this data and you can figure it out easily using the web
interface and filtering for the number of loci present in a certain
percentage of your individuals.

The populations program will also print a distribution of the number of
loci that are found in a certain number of individuals to the
populations.log file.

Again, the program can handle some variation in the number of samples
that contain each catalog locus. However, if your coverage is so low
that each locus is found in only one or two individuals, you will not
have enough data present across the data set to complete your analysis.

6) If you vary the primary parameters to denovo_map.pl (-m, -M, and -n)
do you see these numbers in #5 change dramatically?

After all these steps, you can settle on a final set of parameters for
the main pipeline, denovo_map.pl. If you have sufficient data, you can
choose final filtering parameters for the populations program (-r, and -p).

Bottom line, you need to account for where all your raw data ended up
and evaluate whether your bench protocol was successful and you got a
good signal in the data after sequencing. Only after that can you ask
biological questions using your data.

Best,

julian

Quynh Quach

unread,
Jan 26, 2016, 12:21:22 PM1/26/16
to Stacks, quy...@gmail.com, jcat...@illinois.edu
Hi Julian,
I really appreciate your in-depth reply and I would like to update you with what I've found with my data.
I started out with about 310 million reads from my two lanes of hi-seq and process_radtags retained about 287 million reads. 
All of my barcodes were found and retained between 600,000 to 28 million reads.
My coverage mean was 189 with SD of 3386.85
The sum of loci using default parameters in population was 8572 while if I changed m=4 it was 7215.

Given this, I think that our bench protocol was good and the sequencing was successful unless I'm not interpreting this correctly.
I will look more carefully into what happens during population script that would give me the missing data and would love more input in this if you have any.

Thank you,
Quynh
Reply all
Reply to author
Forward
0 new messages