Hi Quynh,
You will need to develop a deeper understanding of your dataset to
answer the questions you posed. Here are some questions to consider:
1) How many raw reads did you start with prior to any analysis?
2) How many raw reads survived the cleaning/demuliplexing process
(process_radtags)? Ideally, you will have 80% or more of your data retained.
3) How many reads do you have per sample in your analysis?
(process_radtags logfile will report this, or you can check how many
reads are in each file using UNIX commands.)
4) What is the average depth of coverage for each of your samples coming
from ustacks? (This information is printed by ustacks to the screen or
captured to the denovo_map.log file.)
For de novo data, you want a minimum of 20x coverage, and you will get
much better results with 35x coverage per individual. The software can
handle variation in coverage, but it cannot compensate for chronic low
coverage across your data set.
5) How many loci exist in your organism's genome? The catalog will
record this data and you can figure it out easily using the web
interface and filtering for the number of loci present in a certain
percentage of your individuals.
The populations program will also print a distribution of the number of
loci that are found in a certain number of individuals to the
populations.log file.
Again, the program can handle some variation in the number of samples
that contain each catalog locus. However, if your coverage is so low
that each locus is found in only one or two individuals, you will not
have enough data present across the data set to complete your analysis.
6) If you vary the primary parameters to
denovo_map.pl (-m, -M, and -n)
do you see these numbers in #5 change dramatically?
After all these steps, you can settle on a final set of parameters for
the main pipeline,
denovo_map.pl. If you have sufficient data, you can
choose final filtering parameters for the populations program (-r, and -p).
Bottom line, you need to account for where all your raw data ended up
and evaluate whether your bench protocol was successful and you got a
good signal in the data after sequencing. Only after that can you ask
biological questions using your data.
Best,
julian