Hello Stacks Community,
I have a question regarding an error that I am getting when running denovo_map.pl for parameter optimization. I am using a subset of samples (12 samples from 3 populations but categorizing them as 1 theoretical population for parameter optimization). The error occurs in the populations pipeline when attempting to filter SNPs found across 80% of the individuals (or 60%, but it does not occur when I do not utilize the --min-samples-per-pop 0.80(0.60) command), and basically occurs because all of the data is being filtered out. Is this an artifact of my data? A bit about my dataset...
The error message I am getting is:
<Now processing...
Batch 1
Removed 45647 loci that did not pass sample/population constraints from 45647 loci.
Kept 0 loci, composed of 0 sites; 0 of those sites were filtered, 0 variant sites remained.
Error: All data has been filtered out.
Aborted.>
Apologies if this is something trivial.. and thank you in advance for your help!
Jenny Cocciardi
Hi Jenny,
Either you made an error in your analysis or your 12 samples do not share loci widely across them (that is, you cannot find *any* loci that appear in at least 9 or the 12 individuals).
How did you determine your levels of coverage and PCR duplicates? What was the denovo_map.pl command you ran for optimization? What you you mean on your last bullet point, “do not implement a minimum number of individuals” – did you omit the -r 0.8 to the populations program?
The populations.log.distribs file will have a table telling you how many loci are shared among all 12 of your samples, 11, 10, 9, etc.
Best,
julian
Hi Jenny,
You do have good coverage in your data, however, you appear to have very, very few RAD loci in most of your samples. There are a number of places in the log to look at this, but when I extract out how many loci are matched against your catalog for each sample, I get this:
18094 sample loci compared against the catalog containing 46194 loci.
1508 sample loci compared against the catalog containing 46194 loci.
4348 sample loci compared against the catalog containing 46194 loci.
20380 sample loci compared against the catalog containing 46194 loci.
4112 sample loci compared against the catalog containing 46194 loci.
2388 sample loci compared against the catalog containing 46194 loci.
958 sample loci compared against the catalog containing 46194 loci.
9965 sample loci compared against the catalog containing 46194 loci.
6699 sample loci compared against the catalog containing 46194 loci.
2179 sample loci compared against the catalog containing 46194 loci.
1573 sample loci compared against the catalog containing 46194 loci.
2085 sample loci compared against the catalog containing 46194 loci.
So, most of your 12 samples have ~2000 loci or less, so it would be clear why an 80% threshold parameter might fail to find any common loci. You might consider the biology of your species versus the restriction enzyme you used. How many RAD sites do you expect to see? If you run the data without filters, populations will tell you the distribution of how many loci are shared among how many individuals (of course, the populations.log.distrib file you shared is basically empty because all data were filtered).
Hi Loïc,
Your analysis does not look the same as the thread you linked to. In particular, your samples all have good coverage and they seem to have been successfully integrated into the catalog. From the denovo_map.log file:
Depths of Coverage for Processed Samples:
S_alp_10630_IT_1: 73.23x
S_alp_PM20_068_SI_2: 56.84x
S_alp_brev_10582_IT_5: 70.44x
S_alp_brev_10614_IT_6: 43.40x
S_alp_brev_10761_AT_7: 47.86x
S_alp_brev_10787_AT_8: 57.36x
S_alp_brev_10803_AT_9: 62.68x
S_alpina_11111: 52.82x
S_alpina_11133: 39.94x
S_alpina_11148: 46.06x
S_alpina_11180: 33.83x
S_alpina_T22_4b4: 59.20x
S_brev_10865_CH_4: 72.12x
S_brev_10900_CH_3: 49.18x
S_breviserrata_10265: 29.60x
S_breviserrata_24_2014: 48.38x
S_breviserrata_EH10524: 68.27x
S_breviserrata_LP22181b: 56.43x
S_breviserrata_LP22279: 45.09x
S_breviserrata_LP22349: 65.54x
So, 40-60x coverage per sample is more than sufficient. It is not obvious to me why your samples do not appear to share loci across at least 80% of individuals (16 individuals). I would suggest you re-execute the populations program (you do not need to re-run the rest of the pipeline) without the r80 flag to see how many loci there are with no filters. I would also suggest you look at the populations.log.distribs file, which will tell you the number of loci shared between different numbers of samples, to give you an idea of how many shared loci you have and how much missing data you have.
Hi,
It seems like there is something funky with how you are running the software. First, you will want to use a reasonable set of parameters for both the R80 and non-R80 data sets, like M=n=3. You supplied M=n=1
for one, and M=n=3 for the other in the logs you sent. Second, there is something strange with your underlying files, when I look at your distribs file, I see this:
BEGIN samples_per_loc_prefilters
# Distribution of valid samples matched to a catalog locus prior to filtering.
n_samples n_loci
0 1004854
END samples_per_loc_prefilters
Which is saying that none of your loci were found in any of your samples, which is nonsensical. I would expect to see something more like this:
BEGIN samples_per_loc_prefilters
# Distribution of valid samples matched to a catalog locus prior to filtering.
n_samples n_loci
1 111
2 76
3 37
4 42
5 52
6 61
7 202
8 1835
9 48516
10 49073
END samples_per_loc_prefilters
I would suggest running denovo_map.pl from scratch with M=n=3, no filters, and your ‘opt’ popmap from your underlying data (which was, I assume, processed with process_radtags?).
--
Stacks website: http://catchenlab.life.illinois.edu/stacks/
---
You received this message because you are subscribed to the Google Groups "Stacks" group.
To unsubscribe from this group and stop receiving emails from it, send an email to
stacks-users...@googlegroups.com.
To view this discussion on the web visit
https://groups.google.com/d/msgid/stacks-users/eb699b32-b9c9-498b-b645-7950ac7c5af5n%40googlegroups.com.
Hi, we can’t diagnose the problem without seeing the commands you executed. You posted coverage numbers, I believe, prior to the removal of PCR duplicates, but what was coverage after removal of PCR duplicates (gstacks removes them)? What filters did you call for in populations as the snippet you posted shows everything being filtered.
--
Stacks website: http://catchenlab.life.illinois.edu/stacks/
---
You received this message because you are subscribed to the Google Groups "Stacks" group.
To unsubscribe from this group and stop receiving emails from it, send an email to stacks-users...@googlegroups.com.
To view this discussion on the web visit https://groups.google.com/d/msgid/stacks-users/SN6PR11MB2557A28E2028907F1676D018A7BAA%40SN6PR11MB2557.namprd11.prod.outlook.com.
Hi Noelia,
Yes, as you mention, you are losing 97% of your data as PCR duplicates. You also have a small number of RAD loci to start with (~20k per sample), but the assembly proceeds well and most of your reads form stacks properly. I am curious if you have checked any reference genomes from these organisms to understand the number of expected RAD loci? How many more would you expect than the 20k you are assembling? If it is a lot more, this is an additional sign that your DNA extractions did not capture the nuclear DNA present very well.
Try running the pipeline without filtering PCR duplicates to see what things look like with all your data. You may still have very few final loci, since if there are >>20k loci in the genomes, you may have captured a different subset of 20k from each different sample, so they then could not be compared across samples by the software.
julian
To view this discussion on the web visit https://groups.google.com/d/msgid/stacks-users/CAEBv_W1p3pQfae-mevvpGXieHSD_N35Oeein-V7xT%3DOXdBoXyA%40mail.gmail.com.
To view this discussion on the web visit https://groups.google.com/d/msgid/stacks-users/SN6PR11MB25576CC3498920E6819AC5E5A782A%40SN6PR11MB2557.namprd11.prod.outlook.com.