Denovo_map.pl, populations pipeline error when using r80 method

453 views
Skip to first unread message

Jennifer Cocciardi

unread,
Apr 4, 2022, 2:53:08 PM4/4/22
to Stacks

Hello Stacks Community,

I have a question regarding an error that I am getting when running denovo_map.pl for parameter optimization. I am using a subset of samples (12 samples from 3 populations but categorizing them as 1 theoretical population for parameter optimization). The error occurs in the populations pipeline when attempting to filter SNPs found across 80% of the individuals (or 60%, but it does not occur when I do not utilize the --min-samples-per-pop 0.80(0.60) command), and basically occurs because all of the data is being filtered out. Is this an artifact of my data? A bit about my dataset...

  • I'm using paired data, and 98.4% of the paired seq are aligned correctly
  • the depth of coverage is high, with an average of 42.90x coverage
  • I've already filtered out PCR reps and the effective per-sample coverage also looks fine at 35.4x
  • the mean number of sites per locus is 324.4
  • the error occurs for: -M 2, m 3, n 3 AND -M 3, -m 3, -n 3 AND -M 4, -m 3, -n 3
  • when I do not implement a minimum number of individuals, I get 201516 SNPs (-M 2)

The error message I am getting is: 

<Now processing...

Batch 1 

Removed 45647 loci that did not pass sample/population constraints from 45647 loci.

Kept 0 loci, composed of 0 sites; 0 of those sites were filtered, 0 variant sites remained.

Error: All data has been filtered out.

Aborted.>

Apologies if this is something trivial.. and thank you in advance for your help!

Jenny Cocciardi

Catchen, Julian

unread,
Apr 6, 2022, 11:21:02 AM4/6/22
to stacks...@googlegroups.com

Hi Jenny,

 

Either you made an error in your analysis or your 12 samples do not share loci widely across them (that is, you cannot find *any* loci that appear in at least 9 or the 12 individuals).

 

How did you determine your levels of coverage and PCR duplicates? What was the denovo_map.pl command you ran for optimization? What you you mean on your last bullet point, “do not implement a minimum number of individuals” – did you omit the -r 0.8 to the populations program?

 

The populations.log.distribs file will have a table telling you how many loci are shared among all 12 of your samples, 11, 10, 9, etc.

 

Best,

 

julian

Jennifer Cocciardi

unread,
Apr 7, 2022, 8:26:48 AM4/7/22
to Stacks
Hi Julien,

Thank you so much for your response! It seems improbable that the 12 samples do not share loci widely across them, so it is most likely an error on my part. 

To answer your questions:
 - The command I used was: 
          denovo_map.pl -M 2 -m 3 -n 3 -o ~/scratch/demux/EC/tests.denovo/stacks.M2 \
          --popmap ~/scratch/demux/EC/tests.denovo/popmap_test_samples.txt \
           --samples ~/scratch/demux/EC --paired --min-samples-per-pop 0.80
 - I obtained leverage of coverage from the denovo_map.log from cstacks and gstacks and I also looked at the per sample read coverage on the sample output from process_radtags. 
 - To remove PCR reps, I used clone_filter before implementing process_radtags. 
 - And yes, by what I mean on my last bullet point when I did 'not implement a minimum number of individuals is that I omitted the r- 0.8 to the populations program. And then it was not aborted before the populations program finished.

I've attached the denovo_map.log output in case this can offer more insight. I really appreciate your help! 

Thank you very much,

Jenny Cocciardi

denovo_map.log
populations.log.distribs

Catchen, Julian

unread,
Apr 13, 2022, 10:29:34 AM4/13/22
to stacks...@googlegroups.com

Hi Jenny,

 

You do have good coverage in your data, however, you appear to have very, very few RAD loci in most of your samples. There are a number of places in the log to look at this, but when I extract out how many loci are matched against your catalog for each sample, I get this:

 

18094 sample loci compared against the catalog containing 46194 loci.

1508 sample loci compared against the catalog containing 46194 loci.

4348 sample loci compared against the catalog containing 46194 loci.

20380 sample loci compared against the catalog containing 46194 loci.

4112 sample loci compared against the catalog containing 46194 loci.

2388 sample loci compared against the catalog containing 46194 loci.

958 sample loci compared against the catalog containing 46194 loci.

9965 sample loci compared against the catalog containing 46194 loci.

6699 sample loci compared against the catalog containing 46194 loci.

2179 sample loci compared against the catalog containing 46194 loci.

1573 sample loci compared against the catalog containing 46194 loci.

2085 sample loci compared against the catalog containing 46194 loci.

 

So, most of your 12 samples have ~2000 loci or less, so it would be clear why an 80% threshold parameter might fail to find any common loci. You might consider the biology of your species versus the restriction enzyme you used. How many RAD sites do you expect to see? If you run the data without filters, populations will tell you the distribution of how many loci are shared among how many individuals (of course, the populations.log.distrib file you shared is basically empty because all data were filtered).

Loic Pittet

unread,
Dec 19, 2022, 8:45:50 AM12/19/22
to Stacks
Hi,

I would like, if possible, to come back on this problem. I am trying to optimise the parameters using the R80 method on a subset of my samples. I am experiencing the same problem. Attached are some more infos. The problem seems to happen because I only define one population. Is it possible ?

Thank you in advance,

Loïc
denovo_map.log
populations.log
script_denovo_M3n3m3

Catchen, Julian

unread,
Dec 19, 2022, 3:43:39 PM12/19/22
to stacks...@googlegroups.com

Hi Loïc,

 

Your analysis does not look the same as the thread you linked to. In particular, your samples all have good coverage and they seem to have been successfully integrated into the catalog. From the denovo_map.log file:

Depths of Coverage for Processed Samples:

S_alp_10630_IT_1: 73.23x

S_alp_PM20_068_SI_2: 56.84x

S_alp_brev_10582_IT_5: 70.44x

S_alp_brev_10614_IT_6: 43.40x

S_alp_brev_10761_AT_7: 47.86x

S_alp_brev_10787_AT_8: 57.36x

S_alp_brev_10803_AT_9: 62.68x

S_alpina_11111: 52.82x

S_alpina_11133: 39.94x

S_alpina_11148: 46.06x

S_alpina_11180: 33.83x

S_alpina_T22_4b4: 59.20x

S_brev_10865_CH_4: 72.12x

S_brev_10900_CH_3: 49.18x

S_breviserrata_10265: 29.60x

S_breviserrata_24_2014: 48.38x

S_breviserrata_EH10524: 68.27x

S_breviserrata_LP22181b: 56.43x

S_breviserrata_LP22279: 45.09x

S_breviserrata_LP22349: 65.54x

 

So, 40-60x coverage per sample is more than sufficient. It is not obvious to me why your samples do not appear to share loci across at least 80% of individuals (16 individuals). I would suggest you re-execute the populations program (you do not need to re-run the rest of the pipeline) without the r80 flag to see how many loci there are with no filters. I would also suggest you look at the populations.log.distribs file, which will tell you the number of loci shared between different numbers of samples, to give you an idea of how many shared loci you have and how much missing data you have.

Loic Pittet

unread,
Dec 20, 2022, 3:49:10 AM12/20/22
to Stacks
Hi,

There is probably something else going on. Attached are the results for populations with -r 80 and without. 

I don't get what is wrong. Would appreciate any help!

Loïc

populations.log.distribs
R80.populations.log
R80.populations.log.distribs
populations.log

Catchen, Julian

unread,
Dec 20, 2022, 4:45:37 PM12/20/22
to stacks...@googlegroups.com

Hi,

 

It seems like there is something funky with how you are running the software. First, you will want to use a reasonable set of parameters for both the R80 and non-R80 data sets, like M=n=3. You supplied M=n=1 for one, and M=n=3 for the other in the logs you sent. Second, there is something strange with your underlying files, when I look at your distribs file, I see this:

BEGIN samples_per_loc_prefilters

# Distribution of valid samples matched to a catalog locus prior to filtering.

n_samples    n_loci

0      1004854

END samples_per_loc_prefilters

 

Which is saying that none of your loci were found in any of your samples, which is nonsensical. I would expect to see something more like this:

 

BEGIN samples_per_loc_prefilters

# Distribution of valid samples matched to a catalog locus prior to filtering.

n_samples    n_loci

1      111

2      76

3      37

4      42

5      52

6      61

7      202

8      1835

9      48516

10     49073

END samples_per_loc_prefilters

 

I would suggest running denovo_map.pl from scratch with M=n=3, no filters, and your ‘opt’ popmap from your underlying data (which was, I assume, processed with process_radtags?).

--
Stacks website: http://catchenlab.life.illinois.edu/stacks/
---
You received this message because you are subscribed to the Google Groups "Stacks" group.
To unsubscribe from this group and stop receiving emails from it, send an email to stacks-users...@googlegroups.com.
To view this discussion on the web visit https://groups.google.com/d/msgid/stacks-users/eb699b32-b9c9-498b-b645-7950ac7c5af5n%40googlegroups.com.

Loic Pittet

unread,
Dec 30, 2022, 4:27:26 AM12/30/22
to Stacks
Hi,

I've figured out what the problem was. I used data from different sequencing plates. One half of the individuals has read length of 91 bp and the other half has read length of 106 bp. Therefore did the populations program remove all loci when I used the "-r 0.80" argument. I have run the exact same analysis on the same individuals, but with all reads truncated to 91bp and it worked perfectly :)

Cheers,

Loïc PITTET

Noelia Guzmán

unread,
Nov 21, 2023, 12:34:50 PM11/21/23
to Stacks
Hi! Im having a similar problem when I try to optimized parameters en the pipeline denovomap.pl... Although I have a lots of pcr duplicates the coverage of each sample looks good for M1 and n1: 
BEGIN cov_per_sample
# Depths of Coverage for Processed Samples
sample  depth of cov  max cov number reads incorporated % reads incorporated
5503-Leiotettix-sanguineus  47.56 309 442638  87.4
7953-Ronderosia-bergii  50.31 479 944807  84.8
6205-Dichroplus-exilis  50.76 483 870470  84.0
3059-Atrachelacris-unicolor 49.53 493 680537  84.2
7964-Baeacris-punctulatus 50.63 628 894058  78.4
1313-Dichroplus-robustulus  52.33 463 1237662 83.5
3410-Dichroplus-paraelongatus 50.55 499 918914  84.2
1925-Dichroplus-elongatus 51.16 468 783905  84.7
1482-Dichroplus-intermedius 51.79 430 784337  83.3
1974-Dichroplus-schulzi 51.16 328 480220  84.3
1706-Dichroplus-conspersus  49.57 507 793286  85.5
9367-Dichroplus-alejomesai  51.97 608 984276  79.6
8666-Dichroplus-fuscus  50.77 508 1194162 83.8
Demo-Dichroplus-democraticus  49.00 481 411150  84.7
8982-Dichroplus-silveiraguidoi  48.32 501 677629  86.8
V507-Dichroplus-vittiegerum 48.30 221 83954 79.3
END cov_per_sample

i only recovered very low number of  loci: 
Batch 1

Removed 220780 loci that did not pass sample/population constraints from 220782 loci.
Kept 2 loci, composed of 1203 sites; 24 of those sites were filtered, 10 variant sites remained.
Number of loci with PE contig: 2.00 (100.0%);
  Mean length of loci: 591.50bp (stderr 20.50);
Number of loci with SE/PE overlap: 1.00 (50.0%);
  Mean length of overlapping loci: 622.00bp (stderr -nan); mean overlap: 15.00bp (stderr -nan);
Mean genotyped sites per locus: 596.50bp (stderr 25.50).
What could be the problem? 

Thanks in advance
Noelia

Catchen, Julian

unread,
Nov 22, 2023, 3:11:05 PM11/22/23
to stacks...@googlegroups.com

Hi, we can’t diagnose the problem without seeing the commands you executed. You posted coverage numbers, I believe, prior to the removal of PCR duplicates, but what was coverage after removal of PCR duplicates (gstacks removes them)? What filters did you call for in populations as the snippet you posted shows everything being filtered.

Noelia Guzmán

unread,
Nov 23, 2023, 5:41:20 AM11/23/23
to stacks...@googlegroups.com
Hi Julian, thanks for your response. Indeed, the coverage values are those before the removal of the duplicated PCRs, then they drop a lot! I have quite divergent species and that is why I have tried the same with an optimization population between closer species and managed to increase the number of loci a little. These are the parameters they use, I am optimizating the M and n parameters and that is why I use this loop>

for M in {1..12}; do
out=$work/param_opt/denovo.M${M}
mkdir -p $out
cd $out
denovo_map.pl --samples $work/barcodes_2 --popmap $work/popmap_1.tsv --out-path $out --paired -M $ -n $ -T 12 --min-samples-per-pop 0.8 --rm-pcr-duplicates
done


I have quite divergent species and that is why I have tried the same with an optimization between closer species and then the number of loci increase  a little..
Found attached the outputs for more information.
Thanks a lot for your help.

--
Stacks website: http://catchenlab.life.illinois.edu/stacks/
---
You received this message because you are subscribed to the Google Groups "Stacks" group.
To unsubscribe from this group and stop receiving emails from it, send an email to stacks-users...@googlegroups.com.
denovo_map_m2.log
populations_m2.log

Catchen, Julian

unread,
Nov 30, 2023, 3:59:04 PM11/30/23
to stacks...@googlegroups.com

Hi Noelia,

 

Yes, as you mention, you are losing 97% of your data as PCR duplicates. You also have a small number of RAD loci to start with (~20k per sample), but the assembly proceeds well and most of your reads form stacks properly. I am curious if you have checked any reference genomes from these organisms to understand the number of expected RAD loci? How many more would you expect than the 20k you are assembling? If it is a lot more, this is an additional sign that your DNA extractions did not capture the nuclear DNA present very well.

 

Try running the pipeline without filtering PCR duplicates to see what things look like with all your data. You may still have very few final loci, since if there are >>20k loci in the genomes, you may have captured a different subset of 20k from each different sample, so they then could not be compared across samples by the software.

 

julian

 

Noelia Guzmán

unread,
Dec 6, 2023, 7:48:10 AM12/6/23
to stacks...@googlegroups.com
Thanks Julian for your response. I dont have a reference genome for these organisms.
In order to understand what was happening I run the pipeline without removing pcr duplicates using only the forward data and with one group (only one species) and again a lot of loci was removed because did not pass sample/popualtion constraints. What could be the most probable cause?  Confirm this that Im finding different subsets in each individual?I attached the outputs for more information.
Thanks a lot for your help!
Removed 155025 loci that did not pass sample/population constraints from 155247 loci.
Kept 222 loci, composed of 32527 sites; 257 of those sites were filtered, 1308 variant sites remained.
Mean genotyped sites per locus: 146.52bp (stderr 0.33).
 I used this script
denovo_map.pl --samples $work/barcodes_2_forward_disable_vit --popmap $work/popmapvittatus.tsv --out-path $out -M 3 -n 3 -T 12 --min-samples-per-pop 0.8


populations.log
denovo_map.log
Reply all
Reply to author
Forward
0 new messages