Read coverage: what is -m

Hazel Perry

unread,

Jun 9, 2015, 2:13:16 PM6/9/15

to stacks...@googlegroups.com

Hi all,

I have run my analysis and am currently writing it up. I have got to the point of saying my average read coverage is x and come a bit stuck about finding this out. I have a reference genome so reads were processed with process_radtags then mapped then run through stacks using a -m value of 5. I had thought the -m value was the minimum number of reads required to make a stack so with a reference genome this would be the number of reads mapped to one location on the genome, making it the locus depth. However I have just seen a discussion talking about whether this parameter is allele depth/locus depth for the different points in stacks. Can someone clarify exactly what -m would be refering to in pstacks and populations?

Also I would like to say my average coverage per tag per individual was x, but I'm not sure how to go about finding this, I seem to remember it was reported in the terminal during one of the steps but I didn't save this. I know there is allele-depth in the .tags file but I'm after locus/tag depth.

Thanks
Hazel

Mark Ravinet

unread,

Jun 10, 2015, 3:20:13 AM6/10/15

to stacks...@googlegroups.com

Hi Hazel,

From the 2013 Mol Ecol paper on stacks, under the section on reference alignment:

a threshold can be set in pstacks (-m) to require a minimum number of reads before declaring a set of aligned reads a locus

My interpretation for this has always been that when using a reference, -m is the minimum number of reads at a RAD locus. This therefore has the same meaning for the populations module.

As for the coverage per tag, you can quickly calculate this using vcftools (http://vcftools.sourceforge.net/) and it's --site-mean-depth option on the vcf file produced by populations. This will give the coverage per SNP but by extension you can use that to get the per-stack coverage.

Hope that helps

Mark

Hazel Perry

unread,

Jun 10, 2015, 7:09:25 AM6/10/15

to stacks...@googlegroups.com

Hi Mark,
Thanks, thats great. Can I just clarify to get the per stack/tag coverage from the site depth I could do (site mean depth* number of sites) / number of tags, would this be right or I am on completely the wrong track?

I am after something like this, from the Hohenlohe et al 2011 trout paper:
"Reads were filtered for overall quality and presence of a barcode and then aligned with each other to create a catalogue of 98 190 putative loci, or RAD tags. A total of 20.9 million reads contributed to this alignment, for a mean depth of 8.85× per individual per tag (one WCT individual had a mean coverage of 0.7× because of low DNA concentration, while the other individuals ranged from 2.6× to 20.7×; variation among individuals is often observed in a pooled RAD sequencing library)"
Do you have any idea how I could then get mean coverage per tag from each individual? I had wondered about using the --depth option in vcftools but would this not give me the depth over all loci rather than per locus? (I'm just creating my vcf file so haven't had chance to run it yet)

In a perfect world I'd like to be able to make the plots in Hohenhole et al 2010 (stickleback paper) which show number of rad tags across the genome and sequencing depth per rad tag per individual in sliding windows, but I think this would require a purpose made script and as my scripting knowledge is almost non-existent and my time very limited I think I'll just stick with the numbers!

Thanks
Hazel

Mark Ravinet

unread,

Jun 11, 2015, 2:31:44 AM6/11/15

to stacks...@googlegroups.com

Hi Hazel,

Ideally you would want this data from the stacks output, (since that presumably counts the actual number of reads). It does write this to the standard output and for future reference you can save this to a file (and still write to the stdout) like so:

stacks command -options | tee myoutfile.txt

The vcf will contain the read depth for that specific site. Since the Stacks pipeline uses reads of equal lengths, the read depth of any variant site will be the same as the stack depth. You don't need to do anything to that value to get the coverage for your locus.

You can get vcftools to give you the mean depth per individual, mean depth per locus or even a matrix of depths at each genotype using the --depth --site-mean-depth and --geno-depth options respectively. This last option will give you the raw numbers you need to recreate the plot from the Hohenlohe et al 2010 paper. You can quite easily perform a sliding window depth estimate using the genotype-depth matrix in R with this.

Below is some basic code I have used to perform sliding window analyses on dataframes in R. The example below should work for a 100 Mb alignment (just change the value in the seq command below).

# set sliding window and step size 
window <- 100000
step <- 100000 # this means non-overlapping
# generate left coordinates
left <- seq(1, (100*10^6), window)
# generate right coordinates
right <- left + (step-1)
# use sapply to estimate mean individual depth per window
sapply(1:length(left), function(z){
  # create window
  depth_window <- myData[myData$POS >= left[z] & myData$POS <= right[z], ]
  # use apply to calculate individual means for window
  ind_depth <- apply(depth_window[, 3:ncol(depth_window)], 2, function(y), mean, na.rm = TRUE)
  # take mean of individual depths in window and return it
  mean_ind_depth <- mean(ind_depth, na.rm = TRUE)
  return(mean_ind_depth)
})