Julian,
Thanks for the reply.
Quick outline of percent reads containing at least one "#" character
after process_radtags for my current dataset:
I have about 52 million paired end reads (26M each P1 and P2) these
have been filtered by process_radtags, paired end mode with default
settings and all have matching or rescued barcodes. They have not
been trimmed at all, the RE is still present on P1.
About 5.5% of the retained P1 reads still have at least one "#"
character, and about 3.6% of the P2 reads do. I'm not sure if this is
'good' or 'bad', but there are over 2 million reads like this. This
may be small in percentage terms but they do represent a lot of data
to leave on the cutting floor if STACKS really is error tolerant.
I used the above data to assemble paired-end contigs outside of
stacks, so I don't have any cstacks results to report for that set,
but I have attached a .pdf histogram of # of SNPs per position for
another set of RAD data (plotting "SNP Column" from cstacks output).
You can certainly see a pattern of increasing SNP density near the
read end. Back of the envelope calculations put it at about 5%
'extra' SNPs spread across the last 10 positions. I have seen similar
plots in at least 3 other independent rad projects with different
setups and filtering steps. Does this pattern match what you or other
people are seeing? My sense is that this issue *might* be a bit more
acute when using a catalog constructed from many individuals, but I'm
not sure about that.
The graphed catalog was constructed from 24 individuals from 6
populations and was quality filtered with a (more stringent) script
than process_radtags. The script removed all the reads with phred2s,
so I don't think they are the only cause of this pattern. I'm trying
to figure out how 'clean' your data should be before running stacks,
and the best way to clean it.
-Ryan
> --
> For more options or to unsubscribe:
>
http://groups.google.com/group/stacks-users
> Stacks website:
http://creskolab.uoregon.edu/stacks/