Hi Eric,
Answers interleaved below.
Eric Fuchs wrote:
> Hi,
> Sorry for the lack of detail. I have been trying to solve this on my own
> and I seem to have gotten far, but have a lot of questions.
> I have 90 samples from a tropical tree (15 individuals from 6
> different populations) sequences with paired-end ddRAD on an HiSeq 4000,
> no reference genome. The sequences were demultiplexed and filtered with
> process_radtags at the sequencing facility.
>
> I started de novo mapping using the denovo pipeline with the following
> command>
>
>
denovo_map.pl -T 12-m 3-M 2-n 1-o ~/cas/stacks/StacksOutput/--samples
> ~/cas/stacks/--popmap ~/cas/stacks/popmap --paired
>
> The ustacks program showed an average coverage of approximately 11X,
> however some samples had 8X.
> Is this coverage too low???
That coverage is lower than we like to have, but not necessarily
terrible. If you have any extremely low coverage samples (<5x) you may
consider excluding them.
> Is this a problem with the sequencing facility???
Not usually. Typically this results from over multiplexing the number of
samples together, so whoever constructed the molecular library did not
estimate coverage as well as they should have.
You can fix this problem by resequencing the library.
>
> The program ran well until tsv2bam where it stopped because I ran out of
> space. I moved the samples to a larger disk on the cluster and ran
> tsv2bam with the following command:
> |
> tsv2bam -P /home/efuchs/data/cas/stacks/StacksOutput-t 12-M
> /home/efuchs/data/cas/stacks/popmap -R /home/efuchs/data/cas/stacks/
> |
>
> *Here I have a question*:
> I have the raw sequences from the facility in
> /home/efuchs/data/cas/stacks .
> In the -R option, should I have included directory
> /home/efuchs/data/cas/stacks/StacksOutput where the processed sequences are?
>
The -R option should point to the raw, but cleaned paired-end reads.
They should be in the same location as the original single-end data that
you fed to denovo_map/ustacks.
> The program completed without any errors.
>
> After that I ran gstacks with the following command:
>
> gstacks -P /home/efuchs/data/cas/stacks/StacksOutput-t 12-M
> /home/efuchs/data/cas/stacks/popmap
>
>
> gstacks completed without error.
>
> *More Questions*:
> The log file says that 66.6% of loci need phasing; is this an error?
This means the program was able to phase 2/3 of the total loci in the
data set, it is not an error. In the latest release of the pipeline we
have raised this number typically >90%.
> How many snps do I have? Where can I count that number?
First run populations. Easiest place is probably the
populations.sumstats.tsv file.
>
>
> then I ran populations with the following command:
>
> |
> populations -P /home/efuchs/data/cas/stacks/StacksOutput-M
> /home/efuchs/data/cas/stacks/popmap -O
> /home/efuchs/data/cas/stacks/StacksOutput/PopsOut/-r 0.9-p 4-t
> 12--structure --genepop &
> |
>
>
> The program finished and produced a structure file and a genepop file
> with over 40K loci.
>
> *Question and worries:*
> I am worried though, that the average cover according to ustacks was
> low, then how did I recover that many loci? Am I doing something wrong?
> Are these spurious loci?
Seems like you have a lot of data, that's good! It should not be
spurious if your set of SNPs can be found in 90% of your samples, that
is good evidence it is real.
> I will run some basic popgenetics on the data and post back to see if
> they seem ok. I will also start optimizing -M -m -n paramters on a
> reduced number of individuals (5 per pop).
>
> Thanks for any help or answer to any of my multple questions. This is my
> first time running "STACKS" or using NGS of any sort and I feel like I
> was dropped on the deep end of the pool with cement shoes on.
>
> Eric,
>
Best,
julian