Paired end reads when running scripts by hand

293 views
Skip to first unread message

Eric Fuchs

unread,
Apr 3, 2018, 1:25:21 PM4/3/18
to Stacks
Hello,

  I am analyzing 90 samples from ddRAD sequencing, and have paired-end reads.   

I ran de denovo_map.pl with the --paired option and it ran ustacks correctly on 90 samples but failed (segmentation fault) when running cstacks.

I want to run 'cstacks' and 'sstacks' by hand, but did not find any option for paired reads on those commands. 
Do I need to do anything special for paired/end reads?

I am very new to stacks.

Eric,

Nicolas Rochette

unread,
Apr 3, 2018, 2:13:10 PM4/3/18
to Stacks
Hi Eric,

Please provide us with actual commands, logs, etc. There is too much
about your analysis we can't guess.

Regards,

Nicolas

Eric Fuchs

unread,
Apr 17, 2018, 9:46:09 AM4/17/18
to Stacks
Hi,
Sorry for the lack of detail. I have been trying to solve this on my own and I seem to have gotten far, but have a lot of questions.
 
 I have 90 samples from a tropical tree (15 individuals from 6 different populations) sequences with paired-end ddRAD on an HiSeq 4000, no reference genome. The sequences were demultiplexed and filtered with process_radtags at the sequencing facility.

I started de novo mapping using the denovo pipeline with the following command>

denovo_map.pl -T 12 -m 3 -M 2 -n 1 -o ~/cas/stacks/StacksOutput/ --samples ~/cas/stacks/ --popmap ~/cas/stacks/popmap --paired

The ustacks program showed an average coverage of approximately 11X, however some samples had 8X.  
Is this coverage too low??? 
Is this a problem with the sequencing facility???


The program ran well until tsv2bam where it stopped because I ran out of space. I moved the samples to a larger disk on the cluster and ran  tsv2bam with the following command:
 
tsv2bam -P /home/efuchs/data/cas/stacks/StacksOutput -t 12 -M /home/efuchs/data/cas/stacks/popmap -R /home/efuchs/data/cas/stacks/

Here I have a question:
I have the raw sequences from the facility in /home/efuchs/data/cas/stacks . 
In the -R option, should I have included directory /home/efuchs/data/cas/stacks/StacksOutput where the processed sequences are?

The program completed without any errors.

After that I ran gstacks with the following command:

gstacks -P /home/efuchs/data/cas/stacks/StacksOutput  -t 12 -M /home/efuchs/data/cas/stacks/popmap


gstacks completed without error. 

More Questions
 The log file says that 66.6% of loci need phasing; is this an error?
 How many snps do I have? Where can I count that number?


then I ran populations with the following command:

populations -P /home/efuchs/data/cas/stacks/StacksOutput -M /home/efuchs/data/cas/stacks/popmap -O /home/efuchs/data/cas/stacks/StacksOutput/PopsOut/ -r 0.9 -p 4 -t 12 --structure --genepop  &


The program finished and produced a structure file  and a genepop file with over 40K loci. 

Question and worries:
 I am worried though, that the average cover according to ustacks was low, then how did I recover that many loci? Am I doing something wrong? Are these spurious loci?

 I will run some basic popgenetics on the data and post back to see if they seem ok.  I will also start optimizing -M -m -n paramters on a reduced number of individuals (5 per pop).

 
Thanks for any help or answer to any of my multple questions. This is my first time running "STACKS" or using NGS of any sort and I feel like I was dropped on the deep end of the pool with cement shoes on.

Eric,

Julian Catchen

unread,
Apr 19, 2018, 3:27:49 PM4/19/18
to stacks...@googlegroups.com, Eric Fuchs
Hi Eric,

Answers interleaved below.

Eric Fuchs wrote:
> Hi,
> Sorry for the lack of detail. I have been trying to solve this on my own
> and I seem to have gotten far, but have a lot of questions.
> I have 90 samples from a tropical tree (15 individuals from 6
> different populations) sequences with paired-end ddRAD on an HiSeq 4000,
> no reference genome. The sequences were demultiplexed and filtered with
> process_radtags at the sequencing facility.
>
> I started de novo mapping using the denovo pipeline with the following
> command>
>
> denovo_map.pl -T 12-m 3-M 2-n 1-o ~/cas/stacks/StacksOutput/--samples
> ~/cas/stacks/--popmap ~/cas/stacks/popmap --paired
>
> The ustacks program showed an average coverage of approximately 11X,
> however some samples had 8X.
> Is this coverage too low???

That coverage is lower than we like to have, but not necessarily
terrible. If you have any extremely low coverage samples (<5x) you may
consider excluding them.

> Is this a problem with the sequencing facility???

Not usually. Typically this results from over multiplexing the number of
samples together, so whoever constructed the molecular library did not
estimate coverage as well as they should have.

You can fix this problem by resequencing the library.

>
> The program ran well until tsv2bam where it stopped because I ran out of
> space. I moved the samples to a larger disk on the cluster and ran
> tsv2bam with the following command:
> |
> tsv2bam -P /home/efuchs/data/cas/stacks/StacksOutput-t 12-M
> /home/efuchs/data/cas/stacks/popmap -R /home/efuchs/data/cas/stacks/
> |
>
> *Here I have a question*:
> I have the raw sequences from the facility in
> /home/efuchs/data/cas/stacks .
> In the -R option, should I have included directory
> /home/efuchs/data/cas/stacks/StacksOutput where the processed sequences are?
>

The -R option should point to the raw, but cleaned paired-end reads.
They should be in the same location as the original single-end data that
you fed to denovo_map/ustacks.

> The program completed without any errors.
>
> After that I ran gstacks with the following command:
>
> gstacks -P /home/efuchs/data/cas/stacks/StacksOutput-t 12-M
> /home/efuchs/data/cas/stacks/popmap
>
>
> gstacks completed without error.
>
> *More Questions*:
> The log file says that 66.6% of loci need phasing; is this an error?

This means the program was able to phase 2/3 of the total loci in the
data set, it is not an error. In the latest release of the pipeline we
have raised this number typically >90%.

> How many snps do I have? Where can I count that number?

First run populations. Easiest place is probably the
populations.sumstats.tsv file.

>
>
> then I ran populations with the following command:
>
> |
> populations -P /home/efuchs/data/cas/stacks/StacksOutput-M
> /home/efuchs/data/cas/stacks/popmap -O
> /home/efuchs/data/cas/stacks/StacksOutput/PopsOut/-r 0.9-p 4-t
> 12--structure --genepop &
> |
>
>
> The program finished and produced a structure file and a genepop file
> with over 40K loci.
>
> *Question and worries:*
> I am worried though, that the average cover according to ustacks was
> low, then how did I recover that many loci? Am I doing something wrong?
> Are these spurious loci?

Seems like you have a lot of data, that's good! It should not be
spurious if your set of SNPs can be found in 90% of your samples, that
is good evidence it is real.

> I will run some basic popgenetics on the data and post back to see if
> they seem ok. I will also start optimizing -M -m -n paramters on a
> reduced number of individuals (5 per pop).
>
> Thanks for any help or answer to any of my multple questions. This is my
> first time running "STACKS" or using NGS of any sort and I feel like I
> was dropped on the deep end of the pool with cement shoes on.
>
> Eric,
>

Best,


julian

Reply all
Reply to author
Forward
0 new messages