ustack question

Maria BERNARD

unread,

Feb 27, 2013, 11:10:49 AM2/27/13

to stacks...@googlegroups.com

hi everyone,

I have some trouble to understand how ustack works.

I have a RADSeq data for several homozygous indiviuals. I use the stacks-0.99993 version.
I run ustack with this command:
ustacks -t fastq -f DEX35J11_1.fq -i 99 -o ustack_00 -d -r -m 3 -M 0 -N 0 --max_locus_stacks 3

first comment :
it seems that we can't avoid the adding of secondary lecture. These reads are problematic for me because they are used to detect SNP, which is not possible in my case.
I think that before using the ustack output files I need to remove all clusters that contain those SNP. What do you think?

second comment:
--max_locus_stacks 3 means (if I well understand) that each cluster must contain at most 3 stacks, a stacks being a cluster of uniq sequence.
but I observed that my 75554 cluster are composed of 76121 different stacks, and contain between 1 and 99 stacks (average =1,01 stacks per cluster ).
    grep -v -E 'secondary|consensus|model' DEX35J11_1.tags.tsv | awk '{print $3,$5}' | sort -u | awk '{print $1}' | sort | uniq -c | /home/mbernard/save/libPerl/Annotation.svn/colstat 1
    Nombre= 75554   Somme= 76121   Moyenne= 1.01   SD= 0.66   max= 99.00   min= 1.00   Mediane= 1.00
here is the distribution : DEX35J11_1.stacks_per_cluster.distrib.jpg

third comment :
-r and -d options. If I well understand those options remove stacks that are too much covered at different step of the clustering (1) by acomparing all staks or simply stacks that form a cluster/locus).
The 76121 stacks are composed of 720945 primary reads, and contain between 3 and 39021 reads (average = 9.47 primary reads per stacks)
    grep -v -E 'secondary|consensus|model' DEX35J11_1.tags.tsv | awk '{print $3,$5}' | uniq -c | awk '{print $1}' | /home/mbernard/save/libPerl/Annotation.svn/colstat 1
    Nombre= 76121   Somme= 720945   Moyenne= 9.47   SD= 168.36   max= 39021.00   min= 3.00   Mediane= 7.00
here is the distribution : DEX35J11_1.stacks_cov.distrib.jpg

Do you observe the same things? I think that I need to remove the cluster composed of more than 3 stacks or too much covered.

- forth and final comment:
I have some model lines that contain "U" character. What does they mean?
The U character is place at a position where we can observed 3 Gs and 16 Ts. It's not detected as a SNP (letter E in model line) because the variation is too poorly frequent?
0    99    1446        0    +    consensus         TGCAGGGAGAAACTTTGTTGCAGGGCAGAGGAAAAAGAGGGAGAAGCATCGGGGATAGCCACATTAGAAGGGGTGGGAGATGAGGAAATGTTGG    0    0    0
0    99    1446                model               OOOOOOOOOOOOOUOOOOOOOOOOOOOOOOOOOOOOOOOOOOOOOOOOOOOOOOOOOOOOOOOOOOEOOOOOOOOOOOOOOOOOOOOOOOOOOO
0    99    1446                primary    0        TGCAGGGAGAAACGTTGTTGCAGGGCAGAGGAAAAAGAGGGAGAAGCATCGGGGATAGCCACATTAGAAGGGGTGGGAGATGAGGAAATGTTGG
0    99    1446                primary    0        TGCAGGGAGAAACGTTGTTGCAGGGCAGAGGAAAAAGAGGGAGAAGCATCGGGGATAGCCACATTAGAAGGGGTGGGAGATGAGGAAATGTTGG
0    99    1446                primary    0        TGCAGGGAGAAACGTTGTTGCAGGGCAGAGGAAAAAGAGGGAGAAGCATCGGGGATAGCCACATTAGAAGGGGTGGGAGATGAGGAAATGTTGG
0    99    1446                primary    1        TGCAGGGAGAAACTTTGTTGCAGGGCAGAGGAAAAAGAGGGAGAAGCATCGGGGATAGCCACATTAGAAGGGGTGGGAGATGAGGAAATGTTGG
0    99    1446                primary    1        TGCAGGGAGAAACTTTGTTGCAGGGCAGAGGAAAAAGAGGGAGAAGCATCGGGGATAGCCACATTAGAAGGGGTGGGAGATGAGGAAATGTTGG
0    99    1446                primary    1        TGCAGGGAGAAACTTTGTTGCAGGGCAGAGGAAAAAGAGGGAGAAGCATCGGGGATAGCCACATTAGAAGGGGTGGGAGATGAGGAAATGTTGG
0    99    1443    ....

one more question (sorry .... ) . For the cstack step, I thought to use 10 homozygous children for each of the two families I add. But if my individuals have no SNP, I will have some problems in the 2 last steps sstack and genotype no? Must I create 2 articial children representing all the individuals of each family and then generate the catalog on them? Waht is preferable?

Thanks for your advices

Maria

DEX35J11_1.stacks_cov.distrib.jpg

DEX35J11_1.stacks_per_cluster.distrib.jpg

Maria BERNARD

unread,

Feb 27, 2013, 11:12:08 AM2/27/13

to stacks...@googlegroups.com

Sorry ....

I just look at the three flags at the end of the consensus line in the DEX35J11.tags.tsv files corresponding of the options -r et -d.
So just to be sure, we need to remove those cluster manually before launching cstack?

regards

Maria

Julian Catchen

unread,

Feb 28, 2013, 12:10:50 PM2/28/13

to stacks...@googlegroups.com

No, cstacks and the other Stacks components knows about the blacklisting flags
and will not load stacks that are blacklisted.

julian

Julian Catchen

unread,

Feb 28, 2013, 12:15:35 PM2/28/13

to stacks...@googlegroups.com

Hi Maria,

Some answers below.

On 2/27/13 8:10 AM, Maria BERNARD wrote:
> I have a RADSeq data for several homozygous indiviuals. I use the stacks-0.99993
> version.
> I run ustack with this command:
> ustacks -t fastq -f DEX35J11_1.fq -i 99 -o ustack_00 -d -r -m 3 -M 0 -N 0
> --max_locus_stacks 3
>
> first comment :
> it seems that we can't avoid the adding of secondary lecture. These reads are
> problematic for me because they are used to detect SNP, which is not possible in
> my case.
> I think that before using the ustack output files I need to remove all clusters
> that contain those SNP. What do you think?
>

I'll have to look into this detail. In principle, if you set -N to 0 it should
not align secondary reads. What is the output from your ustacks command? It will
print the number of secondary reads aligned, what does it say for your data?

> - forth and final comment:
> I have some model lines that contain "U" character. What does they mean?
> The U character is place at a position where we can observed 3 Gs and 16 Ts.
> It's not detected as a SNP (letter E in model line) because the variation is too
> poorly frequent?

A "U" in the model line means "unknown", or the result of the likelihood ratio
test to decide between a homozygote and heterozygote was not statistically
significant, so the model couldn't decide.

> one more question (sorry .... ) . For the cstack step, I thought to use 10
> homozygous children for each of the two families I add. But if my individuals
> have no SNP, I will have some problems in the 2 last steps sstack and genotype
> no? Must I create 2 articial children representing all the individuals of each
> family and then generate the catalog on them? Waht is preferable?
>

You should be able to set the -n flag to cstacks. This will allow mismatches
between your homozygous samples. So it will detect sites fixed within a sample,
but varying between samples.

Best,

julian

Maria BERNARD

unread,

Mar 1, 2013, 5:32:57 AM3/1/13

to stacks...@googlegroups.com, jcat...@uoregon.edu

Hi Julian,

thanks for the answers. I checked the stacks reads coverage, and this time the distribution is more consistent with what I understand.
Nombre= 75532 Somme= 624141 Moyenne= 8.26 SD= 7.58 max= 344.00 min= 3.00 Mediane= 7.00
And for the number of stacks per cluster I exactly have 1 stacks per cluster, so what I expect with -M otpion set to 0.

For the option -N comment.
command line ustack pour homozygous individuals

ustacks -t fastq -f DEX35J11_1.fq -i 99 -o ustack_00 -d -r -m 3 -M 0 -N 0 --max_locus_stacks 3

execution output:
Min depth of coverage to create a stack: 3
Max distance allowed between stacks: 0
Max distance allowed to align secondary reads: 2
Max number of stacks allowed per de novo locus: 3
Deleveraging algorithm: enabled
Removal algorithm: enabled
Model type: SNP
Alpha significance level for model: 0.05
Parsing /work/sigenae/Project_SexyTrout/sample_filtered/SexyTrout_2/checkRAD/DEX35J11_1.fq
...
Loaded 843895 RAD-Tags; inserted 188003 elements into the RAD-Tags hash map.
1392 reads contained uncalled nucleotides that were modified.
Mean coverage depth is 9; Std Dev: 168.362 Max: 39021
Coverage mean: 9; stdev: 168.362
Deleveraging trigger: 177; Removal trigger: 346
Calculating distance for removing repetitive stacks.
Distance allowed between stacks: 1
Using a k-mer length of 45
Number of kmers per sequence: 49
Miniumum number of k-mers to define a match: 4
Removing repetitive stacks.
Removed 589 stacks.
75554 stacks remain for merging.
Calculating distance between stacks...
Number of kmers per sequence: 1
Miniumum number of k-mers to define a match: 1
Merging stacks, maximum allowed distance: 0 nucleotide(s)
75554 stacks merged into 75554 stacks; deleveraged 0 stacks; removed 0 stacks.
Mean merged coverage depth is 9.54212; Std Dev: 175.215; Max: 39646
Merging remainder radtags
122950 remainder sequences left to merge.
Distance allowed between stacks: 2
Using a k-mer length of 31
Number of kmers per sequence: 63
Miniumum number of k-mers to define a match: 1
Matched 53557 remainder reads; unable to match 69393 remainder reads.
Number of utilized reads: 774502
Writing results

It seems to not take into account the -N option, even with the 0.9994 stacks version.

Best regards

Maria

Julian Catchen

unread,

Mar 2, 2013, 10:29:47 PM3/2/13

to stacks...@googlegroups.com

Thanks, Maria. I have fixed the -N flag to ustacks to accept 0. The fix will be
in the next release.

Best,

julian

Reply all

Reply to author

Forward