Sstacks to identify shared SNPs between samples

1,094 views
Skip to first unread message

Ruth

unread,
Dec 10, 2012, 5:56:32 AM12/10/12
to stacks...@googlegroups.com

Dear all,


Sorry in advance for long post. I am trying to give as much detail as I can without writing a novel.


I am using STACKS to analyse the sequences that I got from a pilot study using SbfI. These are single end sequences generated by Ilumina HiSeq.

Since I don't have a reference genome for my organism I am running several parts of the STACKS separately  to identify SNPs. Process_radtags + ustacks + cstacks/sstacks


I am now having some trouble interpreting some of the outputs from sstacks/cstacks.


For example, if I consider two of my samples: Ire 11 and Ire16. When I run ustacks using m=5 I get:

Number SNPs Ire11= 2848

Number SNPs Ire 16 = 3427


Afterwards, when I build a catalogue using csatcks of both samples and I get a total of 5407 SNPs. 

So, my understanding is that Stacks identifies 5407 SNPs in either Ire11, Ire 16, or both (i.e. the SNPs may not be in Radtags that have been sequenced/identified in both Ire11/Ire16)


However, what I really want to know is how many of the SNPs positions are in RadTags that are shared between both samples.

This is to have an idea of 1) how many of the RadTags have been sequenced in both samples and, 2) how many of these Radtags have SNPs (herezozygote only in Ire11, heterozygote only in Ire16, or heterozygote in both) . 

By doing this, I will have an idea of how much sequencing effort ($$$$$) I will need to put in my future population genetics study 

(i.e. using ~100 samples in my study; how many reads do I need to get lets say > 1000 shared SNPs????)


So, I thought that sstacks was the answer to my trouble. I have been using the XXX.tags.tsv, XXX.alleles.tsv, and XXX.snps.tsv obtained form ustacks for one of the samples (Ire11) as "catalogue" and afterwards I have matched the 

XXX.tags.tsv, XXX.alleles.tsv, and XXX.snps.tsv files from the other sample (Ire16) against the first one (Ire11) to find shared Radtags and shared SNPs. 

I got a promising output file called "sample_Ire16.matches.tsv", that looks like this below (it has 48963 lines):


Headings: SQL ID (not using this); Batch ID (0 because not using a cstacks output and therefore batch IDs as input when running sstacks); catalogue ID (stack ID in Ire11); Sample ID (Ire16=2 in my command line); Stack ID (in Ire16); Haplotype; Stack Depth   


0       0       62       2       69      consensus       25

0       0       63       2       70      consensus       47

0       0       65       2       72      consensus       27

0       0       66       2       73      consensus       18

0       0       68       2       75      consensus       38

0       0       69       2       77      A             23

0       0       69       2       77      G                     24

0       0       87       2       81      consensus       79

0       0       88       2       84      consensus       69

0       0       79       2       88      consensus       127

0       0       81       2       94      consensus       75

0       0       86       2       97      consensus       62

0       0       13364     2       105     G                    39

0       0       66573     2       123     consensus       66

0       0       143     2       190     consensus       67


And now there come my questions:


Besides the promising output file called "sample_Ire16.matches.tsv", I get more output saying:

Parsing ./ustacks_all_reads/stacks_CAGGTG_Ire11_m10p001/sample_CAGGTG.snps.tsv

  Parsing ./ustacks_all_reads/stacks_CAGGTG_Ire11_m10p001/sample_CAGGTG.alleles.tsv

  Parsing ./ustacks_all_reads/stacks_ATTCTG_Ire16_m10p001/sample_ATTCTG.tags.tsv

  Parsing ./ustacks_all_reads/stacks_ATTCTG_Ire16_m10p001/sample_ATTCTG.snps.tsv

  Parsing ./ustacks_all_reads/stacks_ATTCTG_Ire16_m10p001/sample_ATTCTG.alleles.tsv

Searching for sequence matches...

81252 stacks matched against the catalog containing 80997 loci.

  50572 matching loci, 1827 contained no verified haplotypes.

  32 loci matched more than one catalog locus and were excluded.

  1794 loci contained SNPs unaccounted for in the catalog and were excluded. question 1: Are these shared Radtags that have 2 alleles in Ire16 (heterozygotes ) but only one allele (homozygotes) in Ire 11????

  50843 total haplotypes examined from matching loci, 48963 verified.



From the "sample_Ire16.matches.tsv" file, I have assumed that "consensus" means that this RadTag was shared between Ire 11 and Ire16 but there was not a SNP on it.

So I selected only those lines with no consensus on them (grep -v "consensus" sample_Ire16.matches.tsv > SNPs_positions ) and the file "SNP_positions" looks like this:


0 0 69 2 77       A        23

0 0 69 2 77      G        24

0 0 13364 2 105      G        39

0 0 958 2 215      A                48

0 0 902 2 229     GT        12

0 0 1016 2 248     GCCCCT        18

0 0 957 2 280     A                 15

0 0 1020 2 298      CCACCT         27

0 0 3050 2 329      C         32

0 0 1022 2 337     GCGCCTGCCC        182


And now it comes question 2: How do I interpret this???


- If a number is recorded twice in column 3 and column 5; is this a SNP (heterozygote) in both Ire 11 and Ire 16?

- If a number is recorded once in column 3 and column 5; is this a SNP (heterozygote) in Ire 11 but a homozygote in Ire 16?

- When a number is recorded >twice in either column 3 or column 5 (It happened several times). Shall we consider this an sequencing error??


Thank a lot!!!


Ruth


Ryan Waples

unread,
Dec 10, 2012, 1:34:00 PM12/10/12
to stacks...@googlegroups.com
Hello Ruth,

There is a lot going here, but I'll try my best to provide a little bit of help.

A question

1) Are you only concerned with the heterozygosity of samples Ire11 and
Ire16, or are you just using them as example of a larger data set?

And comment(s):

I'm not sure I understand why you are not following the more
'traditional' methods of Stacks, as I don't think you have such unique
questions that you need to roll your own analysis pipeline. You may
be able to get a better sense of your sequence data with the following
steps (but perhaps this won't work for you for some reason I missed).

run ustacks on each sample individually (you have already done this)
(you may also try m<5, as 5 can end up grouping diverged sequences)

run cstacks to generate a catalog. If you have <20 total samples you
could use all of them. With more than that you may what to only pick a
subset for the catalog. Building a catalog will combine all the
observed variation in one place, as well as filter out black flagged
tags. You don't necessarily want *all* the variation you observed in
your catalog, as some (much) of it can be meaningless.

run sstacks for each of your samples against the catalog above.

run populations to export a set of genotypes at SNPs. Pay attention
to the difference between SNP genotypes (many possible per tag) and
coded haplotypes (one per tag). Populations will output a genotype
call for each SNP.

You can then use the assigned genotypes and heterozygosity to evaluate
your sequences, paying attention to SNP located within the same tag.

As for your questions about sstacks, you seem to be using it to match
up two outputs of ustacks. Most folks use it to match a ustacks
output to a cstacks output. Stacks isn't really designed with such a
direct pairwise comparison method in mind. If this is your goal I
would suggest building a catalog out of one individual, then matching
against that. I think your interpretation of the sstacks results was
pretty good, but tough to evaluate given my lack of experience in
doing what you did.

As for your question 1:
I think those unmatched tags would represent alternate homozygous
calls between samples at at least one SNP in the tags. You may be
able to get them to merge with the sstacks -x flag, but not sure of
the implications. Also not sure of the implications of excluding all
tags that don't match in this way, could bias your estimates of
diversity.

question 2
"- If a number is recorded twice in column 3 and column 5; is this a
SNP (heterozygote) in both Ire 11 and Ire 16?"
- yes i think so
"- If a number is recorded once in column 3 and column 5; is this a
SNP (heterozygote) in Ire 11 but a homozygote in Ire 16?"
- yes i think so
"- When a number is recorded >twice in either column 3 or column 5 (It
happened several times). Shall we consider this an sequencing error??"
- I don't think this is necessarily the correct interpretation

If a number appears in column 3 twice (or more), but with different
col 5 values you may simply have more than a single SNP within the
tag. (most likely)
Tri allelic snps are also a possibility. (less likely)
Sequencing error (likely)
Over merged tags (possible given your high values of -m flag)
maybe more possibilities as well..

Good luck

-Ryan
> --
> For more options or to unsubscribe: http://groups.google.com/group/stacks-users
> Stacks website: http://creskolab.uoregon.edu/stacks/

Julian Catchen

unread,
Dec 10, 2012, 9:06:09 PM12/10/12
to stacks...@googlegroups.com
Hi Ruth,

Good advice from Ryan. I agree with him that you are making things more complex
than necessary to answer your question (how many SNPs are available and how many
are shared?).

The best way to run the pipeline in your case is to simply execute the
denovo_map.pl program. Submit your two samples to denovo_map.pl and it will run
ustacks on each, and then build and match against the catalog. The output can
then be interpreted as Ryan describes. (This can be done by hand, but takes more
work.)

Additionally, however, if you enable the database and web interface, the
web-based filters will immediately answer your questions regarding SNPs without
further need to parse output files and it will let you browse your data, which
is quite valuable when working on a non-model organism.

Best,

julian

Ruth

unread,
Dec 11, 2012, 3:50:36 AM12/11/12
to stacks...@googlegroups.com

Hello again,


Thanks so much for your helpful replies and for your patience reading my long post. 

Yes, I see now that running denovo_map.pl can be the answer to my trouble. Seems quite obvious now but sometimes it is difficult to see the obvious, especially when you work on your own!

Regarding Ryan questions about my experiments. I only have results for 2 samples for now. Trying to scale up the rest of the experiments (number of samples and Illumina lanes to use based on this) although I realise that having a pilot study with few more individuals would have been a clear advantage when trying to infer diversity indexes.


Cheers,


Ruth

ananta

unread,
Dec 11, 2012, 8:15:20 AM12/11/12
to stacks...@googlegroups.com
Ruth, using just two and inferring total number of RAD sites and shared sites may be very misleading. From, my experience, I have seen a lot of shared loci when we used 6 genotypes for pilot study and extended further. and, It will reduce when you increase individuals. The best bet would be having less number of unique sites within an individual. 

Thanks

Ruth

unread,
Dec 11, 2012, 9:54:18 AM12/11/12
to stacks...@googlegroups.com

Hello all,


Here I come again.


I have tried denovo_map.pl  with my two samples. 

Once using one sample as parent and the other as progeny:

 denovo_map.pl -o ./denovomap -p ./sample_CAGGTG.fq -r ./sample_ATTCTG.fq -m 5 -n 2 -T 4 -b 1 -t 

And another time using both samples as parents:


 denovo_map.pl -o ./denovomap -p ./sample_CAGGTG.fq -p ./sample_ATTCTG.fq -m 5 -n 2 -T 4 -b 100 -t 


Still don't know what would be best or what would be the difference.


And these files are generated, so it seems to work:

batch_1.catalog.alleles.tsv  batch_1.genotypes_1.tsv   batch_1.markers.tsv        sample_CAGGTG.snps.tsv    sample_ATTCTG.snps.tsv

batch_1.catalog.snps.tsv     batch_1.genotypes_1.txt   denovo_map.log             sample_CAGGTG.tags.tsv     sample_ATTCTG.tags.tsv

batch_1.catalog.tags.tsv     batch_1.haplotypes_1.tsv  sample_CAGGTG.alleles.tsv sample_ATTCTG.alleles.tsv


The output from ustacks and stacks is OK (these are XXX.alleles.tsv, XXX.tags.tsv, and XXX.snps.tsv) BUT the files XXX.genotypes, XXX.haplotypes and XXX.markers are empty.


This is some of the denovo_map.log report that I get:


head -20 denovo_map.log 


ref_map.pl started at 2012-12-11 11:30:22

/usr/local/bin/denovo_map.pl -m 3 -M 3 -n 2 -T 15 -b 1 -t -D RuthTest -o ./denovomap -p ./sample_CAGGTG.fq -r ./sample_ATTCTG.fq

Identifying unique stacks; file   1 of   2 [sample_CAGGTG]

/usr/local/bin//ustacks -t fastq -f ./sample_CAGGTG.fq -o ./denovomap -i  -d -r -m 3 -M 3    -p 15 2>&1

Min depth of coverage to create a stack: 3

Max distance allowed between stacks: 3

Max distance allowed to align secondary reads: 5

Deleveraging algorithm: disabled

Removal algorithm: enabled

Model type: SNP

Parsing ./sample_CAGGTG.fq

Loaded 3306171 RAD-Tags; inserted 457376 elements into the RAD-Tags hash map.

  2546 reads contained uncalled nucleotides that were modified.

  Mean coverage depth is 24; Std Dev: 179.206 Max: 46659

Coverage mean: 24; stdev: 179.206

Deleveraging trigger: 203; Removal trigger: 382

Calculating distance for removing repetitive stacks.

  Distance allowed between stacks: 1

  Using a k-mer length of 43

  Number of kmers per sequence: 46


tail denovo_map.log 


Writing SQL markers file to './denovomap/batch_1.markers.tsv'

Writing SQL genotypes file to './denovomap/batch_1.genotypes_1.txt'

Writing 0 loci to observed haplotype file, './denovomap/batch_1.haplotypes_1.tsv'

mysql --defaults-file=/usr/local/share/stacks/sql/mysql.cnf  -e "LOAD DATA LOCAL INFILE './denovomap/batch_1.markers.tsv' INTO TABLE markers "

mysql --defaults-file=/usr/local/share/stacks/sql/mysql.cnf  -e "LOAD DATA LOCAL INFILE './denovomap/batch_1.genotypes_1.txt' INTO TABLE catalog_genotypes "

/usr/local/bin//index_radtags.pl -D  -t -c 2>&1

DBI connect('-t:mysql_read_default_file=/usr/local/share/stacks/sql/mysql.cnf','',...) failed: Can't connect to local MySQL server through socket '/var/run/mysqld/mysqld.sock' (2) at /usr/local/bin//index_radtags.pl line 570

Unable to connect to the -t MySQL Database!

Can't connect to local MySQL server through socket '/var/run/mysqld/mysqld.sock' (2) at /usr/local/bin//index_radtags.pl line 570.

denovo_map.pl completed at 2012-12-11 11:37:06



Any tips please??? Could this be a problem because I am not using the MySQL database?


Thanks,


Ruth




On Monday, December 10, 2012 11:56:32 AM UTC+1, Ruth wrote:

ananta

unread,
Dec 11, 2012, 10:05:39 AM12/11/12
to stacks...@googlegroups.com
Ruth,

Try with both samples as population -s instead of -p or -r

also try -m 2, 3 so, that you can see if any loci were missed because of your read depth criteria. you have not specified -M, I guess default is 2 but you better work with your parameter. may be -M 2?



denovo_map.pl -o ./denovomap -s ./sample_CAGGTG.fq -s ./sample_ATTCTG.fq -m 5 -n 2 -M 2 -T 4 -b 1 -t 





Thanks

Ruth

unread,
Dec 11, 2012, 12:18:20 PM12/11/12
to stacks...@googlegroups.com
Hello Ananta,

Thanks a lot for your good tips.

Going along with your suggestion;
denovo_map.pl -o ./denovomap -s ./sample_ATTCTG.fq -s ./sample_CAGGTG.fq -m 5 -n 2 -M 2 -T 4 -b 1 -t 

I get the files (note that no catalogue is created):

batch_1.populations.log  sample_ATTCTG.alleles.tsv  sample_ATTCTG.tags.tsv     sample_CAGGTG.snps.tsv
denovo_map.log              sample_ATTCTG.snps.tsv     sample_CAGGTG.alleles.tsv  sample_CAGGTG.tags.tsv

And batch_1.populations.log is empty...

By specifying -s, denovo_map.pl assumes that the input data format is a Bowtie/SAM file and not a fq file (which is my case).

Thanks,

Ruth

ananta

unread,
Dec 11, 2012, 12:36:56 PM12/11/12
to stacks...@googlegroups.com
Ruth, 

This seems interesting as in previous versions -s would allow fastq files, I don't know whether they changed it or it is just typo in update. 
since you have not installed mysql, you can not see them in web browser. So, you may have to do some manual stuffs to compare. 
If I was you I would do:

1) ustacks (give your parameters here -M 2 -m 3?) 

you can use your already created stacks and go to next step 

2) cstacks (make catalog from both samples)
3) sstacks (match both files to this catalog)

then you should see *.matches files and you can do more filtering. Rather than searching with grep, I would merge two matches files with by catalog ID (not stacks ID)..and hopefully you will get your answer. 
 

Ananta

Ruth

unread,
Dec 12, 2012, 9:59:43 AM12/12/12
to stacks...@googlegroups.com
Thanks again Ananta,

This sounds like a good plan to me ;-)

Ruth

Julian Catchen

unread,
Dec 12, 2012, 8:54:20 PM12/12/12
to stacks...@googlegroups.com
Hi Ruth,

That is just a typo, denovo_map.pl is meant for data without a reference genome,
so FASTA or FASTQ. I have fixed the typo in the source and on the website.

I am guessing that cstacks is crashing due to your input data. What is the
output of denovo_map.pl on the command line, and what does the log file say in
./denovomap/denovo_map.log?

julian

On 12/11/12 9:18 AM, Ruth wrote:
> Hello Ananta,
>
> Thanks a lot for your good tips.
>
> Going along with your suggestion;
> denovo_map.pl -o ./denovomap -s ./sample_ATTCTG.fq -s ./sample_CAGGTG.fq -m 5 -n
> 2 -M 2 -T 4 -b 1 -t
>
> I get the files (note that no catalogue is created):
>
> *batch_1.populations.log* sample_ATTCTG.alleles.tsv sample_ATTCTG.tags.tsv
> sample_CAGGTG.snps.tsv
> denovo_map.log sample_ATTCTG.snps.tsv sample_CAGGTG.alleles.tsv
> sample_CAGGTG.tags.tsv
>
> And *batch_1.populations.log *is empty...
>
> By specifying -s, denovo_map.pl assumes that the input data format is a
> Bowtie/SAM file and not a fq file (which is my case).
>
> Thanks,
>
> Ruth
>
>
>
> On Tuesday, December 11, 2012 4:05:39 PM UTC+1, ananta wrote:
>
> Ruth,
>
> Try with both samples as population -s instead of -p or -r
>
> also try -m 2, 3 so, that you can see if any loci were missed because of
> your read depth criteria. /you have not specified -M, I guess default is 2
> but you better work with your parameter. may be -M 2?/
>
>
>
> /denovo_map.pl <http://denovo_map.pl/> -o ./denovomap -s .//sample_CAGGTG.fq
> /-s .//sample_ATTCTG.fq /-m 5 -n 2 -M 2 -T 4 -b 1 -t /
> /
> /
>
> /
> /
>
> Thanks
>
> On Tuesday, December 11, 2012 8:54:18 AM UTC-6, Ruth wrote:
>
> Hello all,
>
>
> Here I come again.
>
>
> I have tried denovo_map.pl <http://denovo_map.pl/> with my two samples.
>
> Once using one sample as parent and the other as progeny:
>
> /denovo_map.pl <http://denovo_map.pl> -o ./denovomap -p
> .//sample_CAGGTG.fq /-r .//sample_ATTCTG.fq /-m 5 -n 2 -T 4 -b 1 -t /
>
> And another time using both samples as parents:
>
>
> /denovo_map.pl <http://denovo_map.pl> -o ./denovomap -p
> .//sample_CAGGTG.fq/-p .//sample_ATTCTG.fq /-m 5 -n 2 -T 4 -b 100 -t /
>
>
> Still don't know what would be best or what would be the difference.
>
>
> And these files are generated, so it seems to work:
>
> /batch_1.catalog.alleles.tsv /*/batch_1.genotypes_1.tsv
> batch_1.markers.tsv/*//sample_CAGGTG/.snps.tsv /sample_ATTCTG/.snps.tsv/
>
> /batch_1.catalog.snps.tsv /*/batch_1.genotypes_1.txt/*/ denovo_map.log
> /sample_CAGGTG/.tags.tsv /sample_ATTCTG/.tags.tsv/
>
> /batch_1.catalog.tags.tsv
> /*/batch_1.haplotypes_1.tsv/*//sample_CAGGTG/.alleles.tsv
> /sample_ATTCTG/.alleles.tsv/
>
>
> The output from ustacks and stacks is OK (these are XXX.alleles.tsv,
> XXX.tags.tsv, and XXX.snps.tsv) *BUT t*he files XXX.genotypes,
> XXX.haplotypes and XXX.markers *are empty.*
>
>
> This is some of the denovo_map.log report that I get:
>
>
> *head -20 denovo_map.log *
>
>
> /ref_map.pl <http://ref_map.pl> started at 2012-12-11 11:30:22/
>
> //usr/local/bin/denovo_map.pl <http://denovo_map.pl> -m 3 -M 3 -n 2 -T
> 15 -b 1 -t -D RuthTest -o ./denovomap -p ./sample_CAGGTG.fq -r
> ./sample_ATTCTG.fq/
>
> /Identifying unique stacks; file 1 of 2 [sample_CAGGTG]/
>
> //usr/local/bin//ustacks -t fastq -f ./sample_CAGGTG.fq -o ./denovomap
> -i -d -r -m 3 -M 3 -p 15 2>&1/
>
> /Min depth of coverage to create a stack: 3/
>
> /Max distance allowed between stacks: 3/
>
> /Max distance allowed to align secondary reads: 5/
>
> /Deleveraging algorithm: disabled/
>
> /Removal algorithm: enabled/
>
> /Model type: SNP/
>
> /Parsing ./sample_CAGGTG.fq/
>
> /Loaded 3306171 RAD-Tags; inserted 457376 elements into the RAD-Tags
> hash map./
>
> / 2546 reads contained uncalled nucleotides that were modified./
>
> / Mean coverage depth is 24; Std Dev: 179.206 Max: 46659/
>
> /Coverage mean: 24; stdev: 179.206/
>
> /Deleveraging trigger: 203; Removal trigger: 382/
>
> /Calculating distance for removing repetitive stacks./
>
> / Distance allowed between stacks: 1/
>
> / Using a k-mer length of 43/
>
> / Number of kmers per sequence: 46/
>
>
> *tail denovo_map.log *
>
>
> Writing SQL markers file to './denovomap/batch_1.markers.tsv'
>
> Writing SQL genotypes file to './denovomap/batch_1.genotypes_1.txt'
>
> Writing 0 loci to observed haplotype file,
> './denovomap/batch_1.haplotypes_1.tsv'
>
> mysql --defaults-file=/usr/local/share/stacks/sql/mysql.cnf -e "LOAD
> DATA LOCAL INFILE './denovomap/batch_1.markers.tsv' INTO TABLE markers "
>
> mysql --defaults-file=/usr/local/share/stacks/sql/mysql.cnf -e "LOAD
> DATA LOCAL INFILE './denovomap/batch_1.genotypes_1.txt' INTO TABLE
> catalog_genotypes "
>
> /usr/local/bin//index_radtags.pl <http://index_radtags.pl> -D -t -c 2>&1
>
> DBI
> connect('-t:mysql_read_default_file=/usr/local/share/stacks/sql/mysql.cnf','',...)
> failed: Can't connect to local MySQL server through socket
> '/var/run/mysqld/mysqld.sock' (2) at /usr/local/bin//index_radtags.pl
> <http://index_radtags.pl> line 570
>
> Unable to connect to the -t MySQL Database!
>
> Can't connect to local MySQL server through socket
> '/var/run/mysqld/mysqld.sock' (2) at /usr/local/bin//index_radtags.pl
> <http://index_radtags.pl> line 570.
>
> denovo_map.pl <http://denovo_map.pl> completed at 2012-12-11 11:37:06
>
>
>
> Any tips please??? Could this be a problem because I am not using the
> MySQL database?
>
>
> Thanks,
>
>
> Ruth
>
>
>
>
> On Monday, December 10, 2012 11:56:32 AM UTC+1, Ruth wrote:
>
> Dear all,
>
>
> Sorry in advance for long post. I am trying to give as much detail
> as I can without writing a novel.
>
>
> I am using STACKS to analyse the sequences that I got from a pilot
> study using SbfI. These are single end sequences generated by
> Ilumina HiSeq.
>
> Since I don't have a reference genome for my organism I am running
> several parts of the STACKS separately to identify SNPs.
> Process_radtags + ustacks + cstacks/sstacks
>
>
> I am now having some trouble interpreting some of the outputs from
> sstacks/cstacks.
>
>
> For example, if I consider two of my samples: Ire 11 and Ire16. When
> I run *ustacks *using m=5 I get:
>
> Number SNPs Ire11= 2848
>
> Number SNPs Ire 16 = 3427
>
>
> Afterwards, when I build a catalogue using *csatcks* of both samples
> and I get a total of 5407 SNPs.
>
> So, my understanding is that Stacks identifies 5407 SNPs in either
> Ire11, Ire 16, or both (*i.e. the SNPs may not be in Radtags that
> have been sequenced/identified in both Ire11/Ire16*)
>
>
> However, what I really want to know is how many of the SNPs
> positions are in RadTags that are shared between both samples.
>
> This is to have an idea of 1) how many of the RadTags have been
> sequenced in both samples and, 2) how many of these Radtags have
> SNPs (herezozygote only in Ire11, heterozygote only in Ire16, or
> heterozygote in both) .
>
> By doing this, I will have an idea of how much sequencing effort
> ($$$$$) I will need to put in my future population genetics study
>
> (i.e. using ~100 samples in my study; how many reads do I need to
> get lets say > 1000 shared SNPs????)
>
>
> So, I thought that *sstacks *was the answer to my trouble. I have
> been using the XXX.tags.tsv, XXX.alleles.tsv, and XXX.snps.tsv
> obtained form ustacks for one of the samples (Ire11) as "catalogue"
> and afterwards I have matched the
>
> XXX.tags.tsv, XXX.alleles.tsv, and XXX.snps.tsv files from the other
> sample (Ire16) against the first one (Ire11) to find shared Radtags
> and shared SNPs.
>
> I got a promising output file called "sample_Ire16.matches.tsv",
> that looks like this below (it has 48963 lines):
>
>
> *Headings*: SQL ID (not using this); Batch ID (0 because not using a
> cstacks output and therefore batch IDs as input when running
> sstacks); catalogue ID (stack ID in Ire11); Sample ID (Ire16=2 in my
> command line); Stack ID (in Ire16); Haplotype; Stack Depth
>
>
> 0 0 62 2 69 consensus 25
>
> 0 0 63 2 70 consensus 47
>
> 0 0 65 2 72 consensus 27
>
> 0 0 66 2 73 consensus 18
>
> 0 0 68 2 75 consensus 38
>
> 0 0 69 2 77 A 23
>
> 0 0 69 2 77 G 24
>
> 0 0 87 2 81 consensus 79
>
> 0 0 88 2 84 consensus 69
>
> 0 0 79 2 88 consensus 127
>
> 0 0 81 2 94 consensus 75
>
> 0 0 86 2 97 consensus 62
>
> 0 0 13364 2 105 G 39
>
> 0 0 66573 2 123 consensus 66
>
> 0 0 143 2 190 consensus 67
>
>
> *And now there come my questions:*
>
> **
>
> /Besides the /promising output file called
> "sample_Ire16.matches.tsv",/I get more output saying:/
>
> /Parsing
> ./ustacks_all_reads/stacks_CAGGTG_Ire11_m10p001/sample_CAGGTG.snps.tsv/
>
> / Parsing
> ./ustacks_all_reads/stacks_CAGGTG_Ire11_m10p001/sample_CAGGTG.alleles.tsv/
>
> / Parsing
> ./ustacks_all_reads/stacks_ATTCTG_Ire16_m10p001/sample_ATTCTG.tags.tsv/
>
> / Parsing
> ./ustacks_all_reads/stacks_ATTCTG_Ire16_m10p001/sample_ATTCTG.snps.tsv/
>
> / Parsing
> ./ustacks_all_reads/stacks_ATTCTG_Ire16_m10p001/sample_ATTCTG.alleles.tsv/
>
> /Searching for sequence matches.../
>
> /81252 stacks matched against the catalog containing 80997 loci./
>
> / 50572 matching loci, 1827 contained no verified haplotypes./
>
> / 32 loci matched more than one catalog locus and were excluded./
>
> //*//**/1794 loci contained SNPs unaccounted for in the catalog and
> were excluded. /**question 1: Are these shared Radtags that have 2
> alleles in Ire16 (heterozygotes ) but only one allele (homozygotes)
> in Ire 11????*
>
> / 50843 total haplotypes examined from matching loci, 48963 verified./
>
>
>
> From the "sample_Ire16.matches.tsv" file, I have assumed that
> "consensus" means that this RadTag was shared between Ire 11 and
> Ire16 but there was not a SNP on it.
>
> So I selected only those lines with no consensus on them (grep -v
> "consensus" sample_Ire16.matches.tsv > SNPs_positions ) and the file
> "SNP_positions" looks like this:
>
>
> 0069277 A 23
>
> 0069277 G 24
>
> 00133642105 G 39
>
> 009582215 A 48
>
> 009022229 GT 12
>
> 0010162248 GCCCCT 18
>
> 009572280 A 15
>
> 0010202298 CCACCT 27
>
> 0030502329 C 32
>
> 0010222337 GCGCCTGCCC 182
>
>
> *And now it comes question 2: How do I interpret this???*
>
> **
>
> *- If a number is recorded twice in column 3 and column 5; is this a
> SNP (heterozygote) in both Ire 11 and Ire 16?*
>
> *- If a number is recorded once in column 3 and column 5; is this a
> SNP (heterozygote) in Ire 11 but a homozygote in Ire 16?*
>
> *- When a number is recorded >twice in either column 3 or column 5
> (It happened several times). Shall we consider this an sequencing
> error??*
>
> **
>
> Thank a lot!!!
>
>
> Ruth
>
>
> --
> For more options or to unsubscribe: http://groups.google.com/group/stacks-users
> Stacks website: http://creskolab.uoregon.edu/stacks/

--
Julian M Catchen, Ph.D.
Institute of Ecology and Evolution
University of Oregon
--
jcat...@uoregon.edu
http://www.uoregon.edu/~jcatchen/

Julian Catchen

unread,
Dec 12, 2012, 8:56:07 PM12/12/12
to stacks...@googlegroups.com
Hi Ruth,

Add the '-S' flag to disable interaction with the database.

julian

On 12/11/12 6:54 AM, Ruth wrote:
> Hello all,
>
>
> Here I come again.
>
>
> I have tried denovo_map.pl <http://denovo_map.pl/> with my two samples.
>
> Once using one sample as parent and the other as progeny:
>
> / denovo_map.pl -o ./denovomap -p .//sample_CAGGTG.fq /-r .//sample_ATTCTG.fq
> /-m 5 -n 2 -T 4 -b 1 -t /
>
> And another time using both samples as parents:
>
>
> /denovo_map.pl -o ./denovomap -p .//sample_CAGGTG.fq/-p .//sample_ATTCTG.fq /-m
> 5 -n 2 -T 4 -b 100 -t /
>
>
> Still don't know what would be best or what would be the difference.
>
>
> And these files are generated, so it seems to work:
>
> /batch_1.catalog.alleles.tsv /*/batch_1.genotypes_1.tsv
> batch_1.markers.tsv/*//sample_CAGGTG/.snps.tsv /sample_ATTCTG/.snps.tsv/
>
> /batch_1.catalog.snps.tsv /*/batch_1.genotypes_1.txt/*/ denovo_map.log
> /sample_CAGGTG/.tags.tsv /sample_ATTCTG/.tags.tsv/
>
> /batch_1.catalog.tags.tsv
> /*/batch_1.haplotypes_1.tsv/*//sample_CAGGTG/.alleles.tsv
> /sample_ATTCTG/.alleles.tsv/
>
>
> The output from ustacks and stacks is OK (these are XXX.alleles.tsv,
> XXX.tags.tsv, and XXX.snps.tsv) *BUT t*he files XXX.genotypes, XXX.haplotypes
> and XXX.markers *are empty.*
>
>
> This is some of the denovo_map.log report that I get:
>
>
> *head -20 denovo_map.log *
>
>
> /ref_map.pl started at 2012-12-11 11:30:22/
>
> //usr/local/bin/denovo_map.pl -m 3 -M 3 -n 2 -T 15 -b 1 -t -D RuthTest -o
> ./denovomap -p ./sample_CAGGTG.fq -r ./sample_ATTCTG.fq/
>
> /Identifying unique stacks; file 1 of 2 [sample_CAGGTG]/
>
> //usr/local/bin//ustacks -t fastq -f ./sample_CAGGTG.fq -o ./denovomap -i -d -r
> -m 3 -M 3 -p 15 2>&1/
>
> /Min depth of coverage to create a stack: 3/
>
> /Max distance allowed between stacks: 3/
>
> /Max distance allowed to align secondary reads: 5/
>
> /Deleveraging algorithm: disabled/
>
> /Removal algorithm: enabled/
>
> /Model type: SNP/
>
> /Parsing ./sample_CAGGTG.fq/
>
> /Loaded 3306171 RAD-Tags; inserted 457376 elements into the RAD-Tags hash map./
>
> / 2546 reads contained uncalled nucleotides that were modified./
>
> / Mean coverage depth is 24; Std Dev: 179.206 Max: 46659/
>
> /Coverage mean: 24; stdev: 179.206/
>
> /Deleveraging trigger: 203; Removal trigger: 382/
>
> /Calculating distance for removing repetitive stacks./
>
> / Distance allowed between stacks: 1/
>
> / Using a k-mer length of 43/
>
> / Number of kmers per sequence: 46/
>
>
> *tail denovo_map.log *
> *ustacks *using m=5 I get:
>
> Number SNPs Ire11= 2848
>
> Number SNPs Ire 16 = 3427
>
>
> Afterwards, when I build a catalogue using *csatcks* of both samples and I
> get a total of 5407 SNPs.
>
> So, my understanding is that Stacks identifies 5407 SNPs in either Ire11,
> Ire 16, or both (*i.e. the SNPs may not be in Radtags that have been
> sequenced/identified in both Ire11/Ire16*)
>
>
> However, what I really want to know is how many of the SNPs positions are in
> RadTags that are shared between both samples.
>
> This is to have an idea of 1) how many of the RadTags have been sequenced in
> both samples and, 2) how many of these Radtags have SNPs (herezozygote only
> in Ire11, heterozygote only in Ire16, or heterozygote in both) .
>
> By doing this, I will have an idea of how much sequencing effort ($$$$$) I
> will need to put in my future population genetics study
>
> (i.e. using ~100 samples in my study; how many reads do I need to get lets
> say > 1000 shared SNPs????)
>
>
> So, I thought that *sstacks *was the answer to my trouble. I have been using
> the XXX.tags.tsv, XXX.alleles.tsv, and XXX.snps.tsv obtained form ustacks
> for one of the samples (Ire11) as "catalogue" and afterwards I have matched the
>
> XXX.tags.tsv, XXX.alleles.tsv, and XXX.snps.tsv files from the other sample
> (Ire16) against the first one (Ire11) to find shared Radtags and shared SNPs.
>
> I got a promising output file called "sample_Ire16.matches.tsv", that looks
> like this below (it has 48963 lines):
>
>
> *Headings*: SQL ID (not using this); Batch ID (0 because not using a cstacks
> output and therefore batch IDs as input when running sstacks); catalogue ID
> (stack ID in Ire11); Sample ID (Ire16=2 in my command line); Stack ID (in
> Ire16); Haplotype; Stack Depth
>
>
> 0 0 62 2 69 consensus 25
>
> 0 0 63 2 70 consensus 47
>
> 0 0 65 2 72 consensus 27
>
> 0 0 66 2 73 consensus 18
>
> 0 0 68 2 75 consensus 38
>
> 0 0 69 2 77 A 23
>
> 0 0 69 2 77 G 24
>
> 0 0 87 2 81 consensus 79
>
> 0 0 88 2 84 consensus 69
>
> 0 0 79 2 88 consensus 127
>
> 0 0 81 2 94 consensus 75
>
> 0 0 86 2 97 consensus 62
>
> 0 0 13364 2 105 G 39
>
> 0 0 66573 2 123 consensus 66
>
> 0 0 143 2 190 consensus 67
>
>
> *And now there come my questions:*
>
> **
>
> /Besides the /promising output file called "sample_Ire16.matches.tsv",/I get
> more output saying:/
>
> /Parsing ./ustacks_all_reads/stacks_CAGGTG_Ire11_m10p001/sample_CAGGTG.snps.tsv/
>
> / Parsing
> ./ustacks_all_reads/stacks_CAGGTG_Ire11_m10p001/sample_CAGGTG.alleles.tsv/
>
> / Parsing
> ./ustacks_all_reads/stacks_ATTCTG_Ire16_m10p001/sample_ATTCTG.tags.tsv/
>
> / Parsing
> ./ustacks_all_reads/stacks_ATTCTG_Ire16_m10p001/sample_ATTCTG.snps.tsv/
>
> / Parsing
> ./ustacks_all_reads/stacks_ATTCTG_Ire16_m10p001/sample_ATTCTG.alleles.tsv/
>
> /Searching for sequence matches.../
>
> /81252 stacks matched against the catalog containing 80997 loci./
>
> / 50572 matching loci, 1827 contained no verified haplotypes./
>
> / 32 loci matched more than one catalog locus and were excluded./
>
> //*//**/1794 loci contained SNPs unaccounted for in the catalog and were
> excluded. /**question 1: Are these shared Radtags that have 2 alleles in
> Ire16 (heterozygotes ) but only one allele (homozygotes) in Ire 11????*
>
> / 50843 total haplotypes examined from matching loci, 48963 verified./
>
>
>
> From the "sample_Ire16.matches.tsv" file, I have assumed that "consensus"
> means that this RadTag was shared between Ire 11 and Ire16 but there was not
> a SNP on it.
>
> So I selected only those lines with no consensus on them (grep -v
> "consensus" sample_Ire16.matches.tsv > SNPs_positions ) and the file
> "SNP_positions" looks like this:
>
>

Ruth

unread,
Dec 13, 2012, 5:57:16 AM12/13/12
to stacks...@googlegroups.com, jcat...@uoregon.edu
Hello Julian,

Thanks a lot for your time and interest.

Attached here are the outputs from command line and denovo_map.log (note that I am not using MySQL)

For now, I have got an answer to my question regarding shared SNPs and shared Radtags between my 2 samples by applying Ananta's last suggestion but it would be nice to be able to run denovo_map in the future.

Cheers,

Ruth
Output_command_lineRF.rtf

Julian Catchen

unread,
Dec 13, 2012, 9:45:47 PM12/13/12
to stacks...@googlegroups.com
Hi Ruth,

After looking at your logs, you just need to disable the SQL interaction for
denovo_map.pl to work (-S). I'll add a check to the program so that it throws an
error if SQL interaction is turned on and it can't connect to the database.

Best,

julian


On 12/13/12 2:57 AM, Ruth wrote:
> Hello Julian,
>
> Thanks a lot for your time and interest.
>
> Attached here are the outputs from command line and denovo_map.log (note that I
> am not using MySQL)
>
> For now, I have got an answer to my question regarding shared SNPs and shared
> Radtags between my 2 samples by applying Ananta's last suggestion but it would
> be nice to be able to run denovo_map in the future.
>
> Cheers,
>
> Ruth
>
> On Thursday, December 13, 2012 2:54:20 AM UTC+1, Julian Catchen wrote:
>
> Hi Ruth,
>
> That is just a typo, denovo_map.pl <http://denovo_map.pl> is meant for data
> without a reference genome,
> so FASTA or FASTQ. I have fixed the typo in the source and on the website.
>
> I am guessing that cstacks is crashing due to your input data. What is the
> output of denovo_map.pl <http://denovo_map.pl> on the command line, and what
Reply all
Reply to author
Forward
0 new messages