Dear all,
Sorry in advance for long post. I am trying to give as much detail as I can without writing a novel.
I am using STACKS to analyse the sequences that I got from a pilot study using SbfI. These are single end sequences generated by Ilumina HiSeq.
Since I don't have a reference genome for my organism I am running several parts of the STACKS separately to identify SNPs. Process_radtags + ustacks + cstacks/sstacks
I am now having some trouble interpreting some of the outputs from sstacks/cstacks.
For example, if I consider two of my samples: Ire 11 and Ire16. When I run ustacks using m=5 I get:
Number SNPs Ire11= 2848
Number SNPs Ire 16 = 3427
Afterwards, when I build a catalogue using csatcks of both samples and I get a total of 5407 SNPs.
So, my understanding is that Stacks identifies 5407 SNPs in either Ire11, Ire 16, or both (i.e. the SNPs may not be in Radtags that have been sequenced/identified in both Ire11/Ire16)
However, what I really want to know is how many of the SNPs positions are in RadTags that are shared between both samples.
This is to have an idea of 1) how many of the RadTags have been sequenced in both samples and, 2) how many of these Radtags have SNPs (herezozygote only in Ire11, heterozygote only in Ire16, or heterozygote in both) .
By doing this, I will have an idea of how much sequencing effort ($$$$$) I will need to put in my future population genetics study
(i.e. using ~100 samples in my study; how many reads do I need to get lets say > 1000 shared SNPs????)
So, I thought that sstacks was the answer to my trouble. I have been using the XXX.tags.tsv, XXX.alleles.tsv, and XXX.snps.tsv obtained form ustacks for one of the samples (Ire11) as "catalogue" and afterwards I have matched the
XXX.tags.tsv, XXX.alleles.tsv, and XXX.snps.tsv files from the other sample (Ire16) against the first one (Ire11) to find shared Radtags and shared SNPs.
I got a promising output file called "sample_Ire16.matches.tsv", that looks like this below (it has 48963 lines):
Headings: SQL ID (not using this); Batch ID (0 because not using a cstacks output and therefore batch IDs as input when running sstacks); catalogue ID (stack ID in Ire11); Sample ID (Ire16=2 in my command line); Stack ID (in Ire16); Haplotype; Stack Depth
0 0 62 2 69 consensus 25
0 0 63 2 70 consensus 47
0 0 65 2 72 consensus 27
0 0 66 2 73 consensus 18
0 0 68 2 75 consensus 38
0 0 69 2 77 A 23
0 0 69 2 77 G 24
0 0 87 2 81 consensus 79
0 0 88 2 84 consensus 69
0 0 79 2 88 consensus 127
0 0 81 2 94 consensus 75
0 0 86 2 97 consensus 62
0 0 13364 2 105 G 39
0 0 66573 2 123 consensus 66
0 0 143 2 190 consensus 67
And now there come my questions:
Besides the promising output file called "sample_Ire16.matches.tsv", I get more output saying:
Parsing ./ustacks_all_reads/stacks_CAGGTG_Ire11_m10p001/sample_CAGGTG.snps.tsv
Parsing ./ustacks_all_reads/stacks_CAGGTG_Ire11_m10p001/sample_CAGGTG.alleles.tsv
Parsing ./ustacks_all_reads/stacks_ATTCTG_Ire16_m10p001/sample_ATTCTG.tags.tsv
Parsing ./ustacks_all_reads/stacks_ATTCTG_Ire16_m10p001/sample_ATTCTG.snps.tsv
Parsing ./ustacks_all_reads/stacks_ATTCTG_Ire16_m10p001/sample_ATTCTG.alleles.tsv
Searching for sequence matches...
81252 stacks matched against the catalog containing 80997 loci.
50572 matching loci, 1827 contained no verified haplotypes.
32 loci matched more than one catalog locus and were excluded.
1794 loci contained SNPs unaccounted for in the catalog and were excluded. question 1: Are these shared Radtags that have 2 alleles in Ire16 (heterozygotes ) but only one allele (homozygotes) in Ire 11????
50843 total haplotypes examined from matching loci, 48963 verified.
From the "sample_Ire16.matches.tsv" file, I have assumed that "consensus" means that this RadTag was shared between Ire 11 and Ire16 but there was not a SNP on it.
So I selected only those lines with no consensus on them (grep -v "consensus" sample_Ire16.matches.tsv > SNPs_positions ) and the file "SNP_positions" looks like this:
0 0 69 2 77 A 23
0 0 69 2 77 G 24
0 0 13364 2 105 G 39
0 0 958 2 215 A 48
0 0 902 2 229 GT 12
0 0 1016 2 248 GCCCCT 18
0 0 957 2 280 A 15
0 0 1020 2 298 CCACCT 27
0 0 3050 2 329 C 32
0 0 1022 2 337 GCGCCTGCCC 182
And now it comes question 2: How do I interpret this???
- If a number is recorded twice in column 3 and column 5; is this a SNP (heterozygote) in both Ire 11 and Ire 16?
- If a number is recorded once in column 3 and column 5; is this a SNP (heterozygote) in Ire 11 but a homozygote in Ire 16?
- When a number is recorded >twice in either column 3 or column 5 (It happened several times). Shall we consider this an sequencing error??
Thank a lot!!!
Ruth
Hello again,
Thanks so much for your helpful replies and for your patience reading my long post.
Yes, I see now that running denovo_map.pl can be the answer to my trouble. Seems quite obvious now but sometimes it is difficult to see the obvious, especially when you work on your own!
Regarding Ryan questions about my experiments. I only have results for 2 samples for now. Trying to scale up the rest of the experiments (number of samples and Illumina lanes to use based on this) although I realise that having a pilot study with few more individuals would have been a clear advantage when trying to infer diversity indexes.
Cheers,
Ruth
Hello all,
Here I come again.
I have tried denovo_map.pl with my two samples.
Once using one sample as parent and the other as progeny:
denovo_map.pl -o ./denovomap -p ./sample_CAGGTG.fq -r ./sample_ATTCTG.fq -m 5 -n 2 -T 4 -b 1 -t
And another time using both samples as parents:
denovo_map.pl -o ./denovomap -p ./sample_CAGGTG.fq -p ./sample_ATTCTG.fq -m 5 -n 2 -T 4 -b 100 -t
Still don't know what would be best or what would be the difference.
And these files are generated, so it seems to work:
batch_1.catalog.alleles.tsv batch_1.genotypes_1.tsv batch_1.markers.tsv sample_CAGGTG.snps.tsv sample_ATTCTG.snps.tsv
batch_1.catalog.snps.tsv batch_1.genotypes_1.txt denovo_map.log sample_CAGGTG.tags.tsv sample_ATTCTG.tags.tsv
batch_1.catalog.tags.tsv batch_1.haplotypes_1.tsv sample_CAGGTG.alleles.tsv sample_ATTCTG.alleles.tsv
The output from ustacks and stacks is OK (these are XXX.alleles.tsv, XXX.tags.tsv, and XXX.snps.tsv) BUT the files XXX.genotypes, XXX.haplotypes and XXX.markers are empty.
This is some of the denovo_map.log report that I get:
head -20 denovo_map.log
ref_map.pl started at 2012-12-11 11:30:22
/usr/local/bin/denovo_map.pl -m 3 -M 3 -n 2 -T 15 -b 1 -t -D RuthTest -o ./denovomap -p ./sample_CAGGTG.fq -r ./sample_ATTCTG.fq
Identifying unique stacks; file 1 of 2 [sample_CAGGTG]
/usr/local/bin//ustacks -t fastq -f ./sample_CAGGTG.fq -o ./denovomap -i -d -r -m 3 -M 3 -p 15 2>&1
Min depth of coverage to create a stack: 3
Max distance allowed between stacks: 3
Max distance allowed to align secondary reads: 5
Deleveraging algorithm: disabled
Removal algorithm: enabled
Model type: SNP
Parsing ./sample_CAGGTG.fq
Loaded 3306171 RAD-Tags; inserted 457376 elements into the RAD-Tags hash map.
2546 reads contained uncalled nucleotides that were modified.
Mean coverage depth is 24; Std Dev: 179.206 Max: 46659
Coverage mean: 24; stdev: 179.206
Deleveraging trigger: 203; Removal trigger: 382
Calculating distance for removing repetitive stacks.
Distance allowed between stacks: 1
Using a k-mer length of 43
Number of kmers per sequence: 46
tail denovo_map.log
Writing SQL markers file to './denovomap/batch_1.markers.tsv'
Writing SQL genotypes file to './denovomap/batch_1.genotypes_1.txt'
Writing 0 loci to observed haplotype file, './denovomap/batch_1.haplotypes_1.tsv'
mysql --defaults-file=/usr/local/share/stacks/sql/mysql.cnf -e "LOAD DATA LOCAL INFILE './denovomap/batch_1.markers.tsv' INTO TABLE markers "
mysql --defaults-file=/usr/local/share/stacks/sql/mysql.cnf -e "LOAD DATA LOCAL INFILE './denovomap/batch_1.genotypes_1.txt' INTO TABLE catalog_genotypes "
/usr/local/bin//index_radtags.pl -D -t -c 2>&1
DBI connect('-t:mysql_read_default_file=/usr/local/share/stacks/sql/mysql.cnf','',...) failed: Can't connect to local MySQL server through socket '/var/run/mysqld/mysqld.sock' (2) at /usr/local/bin//index_radtags.pl line 570
Unable to connect to the -t MySQL Database!
Can't connect to local MySQL server through socket '/var/run/mysqld/mysqld.sock' (2) at /usr/local/bin//index_radtags.pl line 570.
denovo_map.pl completed at 2012-12-11 11:37:06
Any tips please??? Could this be a problem because I am not using the MySQL database?
Thanks,
Ruth