Why use the mock reference genome and not the centroids file?

sethm...@gmail.com

unread,

Dec 12, 2016, 10:39:11 AM12/12/16

to GBS-SNP-CROP: GBS SNP Calling Reference Optional Pipeline

Hi all,

I'm working with NGS data from several wild populations of a plant species, and I'm interested in finding SNPs that are likely to be contained within the chloroplast genome.

I've been able to use blastn to align parts of my mock reference centroids fasta to a related species's complete chloroplast genome and have used this information to pull out the centroids that match. I've then proceeded to use the resultant fasta, still containing centroid information, to continue with the GBS-SNP-CROP pipeline, which doesn't seem to have caused any issues.

Does anyone see any issues with what I've done? I'm mostly posting out of curiosity, but if anyone could comment on the pipeline's use of a poly-A bound mock reference instead of the centroid mock reference, that would be cool.

Secondly, if you have suggestions as to how I could improve what I've done then PLEASE do post them. At the moment I've pulled out 31 SNPs after filtering (step 7), compared to over 4000 when I use the full mock reference...

Many thanks in advance,

Seth

Arthur Melo

unread,

Dec 13, 2016, 9:36:15 AM12/13/16

to GBS-SNP-CROP: GBS SNP Calling Reference Optional Pipeline, sethm...@gmail.com

Hi Seth, thank you for writing ...

Firstly, I'm assuming all your NGS reads comes from chloroplast DNA, right?

If true, I think you can use the GBS-SNP-CROP exactly as suggested on User Manual. I mean, if your raw data comes from chloroplast genomes, when you build the mock reference and then map the high-quality and demultiplexed reads, you are able to call SNPs specific to chloroplast. So, I guess there is no reason to use blastn and any other analysis...

However, if your raw reads are not from chloroplast specific DNA, I think your strategy can works, but instead to blast the mock reference to a related species's complete chloroplast genome, seems to me more interesting you map the high-quality demultiplexed reads (outputs from step 3) and then pull out only reads that mapped. With those reads that matched you can build your mock reference and move forward normally... using your genome version of mock reference, for example.

To answer your question "why use the mock reference genome and not the centroids file", we have saw some SNPs were wrong called due to the mis-alignment of the end of a GBS read across the boundary between two adjacent clusters in the Mock Reference. To minimize such errors, adjacent centroids are now separated from one another by a string of 20 high-quality A’s, an adenine-based boundary found to enhance alignment accuracy. If N clusters are used to build the mock reference, the mock reference will contain 20*(N-1) adenine boundary bases. The coordinates of these bases (found in PosToMask.txt) should be used to mask the final list of putative SNPs in order to avoid calling false SNPs within these artificial boundaries. However, the number of false SNPs are dependent of data, but it isn't so much. You can use the centroid file instead the genome mock reference, but unfortunately you are not able to find those false SNPs.

In order to increase your final number of SNPs called, I encourage you work in a more relaxed way in a SAMTools flags and also in a genotyping criteria on step 7. In addition, you can try to identify your reads that match to chloroplast genome before to build the mock reference.

Please, let me know about your progress.

Best,

Arthur

seth musker

unread,

Dec 13, 2016, 10:51:10 AM12/13/16

to Arthur Melo, GBS-SNP-CROP: GBS SNP Calling Reference Optional Pipeline

Hi Arthur,

Thanks very much for this very helpful reply.

Sorry for not being more specific about where my reads came from, but in fact they're from whole plant extractions and so contain both nuclear and chloroplast DNA. Your suggestion is a good one and hadn't occurred to me. Just to be clear though, you're suggesting that I 1) blast each of the resultant fastq files from step 3 against the related species chloroplast genome, 2) use the results to create fastq files that only contain reads that mapped to the chloroplast genome, 3) then use those chloroplast-only fastq files to build a new mock reference genome, and 4) use this genome for alignment. Is that correct?

Thanks for your explanation of the reasoning behind the poly-A's in the mock reference genome. I'm a little confused though. Why would the program align reads across a scaffold/chromosome boundary in the centroids file? Surely this doesn't make sense as one doesn't know how the scaffolds fit together, so to speak?

Many thanks again,

Seth

--
You received this message because you are subscribed to a topic in the Google Groups "GBS-SNP-CROP: GBS SNP Calling Reference Optional Pipeline" group.
To unsubscribe from this topic, visit https://groups.google.com/d/topic/gbs-snp-crop/Sf2jaWn22cE/unsubscribe.
To unsubscribe from this group and all its topics, send an email to gbs-snp-crop+unsubscribe@googlegroups.com.
To post to this group, send email to gbs-sn...@googlegroups.com.
To view this discussion on the web visit https://groups.google.com/d/msgid/gbs-snp-crop/eafe59a6-c966-4d4d-8d8f-32a52ab0b1b2%40googlegroups.com.
For more options, visit https://groups.google.com/d/optout.

Arthur Melo

unread,

Dec 13, 2016, 1:40:19 PM12/13/16

to GBS-SNP-CROP: GBS SNP Calling Reference Optional Pipeline, sethm...@gmail.com

Hi Seth,

Exactly. I think is can be a reliable reasonable way to call chloroplast SNPs. I also guess you could be more relaxed on SAMTools flags (step 5) and in genotyping criteria flags (step 7), specifically, -F 4 and -f 0 flags on step 5.

To unsubscribe from this group and all its topics, send an email to gbs-snp-crop...@googlegroups.com.

seth musker

unread,

Dec 14, 2016, 12:57:40 AM12/14/16

to Arthur Melo, GBS-SNP-CROP: GBS SNP Calling Reference Optional Pipeline

Hi Arthur,

Thanks, I'm looking forward to trying this out. Just to confirm, you'd relax -F 4, meaning that unmapped reads would no longer be removed? What is your reasoning behind this?

Also, about my other question: could you help explain how the misalignment issue is solved by not using the centroids fasta? Perhaps I'm misunderstanding. I don't understand how a single read could map to separate clusters. Does the centroids file contain information about where the clusters are located relative to each other?

Best,

Seth

To unsubscribe from this group and all its topics, send an email to gbs-snp-crop+unsubscribe@googlegroups.com.

To post to this group, send email to gbs-sn...@googlegroups.com.

To view this discussion on the web visit https://groups.google.com/d/msgid/gbs-snp-crop/8ced8cac-e615-4e38-ac49-804c7da0630b%40googlegroups.com.

Arthur Melo

unread,

Dec 14, 2016, 10:14:53 AM12/14/16

to GBS-SNP-CROP: GBS SNP Calling Reference Optional Pipeline, sethm...@gmail.com

Hi Seth,

No, the opposite. The F flag on SAMTools means what you are not reporting on output. And "4" means unmapped reads. Originaly we recommend -F 2308 which means exclude from output 1. unmapped reads + not primary alignment + supplementary alignment.

Answering your second question, we have been noticed the BWA-mem works better, i.e., perform more concise alignments by using the "genome" way of the mock reference than when the cluster fasta files were used, which means more reads can pass by SAMTools flags filters. Added, it can significatively increasing the read depth for calling SNPs, which then mean you will have lower missing data in your final genotyping matrix due depth filters requirements. So, originally, we simple stitched all clusters in a single mock reference genome. However, in this mock ref genome we also have been found this misalignments in a boundary of two clusters. Due this, we decided stitch two clusters by using a poly A string and then create a file with all this poly A positions. So, if one of those SNPs in the final genotyping matrix were matches with some position on PosToMask.txt file, this SNP needs to be removed once the reference polymorphism is A that comes from the poly A string. These number of SNPs are not so much and I think it can not cause a huge bias on forward analysis, but they are clearly a false SNP called. Also, for sure this kind of misalignment at clusters boundaries doesn't occurs in a cluster fasta file.

Best,

Arthur

seth musker

unread,

Dec 15, 2016, 3:13:26 AM12/15/16

to Arthur Melo, GBS-SNP-CROP: GBS SNP Calling Reference Optional Pipeline

Hi Arthur,

Ah, I see, I misunderstood. I thought you meant to use -F2304 to exclude non-primary and supplementary alignments, but I now see that you meant to use only -F4, which will only exclude unmapped reads.

Thanks for your explanation of the BWA-mem process. That makes a lot of sense. And thanks again for all your help, it's greatly appreciated.

Best,

Seth

Arthur Melo

unread,

Dec 15, 2016, 8:26:33 AM12/15/16

to GBS-SNP-CROP: GBS SNP Calling Reference Optional Pipeline, sethm...@gmail.com

Hi Seth,

Please, let me know about your progress in calling SNPs on chloroplast genome.

Best,

Arthur

Irene Villa

unread,

Jul 22, 2019, 6:39:38 AM7/22/19

to GBS-SNP-CROP: GBS SNP Calling Reference Optional Pipeline

Hi Arthur, Seth and users,

I'm working with diploid plant species and I am interested in recover chloroplast SNPs from GBS data using a chloroplast reference sequence.

I'm using this reference sequence (fasta file) and applying the recommended values for diploid species in genotyping calling (step 7), but the number of recovered SNPs is very low.
I would like to know what values did you use (or what are the most appropiate) for genotyping calling (step 7) when we're working with chloroplast. Should I use values for haploid species?, In this case, the value of this parameters would be: -mnHoDepth0 0 -mnHoDepth1 0 -mnHetDepth 0??

I'd appreciate any information that could resolve my doubts.
Thank you so much in advance!
Irene

Reply all

Reply to author

Forward