Genome alignments for SNP discovery - missing chromosomes problem!

Barnaby Wharam

unread,

Mar 27, 2013, 8:19:29 AM3/27/13

to NGS...@googlegroups.com

Hi all,

I am trying to do SNP discovery and produce a phenetic tree in the worm C.elegans. I have 20 recently wild strains that have been inbred for many generations to produce homozygous inbred lines. These were then sequenced and when I was given the data they were in a sorted.bam format.

I want to produce a consensus genome sequence for each strain which I can then use to:
1. extract fastas for specific genes to create phenetic and phylogenetic trees
2. Search for SNPs that exist between the strains
I have started producing the consensus genomes through this method:

samtools mpileup –uf c_elegans.WS220.genomic.fa GENOME.bam > GENOME.bcf
bcftools view –cg GENOME.bcf > GENOME1.bcf
perl vcfutils.pl vcf2fq GENOME1.bcf > GENOME.fq

python
>>> from Bio import SeqIO
>>> count = SeqIO.convert (“/GENOME.fq”,”fastq”,”/GENOME.fasta”,”fasta”)
>>> exit()

However, 2 out of the 11 I have done so far did not produce a correct consensus sequence. On 2 of the produced .fa files 3 of the chromosomes are missing entirely, these are the same chromosomes in both of these and if I view the .bam in IGV then there is clearly a lot of gaps especially in the intronic regions.

The C.elegans reference genome is one of the best reference genomes available and is well annotated. I understand that bad sequencing data is simply bad sequencing data - however, there does seem to be decent read depth for pretty much all the exonic regions in these .bam files. Am I wrong to try to produce a consensus sequence for the whole genome if there are a lot of gaps (I don't really mind if there are lots of NNNNNNNNNNNNN because I want to lift out fastas from specific loci, and most genic regions do have decent read depth)

Is it possible to solve this by either:
1. going back to the raw data and having less stringent filtering, but then take account of this later during SNP calling?
2. changing some parameters during the mpileup to make the assembly more tolerant of gaps?

Alternatively have I approached this all the wrong way - would I be better off producing multifastas of the loci I want to make a tree with and aligning the .bam files to these?

In terms of SNP discovery can I take the same approach as Victor's practical from the workshop?

I realise these are a complex set of questions - do you need any more information in order to answer them?

Best wishes
Barney

Barnaby Wharam

unread,

Mar 27, 2013, 2:48:02 PM3/27/13

to NGS...@googlegroups.com

Claudius asked me to post my email in the forum - his initial reply was

"Hi Barnaby,

welcome to the NGS forum !

I'm not sure whether I understand where exactly you see the problem, but 3 chromosomes completely missing sounds really dodgy to me.

I have a few questions which maybe could help us troubleshooting:

Do your sequences come from whole genome shotgun sequencing or hybrid capture or another technique? Did you prepare the libraries or the sequencing centre?

The sequencing centre obviously has already done the read mapping for you. Can you show us the exact command lines they have used to do that and also of any pre-processing/filtering before that?

For the two individuals where 3 whole chromosomes are missing, could you extract all reads which did not map successfully (no flag of "0" in the second column) into fastq format and try to map them yourself against the reference genome with stampy? Are you sure the 2 individuals are C. elegans and not another species?

> there is clearly a lot of gaps especially in the intronic regions

could this be caused by topping up genomic DNA with cDNA?

> would I be better off producing multifastas of the loci I want to make a tree with and aligning the .bam files to these?

your reads in the .bam file are already aligned, but realigning is an option. However, if you have shotgun sequencing read data, I would use the whole reference genome for read mapping not just the sequences of the loci that you want to use for phylogenetics. The reason is, that samtools incorporates mapping quality scores into Bayesian SNP quality scores. Your mapping quality scores would be overoptimistic if you mapped against a partial reference sequence and there could also be mismapping because the sequence that would have created a better match doesn't occur in the reduced reference sequence.

> In terms of SNP discovery can I take the same approach as Victor's practical from the workshop?

what you are doing generally makes sense to me, even though I haven't checked your pipeline in detail and I think you are already roughly taking the same approach as in Victor's practical for SNP calling.

claudius"

To which I say:

-no cDNA was used

-It is whole genome shotgun sequencing

- Genepool in Edinburgh prepared libraries and did the readmapping

- I do not understand quite how to extract unmapped reads, do you mean unmapped reads in the .bam? Or is it from a file later in the pipeline? The missing chromosomes have many reads aligned to them in the .bam viewed in IGV. is it possible that the sequences are being lost because the default settings are too stringent during the samtools mpileup?

-I will email the sequencing centre with regards to what they did

What i don't understand is how reads are clearly mapping to the missing chromosomes in the sorted.bam, but these chromosomes have dissappeared by the time I view the fastq. I tried to check the .bcf to see if the missing chr are present at that point but my machine is unable to load it and tends to crash when try to browse it using less. I am currently applying for access to Bristol's equivalent to Iceberg.

Best wishes

Barney

claudiuskerth

unread,

Mar 28, 2013, 9:18:46 AM3/28/13

to NGS...@googlegroups.com

Hi Barney,

ok, I got you now I think. The two chromosomes are missing in the consensus sequences that you created, but the BAM files do contain lots of reads mapping to those two chromosomes. Right?

How did you see that the two chromosomes are missing in the consensus sequences for the two individuals?

samtools mpileup –uf c_elegans.WS220.genomic.fa GENOME.bam > GENOME.bcf
bcftools view –cg GENOME.bcf > GENOME1.bcf
perl vcfutils.pl vcf2fq GENOME1.bcf > GENOME.fq

You should consider setting the "scaled mutation rate" to a higher value than the default 0.001 (which applies to humans with an effective population size of ~10,000), e. g. 0.01. I think it is likely that with the default value you are very conservative in your variant discovery and biased towards the reference genotype. The second command produces a VCF file, which is human readable, not a BCF file. So, I would indicate that in the file ending. Other than that, I cannot see major problems with your three commands, which are equivalent to an example provided in the samtools manual.

Btw, if you try to open a BCF file with less it will crash if the BCF file is large. That's because BCF is a binary format without linefeed characters which are necessary for less to read only the lines into memory that are printed out to the screen. You can view VCF files with less. Don't forget the "-S" option to less, when you do that.

if I view the .bam in IGV then there is clearly a lot of gaps especially in the intronic regions

there does seem to be decent read depth for pretty much all the exonic regions in these .bam files

I think it is likely that the reduced coverage in intronic regions (in general and in the whole genome) is due to greater sequence divergence to the reference sequence as compared to exonic sequences. I would recommend using stampy (with bwa as pre-mapper) for a more sensitive read alignment suitable for more divergent reference sequences.

Am I wrong to try to produce a consensus sequence for the whole genome if there are a lot of gaps (I don't really mind if there are lots of NNNNNNNNNNNNN because I want to lift out fastas from specific loci, and most genic regions do have decent read depth)

If you did paired-end sequencing and there are lot's of gaps or insertions or chromosomal rearrangements with respect to the reference sequence, then one of the reads in a read pair will often not map successfully, which could reduce coverage in those regions of the genome. You might also consider the "-A" switch to samtools mpileup for this reason.

Don't trust the Genepool in Edinburgh! Pester them until they give you all the information about what they did with what you send to them.

Did you do the DNA extractions yourself or did you send tissue samples?

claudius

Barney Wharam

unread,

Apr 29, 2013, 7:30:40 AM4/29/13

to NGS...@googlegroups.com

Hi Claudius/All,

I tried the suggested strategy of changing the scaled mutation rate, I used the code:

bcftools view -cgt 0.01 JU319_genome.bcf > JU319_genome.vcf

the resulting fasta at the end of my protocol still had a missing chromosome

I do not understand how to implement the -A switch you talk about in samtools mpileup. I don't think it worked when I just added -A into the parameters. In the manual it is under the section:

-6	Assume the quality is in the Illumina 1.3+ encoding. -A Do not skip anomalous read pairs in variant calling.

Is this what you meant?

Best wishes,

Apologies for taking so long to reply
Barney

--
You received this message because you are subscribed to the Google Groups "NGS Group APS Sheffield" group.
To unsubscribe from this group and stop receiving emails from it, send an email to NGSshef+u...@googlegroups.com.
To post to this group, send an email to NGS...@googlegroups.com.
Visit this group at http://groups.google.com/group/NGSshef?hl=en-GB.
For more options, visit https://groups.google.com/groups/opt_out.

Reply all

Reply to author

Forward