Hi Barney,
ok, I got you now I think. The two chromosomes are missing in the consensus sequences that you created, but the BAM files do contain lots of reads mapping to those two chromosomes. Right?
How did you see that the two chromosomes are missing in the consensus sequences for the two individuals?
samtools mpileup –uf c_elegans.WS220.genomic.fa GENOME.bam > GENOME.bcf
bcftools view –cg GENOME.bcf > GENOME1.bcf
perl vcfutils.pl vcf2fq GENOME1.bcf > GENOME.fq
You should consider setting the "scaled mutation rate" to a higher value than the default 0.001 (which applies to humans with an effective population size of ~10,000), e. g. 0.01. I think it is likely that with the default value you are very conservative in your variant discovery and biased towards the reference genotype. The second command produces a VCF file, which is human readable, not a BCF file. So, I would indicate that in the file ending. Other than that, I cannot see major problems with your three commands, which are equivalent to an example provided in the samtools manual.
Btw, if you try to open a BCF file with
less it will crash if the BCF
file is large. That's because BCF is a binary format without linefeed
characters which are necessary for
less to read only the lines into
memory that are printed out to the screen. You can view VCF files with
less. Don't forget the "-S" option to less, when you do that.
if I view the .bam in IGV then there is clearly a lot of gaps especially in the intronic regions
there does seem to be decent read depth for pretty much all the exonic regions in these .bam files
I think it is likely that the reduced coverage in intronic regions (in general and in the whole genome) is due to greater sequence divergence to the reference sequence as compared to exonic sequences. I would recommend using stampy (with bwa as pre-mapper) for a more sensitive read alignment suitable for more divergent reference sequences.
Am I wrong to try to
produce a consensus sequence for the whole genome if there are a lot of
gaps (I don't really mind if there are lots of NNNNNNNNNNNNN because I
want to lift out fastas from specific loci, and most genic regions do
have decent read depth)
If you did paired-end sequencing and there are lot's of gaps or insertions or chromosomal rearrangements with respect to the reference sequence, then one of the reads in a read pair will often not map successfully, which could reduce coverage in those regions of the genome. You might also consider the "-A" switch to samtools mpileup for this reason.
Don't trust the Genepool in Edinburgh! Pester them until they give you all the information about what they did with what you send to them.
Did you do the DNA extractions yourself or did you send tissue samples?
claudius