Re: [cortex_var] obtaining whole-genome diploid consensus sequence from .vcf files generated with process_calls.pl

846 views
Skip to first unread message

Isaac Turner

unread,
Jul 16, 2012, 6:26:26 AM7/16/12
to corte...@googlegroups.com
Hi Fernando,

Using the cortex_var '--output_supernodes <output_filename.fasta>'
option you can dump graph supernodes ('consensus sequences'). I'm not
sure how PSMC works, but I'm sure you could generate a BAM from this
FASTA file using a mapper (e.g. stampy).

best,
Isaac

On 13 July 2012 12:47, Fer <fernando...@gmail.com> wrote:
> Hi all,
>
> This is probably far beyond using cortex and starts to be a bioinformic or
> NGS question (and perhaps its not the very appropriate here). I wonder if
> there is any way to get the whole-genome diploid consensus sequence needed
> for PSMC
> from the vcf files generated by process_calls.pl.
>
> At PSMC they use a aln.bam file to get this:
>
> "where `diploid.fq.gz' is typically the whole-genome diploid consensus
> sequence
> of one human individual, which can be generated by, for example:
>
> samtools mpileup -C50 -uf ref.fa aln.bam | bcftools view -c - \
> | vcfutils.pl vcf2fq -D 100 | gzip > diploid.fq.gz"
>
> Perhaps there is a way of getting the .bam from .vcf, I've seen that the
> 1000 Genome Project have developed the "Data Slicer" tool:
> http://www.1000genomes.org/data-slicer
> Anyone has experienced with this tool or have already got whole-genome
> diploid consensus sequences from the .vcf before?
>
> Thanks in advance,
> Fernando

Isaac Turner

unread,
Sep 20, 2012, 12:56:19 PM9/20/12
to corte...@googlegroups.com
Hi Fernando,

Cortex may not call at a particular region in the genome if:

1) It has low coverage (possibly caused by sequencing error)
- By aligning your genome to the cortex binaries, you can check if
there is sequence coverage in the graphs.

2) The genome region is too repetitive / similar to other regions of
the genome (low complexity)
- If you map your reads to the genome, they should have low mapping
scores where the genome has low complexity.

3) There is a high rate of variation between samples in the region
- This is the hardest to test for. If you dump supernodes and map
them to the genome, these regions will be covered by zero or many
supernodes -- although this may not be a very reliable test.

Another approach to find homozygous regions of the genomes may be to:
1) Use multiple calling approaches, taking the union set of all calls.
2) Map all your reads to the genome

Homozygous regions of the genome would then be regions that have no
variants called nearby and have high read coverage and high read
mapping scores. This isn't something I've looked into so other people
may have better suggestions.

Isaac

On 19 September 2012 11:51, Fer <fernando...@gmail.com> wrote:
> Hi Isaac,
>
> PSMC requires to know whether a base is a confident homozygous reference or
> is not callable. Is there any way to know this using cortex?
>
> I do have 11 clean/unclean binaries for 20-26 X samplaes and did call of
> genotypes and variants (using cortex_var 5.0.3.1 without using reference
> genome).
>
> Cheers,
> Fernando

Zamin Iqbal

unread,
Sep 20, 2012, 1:20:32 PM9/20/12
to corte...@googlegroups.com
Fernando, Isaac

Does PSMC really require to know this for all bases in the genome, or at called bases?
For any called base, if you have used the population filter (which you have), Cortex outputs a site quality 
giving confidence about whether it is polymorphic or not, and (if you have genotyped, which you have)
it gives a genotype confidence and call. So you can use the genotype quality to tell the difference between things that are 
confidently called hom ref or not. Given you seem to be going through a process where you run and rerun
analyses as your consensus assembly gets updated, some of Isaac's ideas, though valid, might be a lot of work,
and might face a lot of conflation with reference errors. Anyway, my suggestion to you, is use the pop filter
to pull out a subsection of the genome assembly that is good, and work within that region. 

Zam

Zamin Iqbal

unread,
Sep 21, 2012, 6:50:06 AM9/21/12
to corte...@googlegroups.com
Post from Fernando, that I somehow deleted:

Fernando, Isaac

Does PSMC really require to know this for all bases in the genome, or at called bases?

PSMC requires dividing the entire genome in  100 bp non-overlapping bins and tell in there is at least one heterozygote in each bin (K), is homozygous (T) or is missing (>=90 filtered bases) or uncalled. This way is possible to obtain the *.psmcfa file for each chromosome, for example:

>chr10
NNNTTTKKTKTTTTKTTTNNTTTNNNNNNNNNNNNNNNNNTTTNNNNNNNNTTTKKTTKT....

 
For any called base, if you have used the population filter (which you have), Cortex outputs a site quality 
giving confidence about whether it is polymorphic or not, and (if you have genotyped, which you have)
it gives a genotype confidence and call. So you can use the genotype quality to tell the difference between things that are 
confidently called hom ref or not. Given you seem to be going through a process where you run and rerun
analyses as your consensus assembly gets updated, some of Isaac's ideas, though valid, might be a lot of work,
and might face a lot of conflation with reference errors. Anyway, my suggestion to you, is use the pop filter
to pull out a subsection of the genome assembly that is good, and work within that region. 

My idea is to generate *.psmfa myself based on the vcf calls. That will allow me to state if some bins are Heterozygous (K) of Homozygous (T), but the rest will be label as "N"...this is highly conservative and I will lose information about the bins that are homozygous across all the genome (homozygous sites in all individuals are not called by cortex_var). Heng Li is very reluctanct to using the VCF calls...as stated in this thread: http://hengli.uservoice.com/forums/152783-general/suggestions/2652360-psmc-vcfutils-pl-on-vcf-file-with-multiple-individ. Of course he is pushing to use his SAMTools pipeline but PMSC really needs that information....

It's a pity having done some good work with cortex and then realizing that is not that easy to use this info with PSMC

Cheers,
Nando

Zamin Iqbal

unread,
Sep 21, 2012, 6:52:48 AM9/21/12
to corte...@googlegroups.com

Hi Fernando

OK, yes I see what you mean. I think for that analysis, you should follow Heng's advice. 

cheers

Zam

Fernando Cruz - Gmail

unread,
Sep 21, 2012, 6:29:28 AM9/21/12
to corte...@googlegroups.com
Hi Zam,

On 20 September 2012 19:20, Zamin Iqbal <zam.iqba...@gmail.com> wrote:
Fernando, Isaac

Does PSMC really require to know this for all bases in the genome, or at called bases?
PSMC requires dividing the entire genome in  100 bp non-overlapping bins and tell in there is at least one heterozygote in each bin (K), is homozygous (T) or is missing (>=90 filtered bases) or uncalled. This way is possible to obtain the *.psmcfa file for each chromosome, for example:

>chr10
NNNTTTKKTKTTTTKTTTNNTTTNNNNNNNNNNNNNNNNNTTTNNNNNNNNTTTKKTTKT....

 
For any called base, if you have used the population filter (which you have), Cortex outputs a site quality 
giving confidence about whether it is polymorphic or not, and (if you have genotyped, which you have)
it gives a genotype confidence and call. So you can use the genotype quality to tell the difference between things that are 
confidently called hom ref or not. Given you seem to be going through a process where you run and rerun
analyses as your consensus assembly gets updated, some of Isaac's ideas, though valid, might be a lot of work,
and might face a lot of conflation with reference errors. Anyway, my suggestion to you, is use the pop filter
to pull out a subsection of the genome assembly that is good, and work within that region. 
My idea is to generate *.psmfa myself based on the vcf calls. That will allow me to state if some bins are Heterozygous (K) of Homozygous (T), but the rest will be label as "N"...this is highly conservative and I will lose information about the bins that are homozygous across all the genome (homozygous sites in all individuals are not called by cortex_var). Heng Li is very reluctanct to using the VCF calls...as stated in this thread: http://hengli.uservoice.com/forums/152783-general/suggestions/2652360-psmc-vcfutils-pl-on-vcf-file-with-multiple-individ. Of course he is pushing to use his SAMTools pipeline but PMSC really needs that information....

It's a pity having done some good work with cortex and then realizing that is not that easy to use this info with PSMC

Cheers,
Nando

Fernando Cruz - Gmail

unread,
Sep 21, 2012, 7:12:22 AM9/21/12
to corte...@googlegroups.com
hehehe...bummer that there is not a shortcut in this case!
Cheers,
Fernando

ben

unread,
Oct 16, 2013, 5:40:35 PM10/16/13
to corte...@googlegroups.com
Hi Zam and Isaac,

I would like to follow up on the question of checking samples for coverage in regions without variant calls. I have a reference genome plus raw Illumina data for seven unassembled samples. I used the run_calls.pl script with the independent workflow utilizing the reference for CoordinatesAndInCalling. Everything seems to have gone well and I end up with a nice set of variants in the output vcf files. However, I would like to determine how much, if any, coverage each sample has in regions not identified as variants in the vcf files. In this previous post, it says that I should align my genome to the cortex binaries to check for coverage in the graphs. However, I am not totally sure which binaries this means. Should I just take the binary file that I created from the raw data for each sample and align the reference sequence against each of those separately using "--align"? Is there a way to align them all simultaneously? 

Thanks for your help,
Ben

Zamin Iqbal

unread,
Oct 22, 2013, 5:06:30 PM10/22/13
to corte...@googlegroups.com
Hi Ben

Sorry for slow reply to this, have been travelling.



I would like to follow up on the question of checking samples for coverage in regions without variant calls. I have a reference genome plus raw Illumina data for seven unassembled samples. I used the run_calls.pl script with the independent workflow utilizing the reference for CoordinatesAndInCalling. Everything seems to have gone well and I end up with a nice set of variants in the output vcf files. However, I would like to determine how much, if any, coverage each sample has in regions not identified as variants in the vcf files.

OK, so I think the best thing to do here is for each sample, load its cleaned binary and dump the super nodes (=conservative reliable contigs)

cortex_var_31_c1 --kmer blah --mem_height blah --mem_width blah --multicolour_bin my_sample.clean3.ctx --output_supernodes sample.contigs --print_colour_coverages 

(This will print the contigs + coverages in all the kmers in the contigs)
and then map them to the reference.

 
In this previous post, it says that I should align my genome to the cortex binaries to check for coverage in the graphs. However, I am not totally sure which binaries this means.

In the output directory of your run_calls run, look for binaries/cleaned/sample.whatever.ctx
 
Should I just take the binary file that I created from the raw data

or the cleaned one
 
for each sample and align the reference sequence against each of those separately using "--align"?

yes, you can do this - I think mapping the contigs, and correlating what you see with the kmer coverage printed with the contigs is easier 
 
Is there a way to align them all simultaneously? 


Yes, make a multicolour graph of all samples and use --align will work. But also, loading all samples into a multicolour graph and --output_supernodes and --print_colour_coverages will give you info on all samples at the same time. Just depends on whether you have the memory for it.

cheers

Zam 

Reply all
Reply to author
Forward
0 new messages