1. Please join my meeting, Apr 2, 2015 at 1pm EDT/10am PDT.
https://global.gotomeeting.com/join/364735613
2. Use your microphone and speakers (VoIP) - a headset is recommended. Or, call in using your telephone.
Dial +1 (224) 501-3212
Access Code: 364-735-613
Audio PIN: Shown after joining the meeting
Meeting ID: 364-735-613
GoToMeeting®
Online Meetings Made Easy®
Not at your computer? Click the link to join this meeting from your iPhone®, iPad®, Android® or Windows Phone® device via the GoToMeeting app.
Hi all,In preparation for our Analysis Group call at 1pm EDT this Thursday, the leadership team has developed a strawman analysis plan athttps://drive.google.com/file/d/0B7Ao1qqJJDHQZUdVQTFlaHVhVFE/view?usp=sharing. This is intended to start discussion in the group, and we hope that you will suggest changes and additional analyses that you or others might be interested in doing. We've divided the plan into 4 main parts: assembly, mapping/small variant calling, phasing, and structural variant calling. We have also updated the list of datasets for the PGP trios and progress creating them and making them available athttps://drive.google.com/file/d/0B7Ao1qqJJDHQVm1ZX05mV2k4T0E/view?usp=sharing. The folder also contains notes from our leadership team calls (https://drive.google.com/folderview?id=0B6euVUz1tpLofkdHYlBsTzc2em96aVJEWkNqbGtDNnVJVHF6TE92Z2ttRnhRS1oxQVBTaUE&usp=sharing).
Call-in details are below, and we look forward to talking with many of you tomorrow!Cheers,Justin
On Tuesday, March 24, 2015 at 4:28:41 PM UTC-4, Justin Zook wrote:Hi All,We will be holding the first call to kick-off our new GIAB Analysis Group at 1pm EDT (10am PDT) on April 2. This group will be co-led by Francisco De La Vega, Chris Mason, Valerie Schneider, and Tina Graves, with assistance from Marc and me. We will be discussing plans for analyzing and integrating the short and long read data from the GIAB PGP Trios, and we hope that any of you who are interested in participating in these analyses will be able to join, or that you'll be able to have a representative on the call. If you can't make the call but are interested in performing analyses, please let us know before the call. We will be sending a strawman analysis plan to get us started in the coming days, and we'll also welcome any additional ideas for analyses. Information about joining the meeting is below. Hope to talk with many of you next week!Cheers,Justin1. Please join my meeting, Apr 2, 2015 at 1pm EDT/10am PDT.
https://global.gotomeeting.com/join/364735613
2. Use your microphone and speakers (VoIP) - a headset is recommended. Or, call in using your telephone.
Dial +1 (224) 501-3212
Access Code: 364-735-613
Audio PIN: Shown after joining the meeting
Meeting ID: 364-735-613
GoToMeeting®
Online Meetings Made Easy®
Not at your computer? Click the link to join this meeting from your iPhone®, iPad®, Android® or Windows Phone® device via the GoToMeeting app.
--
You received this message because you are subscribed to the Google Groups "Genome in a Bottle" group.
To unsubscribe from this group and stop receiving emails from it, send an email to genome-in-a-bot...@googlegroups.com.
To post to this group, send email to genome-in...@googlegroups.com.
Visit this group at http://groups.google.com/group/genome-in-a-bottle.
To view this discussion on the web visithttps://groups.google.com/d/msgid/genome-in-a-bottle/73ae3991-2909-4e1b-a77b-fb240a82cf70%40googlegroups.com.
For more options, visit https://groups.google.com/d/optout.
Is there a reason we can't generate these call sets for both 37and 38 without lift over?
Wholly agree with Steve here. Is there a reason we can't generate these call sets for both 37and 38 without lift over?
To view this discussion on the web visit https://groups.google.com/d/msgid/genome-in-a-bottle/164904E8-478B-4034-820B-1A55A30C9495%40me.com.
Wholly agree with Steve here. Is there a reason we can't generate these call sets for both 37and 38 without lift over?
To view this discussion on the web visit https://groups.google.com/d/msgid/genome-in-a-bottle/164904E8-478B-4034-820B-1A55A30C9495%40me.com.
To view this discussion on the web visit https://groups.google.com/d/msgid/genome-in-a-bottle/164904E8-478B-4034-820B-1A55A30C9495%40me.com.
Sorry I was not in on the call. I am currently getting a kidney panel out for CAP/CLIA certification and we have used NA12878 and GIAB data from the very beginning of our test validation. From a clinical point of view I am not ready to go to 38 at this time nor for awhile. The amount of time and expense that goes into the validation of an NGS test makes introducing new content difficult. I would like to see more gene variants examined in the 37 build. I have found that in some cases the genes that are suppose to be in segmental duplication and therefore are not called in the high confidence build; can in fact be reliably called and that other genes that are called should not have been. for example genes with pseudogenes, high homology to other genes and those with VNTRs. I think it would be very powerful for the analysis to be able to really define these issues until long reads are the norm because my take at meetings I go to is that there are a lot of people in NGS that do not recognize the problems or don’t know what they are. This consortium is to be commended for generation of quality data and I think more can be done with 37.
Thanks.
Marjorie
To view this discussion on the web visit https://groups.google.com/d/msgid/genome-in-a-bottle/CAHi-__TaEsGHa7C4qxqG4HE3391qxf1ZojY%3D4fyAnb7aEOEPEw%40mail.gmail.com.
Hi all,
Thanks all for this discussion! I wanted to restart this discussion about GRCh37 vs. GRCh38 since it would be good to settle on a plan for this. My sense based on this discussion and one-on-one discussions is that it's probably best to start with GRCh37, but also encourage everyone to start experimenting with GRCh38, since generating GIAB calls for GRCh38 could help people move to using it more quickly. The main reasons for calling on GRCh37 first are:1. Most of our customers are likely to continuing using GRCh37 for some time, regardless of what we do.2. Most of the current analysis methods have been developed for h37, so we'll probably get good calls faster for h37 and can more easily compare to previous results (e.g., comparing the high-confidence regions on the PGP genomes to NA12878) and to other projects (e.g., the 1000 Genomes SV group is still primarily using h37)
3. Some of us have already started to analyse these genomes using h37.
4. It will likely take some work to get to a reference for h38 that we all agree is good, even if we don't include the ALTs.
All of that said, I do recognise all of the advantages of h38, and that we are uniquely positioned to help develop methods to use h38 and assess the advantages of it and assess accuracy of variant calls against it. Therefore, this is what I propose as a path forward:
Thanks graham. Regarding chromosome sorting, if you have methods to do this and would be interested in doing this it seems like it could be complementary to some of our other methods that provide shorter phasing like moleculo, LFR and 10X. What do others think? Would anyone else be interested in analyzing chromosome sorted data if we would have it? It seems like it may also help with mapping homologous regions if they fall on different chromosomes.
On Fri, Apr 24, 2015 at 11:27 PM Graham Taylor <graham...@unimelb.edu.au> wrote:
--
You received this message because you are subscribed to the Google Groups "Genome in a Bottle" group.
To unsubscribe from this group and stop receiving emails from it, send an email to genome-in-a-bot...@googlegroups.com.
To post to this group, send email to genome-in...@googlegroups.com.
Visit this group at http://groups.google.com/group/genome-in-a-bottle.
To view this discussion on the web visit https://groups.google.com/d/msgid/genome-in-a-bottle/CAAqJ9KCD79UZifkL1Wu66pGrDPxXO2FLdZMEb3Q_xS6AvkMRWg%40mail.gmail.com.
In my experience, the Haplo-seq method from Bing Ren and colleagues [1] is a very accurate method to phase whole chromosomes and is less expensive than the long read Moleculo methods. I do not know if this information is helpful or not. Some of us like to examine noncoding regulatory elements and the epigenome, so this may not be of relevance to the GIAB discussion.
[1] Selvaraj S, Dixon JR, Bansal V, Ren B. Whole-genome haplotype reconstruction using proximity-ligation and shotgun sequencing. Nature Biotech. 31, 1111–1118 (2013) doi:10.1038/nbt.2728
-Gerry Higgins, Ph.D., M.D.
Professor of Computational Medicine and Bioinformatics
University of Michigan Medical School, Ann Arbor, MI
---------
Vice President, Pharmacogenomic Science
AssureRx Health, Inc., Mason, OH
To view this discussion on the web visit https://groups.google.com/d/msgid/genome-in-a-bottle/D1615A92.1448DE%25james.hadfield%40cruk.cam.ac.uk.
For more options, visit https://groups.google.com/d/optout.
This message is for the designated recipient only and may contain privileged, proprietary, or otherwise private information. If you have received it in error, please notify the sender immediately and delete the original. Any other use of this communication by you is prohibited. To contact the Assurex Health Customer Service department, email sup...@assurexhealth.com or call 866.757.9204. To contact our administrator directly, email postm...@assurexhealth.com or call 513.701.5000
Justin;
Thanks for bringing this up. We've been exploring GRCh38 more and would
love to have validation materials there. We need these to
validate/improve pipelines since having GiaB for NA12878 has been
invaluable. Heng's hapdip benchmarks looks really good and personally I
really want possible HLA typing out of the box:
https://github.com/lh3/bwa/blob/master/README-alt.md#preliminary-evaluation
Has anyone worked on Liftover/remap/Crossmap of the GRCh37 GiaB reference to
GRCh38? I know this is imperfect, but I've come to the conclusion we're
going to have to accept Liftover at least in the short term. This is a
summary of current work we're doing trying to move to 38:
https://github.com/chapmanb/bcbio-nextgen/issues/817
NCBI has excellent starting materials for this which includes both full
and no-alt versions, along with pre-built bowtie and bwa indices:
ftp://ftp.ncbi.nlm.nih.gov/genbank/genomes/Eukaryotes/vertebrates_mammals/Homo_sapiens/GRCh38/seqs_for_alignment_pipelines/
I've been starting with these. Here is the set of GRCh38-noalt materials
we've put together so far:
https://github.com/chapmanb/cloudbiolinux/tree/master/ggd-recipes/hg38-noalt
Our plan is to liftover/remap/crossmap resources not likely to move soon
(including GiaB if no one has already done it). I'd love to talk with
anyone that has recommendations or experience with these. Thanks much,
Brad
--
You received this message because you are subscribed to the Google Groups "Genome in a Bottle" group.
To unsubscribe from this group and stop receiving emails from it, send an email to genome-in-a-bot...@googlegroups.com.
To post to this group, send email to genome-in...@googlegroups.com.
Visit this group at http://groups.google.com/group/genome-in-a-bottle.
To view this discussion on the web visit https://groups.google.com/d/msgid/genome-in-a-bottle/CAAqJ9KA0guTg7t44koMSHxfU%2BRF-rk_BmchRhXFazDRXCj2ZbA%40mail.gmail.com.
Yeah, we're in that same boat.We do use a patched 37, where we've incorporated changes that affect (directly or through mis-mapping) the specific genes we test for. As you say, changing reference genomes is not easy, and that's for a whole bunch of different reasons. Unfortunately (given that we've incorporated the changes that help us into 37) moving to 38 won't help us a whole lot more immediately. Thus it's hard to justify doing quickly.As Deanna and others have talked about very articulately, getting a more complete exome or genome, including some medically relevant genes, requires the 38 reference -- that's no doubt correct. We will all wind up there.
--
You received this message because you are subscribed to the Google Groups "Genome in a Bottle" group.
To unsubscribe from this group and stop receiving emails from it, send an email to genome-in-a-bot...@googlegroups.com.
To post to this group, send email to genome-in...@googlegroups.com.
Visit this group at http://groups.google.com/group/genome-in-a-bottle.
To view this discussion on the web visit https://groups.google.com/d/msgid/genome-in-a-bottle/B3DEE3F1-FB62-4856-96CB-7D1FAEECF293%40me.com.
To view this discussion on the web visit https://groups.google.com/d/msgid/genome-in-a-bottle/391F7F3737B1FA458F84848F705D37B231E34836%40NEPEXCH.Nephropath.local.
I would like to second Steve's comments. Unfortunately, I was not able to call in but can back up Steve on the need for a 37 call set because it is going to be a long and arduous process for clinical labs to switch to 38. These labs have the biggest need for multiple references so meeting that need is critical for GIAB's success. There is no point in generating a reference that will not be widely used or used by only a small subset with the capacity to change from 37 to 38 more easily.John
--
You received this message because you are subscribed to the Google Groups "Genome in a Bottle" group.
To unsubscribe from this group and stop receiving emails from it, send an email to genome-in-a-bot...@googlegroups.com.
To post to this group, send email to genome-in...@googlegroups.com.
Visit this group at http://groups.google.com/group/genome-in-a-bottle.
To view this discussion on the web visit https://groups.google.com/d/msgid/genome-in-a-bottle/CAHi-__TaEsGHa7C4qxqG4HE3391qxf1ZojY%3D4fyAnb7aEOEPEw%40mail.gmail.com.
For more options, visit https://groups.google.com/d/optout.
A few things:
1- I would recommend NCBI remap (http://www.ncbi.nlm.nih.gov/genome/tools/remap) over liftOver. It is slower but it is designed to deal with the paralogous expansion/collapse of sequences that represent a large amount of the change between 37 and 38. It also understands the assembly model better. I did volunteer to highlight some of the differences between liftOver and remap- I am working on this now. I am moving over GIAB as well as a few other data sets.
But note: dbSNP already has a build using 38- IMHO it is imperfect, but it is a reasonable start. There is also gene builds from both NCBI (RefSeq) and Ensembl on 38, so there is decent gene annotation here- and no real reason to do this again. I think Ensembl also just released a regulatory build on 38.
To view this discussion on the web visit https://groups.google.com/d/msgid/genome-in-a-bottle/553D44E2.2060208%40personalis.com.
For more options, visit https://groups.google.com/d/optout.
To view this discussion on the web visit https://groups.google.com/d/msgid/genome-in-a-bottle/CAL7shJBzEz4Gn30Pz2njrorhCXNMsLMEV3Z9dyPXVD-U9ndYgQ%40mail.gmail.com.
Hi folks:
Apologies for the late entry into this email thread. First, to address a couple of data-related questions:
1. Does dbSNP have first assembly information?
There is a VCF info field called: ##INFO=<ID=dbSNPBuildID,Number=1,Type=Integer,Description="First dbSNP Build for RS">. I have checked with dbSNP about this field. This field should enable a user to infer the assembly (since there is an assembly per SNP build). However, it should be noted this is the first assembly in which dbSNP mapped the rsid, not necessarily the assembly in which the variant was ascertained by the submitter.
2. Standard fastas for GRCh38 and GRCh37:
As Brad mentioned, the GRC has produced a set of analysis sets (fasta) for GRCh38 that are available on the GenBank FTP site:
There are a few flavors, all of which also include EBV:
GRCh38_no_alt_analysis_set: This excludes the alternate loci
GRCh38_full_analysis_set: This includes the alternate loci
GRCh38_full_plus_hs38d1_analysis_set: This includes the alternate loci, plus the decoy that Heng Li built for GRCh38 (hs38d1; GCA_000786075.2)
The hs38d1 decoy was constructed to add sequence that does not have an exact 101 bp match to the GRCh38 full assembly. It does not include the HLA sequences on Heng’s BWA site b/c those HLA sequences lack INSDC accessions, and the GRC only includes accessioned sequences in the reference or any analysis sets it puts its name on.
The 1000G project is going to begin mapping reads to GRCh38 very soon. They will be using the full assembly with alts, the hs38d1 decoy and the HLA sequences in their analysis set. The fasta they will use is here: ftp://ftp.1000genomes.ebi.ac.uk/vol1/ftp/technical/reference/GRCh38_reference_genome/
This analysis set was developed in conjunction with the GRC and Heng Li. The deflines in the files should be identical to the ones on the GenBank GRCh38 FTP site (with the exception of HLA) to ensure as much compatibility as possible. Laura Clarke, who is involved in this effort for 1000G, is also involved in the GRC. If GiaB ultimately chooses to use HLA, I would recommend we use this analysis set for consistency.
A set of analysis files for the GRCh37.p13 assembly also exists in GenBank:
There are no-alt and full versions.
In contrast to the GRCh38 set, these do not include EBV, and the hs37d5 decoy is not included.
If we go with GRCh37, I would encourage mapping to an analysis set that includes the decoy+EBV; namely the one from 1000G: ftp://ftp.1000genomes.ebi.ac.uk/vol1/ftp/technical/reference/phase2_reference_assembly_sequence/.
Thus, I think we are able to define an analysis set of fasta for GRCh38 that is “good”: the question is primary-only or full assembly. I also think it will be a missed opportunity if we don’t do analysis on GRCh38; there’s an inertia to moving, and it’s generally large projects that need to lead the charge. We’re talking about GRCh37 in part b/c of not using alts. Even if we can’t handle alts, I’d really hate see us not take advantage of the improvements in the rest of the GRCh38 assembly.
I’m interested to hear how the Remap/LiftOver comparison looks for the older GiaB data.
-Valerie
From: Juan Rodriguez-Flores [mailto:jur...@med.cornell.edu]
Sent: Monday, April 27, 2015 8:14 AM
To: Michael James Clark
Cc: Deanna Church; Brad Chapman; Justin Zook; Steve Lincoln; Marjorie Beggs; John Thompson; genome-in...@googlegroups.com
Subject: Re: [GIAB] First GIAB Analysis Group Call - Just to Clarify 37 vs 38
Hi Michael,
To view this discussion on the web visit https://groups.google.com/d/msgid/genome-in-a-bottle/578DE9B8-9C46-4816-BA77-BEFF431CB6AF%40med.cornell.edu.
Hi Justin and all,
Have been catching up with all the email exchanges over the weekend, thanks for raising two important questions, would like to share with some of our thoughts and experience from mapping point of view.
1) We have been using 37 then latterly more 38 as well, as the base for in silico motif map generation to compared our irys system generated de novo maps against to. I attached a few slides here to show graphically some examples during mapping, that many SegDup and especially inverted repeats captured by Bionano de novo consensus maps, often missing or “condensed” in GRCh 37 hence called as a “SV” later were “corrected “ in GRCh38. Some N-base Gaps were also corrected, some inverted repeats remain as we have observed, slide one showed an example on chr1 near 149 Mb, when complete intact molecules spanning ~700 kb area were imaged, the repeat structure were preserved and strongly supported, we found GRCh 38 is more accurate in showing many larger Structural Variations (SV). Similar cases in slide 2, on Chr 7 and 8. In silico digested motif map using 37 or 38 as base always shown in green color while actual consensus map generated from single molecule level imaging is in blue color.
In slide 3 we have found certain N-base regions of unknown sizes, in both 37 and 38, could be mapped precisely, by aligning many human consensus maps generated from a population, we have found them to be fairly conserved, in size at least, from samples across different ethnic background. Hence I think, we could work together to better annotate the distance of these unknown gaps in genome and reference.
2) We have tested flow sorted DNA for genome mapping, Alex and I often found these are among the cleanest and best samples to work with generating supper long genomic DNA, which as you see in slide 1, long physically intact molecules provide the best phasing distance. If sorted at chromosomal level, I believe if aneuploidy chromosomes could be sorted by size and identified by unique sequence, it would be very interesting for cancer cells as well. This approach has become especially valuable for generating Golden standard model reference for large complex plant genome, such as a wheat, barley etc among other. We have some literature regarding chromosome sorting method, just hope more core facilities with flow sorters would be interested in learning and optimizing the method and providing such not so common service to the community. If flow sorting protocols were to be used more widely, then downstream data and analysis would be so much cleaner.
Cheers,
Han Cao, Ph.D.
Chief Scientific Officer
BioNano Genomics, Inc
9640 Towne Centre Drive, Ste. 100, San Diego, CA 92121
Tel: 858.888.7614; Fax 858.430.5927
-------------------------------------------------------
This e-mail message and its contents are intended only for the confidential and authorized use of the recipient(s) named above. If you are not the intended recipient, please notify me immediately by e-mail and delete the original message. Thank you.
--
You received this message because you are subscribed to the Google Groups "Genome in a Bottle" group.
To unsubscribe from this group and stop receiving emails from it, send an email to
genome-in-a-bot...@googlegroups.com.
To post to this group, send email to
genome-in...@googlegroups.com.
Visit this group at http://groups.google.com/group/genome-in-a-bottle.
To view this discussion on the web visit https://groups.google.com/d/msgid/genome-in-a-bottle/553D38F5.5090803%40personalis.com.
To unsubscribe from this group and stop receiving emails from it, send an email to genome-in-a-bot...@googlegroups.com.
To post to this group, send email to genome-in...@googlegroups.com.
To view this discussion on the web visit https://groups.google.com/d/msgid/genome-in-a-bottle/D180BE6A.2C390%25nhansen%40mail.nih.gov.
Nancy:
Let me see if I can get a GRCh38_no_alt+EBV+decoy(hs38d1) up on the GenBank FTP site quickly, so there would be no need to modify what is downloaded.
-Valerie
From: Hansen, Nancy (NIH/NHGRI) [E]
Sent: Tuesday, May 19, 2015 10:31 AM
To: Schneider, Valerie (NIH/NLM/NCBI) [E]; Juan Rodriguez-Flores; Michael James Clark
Cc: Deanna Church; Brad Chapman; Justin Zook; Steve Lincoln; Marjorie Beggs; John Thompson; genome-in...@googlegroups.com
Subject: Re: [GIAB] First GIAB Analysis Group Call - Just to Clarify 37 vs 38
Hi all,
On Sunday, April 26, 2015, Deanna Church <deanna...@personalis.com> wrote:
Hi Brad,
I can share the GIAB data- and maybe I'll use that as a demo set of remap vs. liftOver. Does that work?
I don't think dbSNP typically subsets the data in the way you want. If you ask them, they might be able to provide the data you want on GRCh38 coordinates.
best,
Deanna
April 26, 2015 at 12:13 PM
A few things:
1- I would recommend NCBI remap (http://www.ncbi.nlm.nih.gov/genome/tools/remap) over liftOver. It is slower but it is designed to deal with the paralogous expansion/collapse of sequences that represent a large amount of the change between 37 and 38. It also understands the assembly model better. I did volunteer to highlight some of the differences between liftOver and remap- I am working on this now. I am moving over GIAB as well as a few other data sets.
But note: dbSNP already has a build using 38- IMHO it is imperfect, but it is a reasonable start. There is also gene builds from both NCBI (RefSeq) and Ensembl on 38, so there is decent gene annotation here- and no real reason to do this again. I think Ensembl also just released a regulatory build on 38.
2- For those interested in exploring how to use alts- there are lots of alts on 37- there were 3 regions released with the major release- but 60 novel patches released prior to 38- so for folks want to play around in a space that is well know (37) there is data to play around with.
3- I think it will be a huge missed opportunity if we don't do this work on 38. Validation data sets exist for 37, but not for 38. Until such data exists for 38, it will be impossible for clinical labs to move. GIAB has an opportunity to lead this charge (though note, 1000 genomes is in the process of aligning and calling on 38 now). GIAB can provide important leadership here and we should not miss the opportunity.
best,
Deanna
April 25, 2015 at 12:05 PM
Justin;
Thanks for bringing this up. We've been exploring GRCh38 more and would
love to have validation materials there. We need these to
validate/improve pipelines since having GiaB for NA12878 has been
invaluable. Heng's hapdip benchmarks looks really good and personally I
really want possible HLA typing out of the box:
https://github.com/lh3/bwa/blob/master/README-alt.md#preliminary-evaluation
Has anyone worked on Liftover/remap/Crossmap of the GRCh37 GiaB reference to
GRCh38? I know this is imperfect, but I've come to the conclusion we're
going to have to accept Liftover at least in the short term. This is a
summary of current work we're doing trying to move to 38:
https://github.com/chapmanb/bcbio-nextgen/issues/817
NCBI has excellent starting materials for this which includes both full
and no-alt versions, along with pre-built bowtie and bwa indices:
ftp://ftp.ncbi.nlm.nih.gov/genbank/genomes/Eukaryotes/vertebrates_mammals/Homo_sapiens/GRCh38/seqs_for_alignment_pipelines/
I've been starting with these. Here is the set of GRCh38-noalt materials
we've put together so far:
https://github.com/chapmanb/cloudbiolinux/tree/master/ggd-recipes/hg38-noalt
Our plan is to liftover/remap/crossmap resources not likely to move soon
(including GiaB if no one has already done it). I'd love to talk with
anyone that has recommendations or experience with these. Thanks much,
Brad
April 24, 2015 at 9:35 PM
Hi all,
Thanks all for this discussion! I wanted to restart this discussion about GRCh37 vs. GRCh38 since it would be good to settle on a plan for this. My sense based on this discussion and one-on-one discussions is that it's probably best to start with GRCh37, but also encourage everyone to start experimenting with GRCh38, since generating GIAB calls for GRCh38 could help people move to using it more quickly. The main reasons for calling on GRCh37 first are:
1. Most of our customers are likely to continuing using GRCh37 for some time, regardless of what we do.
2. Most of the current analysis methods have been developed for h37, so we'll probably get good calls faster for h37 and can more easily compare to previous results (e.g., comparing the high-confidence regions on the PGP genomes to NA12878) and to other projects (e.g., the 1000 Genomes SV group is still primarily using h37)
3. Some of us have already started to analyze these genomes using h37.
4. It will likely take some work to get to a reference for h38 that we all agree is good, even if we don't include the ALTs.
All of that said, I do recognize all of the advantages of h38, and that we are uniquely positioned to help develop methods to use h38 and assess the advantages of it and assess accuracy of variant calls against it. Therefore, this is what I propose as a path forward:
1. Decide on a reference for h37 - since 1000 Genomes is using GRCh37 with decoy sequences in their latest analyses, I propose we use hs37d5.fa.gz here: ftp://ftp.1000genomes.ebi.ac.uk/vol1/ftp/technical/reference/phase2_reference_assembly_sequence/. Another option would be to use the 1000 Genomes phase 1 reference, which was GRCh37 without decoy here: ftp://ftp.1000genomes.ebi.ac.uk/vol1/ftp/technical/reference/.
2. Encourage everyone to start experimenting with h38. We might want to decide on a suggested reference for h38 if you want to try using current methods that aren't ALT-aware, and on another suggested reference if you want to develop methods that are ALT-aware. For the first, I'm not aware of any standard fasta's. For the second, 1000 Genomes has a fasta that Heng Li develop at ftp://ftp.1000genomes.ebi.ac.uk/vol1/ftp/technical/reference/GRCh38_reference_genome, so it might be a good start to have comparability with other projects. Any suggestions for a fasta that doesn't contain ALTs?
3. Is anyone interested in testing methods for converting calls and bed files from h37 to h38 and vice versa? Even if this doesn't result in the best calls, it might be really useful to our work as we develop calls for both so that we can compare them.
Does this sound like a reasonable plan forward to everyone?
Thanks!
Justin
On Thursday, April 2, 2015, Steve Lincoln <steve....@me.com> wrote:
--
Sent from Gmail Mobile
--
You received this message because you are subscribed to the Google Groups "Genome in a Bottle" group.
To unsubscribe from this group and stop receiving emails from it, send an email to genome-in-a-bot...@googlegroups.com.
To post to this group, send email to genome-in...@googlegroups.com.
Visit this group at http://groups.google.com/group/genome-in-a-bottle.
To view this discussion on the web visit https://groups.google.com/d/msgid/genome-in-a-bottle/CAAqJ9KA0guTg7t44koMSHxfU%2BRF-rk_BmchRhXFazDRXCj2ZbA%40mail.gmail.com.
For more options, visit https://groups.google.com/d/optout.
April 2, 2015 at 5:00 PM
Yeah, we're in that same boat.
We do use a patched 37, where we've incorporated changes that affect (directly or through mis-mapping) the specific genes we test for. As you say, changing reference genomes is not easy, and that's for a whole bunch of different reasons. Unfortunately (given that we've incorporated the changes that help us into 37) moving to 38 won't help us a whole lot more immediately. Thus it's hard to justify doing quickly.
As Deanna and others have talked about very articulately, getting a more complete exome or genome, including some medically relevant genes, requires the 38 reference -- that's no doubt correct. We will all wind up there.
--
You received this message because you are subscribed to the Google Groups "Genome in a Bottle" group.
To unsubscribe from this group and stop receiving emails from it, send an email to genome-in-a-bot...@googlegroups.com.
To post to this group, send email to genome-in...@googlegroups.com.
Visit this group at http://groups.google.com/group/genome-in-a-bottle.
To view this discussion on the web visit https://groups.google.com/d/msgid/genome-in-a-bottle/B3DEE3F1-FB62-4856-96CB-7D1FAEECF293%40me.com.
For more options, visit https://groups.google.com/d/optout.
This email and any files transmitted with it are confidential and intended solely for the use of the individual or entity to whom they are addressed. This message may contain privileged and / or confidential information. If you are NOT the intended recipient of this message, copying, printing, disseminating, forwarding or any other use or action derived from its content is strictly prohibited. Please notify the sender immediately by e-mail if you have received this e-mail by error and delete this e-mail from your system. If you received the email by error and this message contains patient information, please report the error by contacting the Personalis Clinical Laboratory at clin...@personalis.com.
--
You received this message because you are subscribed to the Google Groups "Genome in a Bottle" group.
To unsubscribe from this group and stop receiving emails from it, send an email to genome-in-a-bot...@googlegroups.com.
To post to this group, send email to genome-in...@googlegroups.com.
Visit this group at http://groups.google.com/group/genome-in-a-bottle.
To view this discussion on the web visit https://groups.google.com/d/msgid/genome-in-a-bottle/553D44E2.2060208%40personalis.com.
For more options, visit https://groups.google.com/d/optout.
-----
Michael James Clark, PhD
--
You received this message because you are subscribed to the Google Groups "Genome in a Bottle" group.
To unsubscribe from this group and stop receiving emails from it, send an email to genome-in-a-bot...@googlegroups.com.
To post to this group, send email to genome-in...@googlegroups.com.
Visit this group at http://groups.google.com/group/genome-in-a-bottle.
To view this discussion on the web visit https://groups.google.com/d/msgid/genome-in-a-bottle/CAL7shJBzEz4Gn30Pz2njrorhCXNMsLMEV3Z9dyPXVD-U9ndYgQ%40mail.gmail.com.
For more options, visit https://groups.google.com/d/optout.
--
You received this message because you are subscribed to the Google Groups "Genome in a Bottle" group.
To unsubscribe from this group and stop receiving emails from it, send an email to
genome-in-a-bot...@googlegroups.com.
To post to this group, send email to
genome-in...@googlegroups.com.
Visit this group at http://groups.google.com/group/genome-in-a-bottle.
To view this discussion on the web visit https://groups.google.com/d/msgid/genome-in-a-bottle/578DE9B8-9C46-4816-BA77-BEFF431CB6AF%40med.cornell.edu.
For more options, visit https://groups.google.com/d/optout.
--
You received this message because you are subscribed to the Google Groups "Genome in a Bottle" group.
To unsubscribe from this group and stop receiving emails from it, send an email to
genome-in-a-bot...@googlegroups.com.
To post to this group, send email to
genome-in...@googlegroups.com.
Visit this group at http://groups.google.com/group/genome-in-a-bottle.
To view this discussion on the web visit https://groups.google.com/d/msgid/genome-in-a-bottle/2A55097D1BF03345927D1C0CD6059FDE1C8EA03B%40msgb01.nih.gov.
The hs38d1 decoy (for GRCh38) is comprised of sequences not present in the GRCh38 full assembly. In contrast, the hs37d1 decoy (for GRCh37) was based solely on the GRCh37 primary assembly.
What this means is that the hs38d1 decoy does not contain sequence whose only occurrence in the GRCh38 assembly is on the alts. It is not incorrect to use it in the context of the GRCh38 primary assembly only and it does add “missing sequence”, but one should be aware that it does not provide any representation for sequence that is unique to the alternate loci.
We have not seen an assessment of the value added by the hs38d1 decoy to GRCh38 primary (as opposed to GRCh38 full), so we don’t know how much benefit is gained from the missing sequence it provides in an alt-free context. B/c we have been unsure how much usage a GRCh38_noalt+hs38d1 decoy analysis set would get, we’ve not yet put it out at NCBI.
-Valerie
From: Hansen, Nancy (NIH/NHGRI) [E]
Sent: Tuesday, May 19, 2015 11:18 AM
To: Schneider, Valerie (NIH/NLM/NCBI) [E]; Juan Rodriguez-Flores; Michael James Clark
Cc: Deanna Church; Brad Chapman; Justin Zook; Steve Lincoln; Marjorie Beggs; John Thompson; genome-in...@googlegroups.com
Subject: Re: [GIAB] First GIAB Analysis Group Call - Just to Clarify 37 vs 38
Thanks, Valerie. That would be great. Do you (and others) agree that this is a realistic reference for GIAB GRCh38 analyses if the aligner used is not alt aware?
--
Comparative Genomics Analysis Unit, NHGRI
On Sunday, April 26, 2015, Deanna Church <deanna...@personalis.com> wrote:
Hi Brad,
I can share the GIAB data- and maybe I'll use that as a demo set of remap vs. liftOver. Does that work?
I don't think dbSNP typically subsets the data in the way you want. If you ask them, they might be able to provide the data you want on GRCh38 coordinates.
best,
Deanna
April 26, 2015 at 12:13 PM
A few things:
1- I would recommend NCBI remap (http://www.ncbi.nlm.nih.gov/genome/tools/remap) over liftOver. It is slower but it is designed to deal with the paralogous expansion/collapse of sequences that represent a large amount of the change between 37 and 38. It also understands the assembly model better. I did volunteer to highlight some of the differences between liftOver and remap- I am working on this now. I am moving over GIAB as well as a few other data sets.
But note: dbSNP already has a build using 38- IMHO it is imperfect, but it is a reasonable start. There is also gene builds from both NCBI (RefSeq) and Ensembl on 38, so there is decent gene annotation here- and no real reason to do this again. I think Ensembl also just released a regulatory build on 38.
2- For those interested in exploring how to use alts- there are lots of alts on 37- there were 3 regions released with the major release- but 60 novel patches released prior to 38- so for folks want to play around in a space that is well know (37) there is data to play around with.
3- I think it will be a huge missed opportunity if we don't do this work on 38. Validation data sets exist for 37, but not for 38. Until such data exists for 38, it will be impossible for clinical labs to move. GIAB has an opportunity to lead this charge (though note, 1000 genomes is in the process of aligning and calling on 38 now). GIAB can provide important leadership here and we should not miss the opportunity.
best,
Deanna
April 25, 2015 at 12:05 PM
Justin;
Thanks for bringing this up. We've been exploring GRCh38 more and would
love to have validation materials there. We need these to
validate/improve pipelines since having GiaB for NA12878 has been
invaluable. Heng's hapdip benchmarks looks really good and personally I
really want possible HLA typing out of the box:
https://github.com/lh3/bwa/blob/master/README-alt.md#preliminary-evaluation
Has anyone worked on Liftover/remap/Crossmap of the GRCh37 GiaB reference to
GRCh38? I know this is imperfect, but I've come to the conclusion we're
going to have to accept Liftover at least in the short term. This is a
summary of current work we're doing trying to move to 38:
https://github.com/chapmanb/bcbio-nextgen/issues/817
NCBI has excellent starting materials for this which includes both full
and no-alt versions, along with pre-built bowtie and bwa indices:
ftp://ftp.ncbi.nlm.nih.gov/genbank/genomes/Eukaryotes/vertebrates_mammals/Homo_sapiens/GRCh38/seqs_for_alignment_pipelines/
I've been starting with these. Here is the set of GRCh38-noalt materials
we've put together so far:
https://github.com/chapmanb/cloudbiolinux/tree/master/ggd-recipes/hg38-noalt
Our plan is to liftover/remap/crossmap resources not likely to move soon
(including GiaB if no one has already done it). I'd love to talk with
anyone that has recommendations or experience with these. Thanks much,
Brad
April 24, 2015 at 9:35 PM
Hi all,
Thanks all for this discussion! I wanted to restart this discussion about GRCh37 vs. GRCh38 since it would be good to settle on a plan for this. My sense based on this discussion and one-on-one discussions is that it's probably best to start with GRCh37, but also encourage everyone to start experimenting with GRCh38, since generating GIAB calls for GRCh38 could help people move to using it more quickly. The main reasons for calling on GRCh37 first are:
1. Most of our customers are likely to continuing using GRCh37 for some time, regardless of what we do.
2. Most of the current analysis methods have been developed for h37, so we'll probably get good calls faster for h37 and can more easily compare to previous results (e.g., comparing the high-confidence regions on the PGP genomes to NA12878) and to other projects (e.g., the 1000 Genomes SV group is still primarily using h37)
3. Some of us have already started to analyze these genomes using h37.
4. It will likely take some work to get to a reference for h38 that we all agree is good, even if we don't include the ALTs.
All of that said, I do recognize all of the advantages of h38, and that we are uniquely positioned to help develop methods to use h38 and assess the advantages of it and assess accuracy of variant calls against it. Therefore, this is what I propose as a path forward:
1. Decide on a reference for h37 - since 1000 Genomes is using GRCh37 with decoy sequences in their latest analyses, I propose we use hs37d5.fa.gz here: ftp://ftp.1000genomes.ebi.ac.uk/vol1/ftp/technical/reference/phase2_reference_assembly_sequence/. Another option would be to use the 1000 Genomes phase 1 reference, which was GRCh37 without decoy here: ftp://ftp.1000genomes.ebi.ac.uk/vol1/ftp/technical/reference/.
2. Encourage everyone to start experimenting with h38. We might want to decide on a suggested reference for h38 if you want to try using current methods that aren't ALT-aware, and on another suggested reference if you want to develop methods that are ALT-aware. For the first, I'm not aware of any standard fasta's. For the second, 1000 Genomes has a fasta that Heng Li develop at ftp://ftp.1000genomes.ebi.ac.uk/vol1/ftp/technical/reference/GRCh38_reference_genome, so it might be a good start to have comparability with other projects. Any suggestions for a fasta that doesn't contain ALTs?
3. Is anyone interested in testing methods for converting calls and bed files from h37 to h38 and vice versa? Even if this doesn't result in the best calls, it might be really useful to our work as we develop calls for both so that we can compare them.
Does this sound like a reasonable plan forward to everyone?
Thanks!
Justin
On Thursday, April 2, 2015, Steve Lincoln <steve....@me.com> wrote:
--
Sent from Gmail Mobile
--
You received this message because you are subscribed to the Google Groups "Genome in a Bottle" group.
To unsubscribe from this group and stop receiving emails from it, send an email to genome-in-a-bot...@googlegroups.com.
To post to this group, send email to genome-in...@googlegroups.com.
Visit this group at http://groups.google.com/group/genome-in-a-bottle.
To view this discussion on the web visit https://groups.google.com/d/msgid/genome-in-a-bottle/CAAqJ9KA0guTg7t44koMSHxfU%2BRF-rk_BmchRhXFazDRXCj2ZbA%40mail.gmail.com.
For more options, visit https://groups.google.com/d/optout.
April 2, 2015 at 5:00 PM
Yeah, we're in that same boat.
We do use a patched 37, where we've incorporated changes that affect (directly or through mis-mapping) the specific genes we test for. As you say, changing reference genomes is not easy, and that's for a whole bunch of different reasons. Unfortunately (given that we've incorporated the changes that help us into 37) moving to 38 won't help us a whole lot more immediately. Thus it's hard to justify doing quickly.
As Deanna and others have talked about very articulately, getting a more complete exome or genome, including some medically relevant genes, requires the 38 reference -- that's no doubt correct. We will all wind up there.
--
You received this message because you are subscribed to the Google Groups "Genome in a Bottle" group.
To unsubscribe from this group and stop receiving emails from it, send an email to genome-in-a-bot...@googlegroups.com.
To post to this group, send email to genome-in...@googlegroups.com.
Visit this group at http://groups.google.com/group/genome-in-a-bottle.
To view this discussion on the web visit https://groups.google.com/d/msgid/genome-in-a-bottle/B3DEE3F1-FB62-4856-96CB-7D1FAEECF293%40me.com.
For more options, visit https://groups.google.com/d/optout.
This email and any files transmitted with it are confidential and intended solely for the use of the individual or entity to whom they are addressed. This message may contain privileged and / or confidential information. If you are NOT the intended recipient of this message, copying, printing, disseminating, forwarding or any other use or action derived from its content is strictly prohibited. Please notify the sender immediately by e-mail if you have received this e-mail by error and delete this e-mail from your system. If you received the email by error and this message contains patient information, please report the error by contacting the Personalis Clinical Laboratory at clin...@personalis.com.
--
You received this message because you are subscribed to the Google Groups "Genome in a Bottle" group.
To unsubscribe from this group and stop receiving emails from it, send an email to genome-in-a-bot...@googlegroups.com.
To post to this group, send email to genome-in...@googlegroups.com.
Visit this group at http://groups.google.com/group/genome-in-a-bottle.
To view this discussion on the web visit https://groups.google.com/d/msgid/genome-in-a-bottle/553D44E2.2060208%40personalis.com.
For more options, visit https://groups.google.com/d/optout.
-----
Michael James Clark, PhD
--
You received this message because you are subscribed to the Google Groups "Genome in a Bottle" group.
To unsubscribe from this group and stop receiving emails from it, send an email to genome-in-a-bot...@googlegroups.com.
To post to this group, send email to genome-in...@googlegroups.com.
Visit this group at http://groups.google.com/group/genome-in-a-bottle.
To view this discussion on the web visit https://groups.google.com/d/msgid/genome-in-a-bottle/CAL7shJBzEz4Gn30Pz2njrorhCXNMsLMEV3Z9dyPXVD-U9ndYgQ%40mail.gmail.com.
For more options, visit https://groups.google.com/d/optout.
--
You received this message because you are subscribed to the Google Groups "Genome in a Bottle" group.
To unsubscribe from this group and stop receiving emails from it, send an email to
genome-in-a-bot...@googlegroups.com.
To post to this group, send email to
genome-in...@googlegroups.com.
Visit this group at http://groups.google.com/group/genome-in-a-bottle.
To view this discussion on the web visit https://groups.google.com/d/msgid/genome-in-a-bottle/578DE9B8-9C46-4816-BA77-BEFF431CB6AF%40med.cornell.edu.
For more options, visit https://groups.google.com/d/optout.
--
You received this message because you are subscribed to the Google Groups "Genome in a Bottle" group.
To unsubscribe from this group and stop receiving emails from it, send an email to
genome-in-a-bot...@googlegroups.com.
To post to this group, send email to
genome-in...@googlegroups.com.
Visit this group at http://groups.google.com/group/genome-in-a-bottle.
To view this discussion on the web visit https://groups.google.com/d/msgid/genome-in-a-bottle/2A55097D1BF03345927D1C0CD6059FDE1C8EA03B%40msgb01.nih.gov.