First GIAB Analysis Group Call

Justin Zook

unread,

Mar 24, 2015, 4:28:41 PM3/24/15

to genome-in...@googlegroups.com

Hi All,

We will be holding the first call to kick-off our new GIAB Analysis Group at 1pm EDT (10am PDT) on April 2. This group will be co-led by Francisco De La Vega, Chris Mason, Valerie Schneider, and Tina Graves, with assistance from Marc and me. We will be discussing plans for analyzing and integrating the short and long read data from the GIAB PGP Trios, and we hope that any of you who are interested in participating in these analyses will be able to join, or that you'll be able to have a representative on the call. If you can't make the call but are interested in performing analyses, please let us know before the call. We will be sending a strawman analysis plan to get us started in the coming days, and we'll also welcome any additional ideas for analyses. Information about joining the meeting is below. Hope to talk with many of you next week!

Cheers,

Justin

1. Please join my meeting, Apr 2, 2015 at 1pm EDT/10am PDT.

https://global.gotomeeting.com/join/364735613

2. Use your microphone and speakers (VoIP) - a headset is recommended. Or, call in using your telephone.

Dial +1 (224) 501-3212

Access Code: 364-735-613

Audio PIN: Shown after joining the meeting

Meeting ID: 364-735-613

GoToMeeting®

Online Meetings Made Easy®

Not at your computer? Click the link to join this meeting from your iPhone®, iPad®, Android® or Windows Phone® device via the GoToMeeting app.

Justin Zook

unread,

Apr 1, 2015, 8:09:40 AM4/1/15

to genome-in...@googlegroups.com

Hi all,

In preparation for our Analysis Group call at 1pm EDT this Thursday, the leadership team has developed a strawman analysis plan at https://drive.google.com/file/d/0B7Ao1qqJJDHQZUdVQTFlaHVhVFE/view?usp=sharing. This is intended to start discussion in the group, and we hope that you will suggest changes and additional analyses that you or others might be interested in doing. We've divided the plan into 4 main parts: assembly, mapping/small variant calling, phasing, and structural variant calling. We have also updated the list of datasets for the PGP trios and progress creating them and making them available at https://drive.google.com/file/d/0B7Ao1qqJJDHQVm1ZX05mV2k4T0E/view?usp=sharing. The folder also contains notes from our leadership team calls (https://drive.google.com/folderview?id=0B6euVUz1tpLofkdHYlBsTzc2em96aVJEWkNqbGtDNnVJVHF6TE92Z2ttRnhRS1oxQVBTaUE&usp=sharing).

Call-in details are below, and we look forward to talking with many of you tomorrow!

Cheers,

Justin

Steve Lincoln

unread,

Apr 2, 2015, 2:50:54 PM4/2/15

to Justin Zook, genome-in...@googlegroups.com

Just to clarify my point about 37 vs. 38

The arguments for 38 are all excellent, and as a number of folks mentioned, this group can indeed lead the way for 38. Great!

There is of course value in generating a reference resource which is immediately useful to as many people in the community as possible, clinical labs or otherwise, and [sadly] today that user community is bigger for 37. Partly for GIAB project PR, partly because these samples will indeed be super useful to those folks, a call set on 37 would indeed be a very good thing to have.

Now, If the lift-over or remap approaches Valerie described (a) work well enough, and (b) produce clear warnings in regions where it's not working well, that's probably more than adequate. I haven't seen such data but it sounds like it exists. We can know pretty quickly if that approach is not good enough

However we create it I'd like to suggest that a 37 call set be an explicit goal of the project, in addition to 38.

My <$0.02

On Apr 1, 2015, at 5:09 AM, Justin Zook <justi...@gmail.com> wrote:

Hi all,

In preparation for our Analysis Group call at 1pm EDT this Thursday, the leadership team has developed a strawman analysis plan athttps://drive.google.com/file/d/0B7Ao1qqJJDHQZUdVQTFlaHVhVFE/view?usp=sharing. This is intended to start discussion in the group, and we hope that you will suggest changes and additional analyses that you or others might be interested in doing. We've divided the plan into 4 main parts: assembly, mapping/small variant calling, phasing, and structural variant calling. We have also updated the list of datasets for the PGP trios and progress creating them and making them available athttps://drive.google.com/file/d/0B7Ao1qqJJDHQVm1ZX05mV2k4T0E/view?usp=sharing. The folder also contains notes from our leadership team calls (https://drive.google.com/folderview?id=0B6euVUz1tpLofkdHYlBsTzc2em96aVJEWkNqbGtDNnVJVHF6TE92Z2ttRnhRS1oxQVBTaUE&usp=sharing).

Call-in details are below, and we look forward to talking with many of you tomorrow!

Cheers,
Justin

On Tuesday, March 24, 2015 at 4:28:41 PM UTC-4, Justin Zook wrote:
Hi All,

We will be holding the first call to kick-off our new GIAB Analysis Group at 1pm EDT (10am PDT) on April 2. This group will be co-led by Francisco De La Vega, Chris Mason, Valerie Schneider, and Tina Graves, with assistance from Marc and me. We will be discussing plans for analyzing and integrating the short and long read data from the GIAB PGP Trios, and we hope that any of you who are interested in participating in these analyses will be able to join, or that you'll be able to have a representative on the call. If you can't make the call but are interested in performing analyses, please let us know before the call. We will be sending a strawman analysis plan to get us started in the coming days, and we'll also welcome any additional ideas for analyses. Information about joining the meeting is below. Hope to talk with many of you next week!

Cheers,
Justin

1. Please join my meeting, Apr 2, 2015 at 1pm EDT/10am PDT.
https://global.gotomeeting.com/join/364735613

2. Use your microphone and speakers (VoIP) - a headset is recommended. Or, call in using your telephone.

Dial +1 (224) 501-3212
Access Code: 364-735-613
Audio PIN: Shown after joining the meeting

Meeting ID: 364-735-613

GoToMeeting®
Online Meetings Made Easy®

Not at your computer? Click the link to join this meeting from your iPhone®, iPad®, Android® or Windows Phone® device via the GoToMeeting app.

--
You received this message because you are subscribed to the Google Groups "Genome in a Bottle" group.
To unsubscribe from this group and stop receiving emails from it, send an email to genome-in-a-bot...@googlegroups.com.
To post to this group, send email to genome-in...@googlegroups.com.
Visit this group at http://groups.google.com/group/genome-in-a-bottle.
To view this discussion on the web visithttps://groups.google.com/d/msgid/genome-in-a-bottle/73ae3991-2909-4e1b-a77b-fb240a82cf70%40googlegroups.com.
For more options, visit https://groups.google.com/d/optout.

Steve Lincoln

unread,

Apr 2, 2015, 2:50:54 PM4/2/15

to genome-in...@googlegroups.com

Just curious.

Steve Lincoln

unread,

Apr 2, 2015, 3:26:49 PM4/2/15

to Alexander Wait Zaranek, genome-in...@googlegroups.com, Justin Zook

Is there a reason we can't generate these call sets for both 37and 38 without lift over?

Francisco's diagram shows a lot of work that needs to happen. There may be legitimate concern about 2x-ing it, considering the largely volunteer project this is.

Steve Lincoln

steve....@me.com

mobile: +1-301-312-1725

On Apr 2, 2015, at 12:06 PM, Alexander Wait Zaranek <sa...@curoverse.com> wrote:

Wholly agree with Steve here. Is there a reason we can't generate these call sets for both 37and 38 without lift over?

To view this discussion on the web visit https://groups.google.com/d/msgid/genome-in-a-bottle/164904E8-478B-4034-820B-1A55A30C9495%40me.com.

Alexander Wait Zaranek

unread,

Apr 2, 2015, 3:51:34 PM4/2/15

to Steve Lincoln, genome-in...@googlegroups.com, Justin Zook

Wholly agree with Steve here. Is there a reason we can't generate these call sets for both 37and 38 without lift over?

On Apr 2, 2015 2:50 PM, "Steve Lincoln" <steve....@me.com> wrote:

To view this discussion on the web visit https://groups.google.com/d/msgid/genome-in-a-bottle/164904E8-478B-4034-820B-1A55A30C9495%40me.com.

John Thompson

unread,

Apr 2, 2015, 3:51:35 PM4/2/15

to Steve Lincoln, Justin Zook, genome-in...@googlegroups.com

I would like to second Steve's comments. Unfortunately, I was not able to call in but can back up Steve on the need for a 37 call set because it is going to be a long and arduous process for clinical labs to switch to 38. These labs have the biggest need for multiple references so meeting that need is critical for GIAB's success. There is no point in generating a reference that will not be widely used or used by only a small subset with the capacity to change from 37 to 38 more easily.

John

To view this discussion on the web visit https://groups.google.com/d/msgid/genome-in-a-bottle/164904E8-478B-4034-820B-1A55A30C9495%40me.com.

Marjorie Beggs

unread,

Apr 2, 2015, 4:07:09 PM4/2/15

to John Thompson, Steve Lincoln, Justin Zook, genome-in...@googlegroups.com

Sorry I was not in on the call. I am currently getting a kidney panel out for CAP/CLIA certification and we have used NA12878 and GIAB data from the very beginning of our test validation. From a clinical point of view I am not ready to go to 38 at this time nor for awhile. The amount of time and expense that goes into the validation of an NGS test makes introducing new content difficult. I would like to see more gene variants examined in the 37 build. I have found that in some cases the genes that are suppose to be in segmental duplication and therefore are not called in the high confidence build; can in fact be reliably called and that other genes that are called should not have been. for example genes with pseudogenes, high homology to other genes and those with VNTRs. I think it would be very powerful for the analysis to be able to really define these issues until long reads are the norm because my take at meetings I go to is that there are a lot of people in NGS that do not recognize the problems or don’t know what they are. This consortium is to be commended for generation of quality data and I think more can be done with 37.

Thanks.

Marjorie

Marjorie Beggs, Ph.D. | Genomics
10810 Executive Center Dr. Ste. 100 | Little Rock, AR 72211
T 501.492.7448 | F 501.604.2699

www.nephropath.com

Confidentiality Notice: This e-mail message, including any attachments, is for the sole use of the intended recipient(s) and may contain confidential and privileged information. Any unauthorized review, use, disclosure or distribution is prohibited. If you are not the intended recipient, please contact the sender by reply e-mail and destroy all copies of the original message.

To view this discussion on the web visit https://groups.google.com/d/msgid/genome-in-a-bottle/CAHi-__TaEsGHa7C4qxqG4HE3391qxf1ZojY%3D4fyAnb7aEOEPEw%40mail.gmail.com.

Steve Lincoln

unread,

Apr 2, 2015, 8:00:34 PM4/2/15

to Marjorie Beggs, John Thompson, Justin Zook, genome-in...@googlegroups.com

Yeah, we're in that same boat.

We do use a patched 37, where we've incorporated changes that affect (directly or through mis-mapping) the specific genes we test for. As you say, changing reference genomes is not easy, and that's for a whole bunch of different reasons. Unfortunately (given that we've incorporated the changes that help us into 37) moving to 38 won't help us a whole lot more immediately. Thus it's hard to justify doing quickly.

As Deanna and others have talked about very articulately, getting a more complete exome or genome, including some medically relevant genes, requires the 38 reference -- that's no doubt correct. We will all wind up there.

Justin Zook

unread,

Apr 25, 2015, 12:35:11 AM4/25/15

to Steve Lincoln, Marjorie Beggs, John Thompson, genome-in...@googlegroups.com

Hi all,

Thanks all for this discussion! I wanted to restart this discussion about GRCh37 vs. GRCh38 since it would be good to settle on a plan for this. My sense based on this discussion and one-on-one discussions is that it's probably best to start with GRCh37, but also encourage everyone to start experimenting with GRCh38, since generating GIAB calls for GRCh38 could help people move to using it more quickly. The main reasons for calling on GRCh37 first are:

1. Most of our customers are likely to continuing using GRCh37 for some time, regardless of what we do.

2. Most of the current analysis methods have been developed for h37, so we'll probably get good calls faster for h37 and can more easily compare to previous results (e.g., comparing the high-confidence regions on the PGP genomes to NA12878) and to other projects (e.g., the 1000 Genomes SV group is still primarily using h37)

3. Some of us have already started to analyze these genomes using h37.

4. It will likely take some work to get to a reference for h38 that we all agree is good, even if we don't include the ALTs.

All of that said, I do recognize all of the advantages of h38, and that we are uniquely positioned to help develop methods to use h38 and assess the advantages of it and assess accuracy of variant calls against it. Therefore, this is what I propose as a path forward:

1. Decide on a reference for h37 - since 1000 Genomes is using GRCh37 with decoy sequences in their latest analyses, I propose we use hs37d5.fa.gz here: ftp://ftp.1000genomes.ebi.ac.uk/vol1/ftp/technical/reference/phase2_reference_assembly_sequence/. Another option would be to use the 1000 Genomes phase 1 reference, which was GRCh37 without decoy here: ftp://ftp.1000genomes.ebi.ac.uk/vol1/ftp/technical/reference/.

2. Encourage everyone to start experimenting with h38. We might want to decide on a suggested reference for h38 if you want to try using current methods that aren't ALT-aware, and on another suggested reference if you want to develop methods that are ALT-aware. For the first, I'm not aware of any standard fasta's. For the second, 1000 Genomes has a fasta that Heng Li develop at ftp://ftp.1000genomes.ebi.ac.uk/vol1/ftp/technical/reference/GRCh38_reference_genome, so it might be a good start to have comparability with other projects. Any suggestions for a fasta that doesn't contain ALTs?

3. Is anyone interested in testing methods for converting calls and bed files from h37 to h38 and vice versa? Even if this doesn't result in the best calls, it might be really useful to our work as we develop calls for both so that we can compare them.

Does this sound like a reasonable plan forward to everyone?

Thanks!

Justin

--
Sent from Gmail Mobile

Graham Taylor

unread,

Apr 25, 2015, 2:27:30 AM4/25/15

to Justin Zook, Steve Lincoln, Marjorie Beggs, John Thompson, genome-in...@googlegroups.com

Realistically, until the annotation and reference data is all moved to 38, diagnostic labs are going to need to work with 37, although EBI seem to be well on the way. It took a couple of years to move from 36 to 37, although maybe there are better tools available for lift-over now. But to be rigorous, reads will need to be re-mapped using 38 rather than alignments and calls lifted over. So I think we are stuck with 37 for now. We are getting cleaner calls using b37 with decoys, so our stats look better. Is that a good reason to use b37 with decoys?

Would there be value in chromosome sorting NA12878 cells lines to get phase? We are doing this for some HLA haplotyping, but we could broaden it to all chromosomes.

Graham

On 25 Apr 2015, at 14:35, Justin Zook <justi...@gmail.com> wrote:

Hi all,

Thanks all for this discussion! I wanted to restart this discussion about GRCh37 vs. GRCh38 since it would be good to settle on a plan for this. My sense based on this discussion and one-on-one discussions is that it's probably best to start with GRCh37, but also encourage everyone to start experimenting with GRCh38, since generating GIAB calls for GRCh38 could help people move to using it more quickly. The main reasons for calling on GRCh37 first are:

1. Most of our customers are likely to continuing using GRCh37 for some time, regardless of what we do.

2. Most of the current analysis methods have been developed for h37, so we'll probably get good calls faster for h37 and can more easily compare to previous results (e.g., comparing the high-confidence regions on the PGP genomes to NA12878) and to other projects (e.g., the 1000 Genomes SV group is still primarily using h37)

3. Some of us have already started to analyse these genomes using h37.

4. It will likely take some work to get to a reference for h38 that we all agree is good, even if we don't include the ALTs.

All of that said, I do recognise all of the advantages of h38, and that we are uniquely positioned to help develop methods to use h38 and assess the advantages of it and assess accuracy of variant calls against it. Therefore, this is what I propose as a path forward:

Justin Zook

unread,

Apr 25, 2015, 2:47:21 AM4/25/15

to Graham Taylor, Steve Lincoln, Marjorie Beggs, John Thompson, genome-in...@googlegroups.com

Thanks graham. Regarding chromosome sorting, if you have methods to do this and would be interested in doing this it seems like it could be complementary to some of our other methods that provide shorter phasing like moleculo, LFR and 10X. What do others think? Would anyone else be interested in analyzing chromosome sorted data if we would have it? It seems like it may also help with mapping homologous regions if they fall on different chromosomes.

Christopher Mason

unread,

Apr 25, 2015, 4:05:35 AM4/25/15

to Justin Zook, Graham Taylor, Steve Lincoln, Marjorie Beggs, John Thompson, genome-in...@googlegroups.com

Yes indeed, having chromosome-sorted cells would be great to phasing and resolving complex repeat structures. We would certainly use it. And for the GRC38/37 debate, I think we should support both and drive everyone towards 38 though. As shown at the meetings, the assembly is simply better and even though it will be painful for some people to migrate, they will eventually need to do so and it will help them in the long run, especially if we make it easier and lead as a group.

Cheers,
Chris
---------------------------------------
Christopher E. Mason, Ph.D.
Associate Professor, Weill Cornell Medical College

WorldQuant Foundation Scholar,

Assoc. Professor of Physiology and Biophysics, Neuroscience, & Computational Genomics

Affiliate Fellow of Genomics, Ethics, and Law, ISP, Yale Law School
(Dry Lab) 1305 York Ave., 13th floor, Rm. Y13-04, Box 140
(Wet Lab) 413 East 69th St, Belfer Research Bldg, 10th floor, Rm. 1062
New York, NY 10021
(m)203-668-1448
-----------------------

On Sat, Apr 25, 2015 at 2:47 AM, Justin Zook <justi...@gmail.com> wrote:

Thanks graham. Regarding chromosome sorting, if you have methods to do this and would be interested in doing this it seems like it could be complementary to some of our other methods that provide shorter phasing like moleculo, LFR and 10X. What do others think? Would anyone else be interested in analyzing chromosome sorted data if we would have it? It seems like it may also help with mapping homologous regions if they fall on different chromosomes.

On Fri, Apr 24, 2015 at 11:27 PM Graham Taylor <graham...@unimelb.edu.au> wrote:

--

You received this message because you are subscribed to the Google Groups "Genome in a Bottle" group.
To unsubscribe from this group and stop receiving emails from it, send an email to genome-in-a-bot...@googlegroups.com.
To post to this group, send email to genome-in...@googlegroups.com.
Visit this group at http://groups.google.com/group/genome-in-a-bottle.

To view this discussion on the web visit https://groups.google.com/d/msgid/genome-in-a-bottle/CAAqJ9KCD79UZifkL1Wu66pGrDPxXO2FLdZMEb3Q_xS6AvkMRWg%40mail.gmail.com.

James Hadfield

unread,

Apr 25, 2015, 9:55:19 AM4/25/15

to Christopher Mason, Justin Zook, Graham Taylor, Steve Lincoln, Marjorie Beggs, John Thompson, genome-in...@googlegroups.com

Sorting chromosomes is not trivial (IMHO): The Illumina Platinum Genomes NA12877 and NA12878 have phasing information; is the GIAB looking at the 10X genomics technology? Maybe start with a phased exome and compare to a phased genome?

James.

From: Christopher Mason <chm...@med.cornell.edu>
Date: Sat, 25 Apr 2015 04:04:37 -0400
To: Justin Zook <justi...@gmail.com>
Cc: Graham Taylor <graham...@unimelb.edu.au>, Steve Lincoln <steve....@me.com>, Marjorie Beggs <marjori...@nephropath.com>, John Thompson <john.t...@claritasgenomics.com>, "genome-in...@googlegroups.com" <genome-in...@googlegroups.com>
Subject: Re: [GIAB] First GIAB Analysis Group Call - Just to Clarify 37 vs 38

To view this discussion on the web visit https://groups.google.com/d/msgid/genome-in-a-bottle/CANemYLLDWQJO4%2BaA2zNSkxVrawm-QoyVkbaz7MX3hJbgQBLuCA%40mail.gmail.com.

Brad Chapman

unread,

Apr 25, 2015, 3:05:41 PM4/25/15

to Justin Zook, Steve Lincoln, Marjorie Beggs, John Thompson, genome-in...@googlegroups.com

Justin;
Thanks for bringing this up. We've been exploring GRCh38 more and would
love to have validation materials there. We need these to
validate/improve pipelines since having GiaB for NA12878 has been
invaluable. Heng's hapdip benchmarks looks really good and personally I
really want possible HLA typing out of the box:

https://github.com/lh3/bwa/blob/master/README-alt.md#preliminary-evaluation

Has anyone worked on Liftover/remap/Crossmap of the GRCh37 GiaB reference to
GRCh38? I know this is imperfect, but I've come to the conclusion we're
going to have to accept Liftover at least in the short term. This is a
summary of current work we're doing trying to move to 38:

https://github.com/chapmanb/bcbio-nextgen/issues/817

> 2. Encourage everyone to start experimenting with h38. We might want to
> decide on a suggested reference for h38 if you want to try using current
> methods that aren't ALT-aware, and on another suggested reference if you
> want to develop methods that are ALT-aware. For the first, I'm not aware
> of any standard fasta's. For the second, 1000 Genomes has a fasta that
> Heng Li develop at
> ftp://ftp.1000genomes.ebi.ac.uk/vol1/ftp/technical/reference/

> <ftp://ftp.1000genomes.ebi.ac.uk/vol1/ftp/technical/reference/phase2_reference_assembly_sequence/>GRCh38_reference_genome,

> so it might be a good start to have comparability with other projects. Any
> suggestions for a fasta that doesn't contain ALTs?

NCBI has excellent starting materials for this which includes both full
and no-alt versions, along with pre-built bowtie and bwa indices:

ftp://ftp.ncbi.nlm.nih.gov/genbank/genomes/Eukaryotes/vertebrates_mammals/Homo_sapiens/GRCh38/seqs_for_alignment_pipelines/

I've been starting with these. Here is the set of GRCh38-noalt materials
we've put together so far:

https://github.com/chapmanb/cloudbiolinux/tree/master/ggd-recipes/hg38-noalt

Our plan is to liftover/remap/crossmap resources not likely to move soon
(including GiaB if no one has already done it). I'd love to talk with
anyone that has recommendations or experience with these. Thanks much,
Brad

Gerry Higgins

unread,

Apr 26, 2015, 9:25:40 AM4/26/15

to James Hadfield, Christopher Mason, Justin Zook, Graham Taylor, Steve Lincoln, Marjorie Beggs, John Thompson, genome-in...@googlegroups.com

In my experience, the Haplo-seq method from Bing Ren and colleagues [1] is a very accurate method to phase whole chromosomes and is less expensive than the long read Moleculo methods. I do not know if this information is helpful or not. Some of us like to examine noncoding regulatory elements and the epigenome, so this may not be of relevance to the GIAB discussion.

[1] Selvaraj S, Dixon JR, Bansal V, Ren B. Whole-genome haplotype reconstruction using proximity-ligation and shotgun sequencing. Nature Biotech. 31, 1111–1118 (2013) doi:10.1038/nbt.2728

-Gerry Higgins, Ph.D., M.D.

1-734-545-2731

Professor of Computational Medicine and Bioinformatics

University of Michigan Medical School, Ann Arbor, MI

---------

Vice President, Pharmacogenomic Science

AssureRx Health, Inc., Mason, OH

http://genesight.com

To view this discussion on the web visit https://groups.google.com/d/msgid/genome-in-a-bottle/D1615A92.1448DE%25james.hadfield%40cruk.cam.ac.uk.

For more options, visit https://groups.google.com/d/optout.

This message is for the designated recipient only and may contain privileged, proprietary, or otherwise private information. If you have received it in error, please notify the sender immediately and delete the original. Any other use of this communication by you is prohibited.

To contact the Assurex Health Customer Service department, email sup...@assurexhealth.com or call 866.757.9204.  To contact our administrator directly, email postm...@assurexhealth.com or call 513.701.5000

Deanna Church

unread,

Apr 26, 2015, 3:14:04 PM4/26/15

to Brad Chapman, Justin Zook, Steve Lincoln, Marjorie Beggs, John Thompson, genome-in...@googlegroups.com

A few things:

1- I would recommend NCBI remap (http://www.ncbi.nlm.nih.gov/genome/tools/remap) over liftOver. It is slower but it is designed to deal with the paralogous expansion/collapse of sequences that represent a large amount of the change between 37 and 38. It also understands the assembly model better. I did volunteer to highlight some of the differences between liftOver and remap- I am working on this now. I am moving over GIAB as well as a few other data sets.

But note: dbSNP already has a build using 38- IMHO it is imperfect, but it is a reasonable start. There is also gene builds from both NCBI (RefSeq) and Ensembl on 38, so there is decent gene annotation here- and no real reason to do this again. I think Ensembl also just released a regulatory build on 38.

2- For those interested in exploring how to use alts- there are lots of alts on 37- there were 3 regions released with the major release- but 60 novel patches released prior to 38- so for folks want to play around in a space that is well know (37) there is data to play around with.

3- I think it will be a huge missed opportunity if we don't do this work on 38. Validation data sets exist for 37, but not for 38. Until such data exists for 38, it will be impossible for clinical labs to move. GIAB has an opportunity to lead this charge (though note, 1000 genomes is in the process of aligning and calling on 38 now). GIAB can provide important leadership here and we should not miss the opportunity.

best,
Deanna

Brad Chapman

April 25, 2015 at 12:05 PM

Justin;
Thanks for bringing this up. We've been exploring GRCh38 more and would
love to have validation materials there. We need these to
validate/improve pipelines since having GiaB for NA12878 has been
invaluable. Heng's hapdip benchmarks looks really good and personally I
really want possible HLA typing out of the box:

https://github.com/lh3/bwa/blob/master/README-alt.md#preliminary-evaluation

Has anyone worked on Liftover/remap/Crossmap of the GRCh37 GiaB reference to
GRCh38? I know this is imperfect, but I've come to the conclusion we're
going to have to accept Liftover at least in the short term. This is a
summary of current work we're doing trying to move to 38:

https://github.com/chapmanb/bcbio-nextgen/issues/817

NCBI has excellent starting materials for this which includes both full
and no-alt versions, along with pre-built bowtie and bwa indices:

ftp://ftp.ncbi.nlm.nih.gov/genbank/genomes/Eukaryotes/vertebrates_mammals/Homo_sapiens/GRCh38/seqs_for_alignment_pipelines/

I've been starting with these. Here is the set of GRCh38-noalt materials
we've put together so far:

https://github.com/chapmanb/cloudbiolinux/tree/master/ggd-recipes/hg38-noalt

Our plan is to liftover/remap/crossmap resources not likely to move soon
(including GiaB if no one has already done it). I'd love to talk with
anyone that has recommendations or experience with these. Thanks much,
Brad

Justin Zook

April 24, 2015 at 9:35 PM

--
You received this message because you are subscribed to the Google Groups "Genome in a Bottle" group.
To unsubscribe from this group and stop receiving emails from it, send an email to genome-in-a-bot...@googlegroups.com.
To post to this group, send email to genome-in...@googlegroups.com.
Visit this group at http://groups.google.com/group/genome-in-a-bottle.

To view this discussion on the web visit https://groups.google.com/d/msgid/genome-in-a-bottle/CAAqJ9KA0guTg7t44koMSHxfU%2BRF-rk_BmchRhXFazDRXCj2ZbA%40mail.gmail.com.

For more options, visit https://groups.google.com/d/optout.

Steve Lincoln

April 2, 2015 at 5:00 PM

Yeah, we're in that same boat.

We do use a patched 37, where we've incorporated changes that affect (directly or through mis-mapping) the specific genes we test for. As you say, changing reference genomes is not easy, and that's for a whole bunch of different reasons. Unfortunately (given that we've incorporated the changes that help us into 37) moving to 38 won't help us a whole lot more immediately. Thus it's hard to justify doing quickly.

As Deanna and others have talked about very articulately, getting a more complete exome or genome, including some medically relevant genes, requires the 38 reference -- that's no doubt correct. We will all wind up there.

--
You received this message because you are subscribed to the Google Groups "Genome in a Bottle" group.
To unsubscribe from this group and stop receiving emails from it, send an email to genome-in-a-bot...@googlegroups.com.
To post to this group, send email to genome-in...@googlegroups.com.
Visit this group at http://groups.google.com/group/genome-in-a-bottle.

To view this discussion on the web visit https://groups.google.com/d/msgid/genome-in-a-bottle/B3DEE3F1-FB62-4856-96CB-7D1FAEECF293%40me.com.

For more options, visit https://groups.google.com/d/optout.

Marjorie Beggs

April 2, 2015 at 1:07 PM

To view this discussion on the web visit https://groups.google.com/d/msgid/genome-in-a-bottle/391F7F3737B1FA458F84848F705D37B231E34836%40NEPEXCH.Nephropath.local.

For more options, visit https://groups.google.com/d/optout.

John Thompson

April 2, 2015 at 12:03 PM

I would like to second Steve's comments. Unfortunately, I was not able to call in but can back up Steve on the need for a 37 call set because it is going to be a long and arduous process for clinical labs to switch to 38. These labs have the biggest need for multiple references so meeting that need is critical for GIAB's success. There is no point in generating a reference that will not be widely used or used by only a small subset with the capacity to change from 37 to 38 more easily.

John

--
You received this message because you are subscribed to the Google Groups "Genome in a Bottle" group.
To unsubscribe from this group and stop receiving emails from it, send an email to genome-in-a-bot...@googlegroups.com.
To post to this group, send email to genome-in...@googlegroups.com.
Visit this group at http://groups.google.com/group/genome-in-a-bottle.
To view this discussion on the web visit https://groups.google.com/d/msgid/genome-in-a-bottle/CAHi-__TaEsGHa7C4qxqG4HE3391qxf1ZojY%3D4fyAnb7aEOEPEw%40mail.gmail.com.
For more options, visit https://groups.google.com/d/optout.

--

Sent with Postbox

This email and any files transmitted with it are confidential and intended solely for the use of the individual or entity to whom they are addressed. This message may contain privileged and / or confidential information. If you are NOT the intended recipient of this message, copying, printing, disseminating, forwarding or any other use or action derived from its content is strictly prohibited. Please notify the sender immediately by e-mail if you have received this e-mail by error and delete this e-mail from your system. If you received the email by error and this message contains patient information, please report the error by contacting the Personalis Clinical Laboratory at clin...@personalis.com.

Brad Chapman

unread,

Apr 26, 2015, 3:55:06 PM4/26/15

to Deanna Church, Justin Zook, Steve Lincoln, Marjorie Beggs, John Thompson, genome-in...@googlegroups.com

Deanna;

> 1- I would recommend NCBI remap
> (http://www.ncbi.nlm.nih.gov/genome/tools/remap) over liftOver. It is
> slower but it is designed to deal with the paralogous expansion/collapse
> of sequences that represent a large amount of the change between 37 and
> 38. It also understands the assembly model better. I did volunteer to
> highlight some of the differences between liftOver and remap- I am
> working on this now. I am moving over GIAB as well as a few other data
> sets.

This is awesome, I'm glad you are working on this. Would you be able to
share these when they're ready? I'd love to add these to the data sources
for CloudBioLinux/bcbio/GEMINI/whatever else wants them and start doing
some initial validations.

> But note: dbSNP already has a build using 38- IMHO it is imperfect, but
> it is a reasonable start. There is also gene builds from both NCBI
> (RefSeq) and Ensembl on 38, so there is decent gene annotation here- and
> no real reason to do this again. I think Ensembl also just released a
> regulatory build on 38.

From our side, we're using the NCBI dbSNP on 38:

https://github.com/chapmanb/cloudbiolinux/blob/master/ggd-recipes/hg38-noalt/dbsnp-142.yaml

Do you know if there are plans to remap some of the older versions that
GATK relies on, like 138 pre-1000 genomes:

https://github.com/chapmanb/bcbio-nextgen/issues/817#issuecomment-95137831

Thanks again for this, we're trying to expose and make use of as many
shared resources for this as possible and looking forward to using 38,
Brad

Deanna Church

unread,

Apr 26, 2015, 4:04:55 PM4/26/15

to Brad Chapman, Justin Zook, Steve Lincoln, Marjorie Beggs, John Thompson, genome-in...@googlegroups.com

Hi Brad,
I can share the GIAB data- and maybe I'll use that as a demo set of remap vs. liftOver. Does that work?

I don't think dbSNP typically subsets the data in the way you want. If you ask them, they might be able to provide the data you want on GRCh38 coordinates.

best,
Deanna

Brad Chapman

April 26, 2015 at 12:55 PM

Deanna Church

April 26, 2015 at 12:13 PM

A few things:

1- I would recommend NCBI remap (http://www.ncbi.nlm.nih.gov/genome/tools/remap) over liftOver. It is slower but it is designed to deal with the paralogous expansion/collapse of sequences that represent a large amount of the change between 37 and 38. It also understands the assembly model better. I did volunteer to highlight some of the differences between liftOver and remap- I am working on this now. I am moving over GIAB as well as a few other data sets.

But note: dbSNP already has a build using 38- IMHO it is imperfect, but it is a reasonable start. There is also gene builds from both NCBI (RefSeq) and Ensembl on 38, so there is decent gene annotation here- and no real reason to do this again. I think Ensembl also just released a regulatory build on 38.

--

Sent with Postbox

Michael James Clark

unread,

Apr 26, 2015, 5:45:34 PM4/26/15

to Deanna Church, Brad Chapman, Justin Zook, Steve Lincoln, Marjorie Beggs, John Thompson, genome-in...@googlegroups.com

dbSNP does have an annotation on each variant telling which version the variant was added in. Could that potentially satisfy GATK's needs?

To view this discussion on the web visit https://groups.google.com/d/msgid/genome-in-a-bottle/553D44E2.2060208%40personalis.com.

For more options, visit https://groups.google.com/d/optout.

--

---

Michael James Clark, PhD

Juan Rodriguez-Flores

unread,

Apr 27, 2015, 8:15:27 AM4/27/15

to Michael James Clark, Deanna Church, Brad Chapman, Justin Zook, Steve Lincoln, Marjorie Beggs, John Thompson, genome-in...@googlegroups.com

Hi Michael,

I believe dbSNP v142 does have this information. I've seen it in the VCF INFO column.

Sincerely,

Juan L Rodriguez-Flores, Ph.D

Department of Genetic Medicine

Weill Cornell Medical College

1305 York Ave. New York NY

To view this discussion on the web visit https://groups.google.com/d/msgid/genome-in-a-bottle/CAL7shJBzEz4Gn30Pz2njrorhCXNMsLMEV3Z9dyPXVD-U9ndYgQ%40mail.gmail.com.

postbox-contact.jpg

compose-unknown-contact.jpg

postbox-contact.jpg

compose-unknown-contact.jpg

Schneider, Valerie (NIH/NLM/NCBI) [E]

unread,

Apr 27, 2015, 5:25:10 PM4/27/15

to Juan Rodriguez-Flores, Michael James Clark, Deanna Church, Brad Chapman, Justin Zook, Steve Lincoln, Marjorie Beggs, John Thompson, genome-in...@googlegroups.com

Hi folks:

Apologies for the late entry into this email thread. First, to address a couple of data-related questions:

1. Does dbSNP have first assembly information?

There is a VCF info field called: ##INFO=<ID=dbSNPBuildID,Number=1,Type=Integer,Description="First dbSNP Build for RS">. I have checked with dbSNP about this field. This field should enable a user to infer the assembly (since there is an assembly per SNP build). However, it should be noted this is the first assembly in which dbSNP mapped the rsid, not necessarily the assembly in which the variant was ascertained by the submitter.

2. Standard fastas for GRCh38 and GRCh37:

As Brad mentioned, the GRC has produced a set of analysis sets (fasta) for GRCh38 that are available on the GenBank FTP site:

ftp://ftp.ncbi.nlm.nih.gov/genomes/all/GCA_000001405.15_GRCh38/seqs_for_alignment_pipelines.ucsc_ids/

There are a few flavors, all of which also include EBV:

GRCh38_no_alt_analysis_set: This excludes the alternate loci

GRCh38_full_analysis_set: This includes the alternate loci

GRCh38_full_plus_hs38d1_analysis_set: This includes the alternate loci, plus the decoy that Heng Li built for GRCh38 (hs38d1; GCA_000786075.2)

The hs38d1 decoy was constructed to add sequence that does not have an exact 101 bp match to the GRCh38 full assembly. It does not include the HLA sequences on Heng’s BWA site b/c those HLA sequences lack INSDC accessions, and the GRC only includes accessioned sequences in the reference or any analysis sets it puts its name on.

The 1000G project is going to begin mapping reads to GRCh38 very soon. They will be using the full assembly with alts, the hs38d1 decoy and the HLA sequences in their analysis set. The fasta they will use is here: ftp://ftp.1000genomes.ebi.ac.uk/vol1/ftp/technical/reference/GRCh38_reference_genome/

This analysis set was developed in conjunction with the GRC and Heng Li. The deflines in the files should be identical to the ones on the GenBank GRCh38 FTP site (with the exception of HLA) to ensure as much compatibility as possible. Laura Clarke, who is involved in this effort for 1000G, is also involved in the GRC. If GiaB ultimately chooses to use HLA, I would recommend we use this analysis set for consistency.

A set of analysis files for the GRCh37.p13 assembly also exists in GenBank:

ftp://ftp.ncbi.nlm.nih.gov/genbank/genomes/Eukaryotes/vertebrates_mammals/Homo_sapiens/GRCh37.p13/seqs_for_alignment_pipelines/

There are no-alt and full versions.

In contrast to the GRCh38 set, these do not include EBV, and the hs37d5 decoy is not included.

If we go with GRCh37, I would encourage mapping to an analysis set that includes the decoy+EBV; namely the one from 1000G: ftp://ftp.1000genomes.ebi.ac.uk/vol1/ftp/technical/reference/phase2_reference_assembly_sequence/.

Thus, I think we are able to define an analysis set of fasta for GRCh38 that is “good”: the question is primary-only or full assembly. I also think it will be a missed opportunity if we don’t do analysis on GRCh38; there’s an inertia to moving, and it’s generally large projects that need to lead the charge. We’re talking about GRCh37 in part b/c of not using alts. Even if we can’t handle alts, I’d really hate see us not take advantage of the improvements in the rest of the GRCh38 assembly.

I’m interested to hear how the Remap/LiftOver comparison looks for the older GiaB data.

-Valerie

From: Juan Rodriguez-Flores [mailto:jur...@med.cornell.edu]
Sent: Monday, April 27, 2015 8:14 AM
To: Michael James Clark
Cc: Deanna Church; Brad Chapman; Justin Zook; Steve Lincoln; Marjorie Beggs; John Thompson; genome-in...@googlegroups.com
Subject: Re: [GIAB] First GIAB Analysis Group Call - Just to Clarify 37 vs 38

Hi Michael,

To view this discussion on the web visit https://groups.google.com/d/msgid/genome-in-a-bottle/578DE9B8-9C46-4816-BA77-BEFF431CB6AF%40med.cornell.edu.

Han Cao

unread,

Apr 28, 2015, 12:41:31 PM4/28/15

to Deanna Church, Brad Chapman, Justin Zook, Steve Lincoln, Marjorie Beggs, John Thompson, genome-in...@googlegroups.com

Hi Justin and all,

Have been catching up with all the email exchanges over the weekend, thanks for raising two important questions, would like to share with some of our thoughts and experience from mapping point of view.

1) We have been using 37 then latterly more 38 as well, as the base for in silico motif map generation to compared our irys system generated de novo maps against to. I attached a few slides here to show graphically some examples during mapping, that many SegDup and especially inverted repeats captured by Bionano de novo consensus maps, often missing or “condensed” in GRCh 37 hence called as a “SV” later were “corrected “ in GRCh38. Some N-base Gaps were also corrected, some inverted repeats remain as we have observed, slide one showed an example on chr1 near 149 Mb, when complete intact molecules spanning ~700 kb area were imaged, the repeat structure were preserved and strongly supported, we found GRCh 38 is more accurate in showing many larger Structural Variations (SV). Similar cases in slide 2, on Chr 7 and 8. In silico digested motif map using 37 or 38 as base always shown in green color while actual consensus map generated from single molecule level imaging is in blue color.

In slide 3 we have found certain N-base regions of unknown sizes, in both 37 and 38, could be mapped precisely, by aligning many human consensus maps generated from a population, we have found them to be fairly conserved, in size at least, from samples across different ethnic background. Hence I think, we could work together to better annotate the distance of these unknown gaps in genome and reference.

2) We have tested flow sorted DNA for genome mapping, Alex and I often found these are among the cleanest and best samples to work with generating supper long genomic DNA, which as you see in slide 1, long physically intact molecules provide the best phasing distance. If sorted at chromosomal level, I believe if aneuploidy chromosomes could be sorted by size and identified by unique sequence, it would be very interesting for cancer cells as well. This approach has become especially valuable for generating Golden standard model reference for large complex plant genome, such as a wheat, barley etc among other. We have some literature regarding chromosome sorting method, just hope more core facilities with flow sorters would be interested in learning and optimizing the method and providing such not so common service to the community. If flow sorting protocols were to be used more widely, then downstream data and analysis would be so much cleaner.

Cheers,

Han Cao, Ph.D.

Chief Scientific Officer

BioNano Genomics, Inc

9640 Towne Centre Drive, Ste. 100, San Diego, CA 92121

Tel: 858.888.7614; Fax 858.430.5927

h...@bionanogenomics.com

www.bionanogenomics.com

-------------------------------------------------------

This e-mail message and its contents are intended only for the confidential and authorized use of the recipient(s) named above. If you are not the intended recipient, please notify me immediately by e-mail and delete the original message. Thank you.

--

You received this message because you are subscribed to the Google Groups "Genome in a Bottle" group.
To unsubscribe from this group and stop receiving emails from it, send an email to genome-in-a-bot...@googlegroups.com.
To post to this group, send email to genome-in...@googlegroups.com.
Visit this group at http://groups.google.com/group/genome-in-a-bottle.

To view this discussion on the web visit https://groups.google.com/d/msgid/genome-in-a-bottle/553D38F5.5090803%40personalis.com.

GRCh37 vs 38 Bionano-Dec-2014-Han.pdf

Brad Chapman

unread,

Apr 28, 2015, 1:20:46 PM4/28/15

to Deanna Church, Justin Zook, Steve Lincoln, Marjorie Beggs, John Thompson, genome-in...@googlegroups.com

Deanna, Michael, Valerie and all;
Thanks for all the additional information. This is incredibly helpful.

> I can share the GIAB data- and maybe I'll use that as a demo set of
> remap vs. liftOver. Does that work?

Deanna, that would be perfect. I'm happy to do Platinum genome
comparisons of GRCh38 versus the two GiaB approaches (remap and
liftOver). I can also do this on noalt versus alt references. When you
have the VCF and access BED files available happy to move forward with
this. I'll continue to put together resources to do this.

> dbSNP does have an annotation on each variant telling which version the
> variant was added in. Could that potentially satisfy GATK's needs?

Michael, Thanks much for this pointer, I'd totally missed that. It looks
like with a combo of KGPhase1, KGPhase3 and dbSNPBuildID we should be
able to produce something from the NCBI dbSNP release. Thanks again.

> The 1000G project is going to begin mapping reads to GRCh38 very
> soon. They will be using the full assembly with alts, the hs38d1 decoy
> and the HLA sequences in their analysis set. The fasta they will use
> is here:
> ftp://ftp.1000genomes.ebi.ac.uk/vol1/ftp/technical/reference/GRCh38_reference_genome/
>
> This analysis set was developed in conjunction with the GRC and Heng
> Li. The deflines in the files should be identical to the ones on the
> GenBank GRCh38 FTP site (with the exception of HLA) to ensure as much
> compatibility as possible. Laura Clarke, who is involved in this
> effort for 1000G, is also involved in the GRC. If GiaB ultimately
> chooses to use HLA, I would recommend we use this analysis set for
> consistency.

Valerie, thanks for the heads up on this. We'd definitely like to get
HLA calling in place as part of the move to GRCh38, using Heng's
approach so I'll incorporate the 1000 genomes target into bcbio:

https://github.com/lh3/bwa/blob/master/README-alt.md#hla-typing

Thanks again for all the help,
Brad

Hansen, Nancy (NIH/NHGRI) [E]

unread,

May 19, 2015, 10:30:45 AM5/19/15

to Schneider, Valerie (NIH/NLM/NCBI) [E], Juan Rodriguez-Flores, Michael James Clark, Deanna Church, Brad Chapman, Justin Zook, Steve Lincoln, Marjorie Beggs, John Thompson, genome-in...@googlegroups.com

Hi all,

Was a decision made about "official" references for our GRCh37/38 analyses? I am beginning a novoalign alignment of the 300x paired-end Illumina data for the trio. My plan was to use references that are essentially what Valerie recommends below (in a nutshell, primary+decoy+EBV, but no alts because novoalign is not alt-aware—is this true?):

GRCh37:
ftp://ftp.1000genomes.ebi.ac.uk/vol1/ftp/technical/reference/phase2_reference_assembly_sequence/hs37d5.fa.gz
("GRCh37 primary
assembly (chromosomal plus unlocalized and unplaced contigs), the rCRS
mitochondrial sequence (AC:NC_012920), Human herpesvirus 4 type 1
(AC:NC_007605) and the concatenated decoy sequences (hs37d5cs.fa.gz).", with chrY PAR's already masked.)

GRCh38:
ftp://ftp.ncbi.nlm.nih.gov/genomes/all/GCA_000001405.15_GRCh38/seqs_for_alignment_pipelines.ucsc_ids/GCA_000001405.15_GRCh38_full_plus_hs38d1_analysis_set.fna
(GRCh38 "sequences, in FASTA format, of
the chromosomes, mitochondrial genome, unlocalized scaffolds, and
unplaced scaffolds.", Y PARs and other duplicate regions masked, alternate loci, plus the human decoy sequences
from hs38d1)
***I would remove the alternate loci entries from this file because I don't think novoalign is alternate aware yet ***

Please, someone let me know if this sounds good before I fire off the analysis.

Thanks!
--Nancy

--
*************************************
Nancy F. Hansen, PhD nha...@nhgri.nih.gov
Comparative Genomics Analysis Unit, NHGRI
5625 Fishers Lane
Rockville, MD 20852
Phone: (301) 435-1560 Fax: (301) 435-6170

On Apr 26, 2015, at 17:45, Michael James Clark <michael.j...@gmail.com<mailto:michael.j...@gmail.com>> wrote:
dbSNP does have an annotation on each variant telling which version the variant was added in. Could that potentially satisfy GATK's needs?

On Sunday, April 26, 2015, Deanna Church <deanna...@personalis.com<mailto:deanna...@personalis.com>> wrote:
Hi Brad,
I can share the GIAB data- and maybe I'll use that as a demo set of remap vs. liftOver. Does that work?

I don't think dbSNP typically subsets the data in the way you want. If you ask them, they might be able to provide the data you want on GRCh38 coordinates.

best,
Deanna

[cid:image0...@01D08106.86B685C0]
Brad Chapman<javascript:_e(%7B%7D,'cvml','chap...@50mail.com');>

[cid:image0...@01D08106.86B685C0]
Deanna Church<javascript:_e(%7B%7D,'cvml','deanna...@personalis.com');>

April 26, 2015 at 12:13 PM
A few things:

1- I would recommend NCBI remap (http://www.ncbi.nlm.nih.gov/genome/tools/remap) over liftOver. It is slower but it is designed to deal with the paralogous expansion/collapse of sequences that represent a large amount of the change between 37 and 38. It also understands the assembly model better. I did volunteer to highlight some of the differences between liftOver and remap- I am working on this now. I am moving over GIAB as well as a few other data sets.

But note: dbSNP already has a build using 38- IMHO it is imperfect, but it is a reasonable start. There is also gene builds from both NCBI (RefSeq) and Ensembl on 38, so there is decent gene annotation here- and no real reason to do this again. I think Ensembl also just released a regulatory build on 38.

2- For those interested in exploring how to use alts- there are lots of alts on 37- there were 3 regions released with the major release- but 60 novel patches released prior to 38- so for folks want to play around in a space that is well know (37) there is data to play around with.

3- I think it will be a huge missed opportunity if we don't do this work on 38. Validation data sets exist for 37, but not for 38. Until such data exists for 38, it will be impossible for clinical labs to move. GIAB has an opportunity to lead this charge (though note, 1000 genomes is in the process of aligning and calling on 38 now). GIAB can provide important leadership here and we should not miss the opportunity.

best,
Deanna

[cid:image0...@01D08106.86B685C0]
Brad Chapman<javascript:_e(%7B%7D,'cvml','chap...@50mail.com');>

April 25, 2015 at 12:05 PM
Justin;
Thanks for bringing this up. We've been exploring GRCh38 more and would
love to have validation materials there. We need these to
validate/improve pipelines since having GiaB for NA12878 has been
invaluable. Heng's hapdip benchmarks looks really good and personally I
really want possible HLA typing out of the box:

https://github.com/lh3/bwa/blob/master/README-alt.md#preliminary-evaluation

Has anyone worked on Liftover/remap/Crossmap of the GRCh37 GiaB reference to
GRCh38? I know this is imperfect, but I've come to the conclusion we're
going to have to accept Liftover at least in the short term. This is a
summary of current work we're doing trying to move to 38:

https://github.com/chapmanb/bcbio-nextgen/issues/817

NCBI has excellent starting materials for this which includes both full
and no-alt versions, along with pre-built bowtie and bwa indices:

ftp://ftp.ncbi.nlm.nih.gov/genbank/genomes/Eukaryotes/vertebrates_mammals/Homo_sapiens/GRCh38/seqs_for_alignment_pipelines/

I've been starting with these. Here is the set of GRCh38-noalt materials
we've put together so far:

https://github.com/chapmanb/cloudbiolinux/tree/master/ggd-recipes/hg38-noalt

Our plan is to liftover/remap/crossmap resources not likely to move soon
(including GiaB if no one has already done it). I'd love to talk with
anyone that has recommendations or experience with these. Thanks much,
Brad

[cid:image0...@01D08106.86B685C0]
Justin Zook<javascript:_e(%7B%7D,'cvml','justi...@gmail.com');>

April 24, 2015 at 9:35 PM
Hi all,

Thanks all for this discussion! I wanted to restart this discussion about GRCh37 vs. GRCh38 since it would be good to settle on a plan for this. My sense based on this discussion and one-on-one discussions is that it's probably best to start with GRCh37, but also encourage everyone to start experimenting with GRCh38, since generating GIAB calls for GRCh38 could help people move to using it more quickly. The main reasons for calling on GRCh37 first are:
1. Most of our customers are likely to continuing using GRCh37 for some time, regardless of what we do.
2. Most of the current analysis methods have been developed for h37, so we'll probably get good calls faster for h37 and can more easily compare to previous results (e.g., comparing the high-confidence regions on the PGP genomes to NA12878) and to other projects (e.g., the 1000 Genomes SV group is still primarily using h37)
3. Some of us have already started to analyze these genomes using h37.
4. It will likely take some work to get to a reference for h38 that we all agree is good, even if we don't include the ALTs.

All of that said, I do recognize all of the advantages of h38, and that we are uniquely positioned to help develop methods to use h38 and assess the advantages of it and assess accuracy of variant calls against it. Therefore, this is what I propose as a path forward:

1. Decide on a reference for h37 - since 1000 Genomes is using GRCh37 with decoy sequences in their latest analyses, I propose we use hs37d5.fa.gz here: ftp://ftp.1000genomes.ebi.ac.uk/vol1/ftp/technical/reference/phase2_reference_assembly_sequence/. Another option would be to use the 1000 Genomes phase 1 reference, which was GRCh37 without decoy here: ftp://ftp.1000genomes.ebi.ac.uk/vol1/ftp/technical/reference/<ftp://ftp.1000genomes.ebi.ac.uk/vol1/ftp/technical/reference/phase2_reference_assembly_sequence/>.
2. Encourage everyone to start experimenting with h38. We might want to decide on a suggested reference for h38 if you want to try using current methods that aren't ALT-aware, and on another suggested reference if you want to develop methods that are ALT-aware. For the first, I'm not aware of any standard fasta's. For the second, 1000 Genomes has a fasta that Heng Li develop at ftp://ftp.1000genomes.ebi.ac.uk/vol1/ftp/technical/reference/<ftp://ftp.1000genomes.ebi.ac.uk/vol1/ftp/technical/reference/phase2_reference_assembly_sequence/>GRCh38_reference_genome, so it might be a good start to have comparability with other projects. Any suggestions for a fasta that doesn't contain ALTs?

3. Is anyone interested in testing methods for converting calls and bed files from h37 to h38 and vice versa? Even if this doesn't result in the best calls, it might be really useful to our work as we develop calls for both so that we can compare them.

Does this sound like a reasonable plan forward to everyone?

Thanks!
Justin

On Thursday, April 2, 2015, Steve Lincoln <steve....@me.com<javascript:_e(%7B%7D,'cvml','steve....@me.com');>> wrote:

--
Sent from Gmail Mobile
--
You received this message because you are subscribed to the Google Groups "Genome in a Bottle" group.

To unsubscribe from this group and stop receiving emails from it, send an email to genome-in-a-bot...@googlegroups.com<javascript:_e(%7B%7D,'cvml','genome-in-a-bottle%2Bunsu...@googlegroups.com');>.
To post to this group, send email to genome-in...@googlegroups.com<javascript:_e(%7B%7D,'cvml','genome-in...@googlegroups.com');>.

Visit this group at http://groups.google.com/group/genome-in-a-bottle.

To view this discussion on the web visit https://groups.google.com/d/msgid/genome-in-a-bottle/CAAqJ9KA0guTg7t44koMSHxfU%2BRF-rk_BmchRhXFazDRXCj2ZbA%40mail.gmail.com<https://groups.google.com/d/msgid/genome-in-a-bottle/CAAqJ9KA0guTg7t44koMSHxfU%2BRF-rk_BmchRhXFazDRXCj2ZbA%40mail.gmail.com?utm_medium=email&utm_source=footer>.

For more options, visit https://groups.google.com/d/optout.

[cid:image0...@01D08106.86B685C0]
Steve Lincoln<javascript:_e(%7B%7D,'cvml','steve....@me.com');>

April 2, 2015 at 5:00 PM

Yeah, we're in that same boat.

We do use a patched 37, where we've incorporated changes that affect (directly or through mis-mapping) the specific genes we test for. As you say, changing reference genomes is not easy, and that's for a whole bunch of different reasons. Unfortunately (given that we've incorporated the changes that help us into 37) moving to 38 won't help us a whole lot more immediately. Thus it's hard to justify doing quickly.

As Deanna and others have talked about very articulately, getting a more complete exome or genome, including some medically relevant genes, requires the 38 reference -- that's no doubt correct. We will all wind up there.

--
You received this message because you are subscribed to the Google Groups "Genome in a Bottle" group.

To unsubscribe from this group and stop receiving emails from it, send an email to genome-in-a-bot...@googlegroups.com<javascript:_e(%7B%7D,'cvml','genome-in-a-bottle%2Bunsu...@googlegroups.com');>.
To post to this group, send email to genome-in...@googlegroups.com<javascript:_e(%7B%7D,'cvml','genome-in...@googlegroups.com');>.

Visit this group at http://groups.google.com/group/genome-in-a-bottle.

To view this discussion on the web visit https://groups.google.com/d/msgid/genome-in-a-bottle/B3DEE3F1-FB62-4856-96CB-7D1FAEECF293%40me.com<https://groups.google.com/d/msgid/genome-in-a-bottle/B3DEE3F1-FB62-4856-96CB-7D1FAEECF293%40me.com?utm_medium=email&utm_source=footer>.

For more options, visit https://groups.google.com/d/optout.

--

Sent with Postbox<http://www.getpostbox.com>

________________________________
This email and any files transmitted with it are confidential and intended solely for the use of the individual or entity to whom they are addressed. This message may contain privileged and / or confidential information. If you are NOT the intended recipient of this message, copying, printing, disseminating, forwarding or any other use or action derived from its content is strictly prohibited. Please notify the sender immediately by e-mail if you have received this e-mail by error and delete this e-mail from your system. If you received the email by error and this message contains patient information, please report the error by contacting the Personalis Clinical Laboratory at clin...@personalis.com<javascript:_e(%7B%7D,'cvml','clin...@personalis.com');>.

--
You received this message because you are subscribed to the Google Groups "Genome in a Bottle" group.

To unsubscribe from this group and stop receiving emails from it, send an email to genome-in-a-bot...@googlegroups.com<javascript:_e(%7B%7D,'cvml','genome-in-a-bottle%2Bunsu...@googlegroups.com');>.
To post to this group, send email to genome-in...@googlegroups.com<javascript:_e(%7B%7D,'cvml','genome-in...@googlegroups.com');>.

Visit this group at http://groups.google.com/group/genome-in-a-bottle.

To view this discussion on the web visit https://groups.google.com/d/msgid/genome-in-a-bottle/553D44E2.2060208%40personalis.com<https://groups.google.com/d/msgid/genome-in-a-bottle/553D44E2.2060208%40personalis.com?utm_medium=email&utm_source=footer>.

For more options, visit https://groups.google.com/d/optout.

--
---
Michael James Clark, PhD

--
You received this message because you are subscribed to the Google Groups "Genome in a Bottle" group.

To unsubscribe from this group and stop receiving emails from it, send an email to genome-in-a-bot...@googlegroups.com<mailto:genome-in-a-bot...@googlegroups.com>.
To post to this group, send email to genome-in...@googlegroups.com<mailto:genome-in...@googlegroups.com>.

Visit this group at http://groups.google.com/group/genome-in-a-bottle.

To view this discussion on the web visit https://groups.google.com/d/msgid/genome-in-a-bottle/CAL7shJBzEz4Gn30Pz2njrorhCXNMsLMEV3Z9dyPXVD-U9ndYgQ%40mail.gmail.com<https://groups.google.com/d/msgid/genome-in-a-bottle/CAL7shJBzEz4Gn30Pz2njrorhCXNMsLMEV3Z9dyPXVD-U9ndYgQ%40mail.gmail.com?utm_medium=email&utm_source=footer>.

For more options, visit https://groups.google.com/d/optout.
--
You received this message because you are subscribed to the Google Groups "Genome in a Bottle" group.

To unsubscribe from this group and stop receiving emails from it, send an email to genome-in-a-bot...@googlegroups.com<mailto:genome-in-a-bot...@googlegroups.com>.
To post to this group, send email to genome-in...@googlegroups.com<mailto:genome-in...@googlegroups.com>.

Visit this group at http://groups.google.com/group/genome-in-a-bottle.

To view this discussion on the web visit https://groups.google.com/d/msgid/genome-in-a-bottle/578DE9B8-9C46-4816-BA77-BEFF431CB6AF%40med.cornell.edu<https://groups.google.com/d/msgid/genome-in-a-bottle/578DE9B8-9C46-4816-BA77-BEFF431CB6AF%40med.cornell.edu?utm_medium=email&utm_source=footer>.

For more options, visit https://groups.google.com/d/optout.

--
You received this message because you are subscribed to the Google Groups "Genome in a Bottle" group.

To unsubscribe from this group and stop receiving emails from it, send an email to genome-in-a-bot...@googlegroups.com<mailto:genome-in-a-bot...@googlegroups.com>.
To post to this group, send email to genome-in...@googlegroups.com<mailto:genome-in...@googlegroups.com>.

Visit this group at http://groups.google.com/group/genome-in-a-bottle.

To view this discussion on the web visit https://groups.google.com/d/msgid/genome-in-a-bottle/2A55097D1BF03345927D1C0CD6059FDE1C8EA03B%40msgb01.nih.gov<https://groups.google.com/d/msgid/genome-in-a-bottle/2A55097D1BF03345927D1C0CD6059FDE1C8EA03B%40msgb01.nih.gov?utm_medium=email&utm_source=footer>.

Toufik

unread,

May 19, 2015, 11:14:13 AM5/19/15

to Adam Novak, Schneider, Valerie (NIH/NLM/NCBI) [E], Juan Rodriguez-Flores, Michael James Clark, Deanna Church, Brad Chapman, Justin Zook, Steve Lincoln, Marjorie Beggs, John Thompson, genome-in...@googlegroups.com

Adam

please have a look on that I

Thank you

Toufik

To unsubscribe from this group and stop receiving emails from it, send an email to genome-in-a-bot...@googlegroups.com.
To post to this group, send email to genome-in...@googlegroups.com.

Visit this group at http://groups.google.com/group/genome-in-a-bottle.

To view this discussion on the web visit https://groups.google.com/d/msgid/genome-in-a-bottle/D180BE6A.2C390%25nhansen%40mail.nih.gov.

Schneider, Valerie (NIH/NLM/NCBI) [E]

unread,

May 19, 2015, 11:18:45 AM5/19/15

to Hansen, Nancy (NIH/NHGRI) [E], Juan Rodriguez-Flores, Michael James Clark, Deanna Church, Brad Chapman, Justin Zook, Steve Lincoln, Marjorie Beggs, John Thompson, genome-in...@googlegroups.com

Nancy:

Let me see if I can get a GRCh38_no_alt+EBV+decoy(hs38d1) up on the GenBank FTP site quickly, so there would be no need to modify what is downloaded.

-Valerie

From: Hansen, Nancy (NIH/NHGRI) [E]
Sent: Tuesday, May 19, 2015 10:31 AM
To: Schneider, Valerie (NIH/NLM/NCBI) [E]; Juan Rodriguez-Flores; Michael James Clark
Cc: Deanna Church; Brad Chapman; Justin Zook; Steve Lincoln; Marjorie Beggs; John Thompson; genome-in...@googlegroups.com
Subject: Re: [GIAB] First GIAB Analysis Group Call - Just to Clarify 37 vs 38

Hi all,

On Sunday, April 26, 2015, Deanna Church <deanna...@personalis.com> wrote:

Hi Brad,
I can share the GIAB data- and maybe I'll use that as a demo set of remap vs. liftOver. Does that work?

I don't think dbSNP typically subsets the data in the way you want. If you ask them, they might be able to provide the data you want on GRCh38 coordinates.

best,
Deanna

Brad Chapman

Deanna Church

April 26, 2015 at 12:13 PM

A few things:

1- I would recommend NCBI remap (http://www.ncbi.nlm.nih.gov/genome/tools/remap) over liftOver. It is slower but it is designed to deal with the paralogous expansion/collapse of sequences that represent a large amount of the change between 37 and 38. It also understands the assembly model better. I did volunteer to highlight some of the differences between liftOver and remap- I am working on this now. I am moving over GIAB as well as a few other data sets.

But note: dbSNP already has a build using 38- IMHO it is imperfect, but it is a reasonable start. There is also gene builds from both NCBI (RefSeq) and Ensembl on 38, so there is decent gene annotation here- and no real reason to do this again. I think Ensembl also just released a regulatory build on 38.

2- For those interested in exploring how to use alts- there are lots of alts on 37- there were 3 regions released with the major release- but 60 novel patches released prior to 38- so for folks want to play around in a space that is well know (37) there is data to play around with.

3- I think it will be a huge missed opportunity if we don't do this work on 38. Validation data sets exist for 37, but not for 38. Until such data exists for 38, it will be impossible for clinical labs to move. GIAB has an opportunity to lead this charge (though note, 1000 genomes is in the process of aligning and calling on 38 now). GIAB can provide important leadership here and we should not miss the opportunity.

best,
Deanna

Brad Chapman

April 25, 2015 at 12:05 PM

Justin;
Thanks for bringing this up. We've been exploring GRCh38 more and would
love to have validation materials there. We need these to
validate/improve pipelines since having GiaB for NA12878 has been
invaluable. Heng's hapdip benchmarks looks really good and personally I
really want possible HLA typing out of the box:

https://github.com/lh3/bwa/blob/master/README-alt.md#preliminary-evaluation

Has anyone worked on Liftover/remap/Crossmap of the GRCh37 GiaB reference to
GRCh38? I know this is imperfect, but I've come to the conclusion we're
going to have to accept Liftover at least in the short term. This is a
summary of current work we're doing trying to move to 38:

https://github.com/chapmanb/bcbio-nextgen/issues/817

NCBI has excellent starting materials for this which includes both full
and no-alt versions, along with pre-built bowtie and bwa indices:

ftp://ftp.ncbi.nlm.nih.gov/genbank/genomes/Eukaryotes/vertebrates_mammals/Homo_sapiens/GRCh38/seqs_for_alignment_pipelines/

I've been starting with these. Here is the set of GRCh38-noalt materials
we've put together so far:

https://github.com/chapmanb/cloudbiolinux/tree/master/ggd-recipes/hg38-noalt

Our plan is to liftover/remap/crossmap resources not likely to move soon
(including GiaB if no one has already done it). I'd love to talk with
anyone that has recommendations or experience with these. Thanks much,
Brad

Justin Zook

April 24, 2015 at 9:35 PM

Hi all,

Thanks all for this discussion! I wanted to restart this discussion about GRCh37 vs. GRCh38 since it would be good to settle on a plan for this. My sense based on this discussion and one-on-one discussions is that it's probably best to start with GRCh37, but also encourage everyone to start experimenting with GRCh38, since generating GIAB calls for GRCh38 could help people move to using it more quickly. The main reasons for calling on GRCh37 first are:

1. Most of our customers are likely to continuing using GRCh37 for some time, regardless of what we do.

2. Most of the current analysis methods have been developed for h37, so we'll probably get good calls faster for h37 and can more easily compare to previous results (e.g., comparing the high-confidence regions on the PGP genomes to NA12878) and to other projects (e.g., the 1000 Genomes SV group is still primarily using h37)

3. Some of us have already started to analyze these genomes using h37.

4. It will likely take some work to get to a reference for h38 that we all agree is good, even if we don't include the ALTs.

All of that said, I do recognize all of the advantages of h38, and that we are uniquely positioned to help develop methods to use h38 and assess the advantages of it and assess accuracy of variant calls against it. Therefore, this is what I propose as a path forward:

1. Decide on a reference for h37 - since 1000 Genomes is using GRCh37 with decoy sequences in their latest analyses, I propose we use hs37d5.fa.gz here: ftp://ftp.1000genomes.ebi.ac.uk/vol1/ftp/technical/reference/phase2_reference_assembly_sequence/. Another option would be to use the 1000 Genomes phase 1 reference, which was GRCh37 without decoy here: ftp://ftp.1000genomes.ebi.ac.uk/vol1/ftp/technical/reference/.

2. Encourage everyone to start experimenting with h38. We might want to decide on a suggested reference for h38 if you want to try using current methods that aren't ALT-aware, and on another suggested reference if you want to develop methods that are ALT-aware. For the first, I'm not aware of any standard fasta's. For the second, 1000 Genomes has a fasta that Heng Li develop at ftp://ftp.1000genomes.ebi.ac.uk/vol1/ftp/technical/reference/GRCh38_reference_genome, so it might be a good start to have comparability with other projects. Any suggestions for a fasta that doesn't contain ALTs?

3. Is anyone interested in testing methods for converting calls and bed files from h37 to h38 and vice versa? Even if this doesn't result in the best calls, it might be really useful to our work as we develop calls for both so that we can compare them.

Does this sound like a reasonable plan forward to everyone?

Thanks!

Justin

On Thursday, April 2, 2015, Steve Lincoln <steve....@me.com> wrote:

--
Sent from Gmail Mobile
--
You received this message because you are subscribed to the Google Groups "Genome in a Bottle" group.

To unsubscribe from this group and stop receiving emails from it, send an email to genome-in-a-bot...@googlegroups.com.
To post to this group, send email to genome-in...@googlegroups.com.

Visit this group at http://groups.google.com/group/genome-in-a-bottle.

To view this discussion on the web visit https://groups.google.com/d/msgid/genome-in-a-bottle/CAAqJ9KA0guTg7t44koMSHxfU%2BRF-rk_BmchRhXFazDRXCj2ZbA%40mail.gmail.com.

For more options, visit https://groups.google.com/d/optout.

Steve Lincoln

April 2, 2015 at 5:00 PM

Yeah, we're in that same boat.

We do use a patched 37, where we've incorporated changes that affect (directly or through mis-mapping) the specific genes we test for. As you say, changing reference genomes is not easy, and that's for a whole bunch of different reasons. Unfortunately (given that we've incorporated the changes that help us into 37) moving to 38 won't help us a whole lot more immediately. Thus it's hard to justify doing quickly.

As Deanna and others have talked about very articulately, getting a more complete exome or genome, including some medically relevant genes, requires the 38 reference -- that's no doubt correct. We will all wind up there.

--
You received this message because you are subscribed to the Google Groups "Genome in a Bottle" group.

To unsubscribe from this group and stop receiving emails from it, send an email to genome-in-a-bot...@googlegroups.com.
To post to this group, send email to genome-in...@googlegroups.com.

Visit this group at http://groups.google.com/group/genome-in-a-bottle.

To view this discussion on the web visit https://groups.google.com/d/msgid/genome-in-a-bottle/B3DEE3F1-FB62-4856-96CB-7D1FAEECF293%40me.com.

For more options, visit https://groups.google.com/d/optout.

--

Sent with Postbox

This email and any files transmitted with it are confidential and intended solely for the use of the individual or entity to whom they are addressed. This message may contain privileged and / or confidential information. If you are NOT the intended recipient of this message, copying, printing, disseminating, forwarding or any other use or action derived from its content is strictly prohibited. Please notify the sender immediately by e-mail if you have received this e-mail by error and delete this e-mail from your system. If you received the email by error and this message contains patient information, please report the error by contacting the Personalis Clinical Laboratory at clin...@personalis.com.

--
You received this message because you are subscribed to the Google Groups "Genome in a Bottle" group.

To unsubscribe from this group and stop receiving emails from it, send an email to genome-in-a-bot...@googlegroups.com.
To post to this group, send email to genome-in...@googlegroups.com.

Visit this group at http://groups.google.com/group/genome-in-a-bottle.

To view this discussion on the web visit https://groups.google.com/d/msgid/genome-in-a-bottle/553D44E2.2060208%40personalis.com.

For more options, visit https://groups.google.com/d/optout.

--

---

Michael James Clark, PhD

--
You received this message because you are subscribed to the Google Groups "Genome in a Bottle" group.

To unsubscribe from this group and stop receiving emails from it, send an email to genome-in-a-bot...@googlegroups.com.
To post to this group, send email to genome-in...@googlegroups.com.

Visit this group at http://groups.google.com/group/genome-in-a-bottle.

To view this discussion on the web visit https://groups.google.com/d/msgid/genome-in-a-bottle/CAL7shJBzEz4Gn30Pz2njrorhCXNMsLMEV3Z9dyPXVD-U9ndYgQ%40mail.gmail.com.

For more options, visit https://groups.google.com/d/optout.

--
You received this message because you are subscribed to the Google Groups "Genome in a Bottle" group.

To unsubscribe from this group and stop receiving emails from it, send an email to genome-in-a-bot...@googlegroups.com.
To post to this group, send email to genome-in...@googlegroups.com.

Visit this group at http://groups.google.com/group/genome-in-a-bottle.

To view this discussion on the web visit https://groups.google.com/d/msgid/genome-in-a-bottle/578DE9B8-9C46-4816-BA77-BEFF431CB6AF%40med.cornell.edu.

For more options, visit https://groups.google.com/d/optout.

--
You received this message because you are subscribed to the Google Groups "Genome in a Bottle" group.

To unsubscribe from this group and stop receiving emails from it, send an email to genome-in-a-bot...@googlegroups.com.
To post to this group, send email to genome-in...@googlegroups.com.

Visit this group at http://groups.google.com/group/genome-in-a-bottle.

To view this discussion on the web visit https://groups.google.com/d/msgid/genome-in-a-bottle/2A55097D1BF03345927D1C0CD6059FDE1C8EA03B%40msgb01.nih.gov.

Hansen, Nancy (NIH/NHGRI) [E]

unread,

May 19, 2015, 11:23:00 AM5/19/15

to Schneider, Valerie (NIH/NLM/NCBI) [E], Juan Rodriguez-Flores, Michael James Clark, Deanna Church, Brad Chapman, Justin Zook, Steve Lincoln, Marjorie Beggs, John Thompson, genome-in...@googlegroups.com

Thanks, Valerie. That would be great. Do you (and others) agree that this is a realistic reference for GIAB GRCh38 analyses if the aligner used is not alt aware?

Best,
--Nancy

Nancy F. Hansen, PhD nha...@nhgri.nih.gov<mailto:nha...@nhgri.nih.gov>

Comparative Genomics Analysis Unit, NHGRI
5625 Fishers Lane
Rockville, MD 20852
Phone: (301) 435-1560 Fax: (301) 435-6170

From: "Schneider, Valerie (NIH/NLM/NCBI) [E]" <schn...@ncbi.nlm.nih.gov<mailto:schn...@ncbi.nlm.nih.gov>>
Date: Mon, 27 Apr 2015 21:25:05 +0000

To: Juan Rodriguez-Flores <jur...@med.cornell.edu<mailto:jur...@med.cornell.edu>>, Michael James Clark <michael.j...@gmail.com<mailto:michael.j...@gmail.com>>

On Apr 26, 2015, at 17:45, Michael James Clark <michael.j...@gmail.com<mailto:michael.j...@gmail.com>> wrote:
dbSNP does have an annotation on each variant telling which version the variant was added in. Could that potentially satisfy GATK's needs?

On Sunday, April 26, 2015, Deanna Church <deanna...@personalis.com<mailto:deanna...@personalis.com>> wrote:
Hi Brad,
I can share the GIAB data- and maybe I'll use that as a demo set of remap vs. liftOver. Does that work?

I don't think dbSNP typically subsets the data in the way you want. If you ask them, they might be able to provide the data you want on GRCh38 coordinates.

best,
Deanna

[cid:image0...@01D09224.847A4D30]

Brad Chapman<javascript:_e(%7B%7D,'cvml','chap...@50mail.com');>

[cid:image0...@01D09224.847A4D30]

Deanna Church<javascript:_e(%7B%7D,'cvml','deanna...@personalis.com');>

April 26, 2015 at 12:13 PM
A few things:

1- I would recommend NCBI remap (http://www.ncbi.nlm.nih.gov/genome/tools/remap) over liftOver. It is slower but it is designed to deal with the paralogous expansion/collapse of sequences that represent a large amount of the change between 37 and 38. It also understands the assembly model better. I did volunteer to highlight some of the differences between liftOver and remap- I am working on this now. I am moving over GIAB as well as a few other data sets.

But note: dbSNP already has a build using 38- IMHO it is imperfect, but it is a reasonable start. There is also gene builds from both NCBI (RefSeq) and Ensembl on 38, so there is decent gene annotation here- and no real reason to do this again. I think Ensembl also just released a regulatory build on 38.

2- For those interested in exploring how to use alts- there are lots of alts on 37- there were 3 regions released with the major release- but 60 novel patches released prior to 38- so for folks want to play around in a space that is well know (37) there is data to play around with.

3- I think it will be a huge missed opportunity if we don't do this work on 38. Validation data sets exist for 37, but not for 38. Until such data exists for 38, it will be impossible for clinical labs to move. GIAB has an opportunity to lead this charge (though note, 1000 genomes is in the process of aligning and calling on 38 now). GIAB can provide important leadership here and we should not miss the opportunity.

best,
Deanna

[cid:image0...@01D09224.847A4D30]

Brad Chapman<javascript:_e(%7B%7D,'cvml','chap...@50mail.com');>

April 25, 2015 at 12:05 PM
Justin;
Thanks for bringing this up. We've been exploring GRCh38 more and would
love to have validation materials there. We need these to
validate/improve pipelines since having GiaB for NA12878 has been
invaluable. Heng's hapdip benchmarks looks really good and personally I
really want possible HLA typing out of the box:

https://github.com/lh3/bwa/blob/master/README-alt.md#preliminary-evaluation

Has anyone worked on Liftover/remap/Crossmap of the GRCh37 GiaB reference to
GRCh38? I know this is imperfect, but I've come to the conclusion we're
going to have to accept Liftover at least in the short term. This is a
summary of current work we're doing trying to move to 38:

https://github.com/chapmanb/bcbio-nextgen/issues/817

NCBI has excellent starting materials for this which includes both full
and no-alt versions, along with pre-built bowtie and bwa indices:

ftp://ftp.ncbi.nlm.nih.gov/genbank/genomes/Eukaryotes/vertebrates_mammals/Homo_sapiens/GRCh38/seqs_for_alignment_pipelines/

I've been starting with these. Here is the set of GRCh38-noalt materials
we've put together so far:

https://github.com/chapmanb/cloudbiolinux/tree/master/ggd-recipes/hg38-noalt

Our plan is to liftover/remap/crossmap resources not likely to move soon
(including GiaB if no one has already done it). I'd love to talk with
anyone that has recommendations or experience with these. Thanks much,
Brad

[cid:image0...@01D09224.847A4D30]

Justin Zook<javascript:_e(%7B%7D,'cvml','justi...@gmail.com');>

April 24, 2015 at 9:35 PM
Hi all,

Thanks all for this discussion! I wanted to restart this discussion about GRCh37 vs. GRCh38 since it would be good to settle on a plan for this. My sense based on this discussion and one-on-one discussions is that it's probably best to start with GRCh37, but also encourage everyone to start experimenting with GRCh38, since generating GIAB calls for GRCh38 could help people move to using it more quickly. The main reasons for calling on GRCh37 first are:
1. Most of our customers are likely to continuing using GRCh37 for some time, regardless of what we do.
2. Most of the current analysis methods have been developed for h37, so we'll probably get good calls faster for h37 and can more easily compare to previous results (e.g., comparing the high-confidence regions on the PGP genomes to NA12878) and to other projects (e.g., the 1000 Genomes SV group is still primarily using h37)
3. Some of us have already started to analyze these genomes using h37.
4. It will likely take some work to get to a reference for h38 that we all agree is good, even if we don't include the ALTs.

All of that said, I do recognize all of the advantages of h38, and that we are uniquely positioned to help develop methods to use h38 and assess the advantages of it and assess accuracy of variant calls against it. Therefore, this is what I propose as a path forward:

1. Decide on a reference for h37 - since 1000 Genomes is using GRCh37 with decoy sequences in their latest analyses, I propose we use hs37d5.fa.gz here: ftp://ftp.1000genomes.ebi.ac.uk/vol1/ftp/technical/reference/phase2_reference_assembly_sequence/. Another option would be to use the 1000 Genomes phase 1 reference, which was GRCh37 without decoy here: ftp://ftp.1000genomes.ebi.ac.uk/vol1/ftp/technical/reference/<ftp://ftp.1000genomes.ebi.ac.uk/vol1/ftp/technical/reference/phase2_reference_assembly_sequence/>.
2. Encourage everyone to start experimenting with h38. We might want to decide on a suggested reference for h38 if you want to try using current methods that aren't ALT-aware, and on another suggested reference if you want to develop methods that are ALT-aware. For the first, I'm not aware of any standard fasta's. For the second, 1000 Genomes has a fasta that Heng Li develop at ftp://ftp.1000genomes.ebi.ac.uk/vol1/ftp/technical/reference/<ftp://ftp.1000genomes.ebi.ac.uk/vol1/ftp/technical/reference/phase2_reference_assembly_sequence/>GRCh38_reference_genome, so it might be a good start to have comparability with other projects. Any suggestions for a fasta that doesn't contain ALTs?

3. Is anyone interested in testing methods for converting calls and bed files from h37 to h38 and vice versa? Even if this doesn't result in the best calls, it might be really useful to our work as we develop calls for both so that we can compare them.

Does this sound like a reasonable plan forward to everyone?

Thanks!
Justin

On Thursday, April 2, 2015, Steve Lincoln <steve....@me.com<javascript:_e(%7B%7D,'cvml','steve....@me.com');>> wrote:

--
Sent from Gmail Mobile
--
You received this message because you are subscribed to the Google Groups "Genome in a Bottle" group.

To unsubscribe from this group and stop receiving emails from it, send an email to genome-in-a-bot...@googlegroups.com<javascript:_e(%7B%7D,'cvml','genome-in-a-bottle%2Bunsu...@googlegroups.com');>.
To post to this group, send email to genome-in...@googlegroups.com<javascript:_e(%7B%7D,'cvml','genome-in...@googlegroups.com');>.

Visit this group at http://groups.google.com/group/genome-in-a-bottle.

To view this discussion on the web visit https://groups.google.com/d/msgid/genome-in-a-bottle/CAAqJ9KA0guTg7t44koMSHxfU%2BRF-rk_BmchRhXFazDRXCj2ZbA%40mail.gmail.com<https://groups.google.com/d/msgid/genome-in-a-bottle/CAAqJ9KA0guTg7t44koMSHxfU%2BRF-rk_BmchRhXFazDRXCj2ZbA%40mail.gmail.com?utm_medium=email&utm_source=footer>.

For more options, visit https://groups.google.com/d/optout.

[cid:image0...@01D09224.847A4D30]

Steve Lincoln<javascript:_e(%7B%7D,'cvml','steve....@me.com');>

April 2, 2015 at 5:00 PM

Yeah, we're in that same boat.

We do use a patched 37, where we've incorporated changes that affect (directly or through mis-mapping) the specific genes we test for. As you say, changing reference genomes is not easy, and that's for a whole bunch of different reasons. Unfortunately (given that we've incorporated the changes that help us into 37) moving to 38 won't help us a whole lot more immediately. Thus it's hard to justify doing quickly.

As Deanna and others have talked about very articulately, getting a more complete exome or genome, including some medically relevant genes, requires the 38 reference -- that's no doubt correct. We will all wind up there.

--
You received this message because you are subscribed to the Google Groups "Genome in a Bottle" group.

To unsubscribe from this group and stop receiving emails from it, send an email to genome-in-a-bot...@googlegroups.com<javascript:_e(%7B%7D,'cvml','genome-in-a-bottle%2Bunsu...@googlegroups.com');>.
To post to this group, send email to genome-in...@googlegroups.com<javascript:_e(%7B%7D,'cvml','genome-in...@googlegroups.com');>.

Visit this group at http://groups.google.com/group/genome-in-a-bottle.

To view this discussion on the web visit https://groups.google.com/d/msgid/genome-in-a-bottle/B3DEE3F1-FB62-4856-96CB-7D1FAEECF293%40me.com<https://groups.google.com/d/msgid/genome-in-a-bottle/B3DEE3F1-FB62-4856-96CB-7D1FAEECF293%40me.com?utm_medium=email&utm_source=footer>.

For more options, visit https://groups.google.com/d/optout.

--

Sent with Postbox<http://www.getpostbox.com>

________________________________
This email and any files transmitted with it are confidential and intended solely for the use of the individual or entity to whom they are addressed. This message may contain privileged and / or confidential information. If you are NOT the intended recipient of this message, copying, printing, disseminating, forwarding or any other use or action derived from its content is strictly prohibited. Please notify the sender immediately by e-mail if you have received this e-mail by error and delete this e-mail from your system. If you received the email by error and this message contains patient information, please report the error by contacting the Personalis Clinical Laboratory at clin...@personalis.com<javascript:_e(%7B%7D,'cvml','clin...@personalis.com');>.

--
You received this message because you are subscribed to the Google Groups "Genome in a Bottle" group.

To unsubscribe from this group and stop receiving emails from it, send an email to genome-in-a-bot...@googlegroups.com<javascript:_e(%7B%7D,'cvml','genome-in-a-bottle%2Bunsu...@googlegroups.com');>.
To post to this group, send email to genome-in...@googlegroups.com<javascript:_e(%7B%7D,'cvml','genome-in...@googlegroups.com');>.

Visit this group at http://groups.google.com/group/genome-in-a-bottle.

To view this discussion on the web visit https://groups.google.com/d/msgid/genome-in-a-bottle/553D44E2.2060208%40personalis.com<https://groups.google.com/d/msgid/genome-in-a-bottle/553D44E2.2060208%40personalis.com?utm_medium=email&utm_source=footer>.

For more options, visit https://groups.google.com/d/optout.

--
---
Michael James Clark, PhD

--
You received this message because you are subscribed to the Google Groups "Genome in a Bottle" group.

To unsubscribe from this group and stop receiving emails from it, send an email to genome-in-a-bot...@googlegroups.com<mailto:genome-in-a-bot...@googlegroups.com>.
To post to this group, send email to genome-in...@googlegroups.com<mailto:genome-in...@googlegroups.com>.

Visit this group at http://groups.google.com/group/genome-in-a-bottle.

To view this discussion on the web visit https://groups.google.com/d/msgid/genome-in-a-bottle/CAL7shJBzEz4Gn30Pz2njrorhCXNMsLMEV3Z9dyPXVD-U9ndYgQ%40mail.gmail.com<https://groups.google.com/d/msgid/genome-in-a-bottle/CAL7shJBzEz4Gn30Pz2njrorhCXNMsLMEV3Z9dyPXVD-U9ndYgQ%40mail.gmail.com?utm_medium=email&utm_source=footer>.

For more options, visit https://groups.google.com/d/optout.
--
You received this message because you are subscribed to the Google Groups "Genome in a Bottle" group.

To unsubscribe from this group and stop receiving emails from it, send an email to genome-in-a-bot...@googlegroups.com<mailto:genome-in-a-bot...@googlegroups.com>.
To post to this group, send email to genome-in...@googlegroups.com<mailto:genome-in...@googlegroups.com>.

Visit this group at http://groups.google.com/group/genome-in-a-bottle.

To view this discussion on the web visit https://groups.google.com/d/msgid/genome-in-a-bottle/578DE9B8-9C46-4816-BA77-BEFF431CB6AF%40med.cornell.edu<https://groups.google.com/d/msgid/genome-in-a-bottle/578DE9B8-9C46-4816-BA77-BEFF431CB6AF%40med.cornell.edu?utm_medium=email&utm_source=footer>.

For more options, visit https://groups.google.com/d/optout.
--
You received this message because you are subscribed to the Google Groups "Genome in a Bottle" group.

To unsubscribe from this group and stop receiving emails from it, send an email to genome-in-a-bot...@googlegroups.com<mailto:genome-in-a-bot...@googlegroups.com>.
To post to this group, send email to genome-in...@googlegroups.com<mailto:genome-in...@googlegroups.com>.

Visit this group at http://groups.google.com/group/genome-in-a-bottle.

To view this discussion on the web visit https://groups.google.com/d/msgid/genome-in-a-bottle/2A55097D1BF03345927D1C0CD6059FDE1C8EA03B%40msgb01.nih.gov<https://groups.google.com/d/msgid/genome-in-a-bottle/2A55097D1BF03345927D1C0CD6059FDE1C8EA03B%40msgb01.nih.gov?utm_medium=email&utm_source=footer>.

Schneider, Valerie (NIH/NLM/NCBI) [E]

unread,

May 19, 2015, 11:41:13 AM5/19/15

to Hansen, Nancy (NIH/NHGRI) [E], Juan Rodriguez-Flores, Michael James Clark, Deanna Church, Brad Chapman, Justin Zook, Steve Lincoln, Marjorie Beggs, John Thompson, genome-in...@googlegroups.com

The hs38d1 decoy (for GRCh38) is comprised of sequences not present in the GRCh38 full assembly. In contrast, the hs37d1 decoy (for GRCh37) was based solely on the GRCh37 primary assembly.

What this means is that the hs38d1 decoy does not contain sequence whose only occurrence in the GRCh38 assembly is on the alts. It is not incorrect to use it in the context of the GRCh38 primary assembly only and it does add “missing sequence”, but one should be aware that it does not provide any representation for sequence that is unique to the alternate loci.

We have not seen an assessment of the value added by the hs38d1 decoy to GRCh38 primary (as opposed to GRCh38 full), so we don’t know how much benefit is gained from the missing sequence it provides in an alt-free context. B/c we have been unsure how much usage a GRCh38_noalt+hs38d1 decoy analysis set would get, we’ve not yet put it out at NCBI.

-Valerie

From: Hansen, Nancy (NIH/NHGRI) [E]
Sent: Tuesday, May 19, 2015 11:18 AM
To: Schneider, Valerie (NIH/NLM/NCBI) [E]; Juan Rodriguez-Flores; Michael James Clark
Cc: Deanna Church; Brad Chapman; Justin Zook; Steve Lincoln; Marjorie Beggs; John Thompson; genome-in...@googlegroups.com
Subject: Re: [GIAB] First GIAB Analysis Group Call - Just to Clarify 37 vs 38

Thanks, Valerie. That would be great. Do you (and others) agree that this is a realistic reference for GIAB GRCh38 analyses if the aligner used is not alt aware?

--

Comparative Genomics Analysis Unit, NHGRI

On Sunday, April 26, 2015, Deanna Church <deanna...@personalis.com> wrote:

Hi Brad,
I can share the GIAB data- and maybe I'll use that as a demo set of remap vs. liftOver. Does that work?

I don't think dbSNP typically subsets the data in the way you want. If you ask them, they might be able to provide the data you want on GRCh38 coordinates.

best,
Deanna

Brad Chapman

Deanna Church

April 26, 2015 at 12:13 PM

A few things:

1- I would recommend NCBI remap (http://www.ncbi.nlm.nih.gov/genome/tools/remap) over liftOver. It is slower but it is designed to deal with the paralogous expansion/collapse of sequences that represent a large amount of the change between 37 and 38. It also understands the assembly model better. I did volunteer to highlight some of the differences between liftOver and remap- I am working on this now. I am moving over GIAB as well as a few other data sets.

But note: dbSNP already has a build using 38- IMHO it is imperfect, but it is a reasonable start. There is also gene builds from both NCBI (RefSeq) and Ensembl on 38, so there is decent gene annotation here- and no real reason to do this again. I think Ensembl also just released a regulatory build on 38.

2- For those interested in exploring how to use alts- there are lots of alts on 37- there were 3 regions released with the major release- but 60 novel patches released prior to 38- so for folks want to play around in a space that is well know (37) there is data to play around with.

3- I think it will be a huge missed opportunity if we don't do this work on 38. Validation data sets exist for 37, but not for 38. Until such data exists for 38, it will be impossible for clinical labs to move. GIAB has an opportunity to lead this charge (though note, 1000 genomes is in the process of aligning and calling on 38 now). GIAB can provide important leadership here and we should not miss the opportunity.

best,
Deanna

Brad Chapman

April 25, 2015 at 12:05 PM

Justin;
Thanks for bringing this up. We've been exploring GRCh38 more and would
love to have validation materials there. We need these to
validate/improve pipelines since having GiaB for NA12878 has been
invaluable. Heng's hapdip benchmarks looks really good and personally I
really want possible HLA typing out of the box:

https://github.com/lh3/bwa/blob/master/README-alt.md#preliminary-evaluation

Has anyone worked on Liftover/remap/Crossmap of the GRCh37 GiaB reference to
GRCh38? I know this is imperfect, but I've come to the conclusion we're
going to have to accept Liftover at least in the short term. This is a
summary of current work we're doing trying to move to 38:

https://github.com/chapmanb/bcbio-nextgen/issues/817

NCBI has excellent starting materials for this which includes both full
and no-alt versions, along with pre-built bowtie and bwa indices:

ftp://ftp.ncbi.nlm.nih.gov/genbank/genomes/Eukaryotes/vertebrates_mammals/Homo_sapiens/GRCh38/seqs_for_alignment_pipelines/

I've been starting with these. Here is the set of GRCh38-noalt materials
we've put together so far:

https://github.com/chapmanb/cloudbiolinux/tree/master/ggd-recipes/hg38-noalt

Our plan is to liftover/remap/crossmap resources not likely to move soon
(including GiaB if no one has already done it). I'd love to talk with
anyone that has recommendations or experience with these. Thanks much,
Brad

Justin Zook

April 24, 2015 at 9:35 PM

Hi all,

Thanks all for this discussion! I wanted to restart this discussion about GRCh37 vs. GRCh38 since it would be good to settle on a plan for this. My sense based on this discussion and one-on-one discussions is that it's probably best to start with GRCh37, but also encourage everyone to start experimenting with GRCh38, since generating GIAB calls for GRCh38 could help people move to using it more quickly. The main reasons for calling on GRCh37 first are:

1. Most of our customers are likely to continuing using GRCh37 for some time, regardless of what we do.

2. Most of the current analysis methods have been developed for h37, so we'll probably get good calls faster for h37 and can more easily compare to previous results (e.g., comparing the high-confidence regions on the PGP genomes to NA12878) and to other projects (e.g., the 1000 Genomes SV group is still primarily using h37)

3. Some of us have already started to analyze these genomes using h37.

4. It will likely take some work to get to a reference for h38 that we all agree is good, even if we don't include the ALTs.

All of that said, I do recognize all of the advantages of h38, and that we are uniquely positioned to help develop methods to use h38 and assess the advantages of it and assess accuracy of variant calls against it. Therefore, this is what I propose as a path forward:

1. Decide on a reference for h37 - since 1000 Genomes is using GRCh37 with decoy sequences in their latest analyses, I propose we use hs37d5.fa.gz here: ftp://ftp.1000genomes.ebi.ac.uk/vol1/ftp/technical/reference/phase2_reference_assembly_sequence/. Another option would be to use the 1000 Genomes phase 1 reference, which was GRCh37 without decoy here: ftp://ftp.1000genomes.ebi.ac.uk/vol1/ftp/technical/reference/.

2. Encourage everyone to start experimenting with h38. We might want to decide on a suggested reference for h38 if you want to try using current methods that aren't ALT-aware, and on another suggested reference if you want to develop methods that are ALT-aware. For the first, I'm not aware of any standard fasta's. For the second, 1000 Genomes has a fasta that Heng Li develop at ftp://ftp.1000genomes.ebi.ac.uk/vol1/ftp/technical/reference/GRCh38_reference_genome, so it might be a good start to have comparability with other projects. Any suggestions for a fasta that doesn't contain ALTs?

3. Is anyone interested in testing methods for converting calls and bed files from h37 to h38 and vice versa? Even if this doesn't result in the best calls, it might be really useful to our work as we develop calls for both so that we can compare them.

Does this sound like a reasonable plan forward to everyone?

Thanks!

Justin

On Thursday, April 2, 2015, Steve Lincoln <steve....@me.com> wrote:

--
Sent from Gmail Mobile
--
You received this message because you are subscribed to the Google Groups "Genome in a Bottle" group.

To unsubscribe from this group and stop receiving emails from it, send an email to genome-in-a-bot...@googlegroups.com.
To post to this group, send email to genome-in...@googlegroups.com.

Visit this group at http://groups.google.com/group/genome-in-a-bottle.

To view this discussion on the web visit https://groups.google.com/d/msgid/genome-in-a-bottle/CAAqJ9KA0guTg7t44koMSHxfU%2BRF-rk_BmchRhXFazDRXCj2ZbA%40mail.gmail.com.

For more options, visit https://groups.google.com/d/optout.

Steve Lincoln

April 2, 2015 at 5:00 PM

Yeah, we're in that same boat.

We do use a patched 37, where we've incorporated changes that affect (directly or through mis-mapping) the specific genes we test for. As you say, changing reference genomes is not easy, and that's for a whole bunch of different reasons. Unfortunately (given that we've incorporated the changes that help us into 37) moving to 38 won't help us a whole lot more immediately. Thus it's hard to justify doing quickly.

As Deanna and others have talked about very articulately, getting a more complete exome or genome, including some medically relevant genes, requires the 38 reference -- that's no doubt correct. We will all wind up there.

--
You received this message because you are subscribed to the Google Groups "Genome in a Bottle" group.

To unsubscribe from this group and stop receiving emails from it, send an email to genome-in-a-bot...@googlegroups.com.
To post to this group, send email to genome-in...@googlegroups.com.

Visit this group at http://groups.google.com/group/genome-in-a-bottle.

To view this discussion on the web visit https://groups.google.com/d/msgid/genome-in-a-bottle/B3DEE3F1-FB62-4856-96CB-7D1FAEECF293%40me.com.

For more options, visit https://groups.google.com/d/optout.

--

Sent with Postbox

This email and any files transmitted with it are confidential and intended solely for the use of the individual or entity to whom they are addressed. This message may contain privileged and / or confidential information. If you are NOT the intended recipient of this message, copying, printing, disseminating, forwarding or any other use or action derived from its content is strictly prohibited. Please notify the sender immediately by e-mail if you have received this e-mail by error and delete this e-mail from your system. If you received the email by error and this message contains patient information, please report the error by contacting the Personalis Clinical Laboratory at clin...@personalis.com.

--
You received this message because you are subscribed to the Google Groups "Genome in a Bottle" group.

To unsubscribe from this group and stop receiving emails from it, send an email to genome-in-a-bot...@googlegroups.com.
To post to this group, send email to genome-in...@googlegroups.com.

Visit this group at http://groups.google.com/group/genome-in-a-bottle.

To view this discussion on the web visit https://groups.google.com/d/msgid/genome-in-a-bottle/553D44E2.2060208%40personalis.com.

For more options, visit https://groups.google.com/d/optout.

--

---

Michael James Clark, PhD

--
You received this message because you are subscribed to the Google Groups "Genome in a Bottle" group.

To unsubscribe from this group and stop receiving emails from it, send an email to genome-in-a-bot...@googlegroups.com.
To post to this group, send email to genome-in...@googlegroups.com.

Visit this group at http://groups.google.com/group/genome-in-a-bottle.

To view this discussion on the web visit https://groups.google.com/d/msgid/genome-in-a-bottle/CAL7shJBzEz4Gn30Pz2njrorhCXNMsLMEV3Z9dyPXVD-U9ndYgQ%40mail.gmail.com.

For more options, visit https://groups.google.com/d/optout.

--
You received this message because you are subscribed to the Google Groups "Genome in a Bottle" group.

To unsubscribe from this group and stop receiving emails from it, send an email to genome-in-a-bot...@googlegroups.com.
To post to this group, send email to genome-in...@googlegroups.com.

Visit this group at http://groups.google.com/group/genome-in-a-bottle.

To view this discussion on the web visit https://groups.google.com/d/msgid/genome-in-a-bottle/578DE9B8-9C46-4816-BA77-BEFF431CB6AF%40med.cornell.edu.

For more options, visit https://groups.google.com/d/optout.

--
You received this message because you are subscribed to the Google Groups "Genome in a Bottle" group.

To unsubscribe from this group and stop receiving emails from it, send an email to genome-in-a-bot...@googlegroups.com.
To post to this group, send email to genome-in...@googlegroups.com.

Visit this group at http://groups.google.com/group/genome-in-a-bottle.

To view this discussion on the web visit https://groups.google.com/d/msgid/genome-in-a-bottle/2A55097D1BF03345927D1C0CD6059FDE1C8EA03B%40msgb01.nih.gov.

Hansen, Nancy (NIH/NHGRI) [E]

unread,

May 19, 2015, 1:59:06 PM5/19/15

to Schneider, Valerie (NIH/NLM/NCBI) [E], Juan Rodriguez-Flores, Michael James Clark, Deanna Church, Brad Chapman, Justin Zook, Steve Lincoln, Marjorie Beggs, John Thompson, genome-in...@googlegroups.com

Thanks again, Valerie,

I can understand NCBI being hesitant to put out too many variations of the references, but since I'm working with non-alt-aware software, and am guessing that more decoys will probably be more accurate than fewer, I'm going to align to GRCh38_noalt+EBV+decoys (created by me from ftp://ftp.ncbi.nlm.nih.gov/genomes/all/GCA_000001405.15_GRCh38/seqs_for_alignment_pipelines.ucsc_ids/GCA_000001405.15_GRCh38_full_plus_hs38d1_analysis_set.fna by removing "alt" entries). This way, as I see it, I'll be consistent with Bert's analysis and other folks who've used the full assembly, with the exception of having removed alt and HLA regions.

Anyone should feel free to email me, off-list or on, if that sounds wrong.

Best,

--Nancy

--
*************************************
Nancy F. Hansen, PhD nha...@nhgri.nih.gov
Comparative Genomics Analysis Unit, NHGRI
5625 Fishers Lane
Rockville, MD 20852
Phone: (301) 435-1560 Fax: (301) 435-6170

From: "Schneider, Valerie (NIH/NLM/NCBI) [E]" <schn...@ncbi.nlm.nih.gov<mailto:schn...@ncbi.nlm.nih.gov>>
Date: Tue, 19 May 2015 11:40:54 -0400
To: Nancy Hansen <nha...@mail.nih.gov<mailto:nha...@mail.nih.gov>>, Juan Rodriguez-Flores <jur...@med.cornell.edu<mailto:jur...@med.cornell.edu>>, Michael James Clark <michael.j...@gmail.com<mailto:michael.j...@gmail.com>>

Cc: Deanna Church <deanna...@personalis.com<mailto:deanna...@personalis.com>>, Brad Chapman <chap...@50mail.com<mailto:chap...@50mail.com>>, Justin Zook <justi...@gmail.com<mailto:justi...@gmail.com>>, Steve Lincoln <steve....@me.com<mailto:steve....@me.com>>, Marjorie Beggs <marjori...@nephropath.com<mailto:marjori...@nephropath.com>>, John Thompson <john.t...@claritasgenomics.com<mailto:john.t...@claritasgenomics.com>>, "genome-in...@googlegroups.com<mailto:genome-in...@googlegroups.com>" <genome-in...@googlegroups.com<mailto:genome-in...@googlegroups.com>>
Subject: RE: [GIAB] First GIAB Analysis Group Call - Just to Clarify 37 vs 38

The hs38d1 decoy (for GRCh38) is comprised of sequences not present in the GRCh38 full assembly. In contrast, the hs37d1 decoy (for GRCh37) was based solely on the GRCh37 primary assembly.

What this means is that the hs38d1 decoy does not contain sequence whose only occurrence in the GRCh38 assembly is on the alts. It is not incorrect to use it in the context of the GRCh38 primary assembly only and it does add “missing sequence”, but one should be aware that it does not provide any representation for sequence that is unique to the alternate loci.

We have not seen an assessment of the value added by the hs38d1 decoy to GRCh38 primary (as opposed to GRCh38 full), so we don’t know how much benefit is gained from the missing sequence it provides in an alt-free context. B/c we have been unsure how much usage a GRCh38_noalt+hs38d1 decoy analysis set would get, we’ve not yet put it out at NCBI.

-Valerie

From: Hansen, Nancy (NIH/NHGRI) [E]
Sent: Tuesday, May 19, 2015 11:18 AM
To: Schneider, Valerie (NIH/NLM/NCBI) [E]; Juan Rodriguez-Flores; Michael James Clark
Cc: Deanna Church; Brad Chapman; Justin Zook; Steve Lincoln; Marjorie Beggs; John Thompson; genome-in...@googlegroups.com<mailto:genome-in...@googlegroups.com>
Subject: Re: [GIAB] First GIAB Analysis Group Call - Just to Clarify 37 vs 38

Thanks, Valerie. That would be great. Do you (and others) agree that this is a realistic reference for GIAB GRCh38 analyses if the aligner used is not alt aware?

Best,
--Nancy

From: "Schneider, Valerie (NIH/NLM/NCBI) [E]" <schn...@ncbi.nlm.nih.gov<mailto:schn...@ncbi.nlm.nih.gov>>
Date: Tue, 19 May 2015 11:15:45 -0400

To: Nancy Hansen <nha...@mail.nih.gov<mailto:nha...@mail.nih.gov>>, Juan Rodriguez-Flores <jur...@med.cornell.edu<mailto:jur...@med.cornell.edu>>, Michael James Clark <michael.j...@gmail.com<mailto:michael.j...@gmail.com>>

Nancy F. Hansen, PhD nha...@nhgri.nih.gov<mailto:nha...@nhgri.nih.gov>

Comparative Genomics Analysis Unit, NHGRI
5625 Fishers Lane
Rockville, MD 20852
Phone: (301) 435-1560 Fax: (301) 435-6170

On Apr 26, 2015, at 17:45, Michael James Clark <michael.j...@gmail.com<mailto:michael.j...@gmail.com>> wrote:
dbSNP does have an annotation on each variant telling which version the variant was added in. Could that potentially satisfy GATK's needs?

On Sunday, April 26, 2015, Deanna Church <deanna...@personalis.com<mailto:deanna...@personalis.com>> wrote:
Hi Brad,
I can share the GIAB data- and maybe I'll use that as a demo set of remap vs. liftOver. Does that work?

I don't think dbSNP typically subsets the data in the way you want. If you ask them, they might be able to provide the data you want on GRCh38 coordinates.

best,
Deanna

[cid:image0...@01D09226.68D73050]

Brad Chapman<javascript:_e(%7B%7D,'cvml','chap...@50mail.com');>

[cid:image0...@01D09226.68D73050]

Deanna Church<javascript:_e(%7B%7D,'cvml','deanna...@personalis.com');>

April 26, 2015 at 12:13 PM
A few things:

1- I would recommend NCBI remap (http://www.ncbi.nlm.nih.gov/genome/tools/remap) over liftOver. It is slower but it is designed to deal with the paralogous expansion/collapse of sequences that represent a large amount of the change between 37 and 38. It also understands the assembly model better. I did volunteer to highlight some of the differences between liftOver and remap- I am working on this now. I am moving over GIAB as well as a few other data sets.

But note: dbSNP already has a build using 38- IMHO it is imperfect, but it is a reasonable start. There is also gene builds from both NCBI (RefSeq) and Ensembl on 38, so there is decent gene annotation here- and no real reason to do this again. I think Ensembl also just released a regulatory build on 38.

2- For those interested in exploring how to use alts- there are lots of alts on 37- there were 3 regions released with the major release- but 60 novel patches released prior to 38- so for folks want to play around in a space that is well know (37) there is data to play around with.

3- I think it will be a huge missed opportunity if we don't do this work on 38. Validation data sets exist for 37, but not for 38. Until such data exists for 38, it will be impossible for clinical labs to move. GIAB has an opportunity to lead this charge (though note, 1000 genomes is in the process of aligning and calling on 38 now). GIAB can provide important leadership here and we should not miss the opportunity.

best,
Deanna

[cid:image0...@01D09226.68D73050]

Brad Chapman<javascript:_e(%7B%7D,'cvml','chap...@50mail.com');>

April 25, 2015 at 12:05 PM
Justin;
Thanks for bringing this up. We've been exploring GRCh38 more and would
love to have validation materials there. We need these to
validate/improve pipelines since having GiaB for NA12878 has been
invaluable. Heng's hapdip benchmarks looks really good and personally I
really want possible HLA typing out of the box:

https://github.com/lh3/bwa/blob/master/README-alt.md#preliminary-evaluation

Has anyone worked on Liftover/remap/Crossmap of the GRCh37 GiaB reference to
GRCh38? I know this is imperfect, but I've come to the conclusion we're
going to have to accept Liftover at least in the short term. This is a
summary of current work we're doing trying to move to 38:

https://github.com/chapmanb/bcbio-nextgen/issues/817

NCBI has excellent starting materials for this which includes both full
and no-alt versions, along with pre-built bowtie and bwa indices:

ftp://ftp.ncbi.nlm.nih.gov/genbank/genomes/Eukaryotes/vertebrates_mammals/Homo_sapiens/GRCh38/seqs_for_alignment_pipelines/

I've been starting with these. Here is the set of GRCh38-noalt materials
we've put together so far:

https://github.com/chapmanb/cloudbiolinux/tree/master/ggd-recipes/hg38-noalt

Our plan is to liftover/remap/crossmap resources not likely to move soon
(including GiaB if no one has already done it). I'd love to talk with
anyone that has recommendations or experience with these. Thanks much,
Brad

[cid:image0...@01D09226.68D73050]

Justin Zook<javascript:_e(%7B%7D,'cvml','justi...@gmail.com');>

April 24, 2015 at 9:35 PM
Hi all,

Thanks all for this discussion! I wanted to restart this discussion about GRCh37 vs. GRCh38 since it would be good to settle on a plan for this. My sense based on this discussion and one-on-one discussions is that it's probably best to start with GRCh37, but also encourage everyone to start experimenting with GRCh38, since generating GIAB calls for GRCh38 could help people move to using it more quickly. The main reasons for calling on GRCh37 first are:
1. Most of our customers are likely to continuing using GRCh37 for some time, regardless of what we do.
2. Most of the current analysis methods have been developed for h37, so we'll probably get good calls faster for h37 and can more easily compare to previous results (e.g., comparing the high-confidence regions on the PGP genomes to NA12878) and to other projects (e.g., the 1000 Genomes SV group is still primarily using h37)
3. Some of us have already started to analyze these genomes using h37.
4. It will likely take some work to get to a reference for h38 that we all agree is good, even if we don't include the ALTs.

All of that said, I do recognize all of the advantages of h38, and that we are uniquely positioned to help develop methods to use h38 and assess the advantages of it and assess accuracy of variant calls against it. Therefore, this is what I propose as a path forward:

1. Decide on a reference for h37 - since 1000 Genomes is using GRCh37 with decoy sequences in their latest analyses, I propose we use hs37d5.fa.gz here: ftp://ftp.1000genomes.ebi.ac.uk/vol1/ftp/technical/reference/phase2_reference_assembly_sequence/. Another option would be to use the 1000 Genomes phase 1 reference, which was GRCh37 without decoy here: ftp://ftp.1000genomes.ebi.ac.uk/vol1/ftp/technical/reference/<ftp://ftp.1000genomes.ebi.ac.uk/vol1/ftp/technical/reference/phase2_reference_assembly_sequence/>.
2. Encourage everyone to start experimenting with h38. We might want to decide on a suggested reference for h38 if you want to try using current methods that aren't ALT-aware, and on another suggested reference if you want to develop methods that are ALT-aware. For the first, I'm not aware of any standard fasta's. For the second, 1000 Genomes has a fasta that Heng Li develop at ftp://ftp.1000genomes.ebi.ac.uk/vol1/ftp/technical/reference/<ftp://ftp.1000genomes.ebi.ac.uk/vol1/ftp/technical/reference/phase2_reference_assembly_sequence/>GRCh38_reference_genome, so it might be a good start to have comparability with other projects. Any suggestions for a fasta that doesn't contain ALTs?

3. Is anyone interested in testing methods for converting calls and bed files from h37 to h38 and vice versa? Even if this doesn't result in the best calls, it might be really useful to our work as we develop calls for both so that we can compare them.

Does this sound like a reasonable plan forward to everyone?

Thanks!
Justin

On Thursday, April 2, 2015, Steve Lincoln <steve....@me.com<javascript:_e(%7B%7D,'cvml','steve....@me.com');>> wrote:

--
Sent from Gmail Mobile
--
You received this message because you are subscribed to the Google Groups "Genome in a Bottle" group.

To unsubscribe from this group and stop receiving emails from it, send an email to genome-in-a-bot...@googlegroups.com<javascript:_e(%7B%7D,'cvml','genome-in-a-bottle%2Bunsu...@googlegroups.com');>.
To post to this group, send email to genome-in...@googlegroups.com<javascript:_e(%7B%7D,'cvml','genome-in...@googlegroups.com');>.

Visit this group at http://groups.google.com/group/genome-in-a-bottle.

To view this discussion on the web visit https://groups.google.com/d/msgid/genome-in-a-bottle/CAAqJ9KA0guTg7t44koMSHxfU%2BRF-rk_BmchRhXFazDRXCj2ZbA%40mail.gmail.com<https://groups.google.com/d/msgid/genome-in-a-bottle/CAAqJ9KA0guTg7t44koMSHxfU%2BRF-rk_BmchRhXFazDRXCj2ZbA%40mail.gmail.com?utm_medium=email&utm_source=footer>.

For more options, visit https://groups.google.com/d/optout.

[cid:image0...@01D09226.68D73050]

Steve Lincoln<javascript:_e(%7B%7D,'cvml','steve....@me.com');>

April 2, 2015 at 5:00 PM

Yeah, we're in that same boat.

We do use a patched 37, where we've incorporated changes that affect (directly or through mis-mapping) the specific genes we test for. As you say, changing reference genomes is not easy, and that's for a whole bunch of different reasons. Unfortunately (given that we've incorporated the changes that help us into 37) moving to 38 won't help us a whole lot more immediately. Thus it's hard to justify doing quickly.

As Deanna and others have talked about very articulately, getting a more complete exome or genome, including some medically relevant genes, requires the 38 reference -- that's no doubt correct. We will all wind up there.

--
You received this message because you are subscribed to the Google Groups "Genome in a Bottle" group.

To unsubscribe from this group and stop receiving emails from it, send an email to genome-in-a-bot...@googlegroups.com<javascript:_e(%7B%7D,'cvml','genome-in-a-bottle%2Bunsu...@googlegroups.com');>.
To post to this group, send email to genome-in...@googlegroups.com<javascript:_e(%7B%7D,'cvml','genome-in...@googlegroups.com');>.

Visit this group at http://groups.google.com/group/genome-in-a-bottle.

To view this discussion on the web visit https://groups.google.com/d/msgid/genome-in-a-bottle/B3DEE3F1-FB62-4856-96CB-7D1FAEECF293%40me.com<https://groups.google.com/d/msgid/genome-in-a-bottle/B3DEE3F1-FB62-4856-96CB-7D1FAEECF293%40me.com?utm_medium=email&utm_source=footer>.

For more options, visit https://groups.google.com/d/optout.

--

Sent with Postbox<http://www.getpostbox.com>

________________________________

This email and any files transmitted with it are confidential and intended solely for the use of the individual or entity to whom they are addressed. This message may contain privileged and / or confidential information. If you are NOT the intended recipient of this message, copying, printing, disseminating, forwarding or any other use or action derived from its content is strictly prohibited. Please notify the sender immediately by e-mail if you have received this e-mail by error and delete this e-mail from your system. If you received the email by error and this message contains patient information, please report the error by contacting the Personalis Clinical Laboratory at clin...@personalis.com<javascript:_e(%7B%7D,'cvml','clin...@personalis.com');>.

--
You received this message because you are subscribed to the Google Groups "Genome in a Bottle" group.

To unsubscribe from this group and stop receiving emails from it, send an email to genome-in-a-bot...@googlegroups.com<javascript:_e(%7B%7D,'cvml','genome-in-a-bottle%2Bunsu...@googlegroups.com');>.
To post to this group, send email to genome-in...@googlegroups.com<javascript:_e(%7B%7D,'cvml','genome-in...@googlegroups.com');>.

Visit this group at http://groups.google.com/group/genome-in-a-bottle.

To view this discussion on the web visit https://groups.google.com/d/msgid/genome-in-a-bottle/553D44E2.2060208%40personalis.com<https://groups.google.com/d/msgid/genome-in-a-bottle/553D44E2.2060208%40personalis.com?utm_medium=email&utm_source=footer>.

For more options, visit https://groups.google.com/d/optout.

--
---
Michael James Clark, PhD

--
You received this message because you are subscribed to the Google Groups "Genome in a Bottle" group.

To unsubscribe from this group and stop receiving emails from it, send an email to genome-in-a-bot...@googlegroups.com<mailto:genome-in-a-bot...@googlegroups.com>.
To post to this group, send email to genome-in...@googlegroups.com<mailto:genome-in...@googlegroups.com>.

Visit this group at http://groups.google.com/group/genome-in-a-bottle.

To view this discussion on the web visit https://groups.google.com/d/msgid/genome-in-a-bottle/CAL7shJBzEz4Gn30Pz2njrorhCXNMsLMEV3Z9dyPXVD-U9ndYgQ%40mail.gmail.com<https://groups.google.com/d/msgid/genome-in-a-bottle/CAL7shJBzEz4Gn30Pz2njrorhCXNMsLMEV3Z9dyPXVD-U9ndYgQ%40mail.gmail.com?utm_medium=email&utm_source=footer>.

For more options, visit https://groups.google.com/d/optout.
--
You received this message because you are subscribed to the Google Groups "Genome in a Bottle" group.

To unsubscribe from this group and stop receiving emails from it, send an email to genome-in-a-bot...@googlegroups.com<mailto:genome-in-a-bot...@googlegroups.com>.
To post to this group, send email to genome-in...@googlegroups.com<mailto:genome-in...@googlegroups.com>.

Visit this group at http://groups.google.com/group/genome-in-a-bottle.

To view this discussion on the web visit https://groups.google.com/d/msgid/genome-in-a-bottle/578DE9B8-9C46-4816-BA77-BEFF431CB6AF%40med.cornell.edu<https://groups.google.com/d/msgid/genome-in-a-bottle/578DE9B8-9C46-4816-BA77-BEFF431CB6AF%40med.cornell.edu?utm_medium=email&utm_source=footer>.

For more options, visit https://groups.google.com/d/optout.
--
You received this message because you are subscribed to the Google Groups "Genome in a Bottle" group.

To unsubscribe from this group and stop receiving emails from it, send an email to genome-in-a-bot...@googlegroups.com<mailto:genome-in-a-bot...@googlegroups.com>.
To post to this group, send email to genome-in...@googlegroups.com<mailto:genome-in...@googlegroups.com>.

Visit this group at http://groups.google.com/group/genome-in-a-bottle.

To view this discussion on the web visit https://groups.google.com/d/msgid/genome-in-a-bottle/2A55097D1BF03345927D1C0CD6059FDE1C8EA03B%40msgb01.nih.gov<https://groups.google.com/d/msgid/genome-in-a-bottle/2A55097D1BF03345927D1C0CD6059FDE1C8EA03B%40msgb01.nih.gov?utm_medium=email&utm_source=footer>.

Reply all

Reply to author

Forward