REF/ALT v Allele1/Allele2

Matthew Maher

unread,

Aug 29, 2021, 12:12:38 PM8/29/21

to locuszoom

As is discussed in FAQ: 'Why is LD information not being shown for my data?' (point 2), LocusZoom needs variants specified as REF and ALT for looking up LD information, and this is problematic with output from meta-analysis programs like METAL which don't seem to track REF/ALT alleles, just effect/other alleles (aka 'Allele1'/'Allele2').

Questions:
1. Should I expect to see warnings/errors in the ingest log when I upload a METAL results file if the column I specified as the 'REF' (I gave Allele2, which is obviously a lie!) contains bases which do not match the specified genome reference? I don't believe I received any warnings/errors.

2. How would the lost functionality manifest itself? is it just the lack of accurate coloration (for LD ranges) for SNPs when zoomed in? and would that be an excess of SNPs showing as gray ("no r2 data") Is anything else effected?

3. I could reprocess/correct the METAL output to get alleles specified as REF/ALT, which I believe would also involve selectively flipping/recalculating Allele Freq (1-x) and Beta (-x). Does Locuszoom do any further manipulation/calculation using the Allele Freq and or Beta? or is limited to simply displaying the values when zoomed in?

----------------------------

FYI: you have a very minor type in the FAQ: "How should I prepare my data for uploading?" - the underscore and colon are flipped in the first portion of:

"By marker (chrom_pos:ref/alt): 9:22125503_G/C"

Matthew Maher

unread,

Aug 29, 2021, 12:32:41 PM8/29/21

to locuszoom

And I should have added to Q3: Does Locuszoom implicitly assume anything about the direction of the BETA (i.e. the REF or the ALT the 'effect' allele?)? or does it ONLY display the value given?

Andy Boughton

unread,

Aug 29, 2021, 8:05:27 PM8/29/21

to locu...@googlegroups.com

Thanks for the questions.

There are two basic answers to the question:

At present, the file parsing options are trusted as given and only used to draw plots. We validate data type (numeric/text) and ranges, but don’t check the genome build. We do try to handle common variations between file formats, but usually those options are explicitly spelled out in the “choose options” UI (eg, allele freq lets a user specify which allele is involved). Because we handle so many file formats, we have tried to avoid being overly magical about detecting and handling highly program-specific behaviors.
We do not presently perform any validation of, eg, ref vs alt allele. This is indeed limiting for programs that do not always output the same allele to the same column (eg when allele2 could be ref or alt, depending on the row). I’d like to see that improve, but it's something we’d want to evaluate carefully before adding, in particular to find a highly performant implementation that keeps the upload process snappy for all our users. If we add new features for calculations in the future, I agree that stronger allele/build validation would be key, to ensure that people were running meaningful analyses.

If you are viewing a lot of GWAS phenotypes, our PheWeb tool does perform build validation. (the latter tool was designed to be run on in house datasets, not as a hosted service)

If an allele is mis-specified, it would affect parts of the plot that reference a specific and exact allele. LD would be by far the most noticeable, but tooltip links (eg gwas catalog) and “show my variant in UKBB phewas” might also go awry. Admittedly these are more niche features. That said, if you find a procedure for transforming the METAL file… or have thoughts on a good fast validator… I’d be happy to add that info to our FAQ to help other users, or even slate good options into our work queue sometime for future implementation. We try to handle a lot of data formats, but that does come with some downsides!

-Andy Boughton

abo...@umich.edu

--
You received this message because you are subscribed to the Google Groups "locuszoom" group.
To unsubscribe from this group and stop receiving emails from it, send an email to locuszoom+...@googlegroups.com.
To view this discussion on the web visit https://groups.google.com/d/msgid/locuszoom/1d18b3ff-697a-4dcb-b4a3-c76ee70886b5n%40googlegroups.com.

Matthew Maher

unread,

Sep 29, 2021, 10:34:06 AM9/29/21

to locuszoom

Thanks for the input Andy. Now responding a bit belatedly to this since I just wrestled with the question....

>> if you find a procedure for transforming the METAL file… or have thoughts on a good fast validator… I’d be happy to add that info to our FAQ to help other users,

Here's my best understanding (which you may want to add to FAQ if you agree):

In order to get the output of a METAL meta-analysis to play well in LocusZoom, there are three problems that need to be resolved:

1. Row Order: LZ needs the input to be in chrom/position order. METAL produces output that is in absolutely no order whatsoever.

2. Allele Order: LZ needs you to supply a column with genomic REF and the ALT. METAL produces output with two alleles in no particular order - it has no concept of a 'genome reference'. METAL could just as happily output alleles A/B with a positive BETA as alleles B/A with a negative BETA.

3. CHR:POS: LZ needs you to supply CHR and POS as distinct columns. METAL has no such columns - METAL combines multiple input files in a meta-analysis based on matching on a single "MARKER" column in each input. The contents of that MARKER are totally user/study dependent. the column could be filled with any values in any format whatsoever - but most likely something like "3:12345", or "chr3:12345" or "chr3:12345_G/T" or "rs987654321" or "my_private_id_1234" or literally whatever...

Note that solving #1 requires first solving #3. And solving #2 will require use of a genome reference dataset.

I have a python script that solves these three problems for case of MARKER values in the form "CHR:POS_...". I'd be happy to donate it, FWIW. This forum does not appear to allow me to attach files, but I'd be happy to email it. Minor change would be required if one's MARKER names contain <chr> and <pos> values, but in a slightly different format. More serious changes would be required if MARKER names are "rs####" - in that case, the code would need to be enhanced to look up all the rs#s in dbSNP database (which one would need to download).

-Matt

Reply all

Reply to author

Forward