Missing RSIDs?

34 views
Skip to first unread message

Matthew Maher

unread,
Feb 21, 2024, 10:04:28 AMFeb 21
to locuszoom
Are there sometimes sporadic outages of the RSID mapping function within my.locuszoom? 

I'm investigating a problem with a GWAS result ( 157370 , uploaded Jan 12) where the nice TopHits table shown below the primary Manhattan view is missing RS#s.  But yet, when you drill into a region view of any of those tophits, you DO get full LD visualization, despite nearly all the variants lacking an assigned RS# (there are sporadic few that do show an RS#). 

We've now uploaded the same file again, get the same manhattan plot, and same TopHits, but now the RS#s ARE populated.

Am I correct that the LD-highlighting stems from 1000G ref panel?  and I thought that the LD-highlighting might require RS# identification?  I'm surprised that any highlighting occurs for variants that do NOT show an RS#. 

Thanks for any thoughts/info.   And thanks for LZ. 

Andrew Boughton

unread,
Feb 21, 2024, 10:21:49 AMFeb 21
to locu...@googlegroups.com
Thanks for the question. Due to the messy nature of the data we receive from diverse users, LD is actually matched only on chr:pos (but requires chr:pos_ref/alt to specify the ref. variant). The only public reference panel we support is 1000G due to sensitive data questions, though you can use your own local LD via "add tabix file".

RsIDs are matched more rigorously, and require the file to specify all of chrom, pos, ref, and alt. (We could be stricter for minor annotations, because a few missing points would be less obvious) I can't think of any reason why the same exact file would yield different results, but I can see some variants being missed due to outdated data. I moved on from the LZ team a few years back, and am not sure what the update frequency is for the rsid annotation dataset after I left.

Because rsids are matched more carefully, any unusual formats in variant specifiers could also be an issue. (Some groups append information to existing fields, for example)

-Andy Boughton

On Feb 21, 2024, at 10:04 AM, 'Matthew Maher' via locuszoom <locu...@googlegroups.com> wrote:

Are there sometimes sporadic outages of the RSID mapping function within my.locuszoom? 
--
You received this message because you are subscribed to the Google Groups "locuszoom" group.
To unsubscribe from this group and stop receiving emails from it, send an email to locuszoom+...@googlegroups.com.
To view this discussion on the web visit https://groups.google.com/d/msgid/locuszoom/00988ff8-eb80-487e-a251-5f2b0572ee7dn%40googlegroups.com.

Andy Boughton

unread,
Feb 21, 2024, 10:56:45 AMFeb 21
to locu...@googlegroups.com
One other possibility: the missing rsID matches (and LD) might indicate that someone selected a different genome build on initial ingest. We don’t try to auto-detect the genome build, because it would make the ingest pipeline much slower in a global web service.

rsID matching is done using a custom internal matcher (implemented using LMDB). It doesn’t depend on an external service or even data volume; it would be quite strange and worth looking into if that feature gave different results for the same exact file and same exact set of options. 

(alas, I can’t promise you direct bugfixes, as I work for a different team now. But I can at least help direct a bug report as courtesy)

-Andy Boughton



On Feb 21, 2024, at 10:04 AM, 'Matthew Maher' via locuszoom <locu...@googlegroups.com> wrote:

Matthew Maher

unread,
Feb 21, 2024, 6:17:19 PMFeb 21
to locuszoom
The genome build does seem to be at the root of the problem.  But something is odd:

As I mentioned, my query stems from the same file (GRCh37) being uploaded twice, but the first time, most RSIDs were missing - and I notice a slight differences in the Gene names mapped to the same CHR:POS:REF:ALT as well.  The second time, all the RSIDs and gene names came out as expected.    I've just now uploaded it again, but this time I intentionally MIS-selected GRCh38 as the build, and the result ended up being exactly the same as the suspect first run (in terms of spotty RSIDs and the gene name assignments).   But of course, in this newer submission the Manhattan plot says at the top:  "Build: GRCh38" since that is what I selected.  But the original problematic run (which, again, contains exactly the same suspect RSIDs/gene names) says "Build: GRCh37" at the top.  i.e. it says GRCh37 on the manhattan plot, but the data were clearly prepped/annotated using GRCh38.

Does the "Build: " label at the top of the Manhattan plot definitely reflect what the user selected on the upload Page?
is there somehow that a job's processing (which build to reference for RSID/gene mapping) could get mismatched from the user selection?


Again, the suspect run is GWAS
157370 from Jan 12.   The ingest log looks unremarkable:

[ingest][2024-01-12T15:46:54+00:00] Performing upload step: Calculate SHA256
[success][2024-01-12T15:46:57+00:00] Step completed
[success] The GWAS file passed validation. Read the logs carefully, in case any specific lines failed to parse.
[ingest][2024-01-12T15:46:57+00:00] Performing upload step: Normalize GWAS file format
[success][2024-01-12T15:55:49+00:00] Step completed
[ingest][2024-01-12T15:55:50+00:00] Performing upload step: QQ plots and top hit detection
[success][2024-01-12T16:03:50+00:00] Step completed
[ingest][2024-01-12T16:03:50+00:00] Performing upload step: Prepare a manhattan plot
[success][2024-01-12T16:08:38+00:00] Step completed


Thanks for any info/thoughts.

Sorry to hear that my.locuszoom is being an orphan. 

Andy Boughton

unread,
Feb 22, 2024, 2:49:15 PMFeb 22
to locu...@googlegroups.com
It's been a while, but as far as I recall, UI should always show what the user selected on input. (the database field for "genome build" is source of truth passed to all UI and annotators) 

Certainly it should be deterministic (use same build, get same result every time)!

I will note that we ignore original rsIDs already in the file, and annotate only from chr/pos/ref/alt. This was mostly a response to really messy user data in original samples; our output files omit any columns that were too ambiguous to present reliably across different input file formats.

There was an old UI bug where the search box used wrong coordinates, but that's very different from annotations. If you do find evidence of something slipping through the cracks- like a default value that is being set when it shouldn't be- then I encourage you to file a bug report at the project repo. No one wants to hand out bad rsID information by default, and that would merit some followup.  As an optional annotation that few people check directly, it's vaguely possible that something has been slipping through the cracks unreported since the feature was released in (checks)... 2020. User bug reports really do matter!


As for current project status? I genuinely don't know; I lend a hand on the mailing list as a courtesy, but can't speak for any official discussions after I moved on. If nothing else, it's open source (server - just the graph part - web UI without the server). 

-Andy Boughton
abo...@umich.edu

Applications Programmer/Analyst, Lead
Center for Statistical Genetics
University of Michigan



Reply all
Reply to author
Forward
0 new messages