"variant from marker" leads to error of "could not parse column contents"

295 views
Skip to first unread message

jie huang

unread,
Jan 1, 2021, 9:39:41 AM1/1/21
to locuszoom

Hi, guys: Happy New year!


I posted a GWAS file to my.locuszoom.org, https://my.locuszoom.org/gwas/892028/. And I have the following two questions. I would deeply appreciate if someone could clarify.


1. When I was uploading my GWAS file, the “select file options” have two options. One is “variants from columns” and the other is “variant from marker”. When I select “variant from marker”, somehow I got a red error message of “could not parse column contents”. What does this mean?  So, it seems that I could only use the first option “variants from columns”. However, this makes the rsID show up as empty for the majority of SNPS in the “Top Loci” table. Maybe Locuszoom could only match CHR:POS with SNP ID for those SNPs in Hapmap. Is there a way to fix this? My input GWAS file does have rsID for all SNPs.


2. For the “Top Loci” table under the Manhattan plot, when I try to sort by the first column “Marker”, it is not numerical. After sorting, the first marker is 1:109,817,590 while the second one is 1:160,373,299. Is there a way to sort by CHR:POS numerically? Also, through PLINK clumping analysis, I found that there are many genome-wide significant SNPs independent of 1:109,817,590. So, there are multiple independent signals in that locus. However, it seems that LD information is not used by LocusZoom to identify independent loci. Instead, the list of “Top Loci” is only based on physical distance. Can you please confirm this? I also wonder why the second marker 1:160,373,299 is included in the “Top Loci” because its -logP is only 6.096.


Thank you very much & best regards,

Jie


Andy Boughton

unread,
Jan 1, 2021, 12:29:18 PM1/1/21
to locu...@googlegroups.com
Thanks for the question.

  1. I would need to see a sample marker value in order to comment on why it could not parse. (we are working on improving UI error messages, but other projects are in progress so this may take a while before it goes live) See our documentation for expected file format in each field. We try to support several dozen file formats, but people are always finding new ways to represent `chr:pos_ref/alt` . (one common issue: appending extraneous information) 
  2. As for rsID information, we generally require a full variant specification (chrom, pos, ref, and alt) in order to match to rsIDs. Ref and alt should be relative to the reference genome, as opposed to other conventions that are harder to standardize across datasets (major/ minor, effect/noneffect etc) This is because, especially in newer dbSNP versions, rsIDs can be ambiguous (more than one ref/alt possible). Long term we'd like to add support for a user-provided rsID field, so long as we can harmonize the value reliably across the most common upload file formats. (again, surprisingly heterogeneous)
  3. Per your question about the top loci table, you are correct that the top loci algorithm is fairly simple and does not try to identify independent signals in the same region. This is meant to be a concise summary of interesting regions, rather than an exhaustive list of top independent signals. It typically picks the top hit in each area (so that one significant region doesn't crowd out all other hits genome wide), but other nearby hits would of course be shown in the LZ plot if they are nearby. 
  4. Thanks for the bug report! The table sort order is confirmed to be a bug, so I've created a ticket to follow up, here: https://github.com/statgen/locuszoom-hosted/issues/18 
-Andy Boughton
abo...@umich.edu

Senior Applications Programmer/Analyst
Center for Statistical Genetics
University of Michigan



--
You received this message because you are subscribed to the Google Groups "locuszoom" group.
To unsubscribe from this group and stop receiving emails from it, send an email to locuszoom+...@googlegroups.com.
To view this discussion on the web visit https://groups.google.com/d/msgid/locuszoom/51ff1dad-5631-4aca-83df-871942fbc218n%40googlegroups.com.

jie huang

unread,
Jan 2, 2021, 11:10:37 PM1/2/21
to locuszoom

Dear Andy:

Thank you very much. I now inserted REF and ALT columns into my GWAS.gz. It originally has A1 and A2, since I run the GWAS using PLINK2. 
Now most rsID does show up. This is great. Since my input GWAS file does have the "SNP" column, which is for rsID, it would be great that Locuszoom could simply use rsID from that column, instead of query rsID by matching CHR:POS:REF:ALT.

Somehow, the  "Could not parse column contents" error message still persists when I  select "variant from marker". I already uploaded my input GWAS.gz file into the locuszoom server. Don't know if you could get a copy of my files from the server. Otherwise, I put a copy of the file at https://drive.google.com/file/d/15LWwsL41WmLis1wcP0RpOuRGajm0qRVJ/view?usp=sharing

Since the data uploading window asks information about allele frequency, just curious, does Locuszoom has a way to show variants of certain frequency (such as MAF<0.01) in different shapes? Otherwise, what is the allele frequency column for?

Finally, I am still puzzled with the fact that the top loci table list variants that are not genome-wide significant (P<5E-08).

Andy Boughton

unread,
Jan 3, 2021, 12:13:51 AM1/3/21
to locu...@googlegroups.com

Thank you very much. I now inserted REF and ALT columns into my GWAS.gz. It originally has A1 and A2, since I run the GWAS using PLINK2. 
Now most rsID does show up. This is great. Since my input GWAS file does have the "SNP" column, which is for rsID, it would be great that Locuszoom could simply use rsID from that column, instead of query rsID by matching CHR:POS:REF:ALT.

In some multiallelic cases, the same rsID value can refer to more than one alt allele: just filling in information from rsID doesn’t 100% guarantee that we would always pick the right alt allele. (this is unintuitive, but that’s an intentional design choice on the part of dbSNP) Rather than guess in a way that would force us to silently drop some rows, we ask for a full and exact variant specification.

(This is the downside of trying to support a very large number of file formats and data sources: when we do make assumptions, they tend to be more rigid)

Incidentally, you shouldn’t need to rename columns in your input file, if the information is there but under a different name. The upload UI tries to suggest columns, but these are only a guess- you can always override that guess based on your own knowledge of the file contents. (so long as the required information is actually there)


Somehow, the  "Could not parse column contents" error message still persists when I  select "variant from marker". I already uploaded my input GWAS.gz file into the locuszoom server. Don't know if you could get a copy of my files from the server. Otherwise, I put a copy of the file at https://drive.google.com/file/d/15LWwsL41WmLis1wcP0RpOuRGajm0qRVJ/view?usp=sharing


The “marker” field is intended to be a single field that contains all chr:pos_ref/alt information (this is to support some programs that concatenate separate fields). The sample file you provide does not have such a column, so it does not surprise me that it would fail to parse. Regrettably, the rsID field cannot be used for this purpose, per above.

Since the data uploading window asks information about allele frequency, just curious, does Locuszoom has a way to show variants of certain frequency (such as MAF<0.01) in different shapes? Otherwise, what is the allele frequency column for?

Currently, allele frequencies are only displayed on the tooltip when you move your mouse over a variant on the scatter plot. A lot of files don’t have that information in a consistent format (eg alt vs minor allele freq), so we don’t currently make the UI depend on it. The underlying plotting library (locuszoom.js) could certainly be used to customize how things are displayed, but those aren’t exposed in the my.locuszoom.org site UI.

I’d like to provide advanced UI controls in the future to expose more of the power in LocusZoom. This isn’t a short term goal, but filtering the plot by allele frequency would definitely be a useful first option. Thanks for the suggestion!


Finally, I am still puzzled with the fact that the top loci table list variants that are not genome-wide significant (P<5E-08).

my.locuszoom.org is meant for early exploration of new analysis, and in the past users have requested the ability to see “suggestive” regions in the list of possibly interesting loci, if there are no more significant hits present. In the individual region plots, we still show the line of GWAS significance at p = 5e-8, so it would still be obvious if the hits were not truly significant. Additionally, the “batch mode” feature goes through top hits starting with most significant first.

Again, we’d love to see all of our UI on my.locuszoom.org become more flexible. A common theme in the above answers is that in trying to support so many file formats, it becomes harder to write display features tailored to one individual file: every filter, color, or shape option has to defensively handle the case of 100 files that lack the required field. If you’re interested in shareable web-based exploration with full control of your own data, and if you are comfortable writing your own code, then LocusZoom.js (the underlying plotting library) can be fully customized to your needs and dataset. There is support for tabix files as well, which makes it easier to use without needing to write a web server backend that would feed the data to the plot.

-Andy Boughton





jie huang

unread,
Jan 3, 2021, 3:48:49 AM1/3/21
to locuszoom

Dear  Andy:

Thank you very much again for your reply! Now it all makes sense!
I still have a little bit follow-up.

1. It is no problem for the "top loci" table to show SNPs with P>5E-08. Just wondering, what is the cutoff then? All 1MB loci with smallest P<1E-06?

2. I understand that it would be too difficult to plot with too many different shapes and colors. But highlighting those variants with MAF <1% would be something really cool and necessary, since rare variants with big effect size is critical. So, if I could request one feature, it will be a line for "plotting variants with MAF < XXX in shape YYY", where users could enter a number for "XXX" and pick a shape for "YYY".

3. I used LocusZoom standalone version, almost 10 years ago, when I was generating hundreds of locuszoom plots on my local laptop. Just curious, is LocusZoom.js the new face of LocusZoom standalone version? The syntax of .js is a bit hard to learn, compared with bash or R :- )

Best regards,
Jie

Andy Boughton

unread,
Jan 4, 2021, 11:18:14 AM1/4/21
to locu...@googlegroups.com

1. It is no problem for the "top loci" table to show SNPs with P>5E-08. Just wondering, what is the cutoff then? All 1MB loci with smallest P<1E-06?


Correct. The current cutoff is p < 1e-6. This is similar to the default value chosen by PheWeb, a tool for browsing large scale PheWAS results.


2. I understand that it would be too difficult to plot with too many different shapes and colors. But highlighting those variants with MAF <1% would be something really cool and necessary, since rare variants with big effect size is critical. So, if I could request one feature, it will be a line for "plotting variants with MAF < XXX in shape YYY", where users could enter a number for "XXX" and pick a shape for "YYY".


I'd love to some day add UI around this feature exposing a variety of options; it's not a small task but would be very powerful. If it does get added, we'll post an announcement to the mailing list and/or website at that point.

3. I used LocusZoom standalone version, almost 10 years ago, when I was generating hundreds of locuszoom plots on my local laptop. Just curious, is LocusZoom.js the new face of LocusZoom standalone version? The syntax of .js is a bit hard to learn, compared with bash or R :- )


The short answer is that LocusZoom.js is the version we are focused on developing going forward; the python version is not officially supported with new updates or annotations. There is some longer discussion of what each tool does ("comparison of tools and features") in the new LocusZoom.js preprint:
Reply all
Reply to author
Forward
0 new messages