Kaviar vs. gnomAD allele frequencies

18 views
Skip to first unread message

Ariel Balter

unread,
May 10, 2021, 1:59:27 PM5/10/21
to Kaviar-discuss
This is cross-posted at Biostars but hasn't gotten any answers.

Below are some plots showing allele frequencies in Kaviar and gnomAD. There is a lot of agreement between the two databases. However, there are odd features.

  1. There are almost no variants for which the allele frequency is higher in Kaviar vs. gnomAD, but many the other way around.
  2. There are many variants for which the allele frequency in gnomAD is roughly 4X that in Kaviar.
  3. There are many variants for which the allele frequency is close to 0 in Kaviar, but range from 0 to 1 in gnomAD.

One possibility for the 4X thing could be zygosity. Perhaps a factor of 2 for each allele?

I couldn't find anything in the documentation that would suggest why the two databases relate this way, in particular the issue of Kaviar variants near 0.

In the plots, the dots represent individual variants (by coordinate), the axes are allele frequencies. In the 2nd plot I randomly select 1e6 points.

NOTE: I would love to provide a MRE, but given the size of the databases it's impossible. I'm using the publically available gnomAD on BigQuery, and downloaded Kaviar and uploaded to BigQuery myself. I was unable to download and work with those files in RStudio or even an AI notebook on GCP. But the inner join (performed on BigQuery) reduces the dataset size enough to download it.

```
q = inner_join(
  kaviar_bq %>% mutate(AF_KV = AF),
  gnomad_bq %>% mutate(AF_GN = AF),
  by=c("chromosome", "position", "reference_allele", "alternate_allele")
) %>% collect()

q %>%
  ggplot(aes(x=AF_KV, y=AF_GN, color=chromosome)) +
  geom_point() +
  ggtitle("Allele Frequencies: Kaviar vs. gnomAD") +
  xlab("Kaviar") +
  ylab("gnomAD") +
  geom_abline(intercept = 0, slope = 1) +
  geom_abline(intercept = 0, slope = 4)

q %>%
  sample_n(1e6) %>%
  ggplot(aes(x=AF_KV, y=AF_GN, color=chromosome)) +
  geom_point() +
  ggtitle("Allele Frequencies: Kaviar vs. gnomAD") +
  xlab("Kaviar") +
  ylab("gnomAD") +
  geom_abline(intercept = 0, slope = 1) +
  geom_abline(intercept = 0, slope = 4)

q %>%
  ggplot(aes(x=AF_KV, y=AF_GN)) +
  geom_hex(bins=100, aes(fill=..density..)) +
  ggtitle("Allele Frequencies: Kaviar vs. gnomAD") +
  xlab("Kaviar") +
  ylab("gnomAD") +
  geom_abline(intercept = 0, slope = 1) +
  geom_abline(intercept = 0, slope = 4) +
  scale_fill_distiller(palette = "Spectral", trans="log10")
```


kaviar_vs_gnomad_all.pngkaviar_vs_gnomad_1e6.pngkaviar_vs_gnomad_density.png

Reply all
Reply to author
Forward
0 new messages