cleaning genptype/allele probability prior to scan1

50 views
Skip to first unread message

Billy Jiang

unread,
Mar 28, 2024, 2:00:30 AMMar 28
to R/qtl2 discussion

Dear Karl, 


I hope this finds you well. Thanks so much for the previous responses.


I've been trying to do QTL mapping in a DO population (n = ~550), which was genotyped using GigaMUGA (markers = ~ 110,000). I've come across some questions regarding how the genotype/allele probability matrix works in scan1() and would love to have some help with my understanding and further application.


After a few attempts on QTL mapping, I noticed this function called clean_genoprob(), which cleans up genotype/allele probabilities by setting extremely small values to 0. I understand that this function was initially developed to help with the problem of unstable estimates of genotype effects in scan1coef() and fit1() when there's a largely absent genotype. I imagined this would not affect the results much and may give me a less noisy scan result. So, I then started to apply this function with the later analysis before scan1()


Surprisingly, as you may see in the attached screenshots, with the same cross2 file (same covariates and phenotype), cleaning the allele probability greatly altered the haplotype-based QTL results. My understanding is that scan1() fits the linear mixed model using allele probability to obtain RSS values, which are then used to calculate LOD scores. Thus, these small probabilities will also affect LOD scores. But it's surprising how much it has changed. 


So, I looked into some of the source codes for scan1() and noticed that scan1() subsets the probabilities, dropping columns with all 0s. This may be the case in our data; many of the markers have one of their allele dosages cleaned to all 0s. 


May I ask how function may affect the result with the allele additive model? Do you happen to have any thoughts on using this clean_genoprob() function before running scan1()


Here is the code I used:


cross2 <- readRDS("cross2_clean.rds")

geno.prob <- calc_genoprob(cross2, error_prob=0.002, map_function="c-f")

allele.prob <- genoprob_to_alleleprob(geno.prob)

allele.prob.clean <- clean_genoprob(allele.prob)


kinship <- calc_kinship(allele.prob, "loco")

kinship.clean <- calc_kinship(allele.prob.clean, "loco")


out <- scan1(allele.prob, phenotype, kinship = kinship, addcovar = covar, Xcovar=sex)

out.clean <- scan1(allele.prob.clean, phenotype, kinship = kinship.clean, addcovar = covar, Xcovar=sex)


plot_scan1(out, cross2$pmap, altcol="green4", bgcolor = "white", altbgcolor = "gray90", ylim = c(0,14))

plot_scan1(out.clean, cross2$pmap, altcol="green4", bgcolor = "white", altbgcolor = "gray90", ylim = c(0,14))


Thank you so much for your help!


Cheers,

Billy

cleaned vs uncleaned alleleProb during scan1.png

Karl Broman

unread,
Mar 28, 2024, 8:38:24 AMMar 28
to R/qtl2 discussion
I'm not sure why there are such big differences. Are there lots of regions where one of the 8 founder alleles is missing?

karl

Billy Jiang

unread,
Mar 31, 2024, 8:48:37 AMMar 31
to R/qtl2 discussion
Hi Karl,

Thanks a lot for your reply. I think that may be the case. I've just written a little function to count the number of markers where one of the 8 alleles has all 0 in its dosage:

check_cleaned_Prob <- function(cleaned_prob) {
  markers_with_0_alleleProb <- 0
 
  # Iterate over each chromosome
  for (chr in 1:20) {
    if (chr < 20){
      prob_chr <- cleaned_prob[[as.character(chr)]]
    } else {
      prob_chr <- cleaned_prob[["X"]]
    }
   
    # Iterate over each marker within the chromosome
    for (marker_num in 1:(dim(prob_chr)[3])) {
      allele_prob_marker <- as.data.frame(prob_chr[,,marker_num])
     
      # Check if the entire allele probability column is zero
      if (any(colSums(allele_prob_marker == 0) == nrow(allele_prob_marker))) {
        markers_with_0_alleleProb <- markers_with_0_alleleProb + 1
      }
    }
  }
 
  return(markers_with_0_alleleProb)
}

# run the function
n_markers_missing_alleleProb <- check_cleaned_Prob(allele.prob.clean)
print(n_markers_missing_alleleProb)
> [1] 65518
print(check_cleaned_Prob(allele.prob))

> [1] 0

Of the ~110,000 markers in the cross2 file, 65518 markers have one of the 8 alleles missing (with all 0 in its probability). In this case, do you have any further recommendations for me using the clean_genoprob() function

Thank you so much for your help again! :)

Cheers,
Billy

Karl Broman

unread,
Mar 31, 2024, 10:02:30 AMMar 31
to R/qtl2 discussion
This seems strange. With a sample size of 500, I wouldn't expect that an allele would be missing at half of the loci in the genome. I would try to figure out what's going on with the allele frequencies.

karl

Dan Gatti

unread,
Apr 1, 2024, 8:32:08 AMApr 1
to rqtl2...@googlegroups.com

That does seem high. We’ve lost some alleles at some locations in the DO. In a set of ~350 DO mice from G46 – G48, I get ~840 markers at which one founder has a frequency of < 1e-4 in these mice.

If I plot it, I get this. 1 means at least one allele was < 1e-4 at that marker:

 

 

What generation are your mice from? It does seem odd that you’d have half of the genome showing zero frequency for a founder.

 

Dan

--
You received this message because you are subscribed to the Google Groups "R/qtl2 discussion" group.
To unsubscribe from this group and stop receiving emails from it, send an email to rqtl2-disc+...@googlegroups.com.
To view this discussion on the web visit https://groups.google.com/d/msgid/rqtl2-disc/c83f5789-548c-4c80-a411-7141a0e366acn%40googlegroups.com.

---

The information in this email, including attachments, may be confidential and is intended solely for the addressee(s). If you believe you received this email by mistake, please notify the sender by return email as soon as possible.

Billy Jiang

unread,
Apr 1, 2024, 6:12:14 PMApr 1
to R/qtl2 discussion
Hi Dan,

Thanks a lot for your response and for sharing your experience. 

The mice are from generations 27 - 37:

> table(cross2[["cross_info"]])
 27  28  29  30  31  32  33  34  35  36  37
 34  20  34  32  34  26  33 112  33 108  79 


Would this be the reason for the loss?

Thanks much,
Billy

Dan Gatti

unread,
Apr 2, 2024, 8:31:15 AMApr 2
to rqtl2...@googlegroups.com

I don’t think so. The situation should be better in earlier generations.  I tried the counting myself both before and after cleaning and I’m not seeing the same thing that you’re seeing. Maybe I’m making a mistake, but I’m not seeing it. The plots below show the markers at which the overall founder frequency goes below 0.01.

 

setwd('~/projects/superovulation/data/')

library(qtl2)

 

# Read in genoprobs.

probs = readRDS('sodo_alleleprobs_grcm39_137K.rds')

 

# Count the number of low allele frequency columns.

count_zero = function(pr) {

 

  # Sum up each founder at each marker.

  zero = lapply(probs, colSums)

  # Tally whether the sum is less than the threshold.

  zero = lapply(zero,  '<', 0.01)

  # Sum up the number of low founders at each marker.

  zero = sapply(zero, colSums)

  unlist(zero)

 

} # count_zero()

 

zero = count_zero(probs)

 

# Clean the genoprobs object by setting low probs == 0.

probs2 = clean_genoprob(object           = probs,

                        value_threshold  = 1e-06,

                        column_threshold = 0.01)

 

zero2 = count_zero(probs2)

 

layout(matrix(1:2, ncol = 1))

plot(zero,  pch = 16, main = 'Unclean')

plot(zero2, pch = 16, main = 'Cleaned')

 

 

Dan

 

From: rqtl2...@googlegroups.com <rqtl2...@googlegroups.com> On Behalf Of Billy Jiang
Sent: Monday, April 1, 2024 6:12 PM
To: R/qtl2 discussion <rqtl2...@googlegroups.com>
Subject: [SOCIAL NETWORK] Re: [SOCIAL NETWORK] [Rqtl2-disc] Re: cleaning genptype/allele probability prior to scan1

 

Hi Dan,

 

Thanks a lot for your response and for sharing your experience. 

The mice are from generations 27 - 37:

 

> table(cross2[["cross_info"]])
 27  28  29  30  31  32  33  34  35  36  37
 34  20  34  32  34  26  33 112  33 108  79 

 

Would this be the reason for the loss?

 

Thanks much,

Billy

 

On Monday 1 April 2024 at 11:32:08 pm UTC+11 Dan Gatti wrote:

That does seem high. We’ve lost some alleles at some locations in the DO. In a set of ~350 DO mice from G46 – G48, I get ~840 markers at which one founder has a frequency of < 1e-4 in these mice.

If I plot it, I get this. 1 means at least one allele was < 1e-4 at that marker:

 

Image removed by sender.

Reply all
Reply to author
Forward
0 new messages