Generate sequence fasta for each DO mouse

4 views

Skip to first unread message

Michelle Lee

unread,

Mar 25, 2025, 6:32:18 PMMar 25

to R/qtl2 discussion

Hi there,

Thank you so much for the qtl2 package - it has been extremely helpful for my research.

I'm working with Diversity Outbred (DO) mice and I'd like to convert the founder strain genotype information back to actual sequences in FASTA format. Specifically, I'm using the genotype probability files (prob.8state.allele.qtl2_200131.Rdata) which contain the probability of each variant coming from each founder strain.

My goal is to create individual FASTA files for each DO mouse described here by substituting these variants into the mm10 reference genome. I'd like to handle all variant types (SNPs, indels, and structural variants) properly.

I'm wondering:
1. Is this achievable using the qtl2 package directly?
2. Do you have any recommendations for other packages or approaches that might help with this task?
3. Are there any existing functions or workflows you're aware of for converting genotype probabilities back to sequence data?

I've started implementing a solution by combining BSgenome.Mmusculus.UCSC.mm10 with the cc_variants.sqlite database, but I wanted to check if there might be a more established approach before proceeding further.

Any pointers or suggestions would be greatly appreciated! Thank you!

Best regards,
Michelle

Dan Gatti

unread,

Mar 27, 2025, 9:32:06 AMMar 27

to rqtl2...@googlegroups.com

I don’t think that qtl2 provides this in a manner that’s exposed to the user. For association mapping, qtl2 imputes the founder SNPs onto the DO diplotypes and then fits the mapping model at each SNP. But the full sequence of each DO genome is never computed. Also, there is considerable uncertainty, on the order of kilobases, about where crossovers occur. So if you are trying to impute full sequences, you’ll have to accept some heuristic compromise in crossover regions. You might be able to look in the scan1snps. R code and see how the SNP probs are imputed. Look for this code:

# snpinfo -> add index

snpinfo <- index_snps(map, snpinfo)

# genoprob -> snpprob

snp_pr <- genoprob_to_snpprob(genoprobs, snpinfo)

You may have to access these functions using qtl2:::index_snps(). But then you’ll have SNP positions and you could insert them into the reference genome. It’s not perfect since it excludes indels and SVs, but it might work. You’d probably be using BioStrings from the Bioconductor suite as well.

I don’t know of a package that performs this task, but I’ve never needed to do it before, so I haven’t looked. I think that you’ll need to use the founder assemblies, which may not contain accurate structural variants, to do what you want to do. And you’ll need enough disk space . The Ensembl GRCm39 genome is ~ 770MB zipped. So you’d multiply that by the number of DO mice that you have.

Also, I’d strongly suggest working with the current reference genome, which is GRCm39. mm10 is over a decade old.

Dan

--
You received this message because you are subscribed to the Google Groups "R/qtl2 discussion" group.
To unsubscribe from this group and stop receiving emails from it, send an email to rqtl2-disc+...@googlegroups.com.
To view this discussion visit https://groups.google.com/d/msgid/rqtl2-disc/637b2198-c029-4eb7-a531-9edbb3150f07n%40googlegroups.com.

---

The information in this email, including attachments, may be confidential and is intended solely for the addressee(s). If you believe you received this email by mistake, please notify the sender by return email as soon as possible.

Reply all

Reply to author

Forward

0 new messages