SNP extraction and comparison - EPIC

550 views
Skip to first unread message

Eliza Walaszczyk

unread,
Jan 12, 2018, 10:22:29 AM1/12/18
to Epigenomics forum
Hi, 
Does anyone know how to extract the SNPs from Illumina EPIC array? I have also genotype data for the same set of samples and I was wondering if I could compare them as a QC step. Is that even possible? 
Thanks , 
E. 

Victor Yuan

unread,
Jan 12, 2018, 11:40:34 AM1/12/18
to epigenom...@googlegroups.com
You can use this function from Minfi to get snp probe measurements:

snps <- minfi::getSnpBeta(rgset) 

They may or may not overlap with your set of genotyping data, you'll just have to compare them yourself.

--
You received this message because you are subscribed to the Google Groups "Epigenomics forum" group.
To unsubscribe from this group and stop receiving emails from it, send an email to epigenomicsforum+unsubscribe@googlegroups.com.
For more options, visit https://groups.google.com/d/optout.

Tim Triche, Jr.

unread,
Jan 14, 2018, 11:44:58 AM1/14/18
to Epigenomics forum


re: how to extract: use minfi 

I got bored running the same old QC pipeline after about 10,000 TCGA samples.  So I wrote a single-function patch to minfi that grabs the SNP probes and stuffs them into metadata (one could also consider using probes with known high-MAF SNPs to do the same thing).  If it's useful to enough people I will turn it into a Bioconductor package and actually maintain it.  Its rationale is documented here:


The function can be sourced as such:

library(minfi) 
# until/unless this is merged into minfi or packaged separately by yours truly

Then when processing your IDATs/rgSet, it extracts the SNPs for an rgSet into the metadata of the resulting grSet. 
For example, all of the FlowSorted.*.450k packages contain an RGChannelSet (rgSet) with the raw intensities, so:

library(FlowSorted.CordBloodNorway.450k)
FlowSorted.CordBloodNorway.450k
# class: RGChannelSet 
# dim: 622399 77 
# metadata(0):
# assays(2): Green Red
# rownames(622399): 10600313 10600322 ... 74810490 74810492
# rowData names(0):
# colnames(77): CD4+_4 CD4+_5 ... WB_22 WB_23
# colData names(8): Sample_Name SampleID ... Slide Basename
# Annotation
#   array: IlluminaHumanMethylation450k
#   annotation: ilmn12.hg19

FSCB.Norway <- tcgaPipeline(FlowSorted.CordBloodNorway.450k)
# Loading required package: IlluminaHumanMethylation450kmanifest
# Running TCGA-style (noob) pipeline on 77 samples...
# Loading required package: IlluminaHumanMethylation450kanno.ilmn12.hg19
# [preprocessNoob] Applying R/G ratio flip to fix dye bias...
# Masking probes with detection p-value > 0.05...
# Placing SNP probe betas in metadata(grSet)$SNPs...
# Done.

The SNP probe "beta" values then end up in metadata(grSet), eg. for a quick sanity check:

mdsPlot(metadata(FSCB.Norway)$SNPs, sampGroups=FSCB.Norway$SampleID, 
        main="MDS projection on SNP genotypes, colored by subject")


That doesn't really tell you anything about the location or genotypes of the SNP probes, however.  It turns out that some comparison with gold-standard GM12878 genotypes from the Genome-in-a-Bottle (GIAB) consortium can address that: (cf. https://github.com/ttriche/infiniumSnps):

library(GenomicRanges) 
GM12878 <- makeGRangesFromDataFrame(
             row.names=1), 
  keep=TRUE)

This will then give you the GRCh37 positions and M/U genotypes of all the SNP probes that I could decipher using the GM12878 GIAB calls.

# for simpler display
mcols(GM12878) <- mcols(GM12878)[, c("M","U")]
show(GM12878)

# GRanges object with 58 ranges and 2 metadata columns:
#              seqnames                 ranges strand |           M           U
#                 <Rle>              <IRanges>  <Rle> | <character> <character>
#    rs3936238     chr1   [ 4031586,  4031586]      * |           A           G
#     rs877309     chr1   [11412265, 11412265]      * |           G           A
#     rs213028     chr1   [21524764, 21524764]      * |           T           C
#   rs11249206     chr1   [25150569, 25150569]      * |           C           T
#     rs654498     chr1   [81945636, 81945636]      * |           T           C
#          ...      ...                    ...    ... .         ...         ...
#     rs798149     chrX [ 15795352,  15795352]      * |           G           A
#    rs5926356     chrX [ 28157252,  28157252]      * |        <NA>        <NA>
#    rs5987737     chrX [114616977, 114616977]      * |        <NA>        <NA>
#    rs1416770     chrX [145026857, 145026857]      * |        <NA>        <NA>
#    rs6626309     chrX [147012065, 147012065]      * |        <NA>        <NA>
#   -------
#   seqinfo: 18 sequences from an unspecified genome; no seqlengths

You could also bolt this onto the side of a SNP matrix for easier subsetByOverlaps(methSNPs, gwasSNPs) usage:

library(minfiDataEPIC)

# only keep the annotated ones:
GM12878_EPIC_SNPs <- metadata(tcgaPipeline(RGsetEPIC))$SNPs[names(GM12878), ] 

# bolt the two together:
mcols(GM12878) <- cbind(mcols(GM12878), GM12878_EPIC_SNPs)

show(GM12878)
#
# GRanges object with 58 ranges and 5 metadata columns:
#              seqnames                 ranges strand |           M           U
#                 <Rle>              <IRanges>  <Rle> | <character> <character>
#    rs3936238     chr1   [ 4031586,  4031586]      * |           A           G
#     rs877309     chr1   [11412265, 11412265]      * |           G           A
#     rs213028     chr1   [21524764, 21524764]      * |           T           C
#   rs11249206     chr1   [25150569, 25150569]      * |           C           T
#     rs654498     chr1   [81945636, 81945636]      * |           T           C
#          ...      ...                    ...    ... .         ...         ...
#     rs798149     chrX [ 15795352,  15795352]      * |           G           A
#    rs5926356     chrX [ 28157252,  28157252]      * |        <NA>        <NA>
#    rs5987737     chrX [114616977, 114616977]      * |        <NA>        <NA>
#    rs1416770     chrX [145026857, 145026857]      * |        <NA>        <NA>
#    rs6626309     chrX [147012065, 147012065]      * |        <NA>        <NA>
#              X200144450018_R04C01 X200144450019_R07C01 X200144450021_R05C01
#                         <numeric>            <numeric>            <numeric>
#    rs3936238         0.0534766974         0.0504425326         0.0607401706
#     rs877309         0.5237327350         0.5314140191         0.5298953202
#     rs213028         0.4849094567         0.4557347337         0.4852366218
#   rs11249206         0.0545081280         0.0412852676         0.0557573090
#     rs654498         0.8275482094         0.8964896490         0.8939673746
#          ...                  ...                  ...                  ...
#     rs798149         0.9403501791         0.9421596963         0.9535634744
#    rs5926356         0.4604924134         0.4667771792         0.4188118812
#    rs5987737         0.9309938387         0.9302616391         0.9290115189
#    rs1416770         0.5306600249         0.5507222106         0.5456836799
#    rs6626309         0.0573575949         0.0289008335         0.0223078301
#   -------
#   seqinfo: 18 sequences from an unspecified genome; no seqlengths

That's the state of it, as best I know, for now.  Please link to https://github.com/ttriche/infiniumSnps in your methods section, if you use this data in a paper or to develop your own methods.  I hope it helps with what you're doing. 
Reply all
Reply to author
Forward
0 new messages