re: how to extract: use minfi
I got bored running the same old QC pipeline after about 10,000 TCGA samples. So I wrote a single-function patch to minfi that grabs the SNP probes and stuffs them into metadata (one could also consider using probes with known high-MAF SNPs to do the same thing). If it's useful to enough people I will turn it into a Bioconductor package and actually maintain it. Its rationale is documented here:
The function can be sourced as such:
library(minfi)
# until/unless this is merged into minfi or packaged separately by yours truly
Then when processing your IDATs/rgSet, it extracts the SNPs for an rgSet into the metadata of the resulting grSet.
For example, all of the FlowSorted.*.450k packages contain an RGChannelSet (rgSet) with the raw intensities, so:
library(FlowSorted.CordBloodNorway.450k)
FlowSorted.CordBloodNorway.450k
# class: RGChannelSet
# dim: 622399 77
# metadata(0):
# assays(2): Green Red
# rownames(622399): 10600313 10600322 ... 74810490 74810492
# rowData names(0):
# colnames(77): CD4+_4 CD4+_5 ... WB_22 WB_23
# colData names(8): Sample_Name SampleID ... Slide Basename
# Annotation
# array: IlluminaHumanMethylation450k
# annotation: ilmn12.hg19
FSCB.Norway <- tcgaPipeline(FlowSorted.CordBloodNorway.450k)
# Loading required package: IlluminaHumanMethylation450kmanifest
# Running TCGA-style (noob) pipeline on 77 samples...
# Loading required package: IlluminaHumanMethylation450kanno.ilmn12.hg19
# [preprocessNoob] Applying R/G ratio flip to fix dye bias...
# Masking probes with detection p-value > 0.05...
# Placing SNP probe betas in metadata(grSet)$SNPs...
# Done.
The SNP probe "beta" values then end up in metadata(grSet), eg. for a quick sanity check:
mdsPlot(metadata(FSCB.Norway)$SNPs, sampGroups=FSCB.Norway$SampleID,
main="MDS projection on SNP genotypes, colored by subject")

That doesn't really tell you anything about the location or genotypes of the SNP probes, however. It turns out that some comparison with gold-standard GM12878 genotypes from the Genome-in-a-Bottle (GIAB) consortium can address that: (cf.
https://github.com/ttriche/infiniumSnps):
library(GenomicRanges)
GM12878 <- makeGRangesFromDataFrame(
row.names=1),
keep=TRUE)
This will then give you the GRCh37 positions and M/U genotypes of all the SNP probes that I could decipher using the GM12878 GIAB calls.
# for simpler display
mcols(GM12878) <- mcols(GM12878)[, c("M","U")]
show(GM12878)
# GRanges object with 58 ranges and 2 metadata columns:
# seqnames ranges strand | M U
# <Rle> <IRanges> <Rle> | <character> <character>
# rs3936238 chr1 [ 4031586, 4031586] * | A G
# rs877309 chr1 [11412265, 11412265] * | G A
# rs213028 chr1 [21524764, 21524764] * | T C
# rs11249206 chr1 [25150569, 25150569] * | C T
# rs654498 chr1 [81945636, 81945636] * | T C
# ... ... ... ... . ... ...
# rs798149 chrX [ 15795352, 15795352] * | G A
# rs5926356 chrX [ 28157252, 28157252] * | <NA> <NA>
# rs5987737 chrX [114616977, 114616977] * | <NA> <NA>
# rs1416770 chrX [145026857, 145026857] * | <NA> <NA>
# rs6626309 chrX [147012065, 147012065] * | <NA> <NA>
# -------
# seqinfo: 18 sequences from an unspecified genome; no seqlengths
You could also bolt this onto the side of a SNP matrix for easier subsetByOverlaps(methSNPs, gwasSNPs) usage:
library(minfiDataEPIC)
# only keep the annotated ones:
GM12878_EPIC_SNPs <- metadata(tcgaPipeline(RGsetEPIC))$SNPs[names(GM12878), ]
# bolt the two together:
mcols(GM12878) <- cbind(mcols(GM12878), GM12878_EPIC_SNPs)
show(GM12878)
#
# GRanges object with 58 ranges and 5 metadata columns:
# seqnames ranges strand | M U
# <Rle> <IRanges> <Rle> | <character> <character>
# rs3936238 chr1 [ 4031586, 4031586] * | A G
# rs877309 chr1 [11412265, 11412265] * | G A
# rs213028 chr1 [21524764, 21524764] * | T C
# rs11249206 chr1 [25150569, 25150569] * | C T
# rs654498 chr1 [81945636, 81945636] * | T C
# ... ... ... ... . ... ...
# rs798149 chrX [ 15795352, 15795352] * | G A
# rs5926356 chrX [ 28157252, 28157252] * | <NA> <NA>
# rs5987737 chrX [114616977, 114616977] * | <NA> <NA>
# rs1416770 chrX [145026857, 145026857] * | <NA> <NA>
# rs6626309 chrX [147012065, 147012065] * | <NA> <NA>
# X200144450018_R04C01 X200144450019_R07C01 X200144450021_R05C01
# <numeric> <numeric> <numeric>
# rs3936238 0.0534766974 0.0504425326 0.0607401706
# rs11249206 0.0545081280 0.0412852676 0.0557573090
# rs654498 0.8275482094 0.8964896490 0.8939673746
# ... ... ... ...
# rs5926356 0.4604924134 0.4667771792 0.
4188118812# rs1416770 0.
5306600249 0.5507222106 0.5456836799
# rs6626309 0.0573575949 0.0289008335 0.0223078301
# -------
# seqinfo: 18 sequences from an unspecified genome; no seqlengths
That's the state of it, as best I know, for now. Please link to
https://github.com/ttriche/infiniumSnps in your methods section, if you use this data in a paper or to develop your own methods. I hope it helps with what you're doing.