Gene ST array summarisation by probeset?

17 views

Skip to first unread message

Sophie Marion de Procé

unread,

Apr 7, 2017, 11:29:39 AM4/7/17

to aroma.affymetrix

Dear all,

I'm analysing a Rat Gene ST 2.1 array. I would like to filter the dataset using thresholds of expression for a minimum proportion of samples in each group.

I've been following the paper by Rodrigo-Domingo et al. (2014) Reproducible probe-level analysis of the Affymetrix Exon 1.0 ST array with R/Bioconductor.

They have two steps to the filtering step, first they filter probesets and then the transcripts. I'm getting stuck at the filtering step (code chunk number 7: intensityFiltering in the R code provided in this paper), as I can't find a way to summarise the Gene St array at the probeset level in order to do the first step. I have used plm <- RmaPlm(can) to summarise my data at the transcript level, but the plmPs <- RmaPlm(csN, mergeGroups = FALSE) seems to summarise by transcripts as well.

So my questions are:

1) Is it possible to summarise a Gene ST array at the probeset level? If yes, how?

2) Less specific to the aroma-affymetrix package, but is it necessary to have the probeset-level dataset in order to filter present/absent probes/transcripts? What would be an appropriate workflow for this?

Thank you very much for your help,

Best wishes,

Sophie.

Here is the code chunk from that paper:

###################################################
### code chunk number 7: intensityFiltering
###################################################

# ** user-defined groups; default: groups defined by the treatment column in SampleInformation.txt
sample.info$number <- seq(1, nrow(sample.info))
groups <- split(sample.info$number, sample.info$treatment)

# check whether the filtering is already performed, perform it otherwise
# at probeset level
if(file.exists(file = paste(output.folder, "/PresentProbesets", ds, ".Rdata", sep = ""))){
 load(file = paste(output.folder, "/PresentProbesets", ds, ".Rdata", sep = ""))
} else {
 # remove cross-hybridising probesets
 non.crosshyb.probesets <- probesets.NetAffx.32$probeset_id[probesets.NetAffx.32$crosshyb_type == 1] 
 if(!exists("exFit")){
  load(file= paste(output.folder, "CoreExonIntensities.Rdata", sep = "/"))
 }
 exon.intensities <- exFit[exFit$groupName %in% non.crosshyb.probesets, -c(3:5)]
 rm("exFit")
 gc()

 # ** user-defined criteria for absent/present probesets
 # ** criterion 1: probeset absent/present in a group of samples: present in at least half of the samples
 present.exons <- lapply(groups, FUN = function(group){
   if(length(group) > 1){
     apply(log2(exon.intensities[, group + 2]) < 3, 1, sum)/length(group) < 0.5
   } else {
     log2(exon.intensities[, group + 2]) > 3
   }
 })
 present.exons <- t(do.call(rbind, present.exons))  # convert to dataframe
 rownames(present.exons) <- exon.intensities$groupName  # use probeset identities

 # ** criterion 2: probeset absent/present in the dataset
 # remove probesets not present in any of the groups
 absent.exons <- apply(present.exons, 1, AllFalse)
 probesets.to.keep <- absent.exons[absent.exons == FALSE]
 probesets.to.keep <- as.factor(names(probesets.to.keep))
 n.present.exons <- length(probesets.to.keep)
 rm(list = c("absent.exons"))
 save(probesets.to.keep, file = paste(output.folder, "/PresentProbesets", ds, ".Rdata", sep = ""))
}
# check whether transcript filtering has been performed; perform it if not
if(file.exists(file = paste(output.folder, "/PresentTranscripts", ds, ".Rdata", sep = ""))){
 load(file = paste(output.folder, "/PresentTranscripts", ds, ".Rdata", sep = ""))
} else {
 if(!exists("exon.intensities")){
  load(file= paste(output.folder, "CoreExonIntensities.Rdata", sep = "/"))
  exon.intensities <- exFit[exFit$groupName %in% non.crosshyb.probesets, -c(3:5)]
  rm("exFit")
  gc()
 }

 # ** user-defined criteria for absent transcripts:
 # criterion 1: half or more of probesets of transcript present in sample
 # --> transcript present in sample
 core.transcripts <- unique(exon.intensities$unitName)
 # create a list of transcript clusters present/absent in each sample
 present.genes    <- lapply(core.transcripts, FUN = function(gene){
   apply(log2(exon.intensities[exon.intensities$unitName == gene, -c(1:2)]) < 3, 2, sum)/
   length(exon.intensities[exon.intensities$unitName == gene,]$groupName) < 0.5
 })# FALSE: gene not present in sample
 names(present.genes) <- core.transcripts 
 present.genes <- do.call(rbind, present.genes) # convert to logical matrix 

 # criterion 2: transcript present in at least half of the samples of a group 
 # --> transcript present in the group
 present.genes.in.group <- lapply(groups, FUN = function(group){
   if(length(group) > 1){
     apply(present.genes[ , group], 1, sum)/length(group) >= 0.5
   } else {
     present.genes[ , group]
   }
 }) 
 present.genes.in.group <- do.call(rbind, present.genes.in.group) # logical matrix
 present.genes.in.group <- t(present.genes.in.group)
 # keep genes only present in at least two groups
 transcripts.to.keep <- apply(present.genes.in.group, 1, TwoOrMoreTrue)
 transcripts.to.keep <- names(transcripts.to.keep[transcripts.to.keep == TRUE])
 save(transcripts.to.keep, file = paste(output.folder, "/PresentTranscripts", ds, ".Rdata", sep = ""))
 rm("exon.intensities")
 gc()
}
 n.present.transcript.clusters <- length(transcripts.to.keep)

Reply all

Reply to author

Forward

0 new messages