Questions on extracting probeset summaries

42 views
Skip to first unread message

Qingzhou Zhang

unread,
Jan 22, 2015, 8:44:38 AM1/22/15
to aroma-af...@googlegroups.com
Hi Henrik,

I was processing HG-U133_Plus_2 datasets. While extracting probeset summaries(chip effects) as a data frame, I only got 27604 objs * n variables.
I was hoping to get a data frame of 54675 objs., which equals the number of units in HG-U133_Plus_2 chip. Am I missing some steps, or processing the wrong CEL files?

Thanks a lot!

Henrik Bengtsson

unread,
Jan 22, 2015, 12:36:00 PM1/22/15
to aroma-affymetrix
Hard to say. Can you share your code (from beginning to end) showing
what you're doing?

Henrik

>
> Thanks a lot!
>
> --
> --
> When reporting problems on aroma.affymetrix, make sure 1) to run the latest
> version of the package, 2) to report the output of sessionInfo() and
> traceback(), and 3) to post a complete code example.
>
>
> You received this message because you are subscribed to the Google Groups
> "aroma.affymetrix" group with website http://www.aroma-project.org/.
> To post to this group, send email to aroma-af...@googlegroups.com
> To unsubscribe and other options, go to http://www.aroma-project.org/forum/
>
> ---
> You received this message because you are subscribed to the Google Groups
> "aroma.affymetrix" group.
> To unsubscribe from this group and stop receiving emails from it, send an
> email to aroma-affymetr...@googlegroups.com.
> For more options, visit https://groups.google.com/d/optout.

Qingzhou Zhang

unread,
Jan 22, 2015, 8:00:36 PM1/22/15
to aroma-af...@googlegroups.com, h...@biostat.ucsf.edu
Hi, Henrik,

Thanks for the reply!

Here is my code:

library("aroma.affymetrix")
RawName = "Project1"
RawChipType = "HG-U133_Plus_2"

ces
<- doGCRMA(RawName, chipType = RawChipType)
data
<- extractDataFrame(ces, units = NULL, addNames = TRUE)

Here is the sessionInfo()

R version
3.1.1 (2014-07-10)
Platform: x86_64-pc-linux-gnu (64-bit)

locale
:
 
[1] LC_CTYPE=en_US.UTF-8       LC_NUMERIC=C               LC_TIME=en_GB.UTF-8      
 
[4] LC_COLLATE=en_US.UTF-8     LC_MONETARY=en_GB.UTF-8    LC_MESSAGES=en_US.UTF-8  
 
[7] LC_PAPER=en_GB.UTF-8       LC_NAME=C                  LC_ADDRESS=C              
[10] LC_TELEPHONE=C             LC_MEASUREMENT=en_GB.UTF-8 LC_IDENTIFICATION=C      

attached
base packages:
[1] stats     graphics  grDevices utils     datasets  methods   base    

other attached packages
:
[1] aroma.light_2.2.1       aroma.affymetrix_2.13.0 aroma.core_2.13.0       R.devices_2.12.0      
[5] R.filesets_2.6.0        R.utils_1.34.0          R.oo_1.18.2             affxparser_1.38.0      
[9] R.methodsS3_1.6.2      

loaded via a
namespace (and not attached):
 
[1] aroma.apd_0.5.0    base64enc_0.1-2    Cairo_1.5-7        digest_0.6.8       DNAcopy_1.40.0    
 
[6] matrixStats_0.13.0 PSCBS_0.43.0       R.cache_0.11.0     R.huge_0.8.0       R.rsp_0.19.7      
[11] tools_3.1.1    

Here is the traceback()

1: extractDataFrame(ces, units = NULL, addNames = TRUE)


I tried several times, but always got a data frame containing 27604 obj. :-(

Thanks

Henrik Bengtsson

unread,
Jan 22, 2015, 10:23:12 PM1/22/15
to aroma-affymetrix
Thanks. I can *not* reproduce this, e.g.

> ces
ChipEffectSet:
Name: GSE9890
Tags: GRBC,QN,RMA,oligo
Path: plmData/GSE9890,GRBC,QN,RMA,oligo/HG-U133_Plus_2
Platform: Affymetrix
Chip type: HG-U133_Plus_2,monocell
Number of arrays: 10
Names: GSM249671, GSM249672, GSM249673, ..., GSM249680 [10]
Time period: 2015-01-17 09:43:28 -- 2015-01-17 09:43:35
Total file size: 5.75MB
RAM: 0.02MB
Parameters: {}

> ces[[1]]
ChipEffectFile:
Name: GSM249671
Tags: chipEffects
Full name: GSM249671,chipEffects
Pathname: plmData/GSE9890,GRBC,QN,RMA,oligo/HG-U133_Plus_2/GSM249671,chipEffects.CEL
File size: 589.25 kB (603394 bytes)
RAM: 0.02 MB
File format: v4 (binary; XDA)
Platform: Affymetrix
Chip type: HG-U133_Plus_2,monocell
Timestamp: 2015-01-17 09:43:28
Parameters: {probeModel: chr "pm"}

> data <- extractDataFrame(ces, units=NULL, addNames=TRUE)
> str(data)
'data.frame': 54675 obs. of 15 variables:
$ unitName : chr "AFFX-BioB-5_at" "AFFX-BioB-M_at" "AFFX-BioB-3_at"
"AFFX-BioC-5_at" ...
$ groupName: chr "" "" "" "" ...
$ unit : int 1 2 3 4 5 6 7 8 9 10 ...
$ group : int 1 1 1 1 1 1 1 1 1 1 ...
$ cell : int 1 2 3 4 5 6 7 8 9 10 ...
$ GSM249671: num 1614 2691 2120 3904 2238 ...
$ GSM249672: num 2612 4060 3301 5686 3280 ...
$ GSM249673: num 2876 5178 4014 6861 4050 ...
$ GSM249674: num 3328 5704 4350 7617 4505 ...
$ GSM249675: num 3101 5455 4131 7735 4560 ...
$ GSM249676: num 5081 8883 7173 10997 7188 ...
$ GSM249677: num 2329 4186 3209 5853 3482 ...
$ GSM249678: num 1723 3177 2353 5537 3141 ...
$ GSM249679: num 1442 2458 2114 4285 2370 ...
$ GSM249680: num 1469 2641 2154 4583 2582 ...

So, let's start troubleshooting. First, you should see the exact same
as I do for:

> cdf <- getCdf(ces)
> cdf
AffymetrixCdfFile:
Path: annotationData/chipTypes/HG-U133_Plus_2
Filename: HG-U133_Plus_2,monocell.CDF
File size: 9.63 MB (10098009 bytes)
Chip type: HG-U133_Plus_2,monocell
RAM: 3.34MB
File format: v4 (binary; XDA)
Dimension: 246x245
Number of cells: 60270
Number of units: 54675
Cells per unit: 1.10
Number of QC units: 9

If not, that's where the problem is. If ok, then check this output:

> map <- getUnitGroupCellMap(cdf)
str(map)> str(map)
Classes 'UnitGroupCellMap' and 'data.frame': 54675 obs. of 3 variables:
$ unit : int 1 2 3 4 5 6 7 8 9 10 ...
$ group: int 1 1 1 1 1 1 1 1 1 1 ...
$ cell : int 1 2 3 4 5 6 7 8 9 10 ...

This "map" is essential in what information gets pulled out and
returned. The number of rows/observations in this data frame should
match the number of units in the 'cdf', i.e. 54,675 units.

Let's start with that.

Henrik

Qingzhou Zhang

unread,
Jan 23, 2015, 10:23:45 AM1/23/15
to aroma-af...@googlegroups.com, h...@biostat.ucsf.edu
Thanks, Henrik,

It seems that something went wrong with the monocell cdf file by troubleshooting:


> cdf

AffymetrixCdfFile:

Path: annotationData/chipTypes/HG-U133_Plus_2

Filename: HG-U133_Plus_2,monocell.CDF

File size: 4.88 MB (5116945 bytes)

Chip type: HG-U133_Plus_2,monocell

RAM: 0.46MB

File format: v4 (binary; XDA)

Dimension: 182x182

Number of cells: 33124

Number of units: 27604

Cells per unit: 1.20

Number of QC units: 9



So I have deleted the previous monocell cdf file in annotationData/chipTypes/HG-U133_Plus_2 and re-create it by the following:

cdf <- AffymetrixCdfFile$byChipType("HG-U133_Plus_2")

cdfM <- getMonocellCdf(cdf, verbose = Arguments$getVerbose(-8, timestamp = TRUE))



However, the above process also failed, here is the output:

> cdfM <- getMonocellCdf(cdf, verbose = Arguments$getVerbose(-8, timestamp = TRUE))

20150123 21:47:53|Retrieving monocell CDF...

20150123 21:47:53| Monocell chip type: HG-U133_Plus_2,monocell

20150123 21:47:53| Locating monocell CDF...

20150123 21:47:53|  Pathname:

20150123 21:47:53| Locating monocell CDF...done

20150123 21:47:53| Could not locate monocell CDF. Will create one for chip type...

20150123 21:47:53|  Creating monocell CDF...

20150123 21:47:53|   Chip type: HG-U133_Plus_2

20150123 21:47:53|   Validate (main) CDF...

20150123 21:47:54|   Validate (main) CDF...done

20150123 21:47:55|   Adding temporary suffix from file...

20150123 21:47:55|    Pathname: annotationData/chipTypes/HG-U133_Plus_2/HG-U133_Plus_2,monocell.CDF

20150123 21:47:55|    Suffix: .tmp

20150123 21:47:55|    Rename existing file?: FALSE

20150123 21:47:55|    Temporary pathname: annotationData/chipTypes/HG-U133_Plus_2/HG-U133_Plus_2,monocell.CDF.tmp

20150123 21:47:55|   Adding temporary suffix from file...done

20150123 21:47:55|   Number of cells per group field: 1

20150123 21:47:55|   Reading CDF group names...

20150123 21:47:55|   Reading CDF group names...done

             used (Mb) gc trigger (Mb) max used (Mb)

   Ncells  603933 32.3     899071 48.1   741108 39.6

   Vcells 1027587  7.9    1757946 13.5  1424724 10.9

            used (Mb) gc trigger (Mb) max used (Mb)

   Ncells 549349 29.4     899071 48.1   899071 48.1

   Vcells 945722  7.3    1757946 13.5  1424724 10.9

20150123 21:47:56|   Number of cells per unit:

      Min. 1st Qu.  Median    Mean 3rd Qu.    Max.

         1       1       1       1       1       1

20150123 21:47:56|   Reading CDF QC units...

20150123 21:47:56|   Reading CDF QC units...done

20150123 21:47:56|   Number of QC cells: 5385 in 9 QC units (0.1MB)

20150123 21:47:56|   Total number of cells: 60060

20150123 21:47:56|   Best array dimension: 246x245 (=60270 cells, i.e. 210 left-over cells)

20150123 21:47:56|   Creating CDF header with source CDF as template...

20150123 21:47:56|    Setting up header...

20150123 21:47:56|     Reading CDF header...

20150123 21:47:56|     Reading CDF header...done

20150123 21:47:56|     Reading CDF unit names...

20150123 21:47:56|     Reading CDF unit names...done

20150123 21:47:56|    Setting up header...done

20150123 21:47:56|    Writing...

20150123 21:47:56|     destHeader:

     List of 12

      $ ncols      : int 245

      $ nrows      : int 246

      $ nunits     : int 54675

      $ nqcunits   : int 9

      $ refseq     : chr ""

      $ chiptype   : chr "HG-U133_Plus_2"

      $ filename   : chr "annotationData/chipTypes/HG-U133_Plus_2/HG-U133_Plus_2.cdf"

      $ rows       : int 1164

      $ cols       : int 1164

      $ probesets  : int 54675

      $ qcprobesets: int 9

      $ reference  : chr ""

20150123 21:47:56|     unitNames:

      chr [1:54675] "AFFX-BioB-5_at" "AFFX-BioB-M_at" "AFFX-BioB-3_at" "AFFX-BioC-5_at" ...

20150123 21:47:56|     qcUnitLengths:

      num [1:9] 15966 174 230 1658 69 ...

20150123 21:47:56|     unitLengths:

      num [1:54675] 116 116 116 116 116 116 116 116 116 116 ...

               used (Mb) gc trigger (Mb) max used (Mb)

     Ncells  561416 30.0     984024 52.6   899071 48.1

     Vcells 1120064  8.6    1925843 14.7  1515846 11.6

               used (Mb) gc trigger (Mb) max used (Mb)

     Ncells  562232 30.1     984024 52.6   899071 48.1

     Vcells 1010995  7.8    5484388 41.9  6516658 49.8

20150123 21:47:57|    Writing...done

20150123 21:47:57|   Creating CDF header with source CDF as template...done

20150123 21:47:57|   Writing QC units...

20150123 21:47:57|    Rearranging QC unit cell indices...

20150123 21:47:57|     Units: 20150123 21:47:57|    

20150123 21:47:57|    Rearranging QC unit cell indices...done

             used (Mb) gc trigger (Mb) max used (Mb)

    Ncells 562529 30.1     984024 52.6   984024 52.6

    Vcells 994748  7.6    3510008 26.8  6516658 49.8

20150123 21:47:57|   Writing QC units...done

20150123 21:47:57|   Number of units: 54675

20150123 21:47:57|   Argument 'ram': 1.000000

20150123 21:47:57|   Average unit length: 116.000000 bytes

20150123 21:47:57|   Number of chunks: 2 (34482 units/chunk)

20150123 21:47:57|   Reading, extracting, and writing units...

20150123 21:47:57|    Chunk #1 of 2 (34482 units)

20150123 21:47:57|    Reading CDF list structure...

20150123 21:47:59|    Reading CDF list structure...done

 => RAM: 132MB

Error in (...) : 3 arguments passed to '(' which requires 1

20150123 21:48:01| Could not locate monocell CDF. Will create one for chip type...done

20150123 21:48:01|Retrieving monocell CDF...done


When I tried on other chiptypes, the same error was returned as well.
I have even cleared the ~/.Rcache/aroma.affymetrix/ folder, which could not solve this problem at all.

Henrik Bengtsson

unread,
Jan 23, 2015, 10:36:26 AM1/23/15
to aroma-affymetrix

This is odd for several reasons, e.g. I'm puzzled how you ended up with a monocell CDF previously but now it gives an error.  Let's troubleshoot more...

What does troubleshoot() output directly after you get that error?

Henrik

> ...

Henrik Bengtsson

unread,
Jan 23, 2015, 10:37:08 AM1/23/15
to aroma-affymetrix


On Jan 23, 2015 7:36 AM, "Henrik Bengtsson" <h...@biostat.ucsf.edu> wrote:
>
> This is odd for several reasons, e.g. I'm puzzled how you ended up with a monocell CDF previously but now it gives an error.  Let's troubleshoot more...
>
> What does troubleshoot() output directly after you get that error?

I meant traceback()

Henrik Bengtsson

unread,
Jan 23, 2015, 11:50:05 AM1/23/15
to aroma-affymetrix
I managed to reproduce this now:

Error in (...) : 3 arguments passed to '(' which requires 1
20150123 08:48:49| Could not locate monocell CDF. Will create one for chip type.
..done
20150123 08:48:49|Retrieving monocell CDF...done
> traceback()
5: .writeCdfUnits(con = con, srcUnits, verbose = verbose2)
4: createMonocellCdf.AffymetrixCdfFile(this, ..., verbose = less(verbose))
3: createMonocellCdf(this, ..., verbose = less(verbose))
2: getMonocellCdf.AffymetrixCdfFile(cdf, verbose = Arguments$getVerbose(-8,
timestamp = TRUE))
1: getMonocellCdf(cdf, verbose = Arguments$getVerbose(-8, timestamp = TRUE))

I'll investigate and fix this asap.

/Henrik

Henrik Bengtsson

unread,
Jan 23, 2015, 2:52:43 PM1/23/15
to aroma-affymetrix
Solved. Before finalize a release, would you mind making sure it
works on your end. Install aroma.affymetrix 2.13.0-9001 by running
the following in a fresh R session:

source('http://callr.org/install#HenrikBengtsson/aroma.af...@2.13.0-9001')

Then retry with

library("aroma.affymetrix")
cdf <- AffymetrixCdfFile$byChipType("HG-U133_Plus_2")
cdfM <- getMonocellCdf(cdf, verbose=TRUE)
print(cdfM)

If it complains about a pre-existing *.tmp file, remove that one an retry.

As soon as you confirm it works, I'll make aroma.affymetrix 2.13.1
available, because this was a critical bug(*).

Thanks for the report

/Henrik

(*) DETAILS: Turns out to be due to a single stray newline. It should have been

affxparser::writeCdfUnits(...)

but it was:

affxparser::writeCdfUnits
(...)

Despite running 24 hours of regular package testing, this piece of
code was never tested. I've now added an explicit test on creating
and re-creating monocell CDF.

Qingzhou Zhang

unread,
Jan 23, 2015, 5:57:30 PM1/23/15
to aroma-af...@googlegroups.com, h...@biostat.ucsf.edu
Thanks!

Seems working pretty well.

> cdf <- AffymetrixCdfFile$byChipType("HG-U133_Plus_2") 
> cdfM <- getMonocellCdf(cdf, verbose=TRUE) 
Retrieving monocell CDF...
 Monocell chip type: HG-U133_Plus_2,monocell
 Locating monocell CDF...
  Pathname: 
 Locating monocell CDF...done
 Could not locate monocell CDF. Will create one for chip type...
 Could not locate monocell CDF. Will create one for chip type...done
Retrieving monocell CDF...done
> print(cdfM)
AffymetrixCdfFile:
Path: annotationData/chipTypes/HG-U133_Plus_2
Filename: HG-U133_Plus_2,monocell.CDF
File size: 9.63 MB (10098009 bytes)
Chip type: HG-U133_Plus_2,monocell
RAM: 0.00MB
File format: v4 (binary; XDA)
Dimension: 246x245
Number of cells: 60270
Number of units: 54675
Cells per unit: 1.10
Number of QC units: 9


On Saturday, 24 January 2015 03:52:43 UTC+8, Henrik Bengtsson wrote:
Solved.  Before finalize a release, would you mind making sure it
works on your end.  Install aroma.affymetrix 2.13.0-9001 by running
the following in a fresh R session:

Henrik Bengtsson

unread,
Jan 23, 2015, 7:17:30 PM1/23/15
to aroma-affymetrix
Great. I've now made aroma.affymetrix 2.13.1 available, which is
installed the usual way:

source('http://callr.org/install#aroma.affymetrix")

Anyone who reads this should update this way.

/Henrik
>> source('http://callr.org/install#HenrikBengtsson/aroma.af...@2.13.0-9001')

Qingzhou Zhang

unread,
Jan 23, 2015, 7:55:16 PM1/23/15
to aroma-af...@googlegroups.com, h...@biostat.ucsf.edu
Many, many thanks, Henrik! Well done!
Reply all
Reply to author
Forward
0 new messages