Best way to get the most subjects with CSF data?

Toni Saari

unread,

Nov 23, 2023, 1:54:59 AM11/23/23

to Alzheimer's Disease Neuroimaging Initiative (ADNI) Data

Hi,

I have been getting acquainted with the ADNI data files and have found answers to most of my questions in the documentation and in this group. However, I just want to make sure I understand the use of CSF data correctly. Before getting access to the data, I asked for the number of participants who are cognitively normal and have CSF amyloid beta data available at baseline. I was kindly answered that the number is 511, but now I am unable to reproduce this figure and I am wondering if there is something I am doing wrong that is making me lose out on n.

Here is what I have tried:

1. Using the adnimerge library and data set in R as follows:

library(ADNIMERGE)

library(dplyr)

dd <- adnimerge

abeta <- dd[complete.cases(dd[ , ('ABETA')]), ]

cnbl <- abeta[which(abeta$DX=='CN' & abeta$VISCODE=='bl'), ]

And I am left with a sample of 369 observations.

2. Using the UPENNBIOMK_MASTER_FINAL_08Nov2023.csv file and following the directions of a recent thread (https://groups.google.com/g/adni-data/c/58y848JRNQQ) here:

csf <- read.csv("UPENNBIOMK_MASTER_FINAL_08Nov2023.csv") #n=7120; multiple rows per subject due to different batches.

csfElecsys <- csf[which(csf$VISCODE=='bl' & (csf$BATCH=='UPENNBIOMK9' | csf$BATCH=='UPENNBIOMK10' | csf$BATCH=='UPENNBIOMK12' | csf$BATCH=='UPENNBIOMK13')), ] #n=954

csfElecsys <- csfElecsys %>%

distinct(EXAMDATE, .keep_all = T) #n=620; keeps first row of multiple baseline batches

ab_dd_upenn <- merge(csfElecsys, dd, by=c("RID", "VISCODE")) #n=620, merge because CSF file does not have diagnosis

ab_cn_upenn <- ab_dd_upenn[which(ab_dd_upenn$DX=='CN'), ] #n=224

And I get 224 observations.

I am not sure how I should be able to get over 500 individuals with CSF Abeta data and normal cognition at baseline. Is merging with ADNIMERGE for diagnosis a mistake?

Thanks in advance!

Toni

Dave E

unread,

Nov 23, 2023, 9:37:52 AM11/23/23

to Alzheimer's Disease Neuroimaging Initiative (ADNI) Data

Hi Toni,

Just as an interested data user (and not a particular expert, in CSF or otherwise), a couple of things....

What is the rationale for taking distinct() exam dates? Not quite sure I follow the 'multiple baseline batches' comment. This actually (I think) has the effect of removing different subjects assessed on the same date. This may, of course, may be what your research question requires. Skipping this constraint keeps ~900 subjects at this step, rather than your 620. After merging, I then get 361 DX == 'CN' subjects rather than your 224.
What was the basis or source of the 511 CN with Abeta data? It does not appear to be the ADNIMERGE dataset, as this has only 270 CN subjects with an ABETA value at VISCODE == "bl". I have not yet compared the values from dd$ABETA with your csf$BATCH interests.

Apologies if I have misunderstood your approach, but feel free to get back.

Regards,

Dave

Toni Saari

unread,

Nov 24, 2023, 3:10:21 AM11/24/23

to Alzheimer's Disease Neuroimaging Initiative (ADNI) Data

Hey Dave,

Thanks for your interest and comments, they were helpful in sorting out my errors! Regarding your first point, you are indeed right. There were multiple batches for the same exam date on the csf dataframe (as explained by Danielle in the thread I linked in the original post), but once I narrowed it down to those with the select batches, then it no longer seems to be as big of a problem. My approach with the distinct exam dates was also wrong due to the issue you mentioned; when grouping by RID and then requiring distinct exam dates, the n drops only a few hundred. Below is the updated approach that gets me to 332 for baseline.

#Load UPENN MASTER data
csf <- read.csv("UPENNBIOMK_MASTER_FINAL_08Nov2023.csv") #n=7120; multiple rows due to different batches
csfElecsys <- csf[which(csf$BATCH=='UPENNBIOMK9' | csf$BATCH=='UPENNBIOMK10' | csf$BATCH=='UPENNBIOMK12' | csf$BATCH=='UPENNBIOMK13'), ] #n=3418

csfElecsys <- csfElecsys %>%

group_by(RID) %>%
distinct(EXAMDATE, .keep_all = T) %>%
ungroup #n=3173

ab_dd_upenn <- merge(csfElecsys, dd, by=c("RID", "VISCODE")) #n=1500, merge because CSF file does not have diagnosis
ab_cn_upenn <- ab_dd_upenn[which(ab_dd_upenn$DX=='CN' & ab_dd_upenn$VISCODE=='bl'), ] #n=332, now also requiring just baseline visits

I also tried using the diagnostic summary file to see if it performed better, but I get 331 cases when doing so.

dxsum <- read.csv("DXSUM_PDXCONV_ADNIALL_08Nov2023.csv") #Different phases of ADNI use different DX variables
dxsum <- dxsum %>% mutate(CN = case_when((Phase== 'ADNI1' & DXCURREN==1) ~ 1,
((Phase== 'ADNI2' | Phase== 'ADNIGO') & DXCHANGE==1) ~ 1,
(Phase== 'ADNI3' & DIAGNOSIS==1) ~1,
TRUE ~ 0)) #simple variable for categorizing as normal cognition or not for testing if dxsum file performs better

csf <- read.csv("UPENNBIOMK_MASTER_FINAL_08Nov2023.csv") #n=7120; multiple rows due to different batches
csfElecsys <- csf[which(csf$BATCH=='UPENNBIOMK9' | csf$BATCH=='UPENNBIOMK10' | csf$BATCH=='UPENNBIOMK12' | csf$BATCH=='UPENNBIOMK13'), ] #n=3418

csfElecsys <- csfElecsys %>%

group_by(RID) %>%
distinct(EXAMDATE, .keep_all = T) %>%
ungroup #n=3173

ab_dxsum_upenn <- merge(csfElecsys, dxsum, by=c("RID", "VISCODE")) #n=3157, merge because CSF file does not have diagnosis
ab_cn_dxsum_upenn <- ab_dxsum_upenn[which(ab_dxsum_upenn$CN==1 & ab_dxsum_upenn$VISCODE=='bl'), ] #n=331

Skipping the distinct participant-wise exam dates gets me to 361 and 360 observations for the ADNIMERGE and dxsum-approaches, respectively.

As for your second point, I am not quite sure how the number came about as I received only a brief reply. It could be that screening, not just baseline diagnosis of CN was also accepted to meet these criteria. But I am just curious to know if there is something I am missing with my approach because I am going to need to narrow the data down even further and every observation counts. :)