Missingness

Ramiro Magno

unread,

Apr 25, 2025, 4:45:17 PMApr 25

to Alzheimer's Disease Neuroimaging Initiative (ADNI) Data

Hello ADNIers,

After reviewing the inst_about_data.pdf document, I still have a few uncertainties regarding how missing values are represented.

According to the documentation, missingness is most often indicated by the code -4, but other codes (e.g., -1) can also signify missing data. Additionally, blank fields may represent missingness as well.

I would like to confirm: do these rules apply simultaneously to the same variable, or are they variable-specific?

For example, in the DXSUM dataset, many predicate variables such as DXNORM and DXAD are described in the data dictionary with only:

1 = Yes

There is no explicit mention of 0 = No for these variables (whereas for other variables, such as DXPARK, a 0 is defined).

Upon inspecting the data, for variables like DXNORM, I observe values of 1, -4, and blanks.

My question is:

Are both -4 and blanks intended to represent missing data in these variables?

Or does a blank carry a different meaning (e.g., potentially indicating "No") in some cases?

In my current processing, I convert both blanks and -4 to NA in R. However, I am concerned that I might be conflating two distinct types of information (i.e., true missingness vs. an implicit "No").

For reference, here is a summary of DXNORM and DIAGNOSIS counts from the raw DXSUM data:

> dx_sum() |> dplyr::count(DXNORM, DIAGNOSIS)
# A tibble: 7 × 3
DXNORM DIAGNOSIS n
<int> <int> <int>
1 -4 2 1606
2 -4 3 1132
3 1 1 1130
4 NA 1 4564
5 NA 2 4574
6 NA 3 1744
7 NA NA 37

Any clarification you could provide would be greatly appreciated.

Thank you very much!

Ramiro

Naomi H Saito

unread,

Apr 25, 2025, 6:41:12 PMApr 25

to Alzheimer's Disease Neuroimaging Initiative (ADNI) Data

Hi Ramiro,

Some variables are used only for particular phase.

For example, DXNORM and DXAD were used only for ADNI1.

So, ADNI1 observations, they have either 1 or -4; nothing entered for all other phases. ( in your crosstab, you can add "Phase", so you will see).

Also, these particular variables could be used with DIAGNOSIS variable.

So, if PHASE = ADNI1 and DIAGNOSIS = 1 (CN) then they all have DXNORM = 1.

For PHASE=ADNI1 and DIAGNOSIS = 3 (Dementia), many of them have DXAD = 1(Alzheimer's Disease = yes). If having DXAD=-4 and DIAGNOSIS=3; then, they should have DXOTHDEM= 1 (Other Dementia (not Alzheimer's Disease) = yes).

On the other hand, DXPARK was used for all phases; however, DATADIC for CODE for ADNI1 had only 1 = Yes, and other phases use 1=Yes, and 0=No. (DXPARK for ADNI1 shows either 1 or -4.)

When I use DATADIC table for the variables, I also check by phase as well.

Naomi

From: adni...@googlegroups.com <adni...@googlegroups.com> on behalf of Ramiro Magno <rma...@pattern.institute>
Sent: Friday, April 25, 2025 1:45 PM
To: Alzheimer's Disease Neuroimaging Initiative (ADNI) Data <adni...@googlegroups.com>
Subject: [adni-data] Missingness

--
You received this message because you are subscribed to the Google Groups "Alzheimer's Disease Neuroimaging Initiative (ADNI) Data" group.
To unsubscribe from this group and stop receiving emails from it, send an email to adni-data+...@googlegroups.com.
To view this discussion visit https://groups.google.com/d/msgid/adni-data/f0e8dd23-9d32-46db-a54c-75c98d06ffdbn%40googlegroups.com.

Message has been deleted

Ramiro Magno

unread,

Apr 29, 2025, 11:19:52 AMApr 29

to Alzheimer's Disease Neuroimaging Initiative (ADNI) Data

Thank you Naomi for the quick and very useful reply!

So I think that the bottom line is that seems reasonable to treat both -4 and blanks as NA, since both represent missing information.

The distinction appears to be that:

In ADNI1, missing values are explicitly coded as -4, likely because those variables were not collected at the time.

In later phases (e.g., ADNIGO, ADNI2, ADNI3, ADNI4), missingness is represented by blanks (NA in R), reflecting missing by design.

Thus, although the technical reason for missingness differs slightly, combining -4 and blanks into a single NA value for analysis is appropriate and should not introduce bias for most applications.

The cross-tabulation you mentioned:

> dx_sum() |> dplyr::count(PHASE = as_phase_fct(PHASE), DXNORM, DIAGNOSIS)
# A tibble: 17 × 4
PHASE DXNORM DIAGNOSIS n
<fct> <int> <int> <int>
1 ADNI1 -4 2 1606
2 ADNI1 -4 3 1132
3 ADNI1 1 1 1130
4 ADNIGO NA 1 89
5 ADNIGO NA 2 323
6 ADNIGO NA 3 63
7 ADNI2 NA 1 1956
8 ADNI2 NA 2 2608
9 ADNI2 NA 3 1107
10 ADNI3 NA 1 1809
11 ADNI3 NA 2 1274
12 ADNI3 NA 3 448
13 ADNI3 NA NA 35
14 ADNI4 NA 1 710
15 ADNI4 NA 2 369
16 ADNI4 NA 3 126
17 ADNI4 NA NA 2

Ramiro Magno

unread,

Apr 29, 2025, 11:19:57 AMApr 29

to Alzheimer's Disease Neuroimaging Initiative (ADNI) Data

Okay, thanks Naomi!

So it seems reasonable to treat both -4 and blanks as NA, since both represent missing information.

> dx_sum() |> dplyr::count(PHASE = as_phase_fct(PHASE), DXNORM, DIAGNOSIS)
# A tibble: 17 × 4
PHASE DXNORM DIAGNOSIS n
<fct> <int> <int> <int>
1 ADNI1 -4 2 1606
2 ADNI1 -4 3 1132
3 ADNI1 1 1 1130
4 ADNIGO NA 1 89
5 ADNIGO NA 2 323
6 ADNIGO NA 3 63
7 ADNI2 NA 1 1956
8 ADNI2 NA 2 2608
9 ADNI2 NA 3 1107
10 ADNI3 NA 1 1809
11 ADNI3 NA 2 1274
12 ADNI3 NA 3 448
13 ADNI3 NA NA 35
14 ADNI4 NA 1 710
15 ADNI4 NA 2 369
16 ADNI4 NA 3 126

17 ADNI4 NA NA 2

The distinction appears to be that:

- In ADNI1, missing values are explicitly coded as -4, likely because those variables were not collected at the time.

- In later phases (e.g., ADNIGO, ADNI2, ADNI3, ADNI4), missingness is represented by blanks (NA in R), possibly reflecting missing by design or incomplete data entry.

Thus, although the technical reason for missingness differs slightly, combining -4 and blanks into a single NA value for analysis is appropriate and should not introduce bias for most applications.

On Friday, April 25, 2025 at 11:41:12 PM UTC+1 nhsaito wrote:

Reply all

Reply to author

Forward