Missingness

44 views
Skip to first unread message

Ramiro Magno

unread,
Apr 25, 2025, 4:45:17 PMApr 25
to Alzheimer's Disease Neuroimaging Initiative (ADNI) Data
Hello ADNIers,

After reviewing the inst_about_data.pdf document, I still have a few uncertainties regarding how missing values are represented.

According to the documentation, missingness is most often indicated by the code -4, but other codes (e.g., -1) can also signify missing data. Additionally, blank fields may represent missingness as well.

I would like to confirm: do these rules apply simultaneously to the same variable, or are they variable-specific?

For example, in the DXSUM dataset, many predicate variables such as DXNORM and DXAD are described in the data dictionary with only:

1 = Yes

There is no explicit mention of 0 = No for these variables (whereas for other variables, such as DXPARK, a 0 is defined).

Upon inspecting the data, for variables like DXNORM, I observe values of 1, -4, and blanks.

My question is:

    Are both -4 and blanks intended to represent missing data in these variables?

    Or does a blank carry a different meaning (e.g., potentially indicating "No") in some cases?

In my current processing, I convert both blanks and -4 to NA in R. However, I am concerned that I might be conflating two distinct types of information (i.e., true missingness vs. an implicit "No").

For reference, here is a summary of DXNORM and DIAGNOSIS counts from the raw DXSUM data:

> dx_sum() |> dplyr::count(DXNORM, DIAGNOSIS)
# A tibble: 7 × 3
  DXNORM DIAGNOSIS     n
   <int>     <int> <int>
1     -4         2  1606
2     -4         3  1132
3      1         1  1130
4     NA         1  4564
5     NA         2  4574
6     NA         3  1744
7     NA        NA    37

Any clarification you could provide would be greatly appreciated. 

Thank you very much!

Ramiro

Naomi H Saito

unread,
Apr 25, 2025, 6:41:12 PMApr 25
to Alzheimer's Disease Neuroimaging Initiative (ADNI) Data
Hi Ramiro,
Some variables are used only for particular phase. 
For example, DXNORM and DXAD were used only for ADNI1.

So, ADNI1 observations, they have either 1 or -4; nothing entered for all other phases.  ( in your crosstab, you can add "Phase", so you will see).   
 
Also, these particular variables could be used with DIAGNOSIS variable.  

So, if PHASE = ADNI1 and DIAGNOSIS = 1 (CN) then they all have DXNORM = 1. 

For PHASE=ADNI1 and DIAGNOSIS = 3 (Dementia), many of them have DXAD = 1(Alzheimer's Disease = yes).  If having DXAD=-4 and DIAGNOSIS=3; then, they should have DXOTHDEM= 1 (Other Dementia (not Alzheimer's Disease) = yes). 

On the other hand, DXPARK was used for all phases; however, DATADIC for CODE for ADNI1 had only 1 = Yes, and other phases use 1=Yes, and 0=No.  (DXPARK for ADNI1 shows either 1 or -4.)

When I use DATADIC table for the variables, I also check by phase as well.    

Naomi

From: adni...@googlegroups.com <adni...@googlegroups.com> on behalf of Ramiro Magno <rma...@pattern.institute>
Sent: Friday, April 25, 2025 1:45 PM
To: Alzheimer's Disease Neuroimaging Initiative (ADNI) Data <adni...@googlegroups.com>
Subject: [adni-data] Missingness
 
--
You received this message because you are subscribed to the Google Groups "Alzheimer's Disease Neuroimaging Initiative (ADNI) Data" group.
To unsubscribe from this group and stop receiving emails from it, send an email to adni-data+...@googlegroups.com.
To view this discussion visit https://groups.google.com/d/msgid/adni-data/f0e8dd23-9d32-46db-a54c-75c98d06ffdbn%40googlegroups.com.
Message has been deleted
Message has been deleted

Ramiro Magno

unread,
Apr 29, 2025, 11:19:52 AMApr 29
to Alzheimer's Disease Neuroimaging Initiative (ADNI) Data
Thank you Naomi for the quick and very useful reply!

So I think that the bottom line is that seems reasonable to treat both -4 and blanks as NA, since both represent missing information.

The distinction appears to be that:

    In ADNI1, missing values are explicitly coded as -4, likely because those variables were not collected at the time.

    In later phases (e.g., ADNIGO, ADNI2, ADNI3, ADNI4), missingness is represented by blanks (NA in R), reflecting missing by design.

Thus, although the technical reason for missingness differs slightly, combining -4 and blanks into a single NA value for analysis is appropriate and should not introduce bias for most applications.

The cross-tabulation you mentioned:

> dx_sum() |> dplyr::count(PHASE = as_phase_fct(PHASE), DXNORM, DIAGNOSIS)
# A tibble: 17 × 4
   PHASE  DXNORM DIAGNOSIS     n
   <fct>   <int>     <int> <int>
 1 ADNI1      -4         2  1606
 2 ADNI1      -4         3  1132
 3 ADNI1       1         1  1130
 4 ADNIGO     NA         1    89
 5 ADNIGO     NA         2   323
 6 ADNIGO     NA         3    63
 7 ADNI2      NA         1  1956
 8 ADNI2      NA         2  2608
 9 ADNI2      NA         3  1107
10 ADNI3      NA         1  1809
11 ADNI3      NA         2  1274
12 ADNI3      NA         3   448
13 ADNI3      NA        NA    35
14 ADNI4      NA         1   710
15 ADNI4      NA         2   369
16 ADNI4      NA         3   126
17 ADNI4      NA        NA     2

Ramiro Magno

unread,
Apr 29, 2025, 11:19:57 AMApr 29
to Alzheimer's Disease Neuroimaging Initiative (ADNI) Data
Okay, thanks Naomi!

So it seems reasonable to treat both -4 and blanks as NA, since both represent missing information.

> dx_sum() |> dplyr::count(PHASE = as_phase_fct(PHASE), DXNORM, DIAGNOSIS)
# A tibble: 17 × 4
   PHASE  DXNORM DIAGNOSIS     n
   <fct>   <int>     <int> <int>
 1 ADNI1      -4         2  1606
 2 ADNI1      -4         3  1132
 3 ADNI1       1         1  1130
 4 ADNIGO     NA         1    89
 5 ADNIGO     NA         2   323
 6 ADNIGO     NA         3    63
 7 ADNI2      NA         1  1956
 8 ADNI2      NA         2  2608
 9 ADNI2      NA         3  1107
10 ADNI3      NA         1  1809
11 ADNI3      NA         2  1274
12 ADNI3      NA         3   448
13 ADNI3      NA        NA    35
14 ADNI4      NA         1   710
15 ADNI4      NA         2   369
16 ADNI4      NA         3   126
17 ADNI4      NA        NA     2

The distinction appears to be that:

    - In ADNI1, missing values are explicitly coded as -4, likely because those variables were not collected at the time.

    - In later phases (e.g., ADNIGO, ADNI2, ADNI3, ADNI4), missingness is represented by blanks (NA in R), possibly reflecting missing by design or incomplete data entry.


Thus, although the technical reason for missingness differs slightly, combining -4 and blanks into a single NA value for analysis is appropriate and should not introduce bias for most applications.
On Friday, April 25, 2025 at 11:41:12 PM UTC+1 nhsaito wrote:
Reply all
Reply to author
Forward
0 new messages