Help: Anonymisation script determining PHI tags

102 views
Skip to first unread message

Eva Herbst

unread,
Nov 6, 2024, 7:04:08 AM11/6/24
to xnat_discussion
Hi everyone,

We are currently writing an anonymisation script and wrote some code to replace all tags with anonymised info ("anon" or hashUIDs or numbers) and maintaining the correct data structure.

The next step is to determine which tags we actually want keep and which we want to overwrite.
However, we were working with this list which has over 2000 tags.
So manually checking which of these possibly has patient info will be quite tedious.

Does anyone already have a list of PHI tags vs imaging data tags for MRI and CT?
I guess the list contains all possible DICOM tags, are there any that are never used in MRI and CT? And has anyone already made a list of which imaging tags need to be kept for MRI and CT (not overwritten because they contain image info)?

We are also thinking of using the blankValues on patient name, address, birth date to cover there being multiple tags containing these, but there are still several fields that contain other patient info (ie pregancy status, date of visit etc). blankValues will likely not be sufficient also due to the possibility of typos and alternate data encodings.

We are already using removeAllPrivateTags.

Another question: for OB tags, which can be various data types, can we overwrite these with anything? E.g. a string? Some OB info is important for the image AFAIK, but others might include patient info (e.g. (0002,0102) = OB Private Information)


Thank you very much!
Eva


David Clunie

unread,
Nov 6, 2024, 8:44:52 AM11/6/24
to xnat_di...@googlegroups.com
Hi Eva

Here are some general thoughts on DICOM de-identification, which are not XNAT specific ...

I suggest that it is good practice to follow the standard:

- make sure you address all the specified data elements [1][2]
- make sure your de-identified images remain compliant [3] esp. wrt. replacement values

The standard may not be perfect, but at least it is maintained and represents the consensus
of a lot of people working in this space. Doing less may be exposing yourself to unnecessary
risk (on the subject of which, document your decisions in a risk analysis). There should be
enough "options" to cover most use cases (if not, let me know).

When you need to replace rather than remove values, take care the dummy values are compliant
with the VR (e.g., the string "anonymized" is not a valid date, nor a valid code string, UID,
etc.). When a data element can be removed or its value made zero length, do so, rather than
inserting a dummy value.

Do not attempt to de-identify the File Meta Information (group 0x0002), recreate it entirely [4],
which makes the question of OB (0002,0102) moot. In general, you cannot replace OB data
elements, they need to be removed if they may contain PHI (and aren't checked, e.g. Overlay
Data), or retained if they are safe (e.g, VOI LUT Data).

Removing all private data elements rather than keeping those known to be safe (e.g., [5])
may lead to unhappiness (esp. failure of quantitative downstream apps that depend on them).

Using a "remove values" approach on free text strings like blankValues is probably not very
robust, as opposed to performing some kind of more sophisticated analysis of the text. If
you can't do this, then it may be safer to use a "keep values" approach instead (esp. for
Study Description, Series Description, Protocol Name, etc. that may be more important than
other "descriptors" listed in [1] or determined from the VR).

See also the discussion of using XNAT in the recent MIDI-B challenge [6][7] and specifically
the comments wrt. handling unstructured text.

David

1. https://dicom.nema.org/medical/dicom/current/output/chtml/part15/chapter_E.html#sect_E.1.1
2. https://link.springer.com/article/10.1007/s10278-024-01182-y
3. https://www.dclunie.com/dicom3tools/dciodvfy.html
4. https://dicom.nema.org/medical/dicom/current/output/chtml/part15/chapter_E.html#para_41258edc-da3a-43bb-ae04-3734051a876b
5. https://wiki.cancerimagingarchive.net/display/Public/Submission+and+De-identification+Overview#:~:text=Private%20Tag%20Dictionary
6. https://wiki.nci.nih.gov/display/MIDI/2024+MIDI-B+Challenge+Workshop
7. https://docs.google.com/presentation/d/1Y4jiDIJDVl-vMwUuVGzvVPzcktP8jwTb/edit#slide=id.p1

On 11/6/24 7:04 AM, Eva Herbst wrote:
> Hi everyone,
>
> We are currently writing an anonymisation script and wrote some code to replace all tags with anonymised info ("anon" or hashUIDs or numbers) and maintaining the correct data structure.
>
> The next step is to determine which tags we actually want keep and which we want to overwrite.
> However, we were working with this list <https://www.dicomlibrary.com/dicom/dicom-tags/> which has over 2000 tags.
> So manually checking which of these possibly has patient info will be quite tedious.
>
> *Does anyone already have a list of PHI tags vs imaging data tags for MRI and CT?*
> *I guess the list contains all possible DICOM tags, are there any that are _never_ used in MRI and CT? And has anyone already made a list of which imaging tags need to be kept for MRI and CT (not overwritten because they contain image info)?*
>
> We are also thinking of using the blankValues on patient name, address, birth date to cover there being multiple tags containing these, but there are still several fields that contain other patient info (ie pregancy status, date of visit etc). blankValues will likely not be sufficient also due to the possibility of typos and alternate data encodings.
>
> We are already using removeAllPrivateTags.
>
> *Another question: *for OB tags, which can be various data types, can we overwrite these with anything? E.g. a string? Some OB info is important for the image AFAIK, but others might include patient info (e.g. (0002,0102) = OB Private Information)
>
>
> Thank you very much!
> Eva
>
>
> --
> You received this message because you are subscribed to the Google Groups "xnat_discussion" group.
> To unsubscribe from this group and stop receiving emails from it, send an email to xnat_discussi...@googlegroups.com <mailto:xnat_discussi...@googlegroups.com>.
> To view this discussion visit https://groups.google.com/d/msgid/xnat_discussion/2321acf8-e201-41a3-9e96-27fdf49d61c9n%40googlegroups.com <https://groups.google.com/d/msgid/xnat_discussion/2321acf8-e201-41a3-9e96-27fdf49d61c9n%40googlegroups.com?utm_medium=email&utm_source=footer>.

Eva Herbst

unread,
Nov 6, 2024, 10:36:45 AM11/6/24
to xnat_discussion
Hi David,

Thank you so much, that is super helpful!

Shortly after asking the question I also found the E section in the DICOM Standard which is exactly what I was looking for.
I agree using this and replacing required tags (and removing optional ones) seems like a better approach than using blankValues.

Thanks again for your help!
I will have a look at the other resources you sent as well.
And thank you for the clarification about the OB fields.

Best,
Eva

akluiber

unread,
Nov 6, 2024, 1:20:11 PM11/6/24
to xnat_discussion
To add to the discussion, I think it would be useful if users could define tags they trust and run a whitelist operation. Basically, the inverse of tagsToRemove example in the language reference. i.e.:

tagsToKeep := { 
 ...
    (0018,0015), //Body Part Examined
    (0018,0050), //Slice Thickness
    (0018,0060), //KVP
    (0018,0090), //Data Collection Diameter
    (0018,1020), //Software Versions(s)    
    (0033,{SOME_PCID}01), //some private tag
    (0099,{SOME_OTHER_PCID}01), //some other private tag
...
}

keepTags[ tagsToKeep ] //remove all other tags not in tagsToKeep

Eva Herbst

unread,
Nov 8, 2024, 5:05:11 AM11/8/24
to xnat_discussion
Hi David,

Thanks again for the help.
Overall our approach is working well, however we still have issues with some VR types.
Specifically some OB tags are listed in the Nema Table E a sDe-identification Action Codes  = D (so they should be replaced rather than removed).
We are not sure what to replace them by since it says "replace with a non-zero length value that may be a dummy value and consistent with the VR" 

E.g. flow identifier, SourceIdentifier, FrameOriginTimestamp, Encapsulated Document, SelectorOBValue, CertificateofSigner

I have the same question for VR type SQ. Some can be removed but not all (some are type D)

Thank you,
Eva

David Clunie

unread,
Nov 22, 2024, 4:16:51 AM11/22/24
to xnat_di...@googlegroups.com
Hi Eva

The point of the replacement with dummy values is that sometimes it is necessary to
preserve compliance with the IOD. The Table in Annex E is in a way an "oversimplification"
of what the IODs in PS3.3 require.

E.g., removing EncapsulatedDocument from an Encapsulated PDF Storage SOP Class instance
would invalidate its compliance (as opposed to replacing it with a short but valid PDF
document value rendered as a blank page.

The same applies for Sequences - if it is type 1 or 2 where it occurs in the IOD, then
appropriate dummy Items are expected to be created.

to test this sort of thing, you might want to compare the output of dciodvfy or a
similar verification tool on the input and output of de-identification; the compliance
should be no worse after de-identification than it was before (recognizing that the
input may not be perfect).

David
> 1. https://dicom.nema.org/medical/dicom/current/output/chtml/part15/chapter_E.html#sect_E.1.1 <https://dicom.nema.org/medical/dicom/current/output/chtml/part15/chapter_E.html#sect_E.1.1>
> 2. https://link.springer.com/article/10.1007/s10278-024-01182-y <https://link.springer.com/article/10.1007/s10278-024-01182-y>
> 3. https://www.dclunie.com/dicom3tools/dciodvfy.html <https://www.dclunie.com/dicom3tools/dciodvfy.html>
> 4. https://dicom.nema.org/medical/dicom/current/output/chtml/part15/chapter_E.html#para_41258edc-da3a-43bb-ae04-3734051a876b <https://dicom.nema.org/medical/dicom/current/output/chtml/part15/chapter_E.html#para_41258edc-da3a-43bb-ae04-3734051a876b>
> 5. https://wiki.cancerimagingarchive.net/display/Public/Submission+and+De-identification+Overview#:~:text=Private%20Tag%20Dictionary <https://wiki.cancerimagingarchive.net/display/Public/Submission+and+De-identification+Overview#:~:text=Private%20Tag%20Dictionary>
> 6. https://wiki.nci.nih.gov/display/MIDI/2024+MIDI-B+Challenge+Workshop <https://wiki.nci.nih.gov/display/MIDI/2024+MIDI-B+Challenge+Workshop>
> 7. https://docs.google.com/presentation/d/1Y4jiDIJDVl-vMwUuVGzvVPzcktP8jwTb/edit#slide=id.p1 <https://docs.google.com/presentation/d/1Y4jiDIJDVl-vMwUuVGzvVPzcktP8jwTb/edit#slide=id.p1>
>
> On 11/6/24 7:04 AM, Eva Herbst wrote:
> > Hi everyone,
> >
> > We are currently writing an anonymisation script and wrote some code to replace all tags with anonymised info ("anon" or hashUIDs or numbers) and maintaining the correct data structure.
> >
> > The next step is to determine which tags we actually want keep and which we want to overwrite.
> > However, we were working with this list <https://www.dicomlibrary.com/dicom/dicom-tags/ <https://www.dicomlibrary.com/dicom/dicom-tags/>> which has over 2000 tags.
> > So manually checking which of these possibly has patient info will be quite tedious.
> >
> > *Does anyone already have a list of PHI tags vs imaging data tags for MRI and CT?*
> > *I guess the list contains all possible DICOM tags, are there any that are _never_ used in MRI and CT? And has anyone already made a list of which imaging tags need to be kept for MRI and CT (not overwritten because they contain image info)?*
> >
> > We are also thinking of using the blankValues on patient name, address, birth date to cover there being multiple tags containing these, but there are still several fields that contain other patient info (ie pregancy status, date of visit etc). blankValues will likely not be sufficient also due to the possibility of typos and alternate data encodings.
> >
> > We are already using removeAllPrivateTags.
> >
> > *Another question: *for OB tags, which can be various data types, can we overwrite these with anything? E.g. a string? Some OB info is important for the image AFAIK, but others might include patient info (e.g. (0002,0102) = OB Private Information)
> >
> >
> > Thank you very much!
> > Eva
> >
> >
> > --
> > You received this message because you are subscribed to the Google Groups "xnat_discussion" group.
> > To unsubscribe from this group and stop receiving emails from it, send an email to xnat_discussi...@googlegroups.com <mailto:xnat_discussi...@googlegroups.com>.
> > To view this discussion visit https://groups.google.com/d/msgid/xnat_discussion/2321acf8-e201-41a3-9e96-27fdf49d61c9n%40googlegroups.com <https://groups.google.com/d/msgid/xnat_discussion/2321acf8-e201-41a3-9e96-27fdf49d61c9n%40googlegroups.com> <https://groups.google.com/d/msgid/xnat_discussion/2321acf8-e201-41a3-9e96-27fdf49d61c9n%40googlegroups.com?utm_medium=email&utm_source=footer <https://groups.google.com/d/msgid/xnat_discussion/2321acf8-e201-41a3-9e96-27fdf49d61c9n%40googlegroups.com?utm_medium=email&utm_source=footer>>.
>
> --
> You received this message because you are subscribed to the Google Groups "xnat_discussion" group.
> To unsubscribe from this group and stop receiving emails from it, send an email to xnat_discussi...@googlegroups.com <mailto:xnat_discussi...@googlegroups.com>.
> To view this discussion visit https://groups.google.com/d/msgid/xnat_discussion/8696cb3a-62a0-491a-9e84-fd400ca30769n%40googlegroups.com <https://groups.google.com/d/msgid/xnat_discussion/8696cb3a-62a0-491a-9e84-fd400ca30769n%40googlegroups.com?utm_medium=email&utm_source=footer>.

Simon Doran

unread,
Nov 27, 2024, 3:59:44 AM11/27/24
to xnat_discussion
Hi All,

  Thanks to David, in his remarks, above for being kind enough to point to the work that we did for the MIDI-B challenge. Taking part was a very instructive exercise.

  An important comment that's worth making is regarding David's statement "Removing all private data elements rather than keeping those known to be safe (e.g., [5])

may lead to unhappiness (esp. failure of quantitative downstream apps that depend on them)."

  One of the areas where we "lost marks" in our original "training" run for the MIDI-B challenge was in our approach to (a) private tags and (b) tags covered by the "Clean Descriptors" option in the DICOM Standard Part 15 Section E - David's ref. [1] above.

  In production, we adopt a "safety-first" approach and remove or replace almost all these tags (in a manner compliant with the DICOM standard), except the ones we know will be needed, for example the DICOM Series Description, and the diffusion b-values in an MRI scan. For the circumstances in which we use the data locally - trials where we have a good handle on the "downstream" requirements - this is a good and very appropriate approach. However, if you are preparing data for more general release (as performed by TCIA, for example), there is a completely different risk assessment that needs to be performed, and this needs to ensure maximum retention of the scientific value of data for future scientific studies, whose needs will not have been determined at the time of deidentification.

  @Eva, the facility that you are looking for is already in principle present in DicomEdit 6, as used by XNAT. If you look here, you can find a description of the retainPrivateTags directive, which should do exactly what you want. However, there is a caveat: during the course of the MIDI-B challenge, we discovered an issue with the implementation of this command. We have corrected this in our private version of the repo, but are discussing with the core XNAT Team at Flywheel when and if our change should go into a future release of Dicomedit. See our slides for details.

  It is worth saying that the MIDI-B challenge process got very technical in the end. Everybody participating in the final phase was able to do a very good job of removing patient information and the differences in scores at the end often came down to niceties of the DICOM standard and differences in interpretation of exactly what the role of the deidentification step was in terms of fixing files that were non-DICOM compliant at the start.

  My ambition is to take the learnings from MIDI-B, update our XNAT-compatible script and make it available to the community if this is of interest. However, this won't happen immediately, because of other work pressures.

  Best wishes,

Simon   

Eva Herbst

unread,
Nov 27, 2024, 4:29:34 AM11/27/24
to xnat_discussion
Thank you so much Simon for your thorough reply!

You make good points about the level of anonymisation needed.
For now we are also pursuing a safety driven approach since we are only using it for our own studies, where we know what tags we need and do not need. On the MRIs we tested so far, we were able to remove all private tags with removeAllPrivateTags.
We are currently also removing the CT series description and the diffusion b-values from MRI but you make a good point, we might need these unless we can get this info elsewhere in the clinic and map it back. I just worry in the series description that there might be notes including patient name.

We are also attempting to replace or remove other data listed in Table E from Nema.
We have not dealt with OB or SQ VR types, and our current script also causes errors (we can run half of it including removeAllPrivateTags so we are narrowing down which tags are causing issues).

" However, there is a caveat: during the course of the MIDI-B challenge, we discovered an issue with the implementation of this command. We have corrected this in our private version of the repo, but are discussing with the core XNAT Team at Flywheel when and if our change should go into a future release of Dicomedit. See our slides for details."
Is this referring to slide 14: "E.g., if there are no retained Private Tags in a block, we believe that the Private Creator tag should be removed from the DICOM file. This led to 20% of all our “errors”?
I am confused by this, I thought the private creator ID is a private tag and would therefore be removed anyway with removeAllPrivateTags?
Or maybe I am not understanding what "block" means?

I had read previously about issues with the privateCreatorID. But from what I read, it was supposed to be replaced with a dummy variable and kept, although on your slides it seems like 
"If the private creator ID is missing from the DICOM or in some other DICOM emergency, bust this out. CODE. set["(0009,1010)", "fubar"]" from DICOM Edit reference

Making your script openly available would be great! Do you have any older versions openly documented from the challenge, so I can use them to inform our current work on our script?

Thank you,
Eva

Simon Doran

unread,
Nov 27, 2024, 5:12:08 AM11/27/24
to xnat_discussion
Hi Eva,

  Apologies, my @Eva should have been @akluiber.

  But, with regard to the point you make, it is actually Slide 9: "Issue tracked down (after closure of training window) to a mismatch in assumptions between DicomEdit and pydicom regarding the VRs of private tags in files encoded with Implicit VR LittleEndian transfer syntax." Below is a fuller analysis of the issue in the paper that will accompany the formal write-up of the MIDI-B challenge (not sure when the publication date of that will be).

  Best wishes,

Simon


More problematically, when we included TCIA’s more extensive set of “safe” private tags, we discovered a previously unreported issue with the output of DicomEdit when the retainPrivateTags option is used with DICOM files encoded using the implicit value representation (VR) transfer syntax. Analysis of DicomEdit’s open-source code showed the order of operations in the algorithm to be (1) compile a list of private tag values for the tags to be retained; (2) delete all private tags; (3) rewrite the retained private tags to the output DICOM file. For files with implicit VR, the “correct” VR is by definition unknown; indeed, in the TCIA-supplied .csv file listing the “safe” tags to keep, a large number of tags have multiple associated VRs.

This leads to the unfortunate circumstance in which DicomEdit makes a different assumption about the output VR for some tags from the one pydicom uses to read the data. An error condition then occurs when a Python application attempts to read the rewritten DICOM file. Further compounding the problem, this issue caused the TCIA assessment script to fail altogether, such that no discrepancy report was generated and this meant we were not able to conduct further validation runs, compromising our anonymisation script development. We eventually solved the problem prior to the final test by changing the behaviour of DicomEdit to: (1) compile a list of private tags present in the file but not in the retain list; (2) delete these tags. Since this operation does not involve rewriting tags, pydicom can read the resulting file.

Simon Doran

unread,
Nov 27, 2024, 5:15:04 AM11/27/24
to xnat_discussion
Re the PrivateCreator tag issue, the MIDI-B evaluation template assumed that the PrivateCreator tag should be retained even if it had been determined that all private tags in the block had been removed by the anonymisation. Our argument is that if there are no remaining private tags, the block should not be there at all and we removed the Private Creator tag. 
Reply all
Reply to author
Forward
0 new messages