need to mask patient info in HL7 messages

harryoo...@hotmail.com

unread,

Sep 26, 2012, 9:40:24 PM9/26/12

to

I need a script, awk preferred, to mask sensitive patient info in HL7
messages.

The line numbers below do not belong to the HL7 messages; I just added
them for the sake of clarity in this posting.

I have some log files containing thousands of HL7 messages, separated by
blank lines, with real patient data. I need to mask out those sensitive
patient info before I could send these files to a third party (a Lab
Report Repostory) for them to use.

A) Sample HL7 message :
1 MSH|^~\&|OPEN ENGINE|CLS|Egate|8832253|20120926150049||ORU^R01|Q521477659T517738211|P|2.3
2 PID|1|123456789^^^AB|123456789^^^8832253|777888999^^^ULI~2444690^^^PSID|Name,Masked||19010131|F|||123 Random St^^Calgary^AB^A1B 2D3^CA^H^^83|83|(123)222-3333||ENG|S||100033344555^^^8832253|789030200|||||||||||N
3 PV1|1|D|01362^^^8832253|UR|||112233^Attending, Doctor|||||||||||D|32112345|ab||||||||||||||||||||||||20120925170500|20120925213000
4 OBR|1|001TKPWNZ|0589313008^101MA|2922077^URINE BACTERIAL CULTURE^L01N^M URINE^^MI|||20120925192000|||^CONTRIBUTOR_SYSTEM^SCMLAB^^^^^^^Personnel||||20120925211100|URINE^^^Midstream|10882^Khorrami, Katayoun^004406||||UR-12-1234567||20120926150043||MA|F||1^^^20120925191900^^ST~^^^^^ST|||||||||20120925192000
5 OBX|1|TX|4384297^URINE BACTERIAL CULTURE^L01N^M URINE^^MI||*****Microbiology Urine*****|||A|||F|||20120926150043
6 OBX|2|TX|4384297^URINE BACTERIAL CULTURE^L01N^M URINE^^MI|||||A|||F|||20120926150043
7 OBX|3|TX|4384297^URINE BACTERIAL CULTURE^L01N^M URINE^^MI|| TEST: Urine Culture|||A|||F|||20120926150043

B) Some HL7 fields need_mask :
1 MSH|^~\&|OPEN ENGINE|CLS|Egate|8832253|20120926150049||ORU^R01|Q521477659T517738211|P|2.3
2 PID|1|PID-2.1_Need_Mask^^^AB|PID-3.1_Need_Mask^^^8832253|PID-4.1_Need_Mask^^^ULI~2444690^^^PSID|PID-5_Need_Mask||PID-7_Need_Mask|F|||PID-11.1_Need_Mask^^Calgary^AB^PID-11.5_Need_Mask^CA^H^^83|83|PID-11.13_Need_Mask||ENG|S||100037307051^^^8832253|789030200|||||||||||N
3 PV1|1|D|01362^^^8832253|UR|||PV1-7.1_Need_Mask^PV1-7.2_Need_Mask|||||||||||D|PV1-19_Need_Mask|ab||||||||||||||||||||||||20120925170500|20120925213000
4 OBR|1|001TKPWNZ|0589313008^101MA|2922077^URINE BACTERIAL CULTURE^L01N^M URINE^^MI|||20120925192000|||^CONTRIBUTOR_SYSTEM^SCMLAB^^^^^^^Personnel||||20120925211100|URINE^^^Midstream|10882^Khorrami, Katayoun^004406||||UR-12-0178975||20120926150043||MA|F||1^^^20120925191900^^ST~^^^^^ST|||||||||20120925192000
5 OBX|1|TX|4384297^URINE BACTERIAL CULTURE^L01N^M URINE^^MI||*****Microbiology Urine*****|||A|||F|||20120926150043
6 OBX|2|TX|4384297^URINE BACTERIAL CULTURE^L01N^M URINE^^MI|||||A|||F|||20120926150043
7 OBX|3|TX|4384297^URINE BACTERIAL CULTURE^L01N^M URINE^^MI|| TEST: Urine Culture|||A|||F|||20120926150043

If I have another file that tell what-to-mask-to-what, would it be easier?
Like ...

Format: <mask_to>, <HL7_Header>, <field> [,<subfield>] // Comment

-- mask_spec.txt--
123456789;PID;2;1 // Patient ID
123456789;PID;3;1 // Patient ID
123456789;PID;4;1 // Patient ID
Name,Masked;PID;5 // Patient Name
19010131;PID;7 // Date of Birth
123 Random Street;PID;11;1 // Street Address
A1B 2D3;PID;11;5 // Postal Code
(123)222-3333;PID;11;13 // Phone Number
112233;PV1;7;1 // Physician ID
Attending,Doctor;PV1;7;2 // Attending Doctor Name
-- mask_spec.txt--

Any help appreciated.
TIA

Janis Papanagnou

unread,

Sep 27, 2012, 1:42:32 AM9/27/12

to

Your sample data are quite confusing and not very suitable to see what
you want. Also the partly field substitutions are not clear.

> If I have another file that tell what-to-mask-to-what, would it be easier?

It depends.

If you have to mask just specific fields in specific records I'd choose
another approach; mask those fields and, if necessary, save the mapping
in an independent file. Something like

awk 'BEGIN { FS=OFS="|" }
$1=="PID" {
old_field3 = $3
new_field3 = "mask3-" ++mask3count
$3 = new_field3
print old_field3, new_field3 >"mapping-file" # if necessary
print $0

old_field11 = $11
new_field11 = "mask11-" ++mask11count
$11 = new_field11
print old_field11, new_field11 >"mapping-file" # if necessary
print $0

# etc. for other fields
}

$1=="PV1" {
# similar as above for other record types
}

# etc. for more record types

' in_data > out_data

That outlined approach can be made more concise by introducing a function
where the field number is a parameter.

Your fields also seem to be substituted partly only (in some cases?); so
an actual substitution would have to use the match() and substr() function
(or alternatively the sub()/gsub() function to replace only the relevant
part and keep the rest.

If you provide more concise sample data and an accurate description about
your substitution rules we can help you in more detail, and likely provide
an even simpler solution.

Janis

Ed Morton

unread,

Sep 27, 2012, 8:42:39 AM9/27/12

to

On 9/26/2012 8:40 PM, harryoo...@hotmail.com wrote:
> I need a script, awk preferred, to mask sensitive patient info in HL7
> messages.
>
> The line numbers below do not belong to the HL7 messages; I just added
> them for the sake of clarity in this posting.
>
> I have some log files containing thousands of HL7 messages, separated by
> blank lines, with real patient data. I need to mask out those sensitive
> patient info before I could send these files to a third party (a Lab
> Report Repostory) for them to use.

It's good that you provided some sample input but couldn't you come up with
something much briefer that REPRESENTS your input instead of something which
presumably IS your in/out and is so lengthy with all those non-alpha-numeric
characters and wrapping lines? If you could put a little effort into that it'd
save everyone reading your post from having to put that effort into
understanding your data and so make it much more likely we'd take that time and
come up with the best answer for you.

The expected output for that input would be very useful too.

>
> A) Sample HL7 message :
> 1 MSH|^~\&|OPEN ENGINE|CLS|Egate|8832253|20120926150049||ORU^R01|Q521477659T517738211|P|2.3
> 2 PID|1|123456789^^^AB|123456789^^^8832253|777888999^^^ULI~2444690^^^PSID|Name,Masked||19010131|F|||123 Random St^^Calgary^AB^A1B 2D3^CA^H^^83|83|(123)222-3333||ENG|S||100033344555^^^8832253|789030200|||||||||||N
> 3 PV1|1|D|01362^^^8832253|UR|||112233^Attending, Doctor|||||||||||D|32112345|ab||||||||||||||||||||||||20120925170500|20120925213000
> 4 OBR|1|001TKPWNZ|0589313008^101MA|2922077^URINE BACTERIAL CULTURE^L01N^M URINE^^MI|||20120925192000|||^CONTRIBUTOR_SYSTEM^SCMLAB^^^^^^^Personnel||||20120925211100|URINE^^^Midstream|10882^Khorrami, Katayoun^004406||||UR-12-1234567||20120926150043||MA|F||1^^^20120925191900^^ST~^^^^^ST|||||||||20120925192000
> 5 OBX|1|TX|4384297^URINE BACTERIAL CULTURE^L01N^M URINE^^MI||*****Microbiology Urine*****|||A|||F|||20120926150043
> 6 OBX|2|TX|4384297^URINE BACTERIAL CULTURE^L01N^M URINE^^MI|||||A|||F|||20120926150043
> 7 OBX|3|TX|4384297^URINE BACTERIAL CULTURE^L01N^M URINE^^MI|| TEST: Urine Culture|||A|||F|||20120926150043
>
> B) Some HL7 fields need_mask :
> 1 MSH|^~\&|OPEN ENGINE|CLS|Egate|8832253|20120926150049||ORU^R01|Q521477659T517738211|P|2.3
> 2 PID|1|PID-2.1_Need_Mask^^^AB|PID-3.1_Need_Mask^^^8832253|PID-4.1_Need_Mask^^^ULI~2444690^^^PSID|PID-5_Need_Mask||PID-7_Need_Mask|F|||PID-11.1_Need_Mask^^Calgary^AB^PID-11.5_Need_Mask^CA^H^^83|83|PID-11.13_Need_Mask||ENG|S||100037307051^^^8832253|789030200|||||||||||N
> 3 PV1|1|D|01362^^^8832253|UR|||PV1-7.1_Need_Mask^PV1-7.2_Need_Mask|||||||||||D|PV1-19_Need_Mask|ab||||||||||||||||||||||||20120925170500|20120925213000
> 4 OBR|1|001TKPWNZ|0589313008^101MA|2922077^URINE BACTERIAL CULTURE^L01N^M URINE^^MI|||20120925192000|||^CONTRIBUTOR_SYSTEM^SCMLAB^^^^^^^Personnel||||20120925211100|URINE^^^Midstream|10882^Khorrami, Katayoun^004406||||UR-12-0178975||20120926150043||MA|F||1^^^20120925191900^^ST~^^^^^ST|||||||||20120925192000
> 5 OBX|1|TX|4384297^URINE BACTERIAL CULTURE^L01N^M URINE^^MI||*****Microbiology Urine*****|||A|||F|||20120926150043
> 6 OBX|2|TX|4384297^URINE BACTERIAL CULTURE^L01N^M URINE^^MI|||||A|||F|||20120926150043
> 7 OBX|3|TX|4384297^URINE BACTERIAL CULTURE^L01N^M URINE^^MI|| TEST: Urine Culture|||A|||F|||20120926150043
>
> If I have another file that tell what-to-mask-to-what, would it be easier?

Not unless the fields you want to "mask" vary for different input files or
something.

> Like ...
>
> Format: <mask_to>, <HL7_Header>, <field> [,<subfield>] // Comment

Your format above says comma-separated fields, but your data has semi-colon
separated fields.

> -- mask_spec.txt--
> 123456789;PID;2;1 // Patient ID
> 123456789;PID;3;1 // Patient ID
> 123456789;PID;4;1 // Patient ID
> Name,Masked;PID;5 // Patient Name
> 19010131;PID;7 // Date of Birth
> 123 Random Street;PID;11;1 // Street Address
> A1B 2D3;PID;11;5 // Postal Code
> (123)222-3333;PID;11;13 // Phone Number
> 112233;PV1;7;1 // Physician ID
> Attending,Doctor;PV1;7;2 // Attending Doctor Name
> -- mask_spec.txt--
>
> Any help appreciated.
> TIA
>

It looks like your fields are separated by "|"s and sub-fields by "^"s, and you
are numbering your fields starting at 0 (while awk starts them at 1) and your
sub-fields at 1. I THINK all you need to do is something like:

awk '
BEGIN { FS=OFS="|" }
$1 == "PID" {
sub(/^[^^]+\^/,"123456789,",$3)
sub(/^[^^]+\^/,"123456789,",$4)
sub(/^[^^]+\^/,"123456789,",$5)

$6 = "Name,Masked"
$8 = "19010131"

n = split($12,sf,/\^/)
sf[1] = "123 Random Street"
sf[5] = "A1B 2D3"
sf[13] = "(123)222-3333"
$12 = sep = ""
for (i=1;i<=n;i++) {
$12 = $12 sep sf[i]
sep = "^"
}
}

$1 == "PV1" {
n = split($8,sf,/\^/)
sf[1] = "112233"
sf[2] = "Attending,Doctor"
$8 = sep = ""
for (i=1;i<=n;i++) {
$8 = $8 sep sf[i]
sep = "^"
}
}
{ print }
' input_file

but without a simpler input file and the expected output it's hard to tell.

Regards,

Ed.

harryoo...@hotmail.com

unread,

Sep 27, 2012, 10:10:49 AM9/27/12

to

Ed,

Your advice and solution are much appreciated.

Your codes work well with the following simplified input.

$ fold -60 infile.txt
PID|1|777777777^^^AB|888888888^^^8832253|999999999^^^ULI~244
4690^^^PSID|Name,Orig||20010131|F|||123 Orig St^^Calgary^AB^
A9B 9D9^CA^H^^83|83|(666)666-666||ENG|S||100033344555^^^8832
253|789030200|||||||||||N
PV1|1|D|01362^^^8832253|UR|||555555^Doctor, Original||||||||
|||D|32112345|ab||||||||||||||||||||||||20120925170500|20120
925213000

$ ./mask.awk < infile.txt | fold -60
PID|1|123456789,^^AB|123456789,^^8832253|123456789,^^ULI~244
4690^^^PSID|Name,Masked||19010131|F|||123 Random Street^^Cal
gary^AB^A1B 2D3^CA^H^^83|83|(666)666-666||ENG|S||10003334455
5^^^8832253|789030200|||||||||||N

PV1|1|D|01362^^^8832253|UR|||112233^Attending,Doctor||||||||
|||D|32112345|ab||||||||||||||||||||||||20120925170500|20120
925213000

Thanks

harryoo...@hotmail.com

unread,

Sep 27, 2012, 10:41:56 AM9/27/12

to

Janis,

I tried it out and your codes work well.

$ cat mask.awk

awk 'BEGIN { FS=OFS="|" }
$1=="PID" {
old_field3 = $3

new_field3 = "123456789" ++mask3count
$3 = new_field3
print $0
}'

$ fold -60 infile.txt
PID|1|777777777^^^AB|888888888^^^8832253|999999999^^^ULI~244
4690^^^PSID|Name,Orig||20010131|F|||123 Orig St^^Calgary^AB^
A9B 9D9^CA^H^^83|83|(666)666-666||ENG|S||100033344555^^^8832
253|789030200|||||||||||N
PV1|1|D|01362^^^8832253|UR|||555555^Doctor, Original||||||||
|||D|32112345|ab||||||||||||||||||||||||20120925170500|20120
925213000

$ ./mask.awk < infile.txt | fold -60

PID|1|1234567891|888888888^^^8832253|999999999^^^ULI~2444690

^^^PSID|Name,Orig||20010131|F|||123 Orig St^^Calgary^AB^A9B

9D9^CA^H^^83|83|(666)666-666||ENG|S||100033344555^^^8832253|
789030200|||||||||||N

Thanks

harryoo...@hotmail.com

unread,

Sep 27, 2012, 11:25:20 AM9/27/12

to

P.S.

It was my fault on the Phone Number mask_spec ...
It should be
(123)222-3333;PID;13;1 // Phone Number
instead of

(123)222-3333;PID;11;13 // Phone Number

.
So the awk snippet should be
$14 = "(123)222-3333"
instead.

Ed Morton

unread,

Sep 28, 2012, 8:49:53 AM9/28/12

to

On 9/27/2012 9:10 AM, harryoo...@hotmail.com wrote:
> Ed,
>
> Your advice and solution are much appreciated.
>
> Your codes work well with the following simplified input.

Then you could make it much more concise with a function, e.g. (untested):

function updsfs(srcS,deltasA, srcA,tgtS,sep,i,n) {
n = split(srcS,srcA,/\^/)

for (i=1;i<=n;i++) {

tgtS = tgtS sep (i in deltasA ? deltasA[i] : srcA[i])
sep = "^"
delete deltasA[i]
}
return tgtS

}

BEGIN { FS=OFS="|" }
$1 == "PID" {

sf[1] = "123456789"
$3 = updsfs($3,sf)

sf[1] = "123456789"
$4 = updsfs($4,sf)

sf[1] = "123456789"
$5 = updsfs($5,sf)

$6 = "Name,Masked"
$8 = "19010131"

sf[1] = "123 Random Street"
sf[5] = "A1B 2D3"
sf[13] = "(123)222-3333"

$12 = updsfs($12,sf)
}

$1 == "PV1" {

sf[1] = "112233"
sf[2] = "Attending,Doctor"

$8 = updsfs($8,sf)
}

{ print }

Regards,

Ed.

Eric

unread,

Sep 28, 2012, 2:37:37 PM9/28/12

to

On 2012-09-27, harryoo...@hotmail.com <harryoo...@hotmail.com> wrote:
> I need a script, awk preferred, to mask sensitive patient info in HL7
> messages.
>
> The line numbers below do not belong to the HL7 messages; I just added
> them for the sake of clarity in this posting.
>
> I have some log files containing thousands of HL7 messages, separated by
> blank lines, with real patient data. I need to mask out those sensitive
> patient info before I could send these files to a third party (a Lab
> Report Repostory) for them to use.

Totally OT for this group, but are you sure you should be mapping
(say) all patient IDs to the same value? What about data for the same
patient? I would have said you needed a safe-haven list of real<->fake
IDs (and possibly some other fields).

Eric
--
ms fnd in a lbry