On 9/26/2012 8:40 PM,
harryoo...@hotmail.com wrote:
> I need a script, awk preferred, to mask sensitive patient info in HL7
> messages.
>
> The line numbers below do not belong to the HL7 messages; I just added
> them for the sake of clarity in this posting.
>
> I have some log files containing thousands of HL7 messages, separated by
> blank lines, with real patient data. I need to mask out those sensitive
> patient info before I could send these files to a third party (a Lab
> Report Repostory) for them to use.
It's good that you provided some sample input but couldn't you come up with
something much briefer that REPRESENTS your input instead of something which
presumably IS your in/out and is so lengthy with all those non-alpha-numeric
characters and wrapping lines? If you could put a little effort into that it'd
save everyone reading your post from having to put that effort into
understanding your data and so make it much more likely we'd take that time and
come up with the best answer for you.
The expected output for that input would be very useful too.
>
> A) Sample HL7 message :
> 1 MSH|^~\&|OPEN ENGINE|CLS|Egate|8832253|20120926150049||ORU^R01|Q521477659T517738211|P|2.3
> 2 PID|1|123456789^^^AB|123456789^^^8832253|777888999^^^ULI~2444690^^^PSID|Name,Masked||19010131|F|||123 Random St^^Calgary^AB^A1B 2D3^CA^H^^83|83|(123)222-3333||ENG|S||100033344555^^^8832253|789030200|||||||||||N
> 3 PV1|1|D|01362^^^8832253|UR|||112233^Attending, Doctor|||||||||||D|32112345|ab||||||||||||||||||||||||20120925170500|20120925213000
> 4 OBR|1|001TKPWNZ|0589313008^101MA|2922077^URINE BACTERIAL CULTURE^L01N^M URINE^^MI|||20120925192000|||^CONTRIBUTOR_SYSTEM^SCMLAB^^^^^^^Personnel||||20120925211100|URINE^^^Midstream|10882^Khorrami, Katayoun^004406||||UR-12-1234567||20120926150043||MA|F||1^^^20120925191900^^ST~^^^^^ST|||||||||20120925192000
> 5 OBX|1|TX|4384297^URINE BACTERIAL CULTURE^L01N^M URINE^^MI||*****Microbiology Urine*****|||A|||F|||20120926150043
> 6 OBX|2|TX|4384297^URINE BACTERIAL CULTURE^L01N^M URINE^^MI|||||A|||F|||20120926150043
> 7 OBX|3|TX|4384297^URINE BACTERIAL CULTURE^L01N^M URINE^^MI|| TEST: Urine Culture|||A|||F|||20120926150043
>
> B) Some HL7 fields need_mask :
> 1 MSH|^~\&|OPEN ENGINE|CLS|Egate|8832253|20120926150049||ORU^R01|Q521477659T517738211|P|2.3
> 2 PID|1|PID-2.1_Need_Mask^^^AB|PID-3.1_Need_Mask^^^8832253|PID-4.1_Need_Mask^^^ULI~2444690^^^PSID|PID-5_Need_Mask||PID-7_Need_Mask|F|||PID-11.1_Need_Mask^^Calgary^AB^PID-11.5_Need_Mask^CA^H^^83|83|PID-11.13_Need_Mask||ENG|S||100037307051^^^8832253|789030200|||||||||||N
> 3 PV1|1|D|01362^^^8832253|UR|||PV1-7.1_Need_Mask^PV1-7.2_Need_Mask|||||||||||D|PV1-19_Need_Mask|ab||||||||||||||||||||||||20120925170500|20120925213000
> 4 OBR|1|001TKPWNZ|0589313008^101MA|2922077^URINE BACTERIAL CULTURE^L01N^M URINE^^MI|||20120925192000|||^CONTRIBUTOR_SYSTEM^SCMLAB^^^^^^^Personnel||||20120925211100|URINE^^^Midstream|10882^Khorrami, Katayoun^004406||||UR-12-0178975||20120926150043||MA|F||1^^^20120925191900^^ST~^^^^^ST|||||||||20120925192000
> 5 OBX|1|TX|4384297^URINE BACTERIAL CULTURE^L01N^M URINE^^MI||*****Microbiology Urine*****|||A|||F|||20120926150043
> 6 OBX|2|TX|4384297^URINE BACTERIAL CULTURE^L01N^M URINE^^MI|||||A|||F|||20120926150043
> 7 OBX|3|TX|4384297^URINE BACTERIAL CULTURE^L01N^M URINE^^MI|| TEST: Urine Culture|||A|||F|||20120926150043
>
> If I have another file that tell what-to-mask-to-what, would it be easier?
Not unless the fields you want to "mask" vary for different input files or
something.
> Like ...
>
> Format: <mask_to>, <HL7_Header>, <field> [,<subfield>] // Comment
Your format above says comma-separated fields, but your data has semi-colon
separated fields.
> -- mask_spec.txt--
> 123456789;PID;2;1 // Patient ID
> 123456789;PID;3;1 // Patient ID
> 123456789;PID;4;1 // Patient ID
> Name,Masked;PID;5 // Patient Name
> 19010131;PID;7 // Date of Birth
> 123 Random Street;PID;11;1 // Street Address
> A1B 2D3;PID;11;5 // Postal Code
> (123)222-3333;PID;11;13 // Phone Number
> 112233;PV1;7;1 // Physician ID
> Attending,Doctor;PV1;7;2 // Attending Doctor Name
> -- mask_spec.txt--
>
> Any help appreciated.
> TIA
>
It looks like your fields are separated by "|"s and sub-fields by "^"s, and you
are numbering your fields starting at 0 (while awk starts them at 1) and your
sub-fields at 1. I THINK all you need to do is something like:
awk '
BEGIN { FS=OFS="|" }
$1 == "PID" {
sub(/^[^^]+\^/,"123456789,",$3)
sub(/^[^^]+\^/,"123456789,",$4)
sub(/^[^^]+\^/,"123456789,",$5)
$6 = "Name,Masked"
$8 = "19010131"
n = split($12,sf,/\^/)
sf[1] = "123 Random Street"
sf[5] = "A1B 2D3"
sf[13] = "(123)222-3333"
$12 = sep = ""
for (i=1;i<=n;i++) {
$12 = $12 sep sf[i]
sep = "^"
}
}
$1 == "PV1" {
n = split($8,sf,/\^/)
sf[1] = "112233"
sf[2] = "Attending,Doctor"
$8 = sep = ""
for (i=1;i<=n;i++) {
$8 = $8 sep sf[i]
sep = "^"
}
}
{ print }
' input_file
but without a simpler input file and the expected output it's hard to tell.
Regards,
Ed.