I need a script, awk preferred, to mask sensitive patient info in HL7
messages.
The line numbers below do not belong to the HL7 messages; I just added
them for the sake of clarity in this posting.
I have some log files containing thousands of HL7 messages, separated by
blank lines, with real patient data. I need to mask out those sensitive
patient info before I could send these files to a third party (a Lab
Report Repostory) for them to use.
-- mask_spec.txt--
123456789;PID;2;1 // Patient ID
123456789;PID;3;1 // Patient ID
123456789;PID;4;1 // Patient ID
Name,Masked;PID;5 // Patient Name
19010131;PID;7 // Date of Birth
123 Random Street;PID;11;1 // Street Address
A1B 2D3;PID;11;5 // Postal Code
(123)222-3333;PID;11;13 // Phone Number
112233;PV1;7;1 // Physician ID
Attending,Doctor;PV1;7;2 // Attending Doctor Name
-- mask_spec.txt--
> I need a script, awk preferred, to mask sensitive patient info in HL7
> messages.
> The line numbers below do not belong to the HL7 messages; I just added
> them for the sake of clarity in this posting.
> I have some log files containing thousands of HL7 messages, separated by
> blank lines, with real patient data. I need to mask out those sensitive
> patient info before I could send these files to a third party (a Lab
> Report Repostory) for them to use.
Your sample data are quite confusing and not very suitable to see what
you want. Also the partly field substitutions are not clear.
> If I have another file that tell what-to-mask-to-what, would it be easier?
It depends.
If you have to mask just specific fields in specific records I'd choose
another approach; mask those fields and, if necessary, save the mapping
in an independent file. Something like
$1=="PV1" {
# similar as above for other record types
}
# etc. for more record types
' in_data > out_data
That outlined approach can be made more concise by introducing a function
where the field number is a parameter.
Your fields also seem to be substituted partly only (in some cases?); so
an actual substitution would have to use the match() and substr() function
(or alternatively the sub()/gsub() function to replace only the relevant
part and keep the rest.
If you provide more concise sample data and an accurate description about
your substitution rules we can help you in more detail, and likely provide
an even simpler solution.
> -- mask_spec.txt--
> 123456789;PID;2;1 // Patient ID
> 123456789;PID;3;1 // Patient ID
> 123456789;PID;4;1 // Patient ID
> Name,Masked;PID;5 // Patient Name
> 19010131;PID;7 // Date of Birth
> 123 Random Street;PID;11;1 // Street Address
> A1B 2D3;PID;11;5 // Postal Code
> (123)222-3333;PID;11;13 // Phone Number
> 112233;PV1;7;1 // Physician ID
> Attending,Doctor;PV1;7;2 // Attending Doctor Name
> -- mask_spec.txt--
On 9/26/2012 8:40 PM, harryooopot...@hotmail.com wrote:
> I need a script, awk preferred, to mask sensitive patient info in HL7
> messages.
> The line numbers below do not belong to the HL7 messages; I just added
> them for the sake of clarity in this posting.
> I have some log files containing thousands of HL7 messages, separated by
> blank lines, with real patient data. I need to mask out those sensitive
> patient info before I could send these files to a third party (a Lab
> Report Repostory) for them to use.
It's good that you provided some sample input but couldn't you come up with something much briefer that REPRESENTS your input instead of something which presumably IS your in/out and is so lengthy with all those non-alpha-numeric characters and wrapping lines? If you could put a little effort into that it'd save everyone reading your post from having to put that effort into understanding your data and so make it much more likely we'd take that time and come up with the best answer for you.
The expected output for that input would be very useful too.
> -- mask_spec.txt--
> 123456789;PID;2;1 // Patient ID
> 123456789;PID;3;1 // Patient ID
> 123456789;PID;4;1 // Patient ID
> Name,Masked;PID;5 // Patient Name
> 19010131;PID;7 // Date of Birth
> 123 Random Street;PID;11;1 // Street Address
> A1B 2D3;PID;11;5 // Postal Code
> (123)222-3333;PID;11;13 // Phone Number
> 112233;PV1;7;1 // Physician ID
> Attending,Doctor;PV1;7;2 // Attending Doctor Name
> -- mask_spec.txt--
> Any help appreciated.
> TIA
It looks like your fields are separated by "|"s and sub-fields by "^"s, and you are numbering your fields starting at 0 (while awk starts them at 1) and your sub-fields at 1. I THINK all you need to do is something like:
It was my fault on the Phone Number mask_spec ...
It should be
(123)222-3333;PID;13;1 // Phone Number
instead of
(123)222-3333;PID;11;13 // Phone Number .
So the awk snippet should be
$14 = "(123)222-3333"
instead.
On 2012-09-27, harryooopot...@hotmail.com <harryooopot...@hotmail.com> wrote:
> I need a script, awk preferred, to mask sensitive patient info in HL7
> messages.
> The line numbers below do not belong to the HL7 messages; I just added
> them for the sake of clarity in this posting.
> I have some log files containing thousands of HL7 messages, separated by
> blank lines, with real patient data. I need to mask out those sensitive
> patient info before I could send these files to a third party (a Lab
> Report Repostory) for them to use.
Totally OT for this group, but are you sure you should be mapping
(say) all patient IDs to the same value? What about data for the same
patient? I would have said you needed a safe-haven list of real<->fake
IDs (and possibly some other fields).