need help with regex

Harry

unread,

May 30, 2018, 3:26:29 PM5/30/18

to

Hi All

I am trying to extract some info from an Oracle database field which contains text like this below.

Identifier1^Last^First^Middle^^^^^EX_NBR^PERSONNEL~
Identifier2^Last^First^Middle^^^^^COL_NBR^PERSONNEL~
Identifier3^Last^First^Middle^^^^^MSP_NBR^PERSONNEL~
Identifier4^Last^First^Middle^^^^^PHARMANETPHYS^PERSONNEL

All the 4 lines above are actually within 1 line of field.
I just break them into 4 lines for the sake of clarity.

There may be 3 or 4 sub-fields as above separated by the tilde character.

I want to extract
Identifier3^Last^First^Middle
from the sub-field
Identifier3^Last^First^Middle^^^^^MSP_NBR^PERSONNEL~

MSP_NBR sub-field may or may not be in the middle.
It could be on the fist or on the last sub-field.

Anyone could advice the regex expression (to be used by Oracle's regexp_replace() to extract the info please?

TIA

Icarus Sparry

unread,

May 30, 2018, 3:46:41 PM5/30/18

to

As I understand it, you want to change

something~something~data^^something

to
data

where the first two somethings don't have ~ characters in them. I don't
know Oracle's regexp_replace. In sed I would do it in 2 steps for ease of
thinking, first get the third field and then trim it down.

Get the third field.
s/^[^~]*~[^~]*~$[^~]*$~.*$/\1

translation,
^ match start of line
[^~] a character that is not ~
* change the last thing to 0 or more occurances, i.e. a run of characters
that is not ~
~ match a ~, the first field seperator
[^~]*~ as above, i.e. the second field and it's seperator
$ start capturing the characters we want
[^~]* the third field
$ stop capturing
~ the seperator at the end of the third field
.* matches anything
$ matches end of line

\1 the text captured in the first set of braces

Then to remove everything after the ^^, just

s/^^.*//

It is possible to put this into one replacement, but it is harder.

Harry

unread,

May 30, 2018, 4:20:02 PM5/30/18

to

On Wednesday, May 30, 2018 at 12:46:41 PM UTC-7, Icarus Sparry wrote:

> Get the third field.
> s/^[^~]*~[^~]*~$[^~]*$~.*$/\1

Yeah, but the MSP_NBR sub-field that I want is not always on the 3rd field.
And the field may usually contain 3 or 4 sub-fields, so there may not be
a tilde character after the 3rd sub-field.

Ed Morton

unread,

May 30, 2018, 4:21:24 PM5/30/18

to

idk what kind of regexp the tool you're using supports with here's an ERE
assuming you want a string that starts with Identifier3:

Identifier3(\^[^\^]*){3}

used in awks match() function to demonstrate:

$ awk 'match($0,/Identifier3(\^[^\^]*){3}/) {print
substr($0,RSTART,RLENGTH)}' file
Identifier3^Last^First^Middle

If that's not what you're looking for then please clarify your requirements.

Regards,

Ed.

Harry

unread,

May 30, 2018, 5:05:38 PM5/30/18

to

OK, if I put the text into a text file and use sed ....
this one works for me if MSP_NBR sub-field is in 1st or last sub-field.

sed 's/^$.*~$*$[^~]*$^^^^^MSP_NBR^PERSONNEL.*$/\2/' xx.txt

On Oracle, this one works for me .....

regexp_replace(table.field, '^(.*~)*([^~]*)\^\^\^\^\^MSP_NBR\^PERSONNEL.*$', '\2')

Thanks

Thomas 'PointedEars' Lahn

unread,

May 31, 2018, 1:43:19 PM5/31/18

to

Icarus Sparry wrote:

> As I understand it, you want to change
>
> something~something~data^^something
>
> to
> data
>
> where the first two somethings don't have ~ characters in them. I don't
> know Oracle's regexp_replace. In sed I would do it in 2 steps for ease of

> thinking, first get the third field and then trim it down. […]

If that is the goal, it is of course much easier to do with awk(1) instead:

awk -F '~' '{ sub(/\^\^.*$/, "", $3); print $3 }'

Whenever you think “fields”, think of awk(1) and perl(1) first.

--
PointedEars
<https://github.com/PointedEars> | <http://PointedEars.de/wsvn/>
Twitter: @PointedEars2
Please do not cc me. /Bitte keine Kopien per E-Mail.

Grant Taylor

unread,

May 31, 2018, 1:50:59 PM5/31/18

to

On 05/31/2018 11:43 AM, Thomas 'PointedEars' Lahn wrote:
> Whenever you think “fields”, think of awk(1) and perl(1) first.

Even if "fields" are something other than "part of a line".

If there is something that separates the fields, chances are good that
awk and / or Perl can easily work with it.

--
Grant. . . .
unix || die