Google Groups no longer supports new Usenet posts or subscriptions. Historical content remains viewable.
Dismiss

odd egrep behavior

7 views
Skip to first unread message

MarkA

unread,
Mar 14, 2013, 5:22:28 PM3/14/13
to
I am trying to extract a string from the output of a DICOM directory
parser. The string I'm trying to match looks similar to this:

(0004,1500) CS #36 [DICOM\ST000000\SE000000\CR000000.DCM] Referenced File
ID

"s" is a much larger string in which the target string is embedded.

From the console, this string matches:

echo $s|egrep -o " \(0004,1500\) CS #[0-9]{2} \[.{,40}\] Reference. File
ID "

However, this string does not:

echo $s|egrep -o " \(0004,1500\) CS #[0-9]{2} \[.{,40}\] Referenced File
ID "

This also matches:

echo $s|egrep -o " \(0004,1500\) CS #[0-9]{2} \[.{,40}\] Reference[d] File
ID "

So does this:

echo $s|egrep -o " \(0004,1500\) CS #[0-9]{2} \[.{1,40}\] Referenced File
ID "


In summary, if I use the form, ".{,40}" to specify the variable part of
the string, egrep won't recognize a match between "Referenced" and
"Referenced", but will recognize "Reference." and "Reference[d]". If I
use ".{1,40}", then "Referenced" matches "Referenced", as expected.

This would seem to be a bug, unless there is something about the regular
expression that I'm missing.

I'm using GNU bash, version 4.1.5(1), and GNU grep, version 2.5.4
OS: Ubuntu 10.04.4 LTS

--
MarkA
Keeper of Things Put There Only Just The Night Before
About eight o'clock

DennisW

unread,
Mar 15, 2013, 1:17:07 PM3/15/13
to
All your regexes match for me. It is highly unlikely that the brace expression would affect a later literal following a series of literal characters. There may be something unusual in the target string or in your regex that didn't get posted in your question. Try passing them through hexdump to see if there's anything unexpected. You're not using lookarounds or alternation, so those aren't affecting things.

In any case, this isn't a Bash issue. It's a coreutils issue. If you need additional help with this, I suggest that you try one of the resources listed at http://www.gnu.org/software/coreutils/

MarkA

unread,
Mar 18, 2013, 5:17:35 PM3/18/13
to
Thank you for your interest and input. I also posted the question on
Ubuntu Forums, where another user was able to reproduce the odd behavior.
It seems to be related to the "{,m}" idiom, which, it turns out, is not
specified in the POSIX standard, while the other variants, {m}, {n,m}, and
{m,}, are.

I agree, it seems unlikely that the brace expansion would effect a literal
several characters later, but there it is. And, I did check a hex dump to
make sure there was nothing odd about the "d" character that wouldn't
match.

In any case, I wound up changing the code to use awk instead of grep for
the match, allowing me to extract the desired field from matching records
in a single step.
0 new messages