Google Groups no longer supports new Usenet posts or subscriptions. Historical content remains viewable.
Dismiss

Perl regex - How to make my greedy quantifier greedier?

49 views
Skip to first unread message

cibalo

unread,
May 16, 2013, 11:46:00 PM5/16/13
to
Hello,

I would like to try some string matching in perl as is in the title.
Let's create some testfiles for testing as follows.
$ mkdir -vp testing/dir.a/dir_b/dir-c; cd testing/dir.a/dir_b/dir-c; \
touch This_is_testing1_org.txt This-is-testing2_org.txt \
this_is_testing3_org.txt this-is-testing4_org.txt; cd

What I am looking for is the result similar to:
$ find testing -type f -name "[a-z]*\.txt"
testing/dir.a/dir_b/dir-c/this-is-testing4_org.txt
testing/dir.a/dir_b/dir-c/this_is_testing3_org.txt
I know it is more easier to find the result this way.

Now I try with perl regex as:
$ ls testing/dir.a/dir_b/dir-c/* | perl -ne '/^(.*\/)([a-z].*)$/;
print $1, " - ", $2, "\n";'
testing/dir.a/dir_b/ - dir-c/This_is_testing1_org.txt
testing/dir.a/dir_b/ - dir-c/This-is-testing2_org.txt
testing/dir.a/dir_b/dir-c/ - this_is_testing3_org.txt
testing/dir.a/dir_b/dir-c/ - this-is-testing4_org.txt
Actually, I want my leftmost greedy quantifier, (.*\/), to be so
greedier that it can prevent the first two output items from listing.

What interests me most is this:
$ ls testing/dir.a/dir_b/dir-c/* | perl -ne '/^(.*\/)([A-Z].*)$/;
print $1, " - ", $2, "\n";'
testing/dir.a/dir_b/dir-c/ - This_is_testing1_org.txt
testing/dir.a/dir_b/dir-c/ - This-is-testing2_org.txt
testing/dir.a/dir_b/dir-c/ - This-is-testing2_org.txt
testing/dir.a/dir_b/dir-c/ - This-is-testing2_org.txt
"This-is-testing2_org.txt" is repeated three times.

Can you please let me know what I'm missing?

Thank you very much in advance!!!

Best Regards,
cibalo

Damien Wyart

unread,
May 17, 2013, 4:14:47 AM5/17/13
to
* cibalo <cib...@gmx.co.uk> in comp.lang.perl.misc:
> [...]

> Now I try with perl regex as:
> $ ls testing/dir.a/dir_b/dir-c/* | perl -ne '/^(.*\/)([a-z].*)$/;
> print $1, " - ", $2, "\n";'
> testing/dir.a/dir_b/ - dir-c/This_is_testing1_org.txt
> testing/dir.a/dir_b/ - dir-c/This-is-testing2_org.txt
> testing/dir.a/dir_b/dir-c/ - this_is_testing3_org.txt
> testing/dir.a/dir_b/dir-c/ - this-is-testing4_org.txt
> Actually, I want my leftmost greedy quantifier, (.*\/), to be so
> greedier that it can prevent the first two output items from listing.
> [...]

To answer strictly to your question, what you were looking for is '*+' ;
but this will not work in your regex: you need to exclude '/' in the
second group to match only on the filename.

You can read more on the topic (using regexes with paths and filenames)
here: http://stackoverflow.com/questions/169008/regex-for-parsing-directory-and-filename

--
DW

Christian Winter

unread,
May 17, 2013, 4:35:50 AM5/17/13
to
Am 17.05.2013 05:46, schrieb cibalo:
> What I am looking for is the result similar to:
> $ find testing -type f -name "[a-z]*\.txt"
> testing/dir.a/dir_b/dir-c/this-is-testing4_org.txt
> testing/dir.a/dir_b/dir-c/this_is_testing3_org.txt
> I know it is more easier to find the result this way.

> Now I try with perl regex as:
> $ ls testing/dir.a/dir_b/dir-c/* | perl -ne '/^(.*\/)([a-z].*)$/;
> print $1, " - ", $2, "\n";'
> testing/dir.a/dir_b/ - dir-c/This_is_testing1_org.txt
> testing/dir.a/dir_b/ - dir-c/This-is-testing2_org.txt
> testing/dir.a/dir_b/dir-c/ - this_is_testing3_org.txt
> testing/dir.a/dir_b/dir-c/ - this-is-testing4_org.txt
> Actually, I want my leftmost greedy quantifier, (.*\/), to be so
> greedier that it can prevent the first two output items from listing.

Two approaches instantly come to my mind:

1. Modifying the pattern for the filename so it looses
its greediness, which unfortunately isn't solved by
adding the non-greedy modifier. A simple solution is
looking for anything but a slash:
/^ (.*\/) ([^\/]+) $/x;

2. Use a negative look-ahead assertion after the first
pattern to invalidate a match if there's still a slash
further ahead:
/^ (.*\/) (?!.*\/) ([a-z].*) $/x;
This one has the added benefit that the capture for $2
is applied to the filename only.

-Chris

Ben Morrow

unread,
May 17, 2013, 5:07:48 AM5/17/13
to

Quoth Christian Winter <thepoet...@arcor.de>:
> Am 17.05.2013 05:46, schrieb cibalo:
> > What I am looking for is the result similar to:
> > $ find testing -type f -name "[a-z]*\.txt"
> > testing/dir.a/dir_b/dir-c/this-is-testing4_org.txt
> > testing/dir.a/dir_b/dir-c/this_is_testing3_org.txt
> > I know it is more easier to find the result this way.
>
> > Now I try with perl regex as:
> > $ ls testing/dir.a/dir_b/dir-c/* | perl -ne '/^(.*\/)([a-z].*)$/;
> > print $1, " - ", $2, "\n";'
> > testing/dir.a/dir_b/ - dir-c/This_is_testing1_org.txt
> > testing/dir.a/dir_b/ - dir-c/This-is-testing2_org.txt
> > testing/dir.a/dir_b/dir-c/ - this_is_testing3_org.txt
> > testing/dir.a/dir_b/dir-c/ - this-is-testing4_org.txt
> > Actually, I want my leftmost greedy quantifier, (.*\/), to be so
> > greedier that it can prevent the first two output items from listing.

It's important to realise that (non-)greediness can never prevent a
pattern from matching, it only changes which parts of the pattern match
which parts of the text. As Damien pointed out, possessiveness, that is
(?>) or .*+, can stop a pattern from matching, but in this case the .*+
in (.*+\/) would run all the way to the end of the string and refuse to
give any up, so you still wouldn't get what you want.

> Two approaches instantly come to my mind:
>
> 1. Modifying the pattern for the filename so it looses
> its greediness, which unfortunately isn't solved by
> adding the non-greedy modifier. A simple solution is
> looking for anything but a slash:
> /^ (.*\/) ([^\/]+) $/x;

That's not the same. What the OP hasn't realised is that '*' in a glob
is not equivalent to .*, it's equivalent to [^/]*. So a simple
translation of the original find pattern would be m!/[a-z][^/]*\.txt$!

Ben

cibalo

unread,
May 17, 2013, 9:14:48 PM5/17/13
to
Hello Ben Morrow, Christian Winter And Damien Wyart:

Thank you very much for your valuable reply.
Your suggestions worked for me.

Best Regards,
cibalo

0 new messages