Struggling with the absurdly simple

31 views
Skip to first unread message

The Frog

unread,
Feb 6, 2014, 5:35:20 AM2/6/14
to re...@googlegroups.com
Hi Everyone,

I am analysing a text file with MS Regular Expressions 5.5 and need a little help with one of the types of lines I am parsing. I need to be able to positively identify if the line comforms to this regex, and if so then capture the first section only.

The text looks like the following:
<one or more words>(<multiple spaces><one or more words>)repeated unknown number of times<optional digits followed by a % sign><end of line>

Grabbing blocks of words seems to be ok, but I am unsure how to do two things here:
1/ Single character words cause a break in the capture groups detection and therefore a new capture when it should be one
2/ Making sure that this format of text and only this format is captured from the source data

I am able to detect word blocks with the following: ((\w+)\s?\w)+ but this breaks when there are single character words like 'a' and 'I'.

Some example text:
The cat in the hat                       fiddly diddly dee                  porky pig and elmer fud                grab a snack
One big chicken            four to the floor                          chicken sandwich please                              34.56%
Rip em a new One            I hate garbage            Muppets rule

What is desired to be returned is:
The cat in the hat
One big chicken
Rip em a new One

The line before these ones is always blank. There is a totally different layout following it too. The following layout can be of different forms so not much help there.

Any help in this would be greatly appreciated. It seems so simple yet it is eluding me.

Cheers

The Frog

Rick Quatro

unread,
Feb 6, 2014, 7:43:02 AM2/6/14
to re...@googlegroups.com
Hi, This one should work:

^(?:\b[\S]+\s)+\S+

Rick

--
--
Sub, Unsub, Read-on-the-web, tune your personal settings for this Regex forum:
http://groups.google.com/group/regex?hl=en
 
---
You received this message because you are subscribed to the Google Groups "Regex" group.
To unsubscribe from this group and stop receiving emails from it, send an email to regex+un...@googlegroups.com.
For more options, visit https://groups.google.com/groups/opt_out.



--
Rick Quatro
585-219-8959 Fax/Voice Mail

The Frog

unread,
Feb 6, 2014, 7:49:10 AM2/6/14
to re...@googlegroups.com

Forgot to mention that in the sample text above the first block of words might actually be just a single work:

Chocolate             monkey donkey birdy        fred flintstone       barney rubble            35.00%

Rick Quatro

unread,
Feb 6, 2014, 7:55:25 AM2/6/14
to re...@googlegroups.com
^(?:\b[\S]+\s?)+\S+


On Thu, Feb 6, 2014 at 7:49 AM, The Frog <mr.frog...@googlemail.com> wrote:

Forgot to mention that in the sample text above the first block of words might actually be just a single work:

Chocolate             monkey donkey birdy        fred flintstone       barney rubble            35.00%

--
--
Sub, Unsub, Read-on-the-web, tune your personal settings for this Regex forum:
http://groups.google.com/group/regex?hl=en
 
---
You received this message because you are subscribed to the Google Groups "Regex" group.
To unsubscribe from this group and stop receiving emails from it, send an email to regex+un...@googlegroups.com.
For more options, visit https://groups.google.com/groups/opt_out.

The Frog

unread,
Feb 6, 2014, 8:07:27 AM2/6/14
to re...@googlegroups.com

Hi Rick,

Thanks for taking the time to have a look at this. The regex you suggest isnt detecting the line with the data concerned. I removed the '^' at the start but to no avail. Perhaps I can explain the line another way that might make more sense.

I have a name at the start of the line I want. This name might be a single word or multiple words. This is followed by an unknown number of white space characters. There may be between 1 and 4 more names of the same type of data (ie/ one or more words), with the names being separated by an unknown number of whitespace characters, optionally followed by an unknown number of whitespace characters and 00.00%.

I dont know if this helps or not. It would also be useful to capture the % value at the end if possible. Dont know if this can be done or not.

Cheers

The Frog

The Frog

unread,
Feb 6, 2014, 8:28:17 AM2/6/14
to re...@googlegroups.com

Hi Rick,

Thanks once again for helping me with this. I figured out why the regex wasnt working at first. Turns out that sometimes the line in question has a whitespace character at the start. So here is what I did:
^\s*?(\S+\s?)\S+
So the first word(s) are found. Very cool. Still dont know how it is possible to tell of the entire line conforms to the pattern of the data. There are multiple possibilities and I need to find just this one pattern. Can we do a repeating non capturing group for the following name(s) and then a capture for the percentile value if it exists? I was also wondering if I had made a mistake in the regex I posted as it seems to be losing the last character of the first name word(s) in the capture group.

Cheers

The Frog

The Frog

unread,
Feb 6, 2014, 8:46:01 AM2/6/14
to re...@googlegroups.com
OK, I have been successful in grabbing each of the names from the line of text, but I am missing the optional value% at the end and I am capturing too much. Here is where I am at so far:
^\s?(\b\S+\b(\s(\S+))*?)(?=\s\s+)

This is also getting a lot of hits on other lines in the data. I'll keep playing...

Cheers

The Frog

Rick Quatro

unread,
Feb 6, 2014, 8:49:57 AM2/6/14
to re...@googlegroups.com
Hi, Sorry I haven't been able to respond, as I am in and out of the office. How are you testing the regular expressions? I use RegexBuddy and highly recommend it (or a tool like it).


--
--
Sub, Unsub, Read-on-the-web, tune your personal settings for this Regex forum:
http://groups.google.com/group/regex?hl=en
 
---
You received this message because you are subscribed to the Google Groups "Regex" group.
To unsubscribe from this group and stop receiving emails from it, send an email to regex+un...@googlegroups.com.
For more options, visit https://groups.google.com/groups/opt_out.

Prashant

unread,
Feb 6, 2014, 8:59:33 AM2/6/14
to re...@googlegroups.com

I recommend ultra picio expesso.

Prashant

Sent from my ASUS Fonepad

Reply all
Reply to author
Forward
0 new messages