False positives in editing data

RichardOnRails

unread,

Nov 19, 2007, 12:39:00 AM11/19/07

to

Hi All,

Below is a 33-line program that analyzes a set of names.
The names proper are prefixed with an optional set of period-separated
numbers.
The analysis checks for the following, and flags violations:
1. A number may not start with zeros.
2. No more that two numbers may be prefixed
3. Names proper must begin with a non-digit

The program works pretty well, except it produces "false negatives,"
which I flag on the output displayed following the program. The
problem is around line 29 when I try to determine if a leading number
has any initial zero by using an RE and testing for Nil.

I'd welcome any ideas on how to correct my mistake(s), especially with
suggestion for improved style.

Thanks in advance,
Richard

Program
-------------------------------------------------------------------------------
MLN = 2 # MaxLeadingNumbers
input = <<DATA
05Topic 5
1Topic 1
2.002.1Topic 2.2.1
2.1Topic 2.1
2.2.02Topic 2.2.2
DATA
input.each { |sName|
# Debug
puts "\n" + "="*10 + "DBG", sName, "="*10+ "DBG"

#Get leading numbers separated by periods into $1, $2, $13
sName =~ /^([\d]+)?\.?([\d]+)?\.?([\d]+)?\.?/

# Save match variables in n[1] ... n[MLN+1]
n = Array.new
(1..MLN+1).each { |i| eval %{n[i] = $#{i}} }

# Debug
print "DBG0>> "; (1..MLN+1).each { |i| printf("\t#{n[i]}") }; puts
# puts "n1=#{n[1]}, n2=#{n[2]}, n3=#{n[3]}"

# Get and check leading, period-separated numbers in dir. names
(1..MLN+1).each { |i|
n[i] =~ /^0+/
i_thNumberExists = n[i] ? true : false
iThNumbrerHasLeadingZero = i_thNumberExists && (eval("$#{i}.class ==
NilClass") ?
true : false)
puts "ERROR: Leading zeros in #{n[i]}" if iThNumbrerHasLeadingZero
&& (i <= MLN)
puts "ERROR: Depth of hierarchy exceeds #{MLN}" if i_thNumberExists
&& i > MLN
}
}

Output
---------------------------------------------

==========DBG
05Topic 5
==========DBG
DBG0>> 05
ERROR: Leading zeros in 05

==========DBG
1Topic 1
==========DBG
DBG0>> 1
ERROR: Leading zeros in 1 [False positive]

==========DBG
2.002.1Topic 2.2.1
==========DBG
DBG0>> 2 002 1
ERROR: Leading zeros in 2 [False positive]
ERROR: Leading zeros in 002
ERROR: Depth of hierarchy exceeds 2

==========DBG
2.1Topic 2.1
==========DBG
DBG0>> 2 1
ERROR: Leading zeros in 2 [False positive]
ERROR: Leading zeros in 1 [False positive]

==========DBG
2.2.02Topic 2.2.2
==========DBG
DBG0>> 2 2 02
ERROR: Leading zeros in 2 [False positive]
ERROR: Leading zeros in 2 [False positive]
ERROR: Depth of hierarchy exceeds 2
>Exit code: 0

RichardOnRails

unread,

Nov 19, 2007, 1:39:38 AM11/19/07

to

On Nov 19, 12:39 am, RichardOnRails

Hi All,

Problem solved!

I went through a convoluted process to determine whether any of the
initial numbers began with a zero. I simplified it to:

i_thNumberHasLeadingZero = (n[i] =~ /^0+/) ? true : false

I apologize for raising the question. My excuse is that I was
flailing about unable to figure out what was wrong. After I posted,
I took a harder look and realized my big mistake: I needed to use the
KISS method :-)

Best wishes,
Richard

Ryan Davis

unread,

Nov 19, 2007, 2:23:30 AM11/19/07

to

I realize you solved your problem, but there is a much more insidious
underlying issues here:

On Nov 18, 2007, at 22:40 , RichardOnRails wrote:

>> MLN = 2 # MaxLeadingNumbers
>>

MLN

>> sName =~ /^([\d]+)?\.?([\d]+)?\.?([\d]+)?\.?/

sName

>> n = Array.new

n

>> print "DBG0>> "; (1..MLN+1).each { |i|
>> printf("\t#{n[i]}") };

nasty compounded lines and obtuse debugging output

>> i_thNumberExists = n[i] ? true : false
>> iThNumbrerHasLeadingZero = i_thNumberExists &&

i_thNumberExists
iThNumbrerHasLeadingZero

ARGH! My eyes! STOP!!!

Ruby is NOT C.

For the love of all that is ruby, please write readable code!

Use English variable names (or whatever your native language is--just
use words). Don't use hungarian notation, it doesn't help--hell, your
variable names are so ugly you can't even read the misspellings in
them. Don't mix camelcase and underscores. Don't use 'n[i] ? true :
false' when 'n[i]' would do. Don't debug with logging. Write tests. /
[\d]+/ is the same as /\d+/. Those evals are TERRIFYING. Don't use them.

And, most importantly, 2 spaces per indent.

If you follow these suggestions, your code will be much more
understandable and less error prone. Really.

Harry Kakueki

unread,

Nov 19, 2007, 2:42:26 AM11/19/07

to

>
> Problem solved!
>
> I went through a convoluted process to determine whether any of the
> initial numbers began with a zero. I simplified it to:
>
> i_thNumberHasLeadingZero = (n[i] =~ /^0+/) ? true : false
>
> I apologize for raising the question. My excuse is that I was
> flailing about unable to figure out what was wrong. After I posted,
> I took a harder look and realized my big mistake: I needed to use the
> KISS method :-)
>
> Best wishes,
> Richard
>
>

If you solved your problem then you won't need this.
But, just in case you are interested in looking at it from a little
different view....
It is not a complete solution, just an idea.

inp = <<DATA

05Topic 5
1Topic 1
2.002.1Topic 2.2.1
2.1Topic 2.1
2.2.02Topic 2.2.2
DATA

err1 = inp.select{|a| a =~ /^0|\.0/}
puts "A number may not start with zero" unless err1.empty?
print err1
puts
err2 = inp.select{|b| b =~ /^\d+\.\d+\.\d/}
puts "No more that two numbers may be prefixed" unless err2.empty?
print err2

Harry

--
A Look into Japanese Ruby List in English
http://www.kakueki.com/ruby/list.html

Raul Parolari

unread,

Nov 19, 2007, 5:47:45 AM11/19/07

to

RichardOnRails wrote:

> Below is a 33-line program that analyzes a set of names.
> The names proper are prefixed with an optional set of period-separated
> numbers.
> The analysis checks for the following, and flags violations:
> 1. A number may not start with zeros.
> 2. No more that two numbers may be prefixed
> 3. Names proper must begin with a non-digit

> ..

> I'd welcome any ideas on how to correct my mistake(s), especially with
> suggestion for improved style.

Richard,

even after your correction, there are things that we can improve in
your program; the problems I see are:

1) the array of numbers for the prefix is configured with the maximum
(MLN) nr of components, rather than with the entries that are present
within the line. This leads to complicated code (evaluating $n variables
beyond the valid ones, etc).
2) the regular expression sName for the numbers would need to be updated
if MLN changes (in fact, it currently contains more values than needed).
3) the script does not catch if the prefix contains characters other
than digits and dots. Try eg, prefix '2.!1' ; the prefix will be
recognized as 2, and consider the entry valid.

So the problem is that the code is very complex, and fragile; the good
news is that we can have a simpler and more robust solution; let us
divide the problem in small steps:

1) we first collect everything until the first letter (not included); we
will consider this the Prefix.
2) we verify that the Prefix has the format that we want: numbers
separated by dots.
3) we then split the Prefix into an array of Numbers.
4) we verify that the Array size does not exceed the MLN max size.
5) we then check if any of the entries begins with 0.
*) oh, and we will use the /x modifier to make the regexps legible

Each of the previous steps will be one line of code, so we should have a
very small script; I will comment the lines with the same number used
above, so you can map every statement with the 'battle plan' above:

class FormatError < Exception
# used right now only for unrecoverable errors
end

input.each do |line|

# debug info
puts "\n" + "="*10 + "DBG", line, "="*10+ "DBG\n"

# 1) collect everything (not greedy) until first letter
if line =~ /^ (.*?) [a-zA-Z] /x

prefix = $1
puts "prefix= '#{prefix}'"

# 2) validate format: numbers separated by digits
if prefix =~ /^ (?:\d+ [.]?)+ $ /x

# 3) collect the numbers in an array
arr = prefix.split('.')

# 4) verify max array size
puts " (Error: depth of hierarchy > #{MLN}!)" if
arr.length > MLN

# 5) any number begins with 0?
puts " (Error: a nr begins with 0!) " if
arr.any? { |v| v =~ /^0/ }

print " Numbers: ", arr.join(', '), "\n"

else
raise FormatError, "Prefix format NOT nr.nr.nr.."
end

else
print "the line has no prefix (option or error?)"
end

end # input

This program generated, tested with the same data you used, this output:

==========DBG
05Topic 5
==========DBG
prefix= '05'
(Error: a nr begins with 0!)
Numbers: 05

==========DBG
1Topic 1
==========DBG
prefix= '1'
Numbers: 1

==========DBG
2.002.1Topic 2.2.1
==========DBG
prefix= '2.002.1'
(Error: depth of hierarchy > 2!)
(Error: a nr begins with 0!)
Numbers: 2, 002, 1

==========DBG
2.1Topic 2.1
==========DBG
prefix= '2.1'
Numbers: 2, 1

==========DBG
2.2.02Topic 2.2.2
==========DBG
prefix= '2.2.02'
(Error: depth of hierarchy > 2!)
(Error: a nr begins with 0!)
Numbers: 2, 2, 02

As you can see, it seems to work well (it also catches double errors,
which is nice). Notice that if the max array size had to be changed, we
just need to modify the MLN value.

I hope this made sense, and that it helped.
Good luck!

Raul
--
Posted via http://www.ruby-forum.com/.

RichardOnRails

unread,

Nov 19, 2007, 9:01:28 PM11/19/07

to

Hi Ryan,

> But, just in case you are interested in looking at it from a little
> different view....
> It is not a complete solution, just an idea.

I AM interested, and it's a great idea. Real Ruby rather than my C-
flavored Ruby :-)

Just for our mutual amusement, I replaced your first criterion:

err1 = inp.select{|a| a =~ /^0|\.0/}

with:

err1 = inp.select{|a| a =~ /^0|^\d*\.0/}

to preclude false hits where the text proper might contain ".0", e.g.,
"7How to get 5.05% interest"

Bottom line: Your code is what I really need. Thanks, again.

Best wishes,
Richard

Harry Kakueki

unread,

Nov 20, 2007, 10:22:26 AM11/20/07

to

>
> Hi Ryan,
>
> > But, just in case you are interested in looking at it from a little
> > different view....
> > It is not a complete solution, just an idea.
>
> I AM interested, and it's a great idea. Real Ruby rather than my C-
> flavored Ruby :-)
>
> Just for our mutual amusement, I replaced your first criterion:
>
> err1 = inp.select{|a| a =~ /^0|\.0/}
>
> with:
>
> err1 = inp.select{|a| a =~ /^0|^\d*\.0/}
>
> to preclude false hits where the text proper might contain ".0", e.g.,
> "7How to get 5.05% interest"
>
> Bottom line: Your code is what I really need. Thanks, again.
>
> Best wishes,
> Richard
>
>
>

Yeah, you need to modify the regular expressions and whatever to fit
your requirements.
Like I said, it's not a complete solution, just an idea you can use.

By the way, stop calling me Shirley. :)
And stop calling me Ryan. :)

Good luck,

Message has been deleted

RichardOnRails

unread,

Nov 20, 2007, 8:24:48 PM11/20/07

to

Hi Harry,

I apologize for the "Ryan" thing. Ryan was the first response on this
thread and got mixed up. I don't know about the "Shirley" thing.

I don't know what's going on there, but I apologize again for any way
I might have erred.

And I fear I owe you a third apology: I minute ago, I started to
reply to your most recent post when I stopped getting a timely
response to my typing. I continue to type, assuming my ISP would
catch up sooner or later and hit Enter. Suddenly, I got a message
that my half-baked response had been posted. Computers: uggggh!

> Like I said, it's not a complete solution, just an idea you can use.

I know. I just wanted to show you I was giving my full attention to
your message.

Again, thanks for showing how Ruby can be used more effectively for
this kind of validation task.

Best wishes,
Richard

RichardOnRails

unread,

Nov 20, 2007, 9:00:02 PM11/20/07

to

Hi Raul,

Thanks for your response.

I like your "battle plan". It works well for your exposition here. I
always have to prepare a high-level one when I propose a software
solution to a user.

Harry posted a succinct illustration of applying Ruby to this kind of
validation, which I appreciated. I also appreciate your slightly
longer approach because I need this structure
so that I can make use of data for my purpose beyond validation.

I especially appreciate your showing me how a regex can be written to
handle an arbitrary number of dot-separated numbers (rather than hard-
code distinct sub-expressions).

> if line =~ /^ (.*?) [a-zA-Z] /x

I have one question about this regex. I have a book that I bought in
2002 but never read until today: "Mastering Regular Expressions", 2nd
Ed., from O'Reilly. I haven't been able to find any reference in
there to a question-mark following ".*".

I thought I could simply remove the question-mark. That caused the
match to fail and yield the programmed error msg. I tried omitting
the question-mark and add a closing "$" in the regex. That made the
parsing fail. So, your question mark is clearly working, but HOW?

I'm fully on board with the rest of your code. Many thanks for it
all!

Regards,
Richard

RichardOnRails

unread,

Nov 20, 2007, 9:25:07 PM11/20/07

to

Hi Ryan,

Thanks for taking the trouble to critique my code. Since you were
kind enough to take a thorough look at the code and offer detailed
comments, I feel obligated to address each issue you raised.

I agree with some of your responses, don't understand some, and
suspect some are merely a matter of taste. I'm going to tackle them
one-by-one, but I especially want to thank you for suggesting Unit
Test in Ruby, which I haven't used yet (and don't immediately see how
I'd apply them).

> >> MLN = 2 # MaxLeadingNumbers
>
> MLN

Are you suggesting that rather than using an abbreviation for
MaxLeadingNumbers, I should use the longer string? If so, don't you
think that would create unnecessary clutter? Is this more than a
matter of personal taste?

>
> >> sName =~ /^([\d]+)?\.?([\d]+)?\.?([\d]+)?\.?/
>
> sName

Did you draw attention to this because of the Hungarian notation? If
so, do you think I'm unwise to adopt the style once advocated by
Charles Simoni, super-programmer and co-founder of a giant software
company?

> >> n = Array.new
>
> n
>
> >> print "DBG0>> "; (1..MLN+1).each { |i|
> >> printf("\t#{n[i]}") };
>
> nasty compounded lines and obtuse debugging output

There was one more piece of that line that got cut off somehow:
"puts". If we get aside from the tracing issue for a moment, don't
you think it's better to have one-liner debugging statements that can
be easily deleted later, rather than multi-line statements where
something might be overlooked when deleting them later?

> >> i_thNumberExists = n[i] ? true : false
> >> iThNumbrerHasLeadingZero = i_thNumberExists &&
>
> i_thNumberExists
> iThNumbrerHasLeadingZero

To me, "i_thNumberExists" reads naturally in English as "i'th Number
Exists". I stated out with iTh ... and thought it less readable. I
eventually replaced all instances of iTh, but posted my code before I
completed that correction. My apologies; I agree that had I not made
them uniform, they would provide risks of misspelling.

> Ruby is NOT C.

I'm a retired software developer with experience with a lot of
languages, but by far the most experience with C and C++ under
Windows. So it takes a little while to adopt the idioms of new
language.

> Use English variable names (or whatever your native language is--just
> use words). Don't use hungarian notation, it doesn't help
> --hell, your
> variable names are so ugly you can't even read the misspellings in
> them. Don't mix camelcase and underscores.

I addressed those issues above.

> Don't use 'n[i] ? true : false'
> when 'n[i]' would do.

Agreed. In this case, I wasn't sure that Ruby was honoring that
equivalence, so I wanted to make it explicit.

> Don't debug with logging. Write tests.

GOOD POINT!

> /[\d]+/ is the same as /\d+/.

Thanks.

> Those evals are TERRIFYING. Don't use them.

How else can you loop through $1, $2 ... without repetitive code?

> And, most importantly, 2 spaces per indent.

Agreed. I thought I did that when I posted.

> If you follow these suggestions, your code will be much more
> understandable and less error prone. Really.

I'll keep my fingers crossed :-)

Regards,
Richard

Clifford Heath

unread,

Nov 21, 2007, 12:39:05 PM11/21/07

to

RichardOnRails wrote:
>>>> sName =~ /^([\d]+)?\.?([\d]+)?\.?([\d]+)?\.?/
>> sName
>
> Did you draw attention to this because of the Hungarian notation? If
> so, do you think I'm unwise to adopt the style once advocated by
> Charles Simoni, super-programmer and co-founder of a giant software
> company?

Yes. Adamantly, and definitely yes.

I might have been a bad idea when he had it, even though to his
credit he was trying to make the best of a bad situation, where
MS had bought the worst C compiler on the planet because the good
ones weren't for sale - they could make more money *not* selling
to MS. Because of a spate of bugs and bad code churned out by the
MS software factory, many caused by type mismatches on function
parameters that weren't detected either at compile time or at
runtime, Hungarian notation *might* have been a good idea once.

It's definitely *not* a good idea with modern C, and even less of
a good idea with Ruby.

Clifford Heath.

Alex Young

unread,

Nov 21, 2007, 1:37:02 PM11/21/07

to

Clifford Heath wrote:
> RichardOnRails wrote:
>>>>> sName =~ /^([\d]+)?\.?([\d]+)?\.?([\d]+)?\.?/
>>> sName
>>
>> Did you draw attention to this because of the Hungarian notation? If
>> so, do you think I'm unwise to adopt the style once advocated by
>> Charles Simoni, super-programmer and co-founder of a giant software
>> company?
>
> Yes. Adamantly, and definitely yes.

Interesting article on why (and where it comes from, and why it might
not have been such a bad idea) here:

http://www.joelonsoftware.com/articles/Wrong.html

--
Alex

Alex Young

unread,

Nov 21, 2007, 1:53:48 PM11/21/07

to

RichardOnRails wrote:
> On Nov 19, 2:23 am, Ryan Davis <ryand-r...@zenspider.com> wrote:

<snip>

>> Those evals are TERRIFYING. Don't use them.
>
> How else can you loop through $1, $2 ... without repetitive code?

I don't think anyone has replied to this, so...

This is what you've got:

> sName =~ /^([\d]+)?\.?([\d]+)?\.?([\d]+)?\.?/
>
>

> # Save match variables in n[1] ... n[MLN+1]
> n = Array.new
> (1..MLN+1).each { |i| eval %{n[i] = $#{i}} }

You can get an equivalent for n with the String#match method:

irb(main):001:0> name = "2.2.02Topic 2.2.2"
=> "2.2.02Topic 2.2.2"
irb(main):002:0> m = name.match( /^([\d]+)?\.?([\d]+)?\.?([\d]+)?\.?/ )
=> #<MatchData:0x3a3a324>
irb(main):003:0> m[1]
=> "2"
irb(main):004:0> m[2]
=> "2"
irb(main):005:0> m[3]
=> "02"

You should be able to use that to get around your second eval, too.

--
Alex

RichardOnRails

unread,

Nov 21, 2007, 8:08:08 PM11/21/07

to

Hi Alex,

> m = name.match( /^([\d]+)?\.?([\d]+)?\.?([\d]+)?\.?/ )

Thanks.

Best wishes,
Richard

RichardOnRails

unread,

Nov 21, 2007, 8:26:28 PM11/21/07

to

Hi Clifford,

> It's definitely *not* a good idea with modern C, and even less of
> a good idea with Ruby.

I don't know anything about Microsoft's choice of compilers. But I
used several C compilers in the '80s, and all cases found Hungarian
notation helpful. I don't think Microsoft's initial choice of
compilers is relevant to my and other's successful employment of that
convention.

Why would it be a bad idea with modern C compilers or with Ruby? You
offer no reason. All it does is add one or two letters before names!
That doesn't bother any human or compiler or interpreter. To my
ears, it sounds like you're telling me you like chocolate after
learning that I like vanilla.

I'm reminded of George Wallace's assessment of the difference between
Democrats vs. Republicans: "There's not a dime's worth of difference.

Best wishes,
Richard

Alex Young

unread,

Nov 21, 2007, 8:42:05 PM11/21/07

to

It depends on the type of Hungarian that you're using, and it's not
clear from your code sample which it is. If you're using an
abbreviation prefix to denote a semantic difference within a type, then
that's (potentially) useful, in both Ruby and C:

us_username = read_unsafe_input()
s_username = sanitise(us_username)

with us_ meaning unsafe and s_ meaning safe, for example. Not something
I'd use myself, but I can see the utility. If it's denoting a class,
then it's not something I can see as useful, in either C or Ruby. In C,
you're duplicating the compiler's type-checking, and in Ruby,
duck-typing means that you shouldn't need to care; it becomes
readability-damaging line noise.

--
Alex

Rick DeNatale

unread,

Nov 21, 2007, 8:44:41 PM11/21/07

to

On Nov 20, 2007 9:35 PM, RichardOnRails

<RichardDummy...@uscomputergurus.com> wrote:
>
> On Nov 19, 2:23 am, Ryan Davis <ryand-r...@zenspider.com> wrote:
>
> > >> sName =~ /^([\d]+)?\.?([\d]+)?\.?([\d]+)?\.?/
> >
> > sName
>
> Did you draw attention to this because of the Hungarian notation? If
> so, do you think I'm unwise to adopt the style once advocated by
> Charles Simoni, super-programmer and co-founder of a giant software
> company?

Actually this form of Hungarian notation, which was called System
Hungarian in Microsoft, is NOT what Simonyi originally sugested (and
what was used in the Application Division).

http://talklikeaduck.denhaven2.com/articles/2007/04/09/hungarian-ducks

--
Rick DeNatale

My blog on Ruby
http://talklikeaduck.denhaven2.com/

Todd Benson

unread,

Nov 21, 2007, 9:44:44 PM11/21/07

to

On Nov 20, 2007 7:24 PM, RichardOnRails

<RichardDummy...@uscomputergurus.com> wrote:
>
> On Nov 20, 10:22 am, Harry Kakueki <list.p...@gmail.com> wrote:
> > By the way, stop calling me Shirley. :)
> > And stop calling me Ryan. :)
> >
> > Good luck,
> >
> > Harry

> Hi Harry,
>
> I apologize for the "Ryan" thing. Ryan was the first response on this
> thread and got mixed up. I don't know about the "Shirley" thing.

Harry is quoting from the movie Airplane (1980) :)

Todd

yermej

unread,

Nov 21, 2007, 9:51:22 PM11/21/07

to

On Nov 20, 8:00 pm, RichardOnRails
<RichardDummyMailbox58...@uscomputergurus.com> wrote:

> > if line =~ /^ (.*?) [a-zA-Z] /x
>
> I have one question about this regex. I have a book that I bought in
> 2002 but never read until today: "Mastering Regular Expressions", 2nd
> Ed., from O'Reilly. I haven't been able to find any reference in
> there to a question-mark following ".*".
>
> I thought I could simply remove the question-mark. That caused the
> match to fail and yield the programmed error msg. I tried omitting
> the question-mark and add a closing "$" in the regex. That made the
> parsing fail. So, your question mark is clearly working, but HOW?

I can help with this one. Check "Mastering Regular Expressions" for
non-greedy operators. Normally, ".*" will match everything it possibly
can. Adding the "?" causes it to do a minimal match -- it will match
as little as necessary to still fill the requirements. In the above
case, it matches everything until the first letter. Without the "?" it
matches everything until the last letter.

Jeremy

Raul Parolari

unread,

Nov 22, 2007, 3:34:34 AM11/22/07

to

RichardOnRails wrote:

> Hi Raul,
>
> I like your "battle plan"..

> I especially appreciate your showing me how a regex can be written to
> handle an arbitrary number of dot-separated numbers (rather than hard-
> code distinct sub-expressions).
>
>> if line =~ /^ (.*?) [a-zA-Z] /x
>

> I thought I could simply remove the question-mark.

> So, your question mark is clearly working, but HOW?
>

Richard

I saw that Gavin has given you (in another thread) a general tutorial on
this. I add a simpler explanation just in the context of the problem we
treated;

.* means 'as many characters as possible'

Now, the point 1 of the 'battle plan' was (I quote):

"1) we first collect everything until the first letter (not included);
we
will consider this the Prefix."

So we want to tell the Regexp Engine: "as few characters as possible
until you see a letter (a-zA-Z), then stop right there!".

Let's examine the 2 expressions, with and without the question mark:

(.*?) [a-zA-Z]
minimal nr of chars needed until .. 1st letter

(.*) [a-zA-Z]
as many chars you can get
possible get away with, and then .. a letter

An example:

s="2.1Topic 2.1"

md = s.match( /^ (.*?) [a-zA-Z] /x )
md[1] # => "2.1"

md = s.match( /^ (.*) [a-zA-Z] /x )
md[1] # => "2.1Topi"

Have you seen? Both expressions were satisfied, but in different ways:
a) the first (with .*?) tried to find the minimal number of characters
until the first letter, and so it stopped when it found the 'T' of
Topic.

b) the second expression tried to find as many characters as possible,
only bounded by having to then find a letter, so it stopped at the 'c'
of Topic.

With sense of humour, somebody observed that ".*? values contentment
over greed"; and since then the ".*?" were called "not greedy", while
the ".*" were called "greedy".

[I stop here as Gavin described to you '.+" & co].

One advice: the key to learn the regular expression is to read a good
book (just trying them drives one insane) while experimenting (just
reading drives one insane too). The time spent pays you back very
quickly at the first serious exercise (as you can develop a
'battle-plan' rather than a 'guerrilla war' with the regexps).

I am glad that you found the script useful, and I hope that this helped
too

RichardOnRails

unread,

Nov 23, 2007, 11:53:52 AM11/23/07

to

Hi Jeremy,

> non-greedy operators.

Thanks for that detailed explanation. I tried Googling for "(.*?)
regular expression", but Google thiks it's too weird to actutally
include "(.*?)" in a search ... and I don;t blame them. I checed
Amazon to see if there is a later edition of "Masttering ...", to no
avail.

Again, thanks for taking the time to respond.

Regards,
Richard

RichardOnRails

unread,

Nov 24, 2007, 2:17:20 PM11/24/07

to

Hi Raul,

Thank you very much for your expanded analysis.

> I saw that Gavin has given you (in another thread) ...

I started a new thread on the "(.*?)" because this thread was getting
too long. And Gavin tuning me on to "greedy" was a big boost. That
let me find some relevant stuff in "Mastering Reglar Expressions, 2nd
ed."

> An example: ...

Your example is great. I went back to Hal Fulton's "The Ruby Way, 2nd
ed." and http://www.ruby-doc.org/core/classes/Regexp.html for
additional Regexp#match documentation.

Not withstanding your exposition and the documentation cited, my
reptilian brain refuses acceptance on this issue. But by running the
examples given and some of my own construction, I should get over
this hump. (I wrote my own NFSA in C for a client's application
roughly 30 year's ago, so I should be equal to the task.)

I'm not going expose my ignorance with any further questions on this
matter. I'll do my homework :-)

With thanks and best wishes,
Richard

RichardOnRails

unread,

Nov 26, 2007, 2:51:33 PM11/26/07

to

Hi Alex,

> abbreviation prefix to denote a semantic difference within a
> type, then that's (potentially) useful, in both Ruby and C:
> us_username = read_unsafe_input()
> s_username = sanitise(us_username)

I like that.

> If it's denoting a class,
> then it's not something I can see as useful

I like to know merely by inspection whether a referent denotes an
integer, a string, a hash or an array of such things. I'd like to
avoid "Syntax error" simply because I failed to include a to_s, to_i,
[] or whatever. I really can't see why a prefixed lower-case letter
or two before a camel-case object name can create so much discussion
irrelevant to the question at hand.

Maybe I'll have so "sanitize" all the code I post to exemplify a
coding issue.

Thank you for your response, notwithstanding my lack of total
agreement.

Best wishes,
Richard

RichardOnRails

unread,

Nov 26, 2007, 3:24:46 PM11/26/07

to

On Nov 21, 8:44 pm, Rick DeNatale <rick.denat...@gmail.com> wrote:
> On Nov 20, 2007 9:35 PM, RichardOnRails
>

Hi Rick,

I loved your blog. Thanks for posting it and informing me about it.

I think my usage of "Hungarian" consistent with Simonyi's intent, at
least how I understand it. In any case, I find my uasage helpful, as
I mentioned to Alex Young on this thread, though I may have to
sanitize future posts to avoid people who don't respond to what I
mean but instead waste time on how I express my question.

Best wishes,
Richard

RichardOnRails

unread,

Nov 26, 2007, 3:32:49 PM11/26/07

to

On Nov 21, 9:44 pm, Todd Benson <caduce...@gmail.com> wrote:
> On Nov 20, 2007 7:24 PM, RichardOnRails
>

> <RichardDummyMailbox58...@uscomputergurus.com> wrote:
>
> > On Nov 20, 10:22 am, Harry Kakueki <list.p...@gmail.com> wrote:
> > > By the way, stop calling me Shirley. :)
> > > And stop calling me Ryan. :)
>
> > > Good luck,
>
> > > Harry
> > Hi Harry,
>
> > I apologize for the "Ryan" thing. Ryan was the first response on this
> > thread and got mixed up. I don't know about the "Shirley" thing.
>
> Harry is quoting from the movie Airplane (1980) :)
>
> Todd

Thanks, Todd. That was over my head. "Airplane" was not a movie I'd
run to see :-)

Regards,
Richard

Alex Young

unread,

Nov 26, 2007, 3:56:05 PM11/26/07

to

RichardOnRails wrote:
<snip>

> Hi Alex,
>
>> abbreviation prefix to denote a semantic difference within a
>> type, then that's (potentially) useful, in both Ruby and C:
>> us_username = read_unsafe_input()
>> s_username = sanitise(us_username)
>
> I like that.
>
>> If it's denoting a class,
>> then it's not something I can see as useful
>
> I like to know merely by inspection whether a referent denotes an
> integer, a string, a hash or an array of such things. I'd like to
> avoid "Syntax error" simply because I failed to include a to_s, to_i,
> [] or whatever.

You'll tend to find that types are much less relevant in Ruby than in C.
The actual class of an object is much less important than the methods
it responds to, and you won't get a syntax error unless the syntax is
actually wrong; this won't be a problem with variables because of the
lack of compile-time type checking. I've tried the method you're
espousing myself, and it didn't actually help me at all. I found it
just wasn't worth the effort. However, I'm more than willing to accept
that it's a difference between your coding style and mine rather than
any fundamental problem with the concept that made the difference.

In terms of posting code here, the most important thing is to make it
readable. Most people won't know what your hungarian prefixes mean, so
they're just line noise to them.

--
Alex

Raul Parolari

unread,

Nov 26, 2007, 4:23:08 PM11/26/07

to

RichardOnRails wrote:

> I like to know merely by inspection whether a referent denotes an
> integer, a string, a hash or an array of such things. I'd like to
> avoid "Syntax error" simply because I failed to include a to_s, to_i,
> [] or whatever. I really can't see why a prefixed lower-case letter
> or two before a camel-case object name can create so much discussion
> irrelevant to the question at hand.

Richard,

I always found fascinating the issue of 'how we name things', and not
only for philosophical reasons; personally I think that the horrendous
amount of time spent to-day in what is called "Testing Dept" is due in
part to problems like that.

A bad and careless Naming methodology (when there is one!) leads
(especially in a project where people share code) to subtle errors,
flawed assumptions, and ultimately to errors (unfortunately, when there
is somebody put in charge of the 'naming standards', he is often very
politically correct, but not the brightest guy around, and the result is
even worse than 'no standards').

One person who wrote something intelligent about this subject is Damian
Conway in his book "Perl Best Practices" (ok, I will get the usual
parochial boos for naming that language, but ok, life continues),
specifically chapter 3 "Naming Conventions".

Even if the examples are on Perl syntax, the substance goes beyond. He
suggests something different than your approach; the name should
indicate not so much the class/type, but the MEANING of the data
structure; for example (of course I will not use Perl syntax, and avoid
the examples that make sense for Perl only):

# scalars
running_total games_count # (rather than 'total','count')

# booleans
is_valid has_end_tag loading_finished

# arrays
events handlers unknowns
# the iteration var
event handler unknown

# hashes
title_of count_for sales_from isbn_from

He even discusses the role of 'nouns' and 'adjectives' in names.. (a
delight to read!).

The emphasis is in the MEANING of the data, not on the 'class/type': do
you see? but the objective is similar to yours: grant somebody looking
at code of somebody else (translated: ourselves 6 months later!) at
least a hope to vaguely understand what is going on!

You may want to glance at it, if you find that approach of interest.

Raul

Names are but noise and smoke,
obscuring heavenly light

Johann Wolfgang von Goethe, "Faust: Part I"

RichardOnRails

unread,

Nov 26, 2007, 5:24:11 PM11/26/07

to

Hi Raul,

> read a good book (just trying them drives one insane)
> while experimenting

I certainly agree with regard to REs. And I've got "Mastering Regular
Expressions, vol. 2" by Friedl.

> (just reading drives one insane too).

Well, they haven't carted me off yet :-)

As I've said, your approach works great. But I do want to
experiment, too. So I tried the following, and I'm hopeful that you
can tolerate another question:

Program
-------
input = <<DATA
05Topic 5
2.002.1Topic 2.2.1
DATA

input.each do |line|
line =~ /^ (\d+[\.]?)+ [^\.\d] /x #
puts line
puts $1, $2
puts
end

Output
------
05Topic 5
05
nil

2.002.1Topic 2.2.1
1
nil

Question
--------
I'm puzzled by "1" in the second output, because the "^" in the RE
specifies that a match must occur at the first character. I expected
to get $1=2, at least, and hopefully $2=002 and %3=1, though I was
willing to work on the last two items.

Am I wrong about the caret?

RichardOnRails

unread,

Nov 26, 2007, 6:27:01 PM11/26/07

to

Hi Raul,

I forgot to tell you that I finally understand your second example.

> md = s.match( /^ (.*) [a-zA-Z] /x )
> md[1] # => "2.1Topi"

Without the question mark, in principal, the ".* initially consumes
all the characters, but then it sees the match fails, because there's
no match for the "[a-zA-Z]". So the ".*" sort of "backs off" and
satisfies it self with "2.1Topi", leaving the "c" to satisfy "[a-zA-
Z]".

Cool. Actually, I read that in "Mastering Regular Expressions, vol.
2", but it really didn't settle into my WeltAnshaung. But I think I
got it now!

Furthermore, the "non-greedy question mark" says "consume only as much
as you need in order to satisfy the total RE. So "(.*?) needs to
consumed all the caracters up to something satisfying the "[a-zA-Z]",
which is the "T"

The one I like settled on is:

s="2.1Topic 2.1"
md = s.match( /^ ([\.\d]*) [^\.\d] /x )
#md[0]=2.1T
#md[1]=2.1

Raul Parolari

unread,

Nov 26, 2007, 8:36:20 PM11/26/07

to

RichardOnRails wrote:

>
> I forgot to tell you that I finally understand your second example.
>
>> md = s.match( /^ (.*) [a-zA-Z] /x )
>> md[1] # => "2.1Topi"
>
> Without the question mark, in principal, the ".* initially consumes
> all the characters, but then it sees the match fails, because there's
> no match for the "[a-zA-Z]". So the ".*" sort of "backs off" and
> satisfies it self with "2.1Topi", leaving the "c" to satisfy "[a-zA-
> Z]".

Very good, Richard!
It is a question on when one is 'content'; if you need a metaphor to
remember it, think of a WallStreet banker (.*, .+) vs a Franciscan monk
(.*?, .+?)
:-)

> The one I like settled on is:
>
> s="2.1Topic 2.1"
> md = s.match( /^ ([\.\d]*) [^\.\d] /x )
> #md[0]=2.1T
> #md[1]=2.1

I see that you have solved the problem in your previous post (that I
could not reply to), when you wrote (removing all other code):

s = "2.002.1Topic 2.2.1"

s =~ /^ (\d+[.]?)+ [^\.\d] /x

I must confess: I was stunned myself that it did not work; foolish of
us, in fact it was working, but you failed to collect the bounty! you
needed parenthesis to include the '+'!

s =~ /^ ((\d+[.]?)+) [^\.\d] /x

p $1, 2 # => "2.002.1", "1"

However it is better to avoid collecting also the inner results as they
overwrite each other in $2 and then confuse us (that's the reason that
you saw the last digit captured above..); so let's use the '?:' trick,
to avoid writing in $2, where instead we will capture the 'non
digits/dots' that come after:

s =~ /^ ((?:\d+[.]?)*) ([^\.\d]+) /x

p $1, $2 # => "2.002.1", "Topic "

Do you see it? I think you do. Now, to finish, let's examine how you
solved the problem in this post:

> s="2.1Topic 2.1"
> md = s.match( /^ ([.\d]*) [^\.\d] /x )

Ah, you resorted to 'pragmatism'.. you said: "the bloody '\d+[.]?)+'
does not work, so I will change it". This was ok, but do you see the
difference between:

((?:\d+[.]?)*) # I changed + -> * to compare

([\d[.]]*)

aside that the second one is easier to read? (you may want to stop
reading and think about this as this is your test to graduate from
"intermediate level regexp" :-)

Ok: if they could speak, they would say respectively:
1) I want 0 or more sequences of (digits followed optionally by a dot)
2) I want 0 or more combinations of digits and dots as they come

Do you see?
both would match: "2.002.1" but the 2nd would also match "...1..37"!

The last question you had was: how do I pick up the digits once I
collected the "2.002.1"? Study scan in Pickaxe and then do:

str = "2.002.1"

str.scan(/ (\d+) /x) # => [["2"], ["002"], ["1"]]

All right, let's call it a Regexp day,

MonkeeSage

unread,

Nov 26, 2007, 9:14:01 PM11/26/07

to

Hi Richard,

Here's a cheat-sheet for ruby regular expression syntax:

http://www.geocities.jp/kosako3/oniguruma/doc/RE.txt

Regards,
Jordan

RichardOnRails

unread,

Nov 26, 2007, 10:13:02 PM11/26/07

to

Hi Jordan,

> http://www.geocities.jp/kosako3/oniguruma/doc/RE.txt

Thanks. I'm running ruby 1.8.2 (2004-12-25) [i386-mswin32]. How can
I tell if it uses Oniguruma RE ver.5.6.0?

"gem list oniguruma -b" gave me:
*** LOCAL GEMS ***
[nothing]
*** REMOTE GEMS ***
oniguruma (1.1.0, 1.0.1, 1.0.0, 0.9.1, 0.9.0)

Judging by this result, I'd say the "5.6.0" is the version of the
Cheat Sheet itself; and that I don't have Oniguruma installed.

I've got The Ruby Way, ver. 2, that covers Oniguruma. But I'm fairly
new to Ruby, so I wonder whether stepping up to Oniguruma is
prudent.

Any ideas?

Regards,
Richard
Regards,
Richard

MonkeeSage

unread,

Nov 26, 2007, 10:32:01 PM11/26/07

to

On Nov 26, 9:13 pm, RichardOnRails

> Thanks. I'm running ruby 1.8.2 (2004-12-25) [i386-mswin32]. How can
> I tell if it uses Oniguruma RE ver.5.6.0?

You don't actually need oniguruma, it's the same syntax as class
Regexp in ruby 1.8 (well, a few things don't work, but 99% does).

Regards,
Jordan

RichardOnRails

unread,

Nov 27, 2007, 7:56:52 PM11/27/07

to

Hi Jordan,

> You don't actually need oniguruma, it's the same syntax as class
> Regexp in ruby 1.8 (well, a few things don't work, but 99% does).

Great! Thank you very much for the Cheat Sheet.

Best wishes,
Richard

RichardOnRails

unread,

Nov 27, 2007, 8:25:05 PM11/27/07

to

> Posted viahttp://www.ruby-forum.com/.

Hi Raul,

Thank you for your further support of my obstinacy  Your help has
guided me to the solution I wanted. Your original one is succinct,
perhaps even elegant in that it decomposes the problem into two sub-
problems which admit of essentially one-line solutions. While I truly
appreciate that approach, I wanted to find a "natural" solution,
which is the one included below. It has one caveat: it's aimed at
processing files of only a few megabytes. That said, I'd be pleased
to hear of any downsides you may foresee.

> > I forgot to tell you that I finally understand your second example.

> [snip]
> Very good, Richard!
> It is a question on when one is 'content'; if you need a metaphor ...

Thanks. I've got that stuff wired into brain now.

> I must confess: I was stunned myself that it did not work; foolish of
> us, in fact it was working, but you failed to collect the bounty! you
> needed parenthesis to include the '+'!

That approach is old news, now that I've conceived of my "natural"
approach

> However it is better to avoid collecting also the inner results as they
> overwrite each other in $2 and then confuse us

Understood! As you'll see, I avoided that pitfall below.

[snip]

> Do you see it? I think you do.

Quit so.

[snip]

This was ok, but do you see the
> difference between:
>
> ((?:\d+[.]?)*) # I changed + -> * to compare
>
> ([\d[.]]*)

[snip]
> Do you see?
For sure!

> All right, let's call it a Regexp day,

I'll drink to that!

With Thanks and Best Wishes, I remain
Yours truly,
Richard

# "Natural" Solution
input = <<DATA
05Topic 05
1.0Topic 1.0
2.002.1Topic 2.2.1
3.15.26.37Topic 3.15.26.37
DATA

MaxDepth = 5
sRE = "^"
(1..MaxDepth).each { |i|
sRE << ' (\d*)(?:\.?)'
}
sRE += ' ([^\.\d].*)'
re = Regexp.new(sRE, Regexp::EXTENDED)

input.each { |line|
puts '='*10
puts line
puts '='*10

# puts re.to_s # Debug
md = line.match( re )
(0..MaxDepth+1).each { |i|
puts "md[#{i}] = " + md[i] if md[i]
}
puts
}

Raul Parolari

unread,

Nov 28, 2007, 2:25:23 AM11/28/07

to

RichardOnRails wrote:
>
> Thank you for your further support of my obstinacy �� Your help has
> guided me to the solution I wanted. Your original one is succinct,
> perhaps even elegant in that it decomposes the problem into two sub-
> problems which admit of essentially one-line solutions. While I truly
> appreciate that approach, I wanted to find a "natural" solution,
> which is the one included below.

Hi, Richard

you mention your 'obstinacy'... and indeed, you have found a way to
implement with Regexps your original design (you did not fool me! :-);
great ingenuity.

However, as much as I am stunned by your progress (your original program
at the top of this post and this one seem like Dante going from Inferno
to Paradiso), I must be frank.

I do not like building arrays to meet some 'maximum treshold', leaving
portions of them empty; it just does not make me 'happy' (in the
Matz/Ruby sense, do you understand?). Of course, it is just an array of
5 positions, but it does not matter; it is just echologically wrong for
me.

I however understand your feeling towards my solution (and the contrast
you create with your 'natural one'); got it! but then I invite you to
explore scan with \G; look at this, that may be doing something more
'natural':

# \G 'anchors' start of next search to end of previous one
re_prefix = %r/\G (\d+ [.]?) /x

input.each { |line| p line.scan(re_prefix).flatten }

Output:
["05"]
["1", "0"]

["2", "002", "1"]

["3", "15", "26", "37"]

Small (1 line!), fast, precise: a beauty.

It is not the complete solution as '\G scan' in Ruby does not allow you
to change the regexp without interrupting the job (there are sad
workarounds for it), so the job is not complete. The library
StringScanner ('strscan') is of interest, as it solves this problem
nicely (and is in C, so is fast).

Perhaps, examine \G and/or StringScanner, and see if you can find a
solution that meets what you are looking for, without (as seen from me)
imperfections.

Congrats for your progress (in 1 week!)

Raul

Some people, when confronted with a problem, think:
"I know, I'll use regular expressions".
Now they have two problems.

Jamie Zawinski

:-)

Raul Parolari

unread,

Nov 28, 2007, 2:40:41 AM11/28/07

to

Just to correct the position of a parenthesis above:

> re_prefix = %r/\G (\d+ [.]?) /x

re_prefix = %r/\G (\d+) [.]? /x

Raul

RichardOnRails

unread,

Nov 28, 2007, 9:44:00 PM11/28/07

to

Hi Raul,

> > re_prefix = %r/\G (\d+ [.]?) /x

I have learned something: I saw immediately that the mistyped version
was not what you intended because the dots would have been captured.
I applied your correction and things worked as advertised.

> ... you have found a way to implement with Regexps your

> original design (you did not fool me! :-);

:BG

> great ingenuity [snip] stunned by your progress

Thanks for the compliments, but they're not merited in this respect:

I wrote my first program circa 1955 (on paper only; no execution)
after receiving a letter from a former high-school classmate
announcing that he had encountered new-fangled machines at Princeton
called "computers". He furthermore recounted their instruction set.
I was hooked. After grinding out a degree from night-college and
earning an NSF graduate fellowship in math, I finally got a job
programming a real computer, which I continued until I retired a few
years ago.

> I must be frank.

Absolutely. I've invited that, and am also impressed with your
gracious approach.

> I do not like building arrays to meet some 'maximum treshold', leaving
portions of them empty; it just does not make me 'happy' (in the
Matz/Ruby sense, do you understand?). Of course, it is just an array
of
5 positions, but it does not matter; it is just echologically wrong
for
me.

I acknowledge and share the aesthetic validity of your displeasure.

> I however understand your feeling towards my solution (and the contrast
you create with your 'natural one'); got it! but then I invite you to
explore scan with \G; look at this, that may be doing something more
'natural':

> # \G 'anchors' start of next search to end of previous one

re_prefix = %r/\G (\d+ [.]?) /x

input.each { |line| p line.scan(re_prefix).flatten }

> Small (1 line!), fast, precise: a beauty.

I agree fully!

Of course I couldn't leave "well enough" alone, so here's my mod:

input.each { |line| line.scan(re_prefix).flatten.each { |e|
printf("%s ",e)}; puts }

> Perhaps, examine \G and/or StringScanner, ...

I took a fast look at ruby-doc.org/core/ ... looks good. Thanks

> see if you can find a
solution that meets what you are looking for, without (as seen from
me)
imperfections.

Your latest approach suits my requirement (and taste) perfectly. I'm
off now to continue work on the project I'm developing, which you
might find interesting. Since it's off-topic, send me an email if you
want details. (My email-address looks artificial in order to deter
spammers, but I do have a mail box for it which I only check
sporadically unless I anticipate legitimate email.)

> Some people, when confronted with a problem, think:
"I know, I'll use regular expressions".
Now they have two problems.

:BG

Again, thank you and
Best Wishes,
Richard