[QUIZ] NDiff (#46)

Ruby Quiz

unread,

Sep 9, 2005, 9:06:46 AM9/9/05

to

The three rules of Ruby Quiz:

1. Please do not post any solutions or spoiler discussion for this quiz until
48 hours have passed from the time on this message.

2. Support Ruby Quiz by submitting ideas as often as you can:

http://www.rubyquiz.com/

3. Enjoy!

-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=

by Bill Kleb

This week's quiz is to write a version of the diff command that compares files
numerically. In Unix-speak,

$ ndiff --help
Usage: ndiff [options] file1 file2

Numerically compare files line by line, numerical field by numerical field.

-d INT --digits INT Maximum number of significant digits that
should match. (default: 0)
-h --help Output this help.
-q --quiet No output, just exit code.
-s --statistics Provide comparison statistics only. (extra credit)
-t DBL --tolerance DBL Tolerate <= DBL distance between numbers.
(default: 0.0)

For example, given fileA,

1.000001
2.00
-3
Cy=0.11278889E-01 Cx=-1.343e+02

And fileB,

1.000000
1.99
-3.4
Cy=0.11278890E-01 Cx=-1.343e+02

the following scenarios could play out:

$ ndiff fileA fileB
1,4c1,4
< 1.000001
< 2.00
< -3
< Cy=0.11278889E-01 Cx=-1.343e+02
---
> 1.000000
> 1.99
> -3.4
> Cy=0.11278890E-01 Cx=-1.343e+02

$ ndiff -t 0.000001 fileA fileB
2,3c2,3
< 2.00
< -3
---
> 1.99
> -3.4

$ ndiff --tolerance 0.01 fileA fileB
3c3
< -3
---
> -3.4

$ ndiff --digits 1 fileA fileB (zero exit code)

$ ndiff -q fileA fileB (non-zero exit code)

and, for extra credit,

$ ndiff --statistics fileA fileB
Numbers compared: 5
Distance range: 0.0..0.4
Average distance: 0.99987e-01 [guess]
Mean distance: 1.0e-06 [guess]

FWIW, the results of this quiz will be used by NASA's FUN3D
team for regression testing aerothermodynamic simulation
software.

Hugh Sasse

unread,

Sep 9, 2005, 9:26:17 AM9/9/05

to

It would seem useful to be produce a file of numerical differences
as well. Writing the subsequent patch program would open another
can of worms, but difference plots definately have their worth.

Hugh

James Edward Gray II

unread,

Sep 9, 2005, 10:06:48 AM9/9/05

to

That's an interesting point. Maybe we can attract Bill's attention
and he'll enlighten us about how NASA would use the data...

James Edward Gray II

Hugh Sasse

unread,

Sep 9, 2005, 10:19:27 AM9/9/05

to

On Fri, 9 Sep 2005, Ruby Quiz wrote:

> Cy=0.11278889E-01 Cx=-1.343e+02
>
> And fileB,
>
> Cy=0.11278890E-01 Cx=-1.343e+02

Named fields are an interesting case: what should happen for
Cy=0.11278889E-01 Cx=-1.343e+02

And fileB,

Cx=0.11278890E-01 Cy=-1.343e+02 Cz=2.9979e+08
?
Fields named in a different order, and an extra field introduced.

Hugh

Jim Freeze

unread,

Sep 9, 2005, 2:34:40 PM9/9/05

to

Hi

To help the quiz writers focus on the 'the quiz' and not the administrative
part of the code and to help promote CommandLine, here is a little snippet
that some may find useful to get started.

#!/usr/bin/env ruby

require 'rubygems'
require 'commandline'

class NDiffApp < CommandLine::Application
def initialize
author "Jim Freeze"
copyright "2005 Jim Freeze"
synopsis "[-hqs] [-d INT] [-t DBL] file1 file2"
short_description "Numerically compare files line by line, "+

"numerical field by numerical field."

option :names => %w(--digits -d),
:opt_description => "Maximum number of significant digits that "+
"should match. (default: 0)",
:arg_description => "INT",
:opt_found => get_arg,
:opt_not_found => "0"

option :names => %w(--tolerance -t),
:opt_description => "Tolerate <= DBL distance between numbers. "+
"(default: 0.0)",
:arg_description => "DBL",
:opt_found => get_arg,
:opt_not_found => "0.0"

option :flag,
:names => %w(--statistics -s),
:opt_description => "Provide comparison statistics only.
(extra credit)"

option :flag,
:names => %w(--quiet -q),
:opt_description => "No output printed to screen. just exit code.??"

option :help

expected_args :file1, :file2
end

def main
NDiff.new(@file1, @file2,
@option_data["--digits"].to_i,
@option_data["--tolerance"].to_f,
@option_data["--statistics"],
@option_data["--quiet"])
end

end#class NDiffApp

class NDiff
# Your code here
def initialize(file1, file2,
digits=0,
tolerance=0.0,
statistics=false,
quiet=false)

p file1
p file2
p digits
p tolerance
p statistics
p quiet
end

end#class NDiff

The printout is quit nice, and it takes no extr effort:

% ./ndiff
Usage: ndiff [-hqs] [-d INT] [-t DBL] file1 file2
% ./ndiff file1
ERROR: Missing expected arguments. Found 1 but expected 2.
Usage: ndiff [-hqs] [-d INT] [-t DBL] file1 file2
% ./ndiff -h
NAME

ndiff - Numerically compare files line by line, numerical field by
numerical field.

SYNOPSIS

ndiff [-hqs] [-d INT] [-t DBL] file1 file2

OPTIONS

--digits,-d INT

Maximum number of significant digits that should match.
(default: 0)

--tolerance,-t DBL

Tolerate <= DBL distance between numbers. (default: 0.0)

--statistics,-s

Provide comparison statistics only. (extra credit)

--quiet,-q
No output printed to screen. just exit code.??

--help,-h
Displays help page.

AUTHOR: Jim Freeze
COPYRIGHT (c) 2005 Jim Freeze

A little note, I did not understand "just exit code" for -q. What is
'-q' supposed
to do. No output at all?

It also appears that the ability to type cast input to floats or ints or
whatever should be added to CommandLine. All comments are appreciated.

--
Jim Freeze

Jacob Fugal

unread,

Sep 9, 2005, 3:04:16 PM9/9/05

to

On 9/9/05, Jim Freeze <j...@freeze.org> wrote:
> Hi
>
> To help the quiz writers focus on the 'the quiz' and not the administrative
> part of the code and to help promote CommandLine, here is a little snippet
> that some may find useful to get started.

<snip>

Thanks, Jim! I'd actually been trying to use CommandLine::Application,
but was having some problems. Comparing what I had against this let me
see where my problems were.

> A little note, I did not understand "just exit code" for -q. What is
> '-q' supposed
> to do. No output at all?

Yes, as in "no output to STDOUT". But there is output: the exit code.
This is a common feature of many unix tools (e.g. diff[1], grep). It
allows the tool to be used in a boolean environment during shell
scripting. An exit code of 0 means there was no difference (within
tolerance), and a non-zero exit code means there was a difference.

Jacob Fugal

Jim Freeze

unread,

Sep 9, 2005, 3:14:25 PM9/9/05

to

On 9/9/05, Jacob Fugal <luk...@gmail.com> wrote:
>
> Thanks, Jim! I'd actually been trying to use CommandLine::Application,
> but was having some problems. Comparing what I had against this let me
> see where my problems were.

Great. Glad to have helped. Any feedback to improve CommandLine is
always appreciated.

> > A little note, I did not understand "just exit code" for -q. What is
> > '-q' supposed
> > to do. No output at all?
>
> Yes, as in "no output to STDOUT". But there is output: the exit code.
> This is a common feature of many unix tools (e.g. diff[1], grep). It
> allows the tool to be used in a boolean environment during shell
> scripting. An exit code of 0 means there was no difference (within
> tolerance), and a non-zero exit code means there was a difference.

Ok, thanks for that explanation. I probably would have called it
"--boolean" instead of --quiet. I always think of quiet as no verbosity,
but I could be just un-informed on this one.

--
Jim Freeze

Bil Kleb

unread,

Sep 10, 2005, 10:28:52 AM9/10/05

to

Sorry, one of my "ignore thread" filters apparently
got out of hand...

We do regression testing as part of our continuous integration
and currently, merely use the stock diff command to compare
output with a pre-existing "golden" output.

This is troublesome if you switch compiler versions, machine
architectures, order of operations, or other things that merely
tweak the answers in the 13th decimal place or something.

Our currently hack is to do some truncation of the output
before write the data. So, diff only winds up comparing
a certain number of decimals, but this is, of course, is
not very flexible.

The patch style output really isn't important. In fact,
just a non-zero exit code is enough for starters.

So we need ability to compare numerical fields line-by-line,
ignoring any text that might be surrounding them, but
make sure that for a given line, the same number of
"numeric fields" exist.

Later,
--
Bil
http://fun3d.larc.nasa.gov

Bil Kleb

unread,

Sep 10, 2005, 10:30:23 AM9/10/05

to

Hugh Sasse wrote:
>
> Named fields are an interesting case: what should happen for
> Cy=0.11278889E-01 Cx=-1.343e+02
>
> And fileB,
>
> Cx=0.11278890E-01 Cy=-1.343e+02 Cz=2.9979e+08
> ?
> Fields named in a different order, and an extra field introduced.

Non-zero exit code due to the different number of "numeric fields".

Ignore named fields.

Bil Kleb

unread,

Sep 10, 2005, 10:33:29 AM9/10/05

to

Jim Freeze wrote:
> Hi

Hello.

> To help the quiz writers focus on the 'the quiz' and not the administrative
> part of the code and to help promote CommandLine, here is a little snippet
> that some may find useful to get started.

Awesome.

Thanks for accommodating my lame quiz-writing abilities.

Thanks again,
--
Bil
http://fun3d.larc.nasa.gov

Bil Kleb

unread,

Sep 10, 2005, 10:36:32 AM9/10/05

to

An exit code is much desired for our application.

Regards,
--
Bil
http://fun3d.larc.nasa.gov

Meador Inge

unread,

Sep 10, 2005, 10:52:17 AM9/10/05

to

> So we need ability to compare numerical fields line-by-line,
> ignoring any text that might be surrounding them, but
> make sure that for a given line, the same number of
> "numeric fields" exist.

But shouldn't the surrounding text matter to some degree? Otherwise:
a b 1.2345 c d
would match:
d c 1.2345 b a
. It seems to me much safer to at least assure that the surrounding
text is the same in both files.

Bil Kleb

unread,

Sep 10, 2005, 11:05:55 AM9/10/05

to

That would be a good option, but at this point, all
we care about is the numerical tolerance aspect -- ndiff
should be primarily focused on numbers.

We can use diff for text-based changes if need be.

Thanks for the interest,
--
Bil
http://fun3d.larc.nasa.gov

Bil Kleb

unread,

Sep 10, 2005, 1:12:12 PM9/10/05

to

Hugh Sasse wrote:
> It would seem useful to be produce a file of numerical differences
> as well.

Yes, but not to us at this point.

Jim Freeze

unread,

Sep 12, 2005, 2:35:58 PM9/12/05

to

Uhm, where are the solutions?

I usually see solutions posted on Sunday.
Here it is Mon afternoon, and I have not seen
any quiz traffic. Is my email playing tricks on me?

--
Jim Freeze

James Edward Gray II

unread,

Sep 12, 2005, 2:45:25 PM9/12/05

to

Only if my email is broken too. Haven't had time to fiddle with it
myself yet, but it looks like I better find some... ;)

James Edward Gray II

Bil Kleb

unread,

Sep 13, 2005, 7:29:52 AM9/13/05

to

James Edward Gray II wrote:

Sigh, maybe it's just my lame quiz writing skills...

Later,
--
Bil
http://fun3d.larc.nasa.gov

Note: posted via newsgroup.

Brian Schröder

unread,

Sep 13, 2005, 9:10:31 AM9/13/05

to

I really want to do this quiz as it looks quite interesting, but I'm
totally swamped with work. I won't have time to do it until next
weekend. But at least some thoughts here:

I would like to write a fdiff i.e. fuzzy-diff. That would behave as follows:

1. Specify fuzzycompare fields with regular expressions.
2. When comparing a row first extract the fuzzyfields from the row,
compare both rows and call a user-specified compare function on the
fuzzyfields.

That would allow for a more general compare approach. I.e.

number-diff:

fuzzyfields = /number-regexp/
compare = proc do | n1, n2| (n1.to_f - n2.to_f).abs < EPSILON end

space-diff:

fuzzyfields = /\s+/
compare = proc do true end

dist-diff

fuzzyfields = /\A.*\Z/
compare = proc do | s1, s2 | edit_distance(s1, s2) < 2 end

etc...

regards,

Brian

--
http://ruby.brian-schroeder.de/

Stringed instrument chords: http://chordlist.brian-schroeder.de/

James Edward Gray II

unread,

Sep 14, 2005, 5:36:57 PM9/14/05

to

On Sep 12, 2005, at 8:24 PM, Daniel Sheppard wrote:

> <<quiz_46_ndiff.rb>> This quiz made me delve into the darkened
> alleyways of undocumented standard libraries (well, bigdecimal).
> Luckily
> the whole thing worked exactly as guessed - probably why nobody has
> bothered documenting it yet.

Nice solution.

Just one small comment:

..
SyncEnumerator.new(file1, file2).each do |line1, line2|
line1, line2 = NumberLine.new(line1), NumberLine.new(line2)
SyncEnumerator.new(line1, line2).collect { |num1, num2|
distances << (num1 - num2).abs }
end
..

I'm pretty sure you meant each() instead of collect() for that second
SyncEnumerator. I just wanted to point that out, since I changed it
in the quiz summary.

James Edward Gray II

Daniel Sheppard

unread,

Sep 14, 2005, 7:48:56 PM9/14/05

to

yeah - I played around with that section a bit and the collect was from
an alternative way of working.

-----Original Message-----
From: James Edward Gray II [mailto:ja...@grayproductions.net]
Sent: Thursday, 15 September 2005 7:37 AM
To: ruby-talk ML
Subject: Re: [SOLUTION] NDiff (#46)

On Sep 12, 2005, at 8:24 PM, Daniel Sheppard wrote:

> <<quiz_46_ndiff.rb>> This quiz made me delve into the darkened
> alleyways of undocumented standard libraries (well, bigdecimal).
> Luckily
> the whole thing worked exactly as guessed - probably why nobody has
> bothered documenting it yet.

Nice solution.

Just one small comment:

.

SyncEnumerator.new(file1, file2).each do |line1, line2|
line1, line2 = NumberLine.new(line1), NumberLine.new(line2)
SyncEnumerator.new(line1, line2).collect { |num1, num2| distances

<< (num1 - num2).abs } end ...

I'm pretty sure you meant each() instead of collect() for that second
SyncEnumerator. I just wanted to point that out, since I changed it in
the quiz summary.

James Edward Gray II

#####################################################################################
This email has been scanned by MailMarshal, an email content filter.
#####################################################################################

Ruby Quiz

unread,

Sep 15, 2005, 8:49:54 AM9/15/05

to

Both solutions had some really interesting material in them. I would certainly
talk about both, if time and space allowed.

Paul Vaillant had some interesting parsing code (see DataFile.parse_line()) and
comparison code (see DataFile::CompareResults.compare()). Paul's entire
solution was also very clean and readable, though it did come out a bit longer
than Daniel Sheppard's code. Finally, Paul's code is over three times faster,
which could be a real asset for significant data sets.

Daniel's code is the one I want to look at below, because of the way it handles
an interesting Ruby problem. What makes this challenge fun, I think, is that
it's one of the edge cases where Ruby's internal iterator can feel awkward.
You're always walking through two separate collections in this problem. Ruby
has some nice solutions for this, buried in the dark corners of its standard
library, but many of us just aren't aware of them. One of these tools,
SyncEnumerator from the generator library, is heavily used in the following
code. Let's see how something like that can make a real difference in the end
result:

require 'generator'
require 'pathname'
require 'optparse'
require 'bigdecimal'

class NumberLine
include Enumerable
def initialize(line)
@line = line
end
def each
@line.scan( / ((-)?([0-9]+|\.)
\.?([0-9]+)?[Ee]?
([-+]?[0-9]+)?) /x) do |match|
yield BigDecimal.new(match[0])
end
end
def to_s
@line
end
end

# ...

This is a simple encapsulation of a single line of one or more numbers. The
only method of interest here is each(), which scan()s the line looking for all
the numbers it can find. That lengthy Regexp just matches an optional minus
sign, followed by some digits (or a period), optionally followed by a period and
some more digits, optionally following by an E or e and another number. This
isn't a fool-proof match since it will catch a String like "-." or even "..",
but it's probably good enough for most cases. Each number matched is fed to
BigDecimal and then yielded to a given block. The use of the library here slows
things down a bit, but it avoids floating point inaccuracies too, which is more
important to this challenge.

# ...

class NumericalDifferencer
attr_accessor :digits, :tolerance
def initialize
@digits = 0
@tolerance = 0.0
end
def each_diff(file1, file2)
line = 0
diff = nil

SyncEnumerator.new(file1, file2).each do |line1, line2|
line1, line2 = NumberLine.new(line1), NumberLine.new(line2)

line += 1
num_enum = SyncEnumerator.new(line1, line2)
if num_enum.all? do |a,b|
unless @digits == 0
a,b = a.round(@digits-1), b.round(@digits-1)
end
(a - b).abs <= @tolerance
end
if diff
yield diff
diff = nil
end
else
diff = NumericalDiff.new(line) unless diff
diff.add_lines(line1, line2)
end
end
yield diff if diff
end
end

# ...

Here we get into the actual file comparison code and some SyncEnumerator fun.
The first SyncEnumerator created here walks through the two files given in step.
Each iteration of the block receives the next line from both files. Those lines
are turned into NumberLine objects, which we have already seen. Then a
SyncEnumerator is made from those objects, which allows the code to traverse
each number found in both lines, again in step. That object is used in the if
condition to ensure that all?() numbers in the line are within the needed
tolerance (rounding if requested). When any numbers of the lines don't match
close enough a NumericalDiff object is created and the current lines, plus any
following lines that also differ, are added to it. When the NumericalDiff is
complete, because the code has returned to matching lines or completed running
through the files, the diff object is yielded to the provided block.

Let's take a look at that NumercialDiff class:

# ...

class NumericalDiff
attr_reader :start_line, :left, :right
def initialize(start_line)
@start_line = start_line
@left = []
@right = []
end
def add_lines(line1, line2)
@left << line1
@right << line2
end
def to_s
lines = "#{@start_line},#{@start_line + @left.length - 1}"
str = "#{lines}c#{lines}\n"
str << @left.collect { |x| "< #{x}" }.join
str << "---\n"
str << @right.collect { |x| "> #{x}" }.join
end
end

# ...

No black magic in here. The work method is to_s() and it should be easy enough
to see that it's just creating the quiz format there.

That's really the majority of the solution. The rest is mainly option parsing:

# ...

differ = NumericalDifferencer.new
quiet = false
statistics = false

opts = OptionParser.new do |opts|
opts.banner = "Usage: #{$0} [options] file1 file2"

opts.separator ""
opts.separator "Numerically compare files line by line, " +
"numerical field by numerical field."
opts.separator ""
opts.on("-d", "--digits INT", Integer,
"Maximum number of significant digits that should match.",
"(default: 0)") do |digits|
differ.digits = digits
end
opts.on("-t", "--tolerance DBL", String,
"Tolerate <= DBL distance between numbers.",
"(default: 0.0)") do |tolerance|
differ.tolerance = BigDecimal.new(tolerance)
end
opts.on("-h", "--help", "Output this help.") do |help|
puts opts
exit 0
end
opts.on("-q", "--quiet", "No output, just exit code.") do |value|
quiet = value
end
opts.on("-s", "--statistics",
"Provide comparison statistics only.") do |value|
statistics = value
end
end

# ...

The top few lines there create a variables to hold the diff tools and setting
information. Then the last chunk of code is OptionParser 101.

Finally, here's the code that kicks it all off:

# ...

begin
opts.parse!(ARGV)
if quiet && statistics
raise "--quiet and --statistics are mutually exclusive"
end
raise "Must pass two filenames" unless ARGV.length == 2
files = ARGV.collect { |x| Pathname.new(x) }
files.each do |f|
unless f.exist? && f.file?
raise "'#{f}' does not exist"
end
end
File.open(files[0]) do |file1|
File.open(files[1]) do |file2|
if(statistics)
distances = []

SyncEnumerator.new(file1, file2).each do |line1, line2|
line1, line2 = NumberLine.new(line1),
NumberLine.new(line2)

SyncEnumerator.new(line1, line2).each do |num1, num2|

distances << (num1 - num2).abs
end

end
class << distances
def median
sorted = self.sort
mid = sorted.size / 2
if sorted.size % 2 == 0
(sorted[mid] + sorted[mid - 1]) / 2
else
sorted[mid]
end
end
def mean
self.inject(0) { |x, sum| sum + x / self.size}
end
end
puts(<<-EOF)
Numbers compared: #{distances.size}
Distance range: #{distances.min}..#{distances.max}
Median distance: #{distances.median}
Mean distance: #{distances.mean}
EOF
elsif(quiet)
differ.each_diff(file1, file2) do |diff|
exit 1
end
exit 0
else
different = false
differ.each_diff(file1, file2) do |diff|
different = true
puts diff
end
exit(different ? 1 : 0)
end
end
end
rescue => e
warn e
warn opts
exit 1
end

I know that looks like a lot of code, but the majority of it is the extra credit
part of the challenge. If you can spot that big if statement in the middle,
you'll see that the majority of the code is in the first branch. That part
calculates statistics, by implementing its own inline custom traversal of the
files (again with SyncEnumerator). The really interesting feature in this code
is that the results are just compiled into an Array and then that Array is
modified with extra methods to make outputting the results easy. If you're not
familiar with Ruby's class << some_object ... end idiom, it simply adds the
methods to just that one object. A handy trick put to good use here, I think.

The other two very short branches (elsif and else), actually solve the quiz.
They just react according to the each_diff() method we examined earlier.

It's probably worth mentioning again here, the generator library is pure Ruby
and makes use of continuations, which can be quite slow. BigDecimal is never
going to be as fast as using Floats either, for obvious reasons. The result is
that this code is quite a bit slower than Paul's solution. (Over three times
slower, in my tests.) That doesn't affect small data sets, like the ones in the
quiz, very much, but you may start waiting a bit on longer inputs.

My thanks to both quiz solvers, for showing off some super cool Ruby techniques.

Tomorrows quiz is a simple challenge for all you web application junkies, that
may just prove interesting to the community at large. Join in the fun and show
us just how cool your favorite web application framework can be...

Bil Kleb

unread,

Sep 15, 2005, 11:48:28 AM9/15/05

to

Ruby Quiz wrote:
> Both solutions had some really interesting material in them. I would certainly
> talk about both, if time and space allowed.

You need to take credit for the Summary
on the webpage version:

http://rubyquiz.com/quiz46.html

Otherwise, the rest of my team will be
(incorrectly) dazzled by "my" Ruby knowledge
when I point them there. :)

James Edward Gray II

unread,

Sep 15, 2005, 3:14:43 PM9/15/05

to

On Sep 15, 2005, at 11:01 AM, Bil Kleb wrote:

> Ruby Quiz wrote:
>
>> Both solutions had some really interesting material in them. I
>> would certainly
>> talk about both, if time and space allowed.
>>
>
> You need to take credit for the Summary
> on the webpage version:
>
> http://rubyquiz.com/quiz46.html
>
> Otherwise, the rest of my team will be
> (incorrectly) dazzled by "my" Ruby knowledge
> when I point them there. :)

My general rule is that if it doesn't have a by-line, it was me.
(This has been added to the Ruby Quiz FAQ.)

However, I made an exception in this case and changed the page as
requested.

James Edward Gray II