[QUIZ] TumbleDRYer (#53)

Ruby Quiz

unread,

Oct 28, 2005, 4:58:50 PM10/28/05

to

The three rules of Ruby Quiz:

1. Please do not post any solutions or spoiler discussion for this quiz until
48 hours have passed from the time on this message.

2. Support Ruby Quiz by submitting ideas as often as you can:

http://www.rubyquiz.com/

3. Enjoy!

-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=

by Hugh Sasse

The Pragmatic Programmers (and others) recommend that you remove repetition from
code. They call this principle DRY: Don't Repeat Yourself. Benefits include
not having to track down multiple places to make a change affecting the whole
system, prevention of inconsistency, as well as reduced debugging time.
Sometimes code ends up being repetitive. Take this MySQL example from:

http://manuals.rubyonrails.com/read/chapter/48

CREATE TABLE `authors` (
`id` int(11) NOT NULL auto_increment,
`firstname` varchar(50) NOT NULL default '',
`name` varchar(50) NOT NULL default '',
`nickname` varchar(50) NOT NULL default '',
`contact` varchar(50) NOT NULL default '',
`password` varchar(50) NOT NULL default '',
`description` text NOT NULL,
PRIMARY KEY (`id`)
) TYPE=MyISAM AUTO_INCREMENT=3 ;

CREATE TABLE `categories` (
`id` int(11) NOT NULL auto_increment,
`name` varchar(20) NOT NULL default '',
`description` varchar(70) NOT NULL default '',
PRIMARY KEY (`id`)
) TYPE=MyISAM AUTO_INCREMENT=3 ;

CREATE TABLE `categories_documents` (
`category_id` int(11) NOT NULL default '0',
`document_id` int(11) NOT NULL default '0',
) TYPE=MyISAM ;

CREATE TABLE `documents` (
`id` int(11) NOT NULL auto_increment,
`title` varchar(50) NOT NULL default '',
`description` text NOT NULL,
`author_id` int(11) NOT NULL default '0',
`date` date NOT NULL default '0000-00-00',
`filename` varchar(50) NOT NULL default '',
PRIMARY KEY (`id`),
KEY `document` (`title`)
) TYPE=MyISAM AUTO_INCREMENT=14 ;

clearly there's a lot of repetition in there. It would be nice if it were
possible to eliminate it. This quiz is about writing a program which we'll call
TumbleDRYer, which, given repetitive input, will create ruby code to generate
that input, such that the generating program has much less repetition in it.
One way might be to generate a program like this:

# Undoes the work of TumbleDRYer :-)
class WashingMachine
def initialize
@create = 'CREATE TABLE'
@type = 'TYPE=MyISAM'
@varchar_50 = "varchar(50) NOT NULL default '',"
@varchar_20 = "varchar(20) NOT NULL default '',"
@int_11 = "int(11) NOT NULL auto_increment,"
@int_11_0 = "int(11) NOT NULL default '0',"
@pk = "PRIMARY KEY (`id`)"
@auto = "AUTO_INCREMENT="
end

def output
print <<EOF
#{@create} `authors` (
`id` #{@int_11}
`firstname` #{@varchar_50}
`name` #{@varchar_50}
`nickname` #{@varchar_50}
`contact` #{@varchar_50}
`password` #{@varchar_50}
`description` text NOT NULL,
#{@pk}
) #{@type} #{@auto}3 ;

#{@create} `categories` (
`id` #{@int_11}
`name` #{@varchar_20}
`description` varchar(70) NOT NULL default '',
#{@pk}
) #{@type} #{@auto}3 ;

#{@create} `categories_documents` (
`category_id` #{@int_11_0}
`document_id` #{@int_11_0}
) #{@type} ;

#{@create} `documents` (
`id` #{@int_11}
`title` #{@varchar_50}
`description` text NOT NULL,
`author_id` #{@int_11_0}
`date` date NOT NULL default '0000-00-00',
`filename` #{@varchar_50}
#{@pk},
KEY `document` (`title`)
) #{@type} #{@auto}14 ;
EOF
end
end

WashingMachine.new.output

or something of the kind. Or it might be worth using ERB.

Obviously the more human readable to the resulting program is the better: the
point is to produce something that simplifies maintenance, rather than data
compression. Techniques from data compression would seem to be useful, however.
Building up a dictionary of character groups and their frequencies might be
useful, and various strategies are possible for how to fragment the string into
repeated chunks, such as: the longest repeated substring; splitting on some kind
of boundary; or techniques from Ziv-Lempel coding to create the groups.

Bill Guindon

unread,

Oct 28, 2005, 7:06:32 PM10/28/05

to

On 10/28/05, Ruby Quiz <ja...@grayproductions.net> wrote:

[snip]

> -=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=
>
> by Hugh Sasse

[snip]

> CREATE TABLE `documents` (
> `id` int(11) NOT NULL auto_increment,
> `title` varchar(50) NOT NULL default '',
> `description` text NOT NULL,
> `author_id` int(11) NOT NULL default '0',
> `date` date NOT NULL default '0000-00-00',
> `filename` varchar(50) NOT NULL default '',
> PRIMARY KEY (`id`),
> KEY `document` (`title`)
> ) TYPE=MyISAM AUTO_INCREMENT=14 ;

This looks like a fun one to me, as it's one of the main things I use
Ruby for. Looking forward to seeing what others do with it -- and
hoping to clean up an example of my own.

Since I had to rtfm [1] for a couple simple SQL questions (yeah, my
SQL is probably rustier than most), thought I'd save others the time.

q) Why doesn't the 'description' field have a 'default'?
a) "BLOB and TEXT columns cannot be assigned a default value."

q) How should you treat columns that allow NULLs (none of the examples do)?
a) "If neither NULL nor NOT NULL is specified, the column is treated
as though NULL had been specified."

[1] http://dev.mysql.com/doc/refman/5.0/en/create-table.html

--
Bill Guindon (aka aGorilla)

Bill Guindon

unread,

Oct 28, 2005, 8:02:03 PM10/28/05

to

nm, I seriously misread the challenge. either way, looking forward to
it, and might take a stab at it.

Dave Burt

unread,

Oct 28, 2005, 8:29:40 PM10/28/05

to

Bill Guindon wrote:
> Since I had to rtfm [1] for a couple simple SQL questions (yeah, my
> SQL is probably rustier than most), thought I'd save others the time.
>
> q) Why doesn't the 'description' field have a 'default'?
> a) "BLOB and TEXT columns cannot be assigned a default value."

I'm pretty sure that's DBMS-specific (MySQL), and I'm pretty sure SQLite
(for example) allows this. I'm actually not sure about other major DBMSes.

> q) How should you treat columns that allow NULLs (none of the examples
> do)?
> a) "If neither NULL nor NOT NULL is specified, the column is treated
> as though NULL had been specified."

That's standard SQL. Most schema-DDLs include optional columns; I like to
omit both the NULL and the NOT.

While we're looking at the MySQL DDL:

q) What's with the backticks (`)?
a) That's MySQL's way of quoting column names. They're unnecessary as long
as the name is alphabetic and not a keyword. I don't know of any other DBMS
that uses backticks for this.

q) What's with the TYPE=?
a) Again, a MySQL feature - MySQL supports different storage engines (and
here was I thinking a DBMS was supposed to manage storage...) and this
specifies which to use. MyISAM is the crippled default one which doesn't
support referential integrity.

q) Why doesn't auto_increment work with other databases?
a) Because defining an auto-increment field is not specified in SQL-92. They
can all do it, though.

q) Why are data types, "default" and "auto_increment" in lowercase when
other keywords are uppercase?
a) I don't know.

For what it's worth, MySQL 5 is a lot better than MySQL 4 (now with in-line
views, I believe), but I'd still recommend Postgresql or really anything
else over it.

Cheers,
Dave

Donald Huebschman

unread,

Oct 31, 2005, 10:34:11 AM10/31/05

to

I have a beige g3 that apple will not support past 10.2. What version
of ruby is the latest version that will run on a 10.2 system and
where can I get it. 1.8.3 does not compile. Is there anything else I
would need to run ruby?

Thanks,
Donald

Michal Suchanek

unread,

Oct 31, 2005, 3:17:46 PM10/31/05

to

Of course it does not. Look at the patches included in the fink
package for 10.3.
I guess the package would build on 10.2 as well. But I cannot try myself.
Specifically I am not sure the environ patch would work on 10.2 as well.

hth

Michal Suchanek

Bob Showalter

unread,

Nov 1, 2005, 2:10:12 PM11/1/05

to

My approach was to look for repeated "phrases" made up of one or more
"words" within a line. Running my solution against the sample input data
produces the following output:

class WashingMachine
def initialize
@id = '`id` int(11) NOT NULL auto_increment,'
@va = 'varchar(50) NOT NULL default \'\','
@in = 'int(11) NOT NULL default \'0\','
@no = 'NOT NULL default'
@ty = ') TYPE=MyISAM'
@de = '`description`'
@cr = 'CREATE TABLE'
@pr = 'PRIMARY KEY'

end
def output
print <<EOF

#{@cr} `authors` (
#{@id}
`firstname` #{@va}
`name` #{@va}
`nickname` #{@va}
`contact` #{@va}
`password` #{@va}
#{@de} text NOT NULL,
#{@pr} (`id`)
#{@ty} AUTO_INCREMENT=3 ;

#{@cr} `categories` (
#{@id}
`name` varchar(20) #{@no} '',
#{@de} varchar(70) #{@no} '',
#{@pr} (`id`)

#{@ty} AUTO_INCREMENT=3 ;
#{@cr} `categories_documents` (
`category_id` #{@in}
`document_id` #{@in}
#{@ty} ;

#{@cr} `documents` (
#{@id}
`title` #{@va}
#{@de} text NOT NULL,
`author_id` #{@in}
`date` date #{@no} '0000-00-00',
`filename` #{@va}
#{@pr} (`id`),

KEY `document` (`title`)

#{@ty} AUTO_INCREMENT=14 ;
EOF
end
end

Here's my solution:

require 'strscan'
require 'abbrev'

class TumbleDRYer

# minimum length of a phrase to consider
MIN_PHRASE = 10

# minimum times a phrase must occur to consider
MIN_OCCUR = 3

# minimum length for abbreviation
MIN_ABBR = 2

def initialize(string)
@input = string
end

def dry

# this will accumulate a list of repeated phrases to condense
phrases = Array.new

# this will receive the abbreviation for each phrase
abbr = Hash.new

lines = @input.to_a
loop do

# process the input data by lines. we find "phrases" by
# first finding the start and end of each "word" in the line,
# and then combining those words into longer phrases. for
# each phrase, we count the number of times it occurs in the
# total input.
phr = Hash.new
lines.each do |line|
s = StringScanner.new(line)
words = Array.new
loop do
s.scan_until(/(?=\S)/) or break
beg = s.pos
s.scan(/\S+/)
words << [ beg, s.pos ]
end
require 'pp'

# combine words to make 'phrases'
combos(words)

# accumulate phrases, counting their occurences.
# skip phrases that are too short.
words.each do |from, to|
p = line[from, to - from]
next unless p.length >= MIN_PHRASE
phr[p] ||= 0
phr[p] += 1
end
end

# get the longest phrase that occurs the most times
longest = phr.sort_by { |k,v| -(k.length * 1000 + v)
}.find { |k,v| v >= MIN_OCCUR } or break
phrase, occurs = longest

# save the phrase, and then blank it out of the input data
# so we can search for more phrases
phrases << phrase
lines.each { |line| line.gsub!(phrase, ' ' * phrase.length) }

end

# now we have all the phrases we want to replace.
# find unique abbreviations for each phrase.
temp = Hash.new
phrases.each do |phrase|
key = phrase.scan(/\w+/).flatten.to_s.downcase
key = '_' + key unless key =~ /^[_a-zA-Z]/
key += '_' while temp.has_key? key
temp[key] = phrase
end
temp.keys.abbrev.sort.each do |s, key|
phrase = temp[key]
abbr[phrase] = s if abbr[phrase].nil? ||
abbr[phrase].length < MIN_ABBR
end

# generate the output class
puts "class WashingMachine"
puts " def initialize"
phrases.each do |phrase|
puts ' @' + abbr[phrase] + " = '" +
phrase.gsub("'", "\\\\'") + "'"
@input.gsub!(phrase, '#{@' + abbr[phrase] + '}')
end
puts " end\n"
puts " def output\nprint <<EOF"
puts @input
puts "EOF\n end\n"
puts "end"

end

private

def combos(arr, max = arr.size - 1, i = 0)
(i+1..max).each do |j|
arr << [ arr[i][0], arr[j][1] ]
end
combos(arr, max, i + 1) if i < max - 1
end

end

TumbleDRYer.new(ARGF.read).dry

Hugh Sasse

unread,

Nov 1, 2005, 2:42:12 PM11/1/05

to

Interesting solution. I've overlooked a lot of the functionality of
strscan, so that is worth knowing about. The approach for
generating mnemonics was nice as well, as well as the reverse sort
by negation.
Thank you,
Hugh

Brian Schröder

unread,

Nov 1, 2005, 4:34:08 PM11/1/05

to

I decided to cheat, here is my solution:

bschroed@oerfi:~/svn/projekte/tumblezip$ ./tumblezip.rb tumblezip.rb >
tumblezip.tumbled.rb
bschroed@oerfi:~/svn/projekte/tumblezip$ cat tumblezip.tumbled.rb
require "zlib"
puts Zlib::Inflate.inflate(DATA.read)
__END__
xSVÔ/-.ÒOÊÌÓ/*Mªäâ*J-,Í,JUP¯ÊÉLRçâ*(-)VPÕ
+4u¸@@U,³²òÌKËI,IÕËÐ▒.!zE(c))0åJññ(r)~.ññJ:
FTá▒äîÑÈ1}/
bschroed@oerfi:~/svn/projekte/tumblezip$ ruby tumblezip.tumbled.rb
#!/usr/bin/ruby

require 'zlib'

puts %(require "zlib"),
%(puts Zlib::Inflate.inflate(DATA.read)),
"__END__",
Zlib::Deflate.deflate(ARGF.read)

It definitely fails on human readability, though compression and dry
is quite good ;-)

cheers,

Brian

--
http://ruby.brian-schroeder.de/

Stringed instrument chords: http://chordlist.brian-schroeder.de/

Hugh Sasse

unread,

Nov 2, 2005, 5:31:31 AM11/2/05

to

This message is in MIME format. The first part should be readable text,
while the remaining parts are likely unreadable without MIME-aware tools.

Ruby Quiz

unread,

Nov 3, 2005, 8:44:29 AM11/3/05

to

Great quiz Hugh! This is one of those unique ideas that was a lot of fun to
play with. There's always some neat things to be accomplished with code that
writes code.

I want to take a look at Bob Showalter's solution below. It's really just one
linear procedure, so I'll show all the code and then break it down:

That's the code, save one unused line I removed.

The first thing I do when reading code is to try and understand the process. To
get the lay of the land, we need to know what's called where. The path of
execution is trivial to find here. The last line is the only code, besides a
couple of require statements, outside of a class definition. We know that's
what Ruby will run.

That line tells us that we need to be examining two methods
TumbleDRYer#initialize and TumbleDRYer#dry. Don't be shy with that scrollbar
now. Zip around until you find them. TumbleDRYer#initialize is easy enough,
all it does is record the passed input. TumbleDRYer#dry is a monster, so let's
come back to it.

Anything else in this file? Well, we can see that two standard libraries are
used, strscan and abbrev. We can also see three constants at the top of the
class that start to give us ideas about how the code will break down the
sections it is meant to simplify.

Beyond that, there's only one other method, which we can assume is a helper to
TumbleDRYer#dry. It's smaller, so let's see if we can figure it out first. Uh
oh, recursion! My brain just melted and leaked out of my ear. Okay, let's see
if we can cheat and just call it. Copy and paste and irb to the rescue:

>> def combos(arr, max = arr.size - 1, i = 0)
>> (i+1..max).each do |j|

?> arr << [ arr[i][0], arr[j][1] ]

>> end
>> combos(arr, max, i + 1) if i < max - 1
>> end

=> nil
>> combos( (1..10).to_a )
=> nil

Or not. I was hoping for a more informative return value obviously.

Okay, it looks like I actually need to read a little. I see that `arr` is a
parameter of the method. That's how I decided on a sample data set. A little
farther in I can see `arr << ...`. Bingo. It modifies the array. Now I know
how to check it:

>> test_data = (1..10).to_a
=> [1, 2, 3, 4, 5, 6, 7, 8, 9, 10]
>> combos test_data
=> nil
>> test_data
=> [1, 2, 3, 4, 5, 6, 7, 8, 9, 10, [1, 1], [1, 1], [1, 0], [1, 0], [1, 1],
[1, 1], [1, 0], [1, 0], [1, 1], [0, 1], [0, 0], [0, 0], [0, 1], [0, 1],
[0, 0], [0, 0], [0, 1], [1, 0], [1, 0], [1, 1], [1, 1], [1, 0], [1, 0],
[1, 1], [0, 0], [0, 1], [0, 1], [0, 0], [0, 0], [0, 1], [1, 1], [1, 1],
[1, 0], [1, 0], [1, 1], [0, 1], [0, 0], [0, 0], [0, 1], [1, 0], [1, 0],
[1, 1], [0, 0], [0, 1], [1, 1]]

Yuck. In the dictionary under "unhelpful", you will find the above listing.
More reading is required.

Well, if I finish the line I stopped at last time, I find this little nugget:
`[ arr[i][0], arr[j][1] ]`. That makes me think `arr` is expected to be an
Array of Arrays. It looks like the subarrays only need two items, since the
indexes are 0 and 1. Third try is a charm:

>> test_data = [1, 3, 5, 7, 9].map { |n| [n, n + 1] }
=> [[1, 2], [3, 4], [5, 6], [7, 8], [9, 10]]
>> combos test_data
=> nil
>> test_data
=> [[1, 2], [3, 4], [5, 6], [7, 8], [9, 10], [1, 4], [1, 6], [1, 8], [1, 10],
[3, 6], [3, 8], [3, 10], [5, 8], [5, 10], [7, 10]]

That looks promising. Now what are we seeing? Obviously it added some
groupings to the end of the Array. 1 was already seen with 2, but it also
paired it with 4, 6, 8, and 10. That's all of the other second elements. Then
3 is paired with 6, 8, and 10 and it was already paired with 4, which leaves
only 2 missing. Now I see the pattern. Each first element is paired with all
of the second elements that come after it, in the original Array.

Even knowing how that works, I'm not sure I understand what that is actually for
yet. I'll file it away in the back of my mind and see if I'm ready to tackle
TumbleDRYer#dry.

Here's the first chunk of that method, to save you some scrolling:

def dry

# this will accumulate a list of repeated phrases to condense
phrases = Array.new

# this will receive the abbreviation for each phrase
abbr = Hash.new

lines = @input.to_a
loop do

# process the input data by lines. we find "phrases" by
# first finding the start and end of each "word" in the line,
# and then combining those words into longer phrases. for
# each phrase, we count the number of times it occurs in the
# total input.
phr = Hash.new
lines.each do |line|
s = StringScanner.new(line)
words = Array.new
loop do
s.scan_until(/(?=\S)/) or break
beg = s.pos
s.scan(/\S+/)
words << [ beg, s.pos ]
end

# combine words to make 'phrases'
combos(words)

# accumulate phrases, counting their occurrences.

# skip phrases that are too short.
words.each do |from, to|
p = line[from, to - from]
next unless p.length >= MIN_PHRASE
phr[p] ||= 0
phr[p] += 1
end
end

# get the longest phrase that occurs the most times
longest = phr.sort_by { |k,v| -(k.length * 1000 + v)
}.find { |k,v| v >= MIN_OCCUR } or break
phrase, occurs = longest

# save the phrase, and then blank it out of the input data
# so we can search for more phrases
phrases << phrase
lines.each { |line| line.gsub!(phrase, ' ' * phrase.length) }

end

# ...

Luckily, that's well commented code. The comments easily explain what the
variables are for, and then we get an introduction to the process in `loop do
.. end`. As soon as I read that, TumbleDRYer#combos clicks into place.

The Array of Arrays passed to TumbleDRYer#combos holds a start and end position
for each word. The start positions are combined with all of the end positions
after that word to give the possible phrases.

Before the call to TumbleDRYer#combos, we can see the Array of Arrays being
created, with a little StringScanner work. Words are found by scanning
non-whitespace characters, because replacing at any other boundaries pretty much
kills the human readable goal of the output.

Just beneath that call to TumbleDRYer#combos the phrases are assembled, checked
for the minimum length, and tallied in the phrase count. The chunk of code
right after that selects the longest phrase that occurs the most times for
substitution. Note the clever use of negation in #sort_by here, to reverse the
order. The last couple of lines save the phrase and replace it with whitespace,
so it won't match in future iterations.

Remember, this is all inside of a big `loop do ... end`, so by the time we break
out of here, all the replacement phrases will have been selected.

# ...

# now we have all the phrases we want to replace.
# find unique abbreviations for each phrase.
temp = Hash.new
phrases.each do |phrase|
key = phrase.scan(/\w+/).flatten.to_s.downcase
key = '_' + key unless key =~ /^[_a-zA-Z]/
key += '_' while temp.has_key? key
temp[key] = phrase
end
temp.keys.abbrev.sort.each do |s, key|
phrase = temp[key]
abbr[phrase] = s if abbr[phrase].nil? ||
abbr[phrase].length < MIN_ABBR
end

# ...

This section of the method just finds names for each of the replaced phrases.
It generally choses the first few letters or numbers of the phrase, unless it
needs to insert a _ character or two to maintain Ruby syntax or break
ambiguities. A lot of the work here is done by the abbrev library. If you're
not familiar with how that works, ask irb:

>> require "abbrev"
=> true
>> names = %w{James Edward Gray II}
=> ["James", "Edward", "Gray", "II"]
>> names.respond_to? :abbrev
=> true
>> names.abbrev
=> {"II"=>"II", "Gr"=>"Gray", "Gra"=>"Gray", "Jam"=>"James", "Ja"=>"James",
"Edw"=>"Edward", "Edwa"=>"Edward", "Edwar"=>"Edward", "Gray"=>"Gray",
"James"=>"James", "Edward"=>"Edward", "E"=>"Edward", "G"=>"Gray",
"Jame"=>"James", "I"=>"II", "Ed"=>"Edward", "J"=>"James"}

It turns an Array into a Hash of all the unique abbreviations that can be used
to form the words in the original Array.

Last little bit of code:

# ...

# generate the output class
puts "class WashingMachine"
puts " def initialize"
phrases.each do |phrase|
puts ' @' + abbr[phrase] + " = '" +
phrase.gsub("'", "\\\\'") + "'"
@input.gsub!(phrase, '#{@' + abbr[phrase] + '}')
end
puts " end\n"
puts " def output\nprint <<EOF"
puts @input
puts "EOF\n end\n"
puts "end"

end

That just outputs the DRYed code to STDOUT. This output uses a Ruby heredoc
with String interpolation. Unfortunately, I think that means that Ruby code,
with its own String interpolation, will confuse it. Have a look at Dominik
Bathon's code for custom interpolation or Daniel Sheppard's code for ERb
templates that even make use of arguments in replacement.

As always my thanks to all the creative souls that gave this quiz a try. All of
the solutions are educational.

Ruby Quiz will take a week off now so I can run up to Denver and wish my sister,
Nicole Blake Thanner, a happy sweet sixteenth birthday!

James Edward Gray II

unread,

Nov 4, 2005, 7:56:00 PM11/4/05

to

On Nov 3, 2005, at 7:44 AM, Ruby Quiz wrote:

> Ruby Quiz will take a week off now...

Just FYI, there will be a Ruby Quiz on the 11th, but it will be
later. My return flight doesn't land until that evening and I can't
launch it until I'm back home. It will make the right day, but not
at the usual morning time. Thanks for your patients.

James Edward Gray II

Anthony Moralez

unread,

Nov 5, 2005, 7:57:55 PM11/5/05

to

On 11/4/05, James Edward Gray II <ja...@grayproductions.net> wrote:

> Thanks for your patients.

But I'm not a doctor ;)
Thanks James for all of your hard work making the Quz happen.

Anthony Moralez