1. Please do not post any solutions or spoiler discussion for this quiz until
48 hours have passed from the time on this message.
2. Support Ruby Quiz by submitting ideas as often as you can:
3. Enjoy!
-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=
by Lyndon Samson
All this fiddling with bits in the thread "How to get non-unique elements from
an array" got me digressing to search engines and indexing.
So if you have for example:
Doc1=The quick brown fox
Doc2=Jumped over the brown dog
Doc3=Cut him to the quick
You can build a table with bit number and word.
1 the
2 quick
3 brown
4 fox
5 jumped
6 over
7 dog
8 cut
9 him
10 to
11 quick
To create indices:
Doc1=00000001111
Doc2=00001110101
Doc3=11110000011
You can very quickly return the Docs that contain 'the' [ Doc1,Doc2,Doc3 ], or
brown [ Doc1,Doc2 ] etc.
This week's Ruby Quiz is to write a simple indexer/query system.
[ Note:
In the spirit of that thread, I think part of the quiz should be to solve the
indexing problem in the shortest, most elegant, yet fastest way possible. Maybe
that goes without saying, but I've seen some pretty long quiz solutions in the
past.
--Ryan Leavengood ]
#!/usr/local/bin/ruby
class Index
INDEX_FILE = 'index.dat'
# loads existing index file, if any
def initialize
@terms = {}
@index = {}
if File.exists? INDEX_FILE
@terms, @index = Marshal.load(File.open(INDEX_FILE, 'rb') {|f|
f.read})
end
end
# sets the current document being indexed
def document=(name)
@document = name
end
# adds given term to the index under the current document
def <<(term)
raise "No document defined" unless defined? @document
unless @terms.include? term
@terms[term] = @terms.length
end
i = @terms[term]
@index[@document] ||= 0
@index[@document] |= 1 << i
end
# finds documents containing all of the specified terms.
# if a block is given, each document is supplied to the
# block, and nil is returned. Otherwise, an array of
# documents is returned.
def find(*terms)
@index.each do |document, mask|
if terms.all? { |term| @terms[term] && mask[@terms[term]] != 0 }
yield document
end
end
end
# dumps the entire index
def dump
@index.each do |document, mask|
puts "#{document}:"
@terms.each do |term, value|
puts " #{term}" if mask & value
end
end
end
# saves the index data to disk
def save
File.open(INDEX_FILE, 'wb') do |f|
Marshal.dump([@terms, @index], f)
end
end
end
idx = Index.new
case ARGV.shift
when 'add'
ARGV.each do |fname|
idx.document = fname
IO.foreach(fname) do |line|
line.downcase.scan(/\w+/) { |term| idx << term }
end
end
idx.save
when 'find'
idx.find(*ARGV.collect { |s| s.downcase }) { |document| puts document }
when 'dump'
idx.dump
else
print <<-EOS
Usage: #$0 add file [file...] Adds files to index
#$0 find term [term...] Lists files containing all term(s)
#$0 dump Dumps raw index data
EOS
end
Oops, that comment is wrong. You must supply a block. I forgot to go
back and add the support for returning an array.
Hello Bob,
thank you for the solution, but please try to respect the non-spoiler
period in the future. From the announcement eMail:
1. Please do not post any solutions or spoiler discussion for this quiz until
48 hours have passed from the time on this message.
cheers,
Brian
--
http://ruby.brian-schroeder.de/
Stringed instrument chords: http://chordlist.brian-schroeder.de/
Ummm, today is Monday :o)
>> [snip]
>>
>
> Hello Bob,
>
> thank you for the solution, but please try to respect the non-spoiler
> period in the future. From the announcement eMail:
>
> 1. Please do not post any solutions or spoiler discussion for this
> quiz until
> 48 hours have passed from the time on this message.
I'm pretty sure we're past that now. ;)
Do we need a Ruby Quiz to parse the date out of an email header and
add 48 hours to it? <laughs>
James Edward Gray II
Actually, we need the following, much more practical, quiz:
Write a program that watches a mailing list (or newsgroup, if you
prefer) for a message with a subject line that contains "[QUIZ]". If
the body of that email also contains "Please do not post any solutions
or spoiler discussion for this quiz until 48 hours have passed from the
time on this message.", the program should email the list once per
minute with the number of minutes remaining until the 48 hours have
passed. Sample output should look about like this:
------------------------------
To: "ruby-talk ML" <ruby...@ruby-lang.org>
From: "Annoying mailer" <usubscrib...@gmail.com>
Subject: [QUIZ][COUNTDOWN] Index and Query (#54)
2879 minutes remaining!
------------------------------
------------------------------
To: "ruby-talk ML" <ruby...@ruby-lang.org>
From: "Annoying mailer" <usubscrib...@gmail.com>
Subject: [QUIZ][COUNTDOWN] Index and Query (#54)
2878 minutes remaining!
------------------------------
------------------------------
To: "ruby-talk ML" <ruby...@ruby-lang.org>
From: "Annoying mailer" <usubscrib...@gmail.com>
Subject: [QUIZ][COUNTDOWN] Index and Query (#54)
2877 minutes remaining!
------------------------------
...
Yes, I'm joking.
Keith
Sorry, I thought 48 hours had passed.
From the announcement:
Date: Sat, 12 Nov 2005 10:27:39 +0900
Posted: Fri, 11 Nov 2005 20:27:28 -0500
My post:
Date: Tue, 15 Nov 2005 00:56:22 +0900
Posted: Mon, 14 Nov 2005 10:50:36 -0500
That's more than 48 hours, no?
So sorry for the confusion. I left friday and somehow thought the quiz
was sent today. My inner clock is a bit out of sync at the moment.
Maybe I should write a ruby program to skew my inner clock until it is
in sync again.
Humble apologies again,
brian
class IndexHash
attr_accessor :index
def initialize( documents=nil )
@index = Hash.new( [] )
input( documents ) if documents
end
def input( documents )
documents.each_pair do |symbol, contents|
contents.split.each { |word| insert( symbol, word) }
end
end
def insert( document_symbol, word )
@index[word.downcase] = [] unless @index.has_key?( word.downcase )
@index[word.downcase].push( document_symbol ) unless
@index[word.downcase].include?( document_symbol )
end
def find( word )
@index[ word.downcase ]
end
def words
@index.keys.sort
end
end
class IndexBitmap
attr_accessor :index
def initialize( documents=nil )
@index = []
@documents = {}
input( documents ) if documents
end
def input( documents )
documents.each_pair do |symbol, contents|
contents.split.each { |word| insert( symbol, word) }
end
end
def insert( document_symbol, word )
@index.push( word.downcase ) unless @index.include?( word.downcase )
@documents[ document_symbol ] = 0 unless @documents.has_key?(
document_symbol )
@documents[ document_symbol ] += (1<<@index.index( word.downcase ))
end
def find( word )
result = []
@documents.each_pair do |symbol, value|
result.push( symbol ) if value & (1<<@index.index( word.downcase ))
end
result
end
def words
@index.sort
end
end
To verify this, I used the following tests. I just had to change which
class was being tested (@test_class defined in 'setup'):
require 'test/unit'
require 'index'
class Array
# Contents of the two arrays are the same, but the order may be
different
def equivalent(other)
self.each do |item|
if !other.include?( item ) then
return false
end
end
return true
end
end
DOC1 = "The quick brown fox"
INDEX1 = [ 'the', 'quick', 'brown', 'fox' ]
DOC2 = "Jumped over the brown dog"
INDEX2 = [ 'jumped', 'over', 'the', 'brown', 'dog' ]
DOC3 = "Cut him to the quick"
INDEX3 = [ 'cut', 'him', 'to', 'the', 'quick' ]
class TestIndex < Test::Unit::TestCase
def setup
@test_class = IndexBitmap
@i = @test_class.new
end
def test_index_single_document
@i.input( :doc1=>DOC1 )
assert_equal( INDEX1.sort, @i.words )
end
def test_index_muliple_documents_input_one_at_a_time
@i.input( :doc1=>DOC1 )
@i.input( :doc2=>DOC2 )
@i.input( :doc3=>DOC3 )
assert_equal( (INDEX1+INDEX2+INDEX3).uniq.sort, @i.words )
end
def test_index_muliple_documents_input_all_at_one_time
@i.input( :doc1=>DOC1, :doc2=>DOC2, :doc3=>DOC3 )
assert_equal( (INDEX1+INDEX2+INDEX3).uniq.sort, @i.words )
end
def test_index_single_document_on_new
j = @test_class.new( :doc1=>DOC1 )
assert_equal( INDEX1.sort, j.words )
end
def test_index_muliple_documents_input_all_at_one_time_on_new
j = @test_class.new( :doc1=>DOC1, :doc2=>DOC2, :doc3=>DOC3 )
assert_equal( (INDEX1+INDEX2+INDEX3).uniq.sort, j.words )
end
def test_index_find
@i.input( :doc1=>DOC1, :doc2=>DOC2, :doc3=>DOC3 )
assert_equal( true, [:doc1,:doc2,:doc3].equivalent( @i.find( 'the' )
) )
assert_equal( true, [:doc1,:doc3].equivalent( @i.find( 'quick' ) ) )
assert_equal( true, [:doc2].equivalent( @i.find( 'dog' ) ) )
assert_equal( true, [:doc1,:doc2].equivalent( @i.find( 'brown' ) ) )
end
end
I've read it and remembered a structure that I've studied in college,
the Trie. So I've implemented a very ineficient Trie and tried it.
In a trie, there is a tree of letters. Each word is saved in the tree
(so, words with the same root share a part of the trie, saving space).
I've added to this structure the references for each word.
This is the code:
require "pp"
require "set"
$stdout.sync = true # rubyeclipse requires it
class Trie
def initialize
@containers = Set.new
@tries = Hash.new
end
def containers(word)
if word.length == 0 then
return @containers
end
trie = @tries[ word[0,1] ]
return trie ? trie.containers(word[1...word.length]) : Set.new
end
def add(word, index)
if word.length == 0 then
@containers << index
else
# word[0,1] returns a String. word[0] returns a number (yack!)
trie = @tries[ word[0,1] ] ||= Trie.new
trie.add( word[1...word.length], index )
end
end
end
class Indexer
def initialize( texts )
@trie = Trie.new
texts.each do
|t|
t.split.each do
|w|
@trie.add(w.capitalize, t)
end
end
end
def containers(word)
@trie.containers(word.capitalize)
end
end
texts = ["The quick brown fox", "Jumped over the brown dog", "Cut him
to the quick"]
indexer = Indexer.new(texts)
puts "containers for \"the\""
pp indexer.containers('the') # -> ["The quick brown fox", "Jumped over
the brown dog", "Cut him to the quick"]
puts "containers for \"brown\""
pp indexer.containers('brown') # -> ["The quick brown fox", "Jumped
over the brown dog"]
puts "containers for \"inexistant\""
pp indexer.containers('inexistant') #-> []
This is my first submitted solution. I only started learning ruby in my
free time 4 days ago, so don't hold back the critique. I would really
appreciate comments and suggestions on how I can get more clued in to
the ruby style & conventions. I'm afraid it's a bit longer than the
others, but I'm just learning :)
class Catalogue
def initialize(start_docs=[[]])
#Expects an array of [space-delimited keyword list, object to
catalogue] arrays for each initial object
@keywords = Array.new #Array of used keywords. Position is important.
@cat_objects = Array.new #Array of [keyword bitfield, stored object]
arrays
start_docs.each do |st_doc|
self.catalogue!(st_doc[0], st_doc[1])
end
end
def each_under_kw(keyword)
#Expects a string keyword. Yields objects using that keyword.
if cindex = @keywords.index(keyword.upcase)
@cat_objects.each do |cat_obj|
yield(cat_obj[1]) unless ((cat_obj[0] & (2 ** cindex)) == 0)
end
end
end
def each
@cat_objects.each {|obj| yield obj[1]}
end
def catalogue!(keyword_list, cat_object)
#Adds a new object to the catalogue. Expects a space-delimited list of
keywords and an object to catalogue.
key_bitfield = 0
split_list = keyword_list.upcase.split
unless split_list.empty?
split_list.each do |test_keyword|
cindex = @keywords.index(test_keyword)
if cindex == nil
cindex = @keywords.length
@keywords << test_keyword
end
key_bitfield |= 2 ** cindex
end
@cat_objects << [key_bitfield , cat_object]
end
end
attr_accessor :cat_objects, :keywords
end
# Begin Demonstration
# For this demonstration, the list of keywords itself is the object
stored.
# This does not have to be the case, any object can be stored.
doc1 = "The quick brown fox"
doc2 = "Jumped over the brown dog"
doc3 = "Cut him to the quick"
demo = Catalogue.new([[doc1, doc1], [doc2, doc2]]) #Create the
catalogue with 2 objects
demo.catalogue!(doc3, doc3) #Add an object to the catalogue
print "All phrases:\n"
demo.each do |obj|
print obj + "\n"
end
print "\nList of objects with keyword 'the':\n"
demo.each_under_kw('the') do |obj|
print obj + "\n"
end
print "\nList of objects with keyword 'brown':\n"
demo.each_under_kw('brown') do |obj|
print obj + "\n"
end
print "\nList of objects with keyword 'dog':\n"
demo.each_under_kw('dog') do |obj|
print obj + "\n"
end
print "\nList of objects with keyword 'quick':\n"
demo.each_under_kw('quick') do |obj|
print obj + "\n"
end
#End Demonstration
Here's an update of my solution that fixes my dump method and makes
the block optional on the find method:
#!/usr/local/bin/ruby
# document indexing/searching class
class Index
# default index file name
INDEX_FILE = 'index.dat'
# loads existing index file, if any
def initialize(index_file = INDEX_FILE)
@terms = {}
@index = {}
@index_file = index_file
if File.exists? @index_file
@terms, @index = Marshal.load(
File.open(@index_file, 'rb') {|f| f.read})
end
end
# sets the current document being indexed
def document=(name)
@document = name
end
# adds given term to the index under the current document
def <<(term)
raise "No document defined" unless defined? @document
unless @terms.include? term
@terms[term] = @terms.length
end
i = @terms[term]
@index[@document] ||= 0
@index[@document] |= 1 << i
end
# finds documents containing all of the specified terms.
# if a block is given, each document is supplied to the
# block, and nil is returned. Otherwise, an array of
# documents is returned.
def find(*terms)
results = []
@index.each do |document, mask|
if terms.all? { |term| @terms[term] && mask[@terms[term]] != 0 }
block_given? ? yield(document) : results << document
end
end
block_given? ? nil : results
end
# dumps the entire index, showing each term and the documents
# containing that term
def dump
@terms.sort.each do |term, value|
puts "#{term}:"
@index.sort.each do |document, mask|
puts " #{document}" if mask[@terms[term]] != 0
end
end
end
# saves the index data to disk
def save
File.open(@index_file, 'wb') do |f|
Marshal.dump([@terms, @index], f)
end
end
end
if $0 == __FILE__
idx = Index.new
case ARGV.shift
when 'add'
ARGV.each do |fname|
idx.document = fname
IO.foreach(fname) do |line|
line.downcase.scan(/\w+/) { |term| idx << term }
end
end
idx.save
when 'find'
idx.find(*ARGV.collect { |s| s.downcase }) do |document|
puts document
end
when 'dump'
idx.dump
else
print <<-EOS
Usage: #$0 add file [file...] Adds files to index
#$0 find term [term...] Lists files containing all term(s)
#$0 dump Dumps raw index data
EOS
end
end
This sounds similar to how LZW compression works. Cool!
--
Into RFID? www.rfidnewsupdate.com Simple, fast, news.
> This is my first submitted solution. I only started learning ruby
> in my
> free time 4 days ago, so don't hold back the critique.
Here's a tip that leaped out at me just while glancing at your code.
We spell:
print obj + "\n"
as:
puts obj
Hope that helps.
James Edward Gray II
index.search("+ruby rails -python") {|doc, score| puts "#{score}:#{doc}"}
The results are scored by the number of matching terms. My solution
also allows updates and deletes. The most complicated method in there
is optimize. Basically what this is doing is shortening all the
document bitmaps for each term to remove the deleted document.
Cheers,
Dave
PS: I'm currently benchmarking the solutions with some surprising results.
require 'strscan'
module SimpleFerret
class Analyzer
ENGLISH_STOP_WORDS = [
"a", "an", "and", "are", "as", "at", "be", "but", "by", "for", "if",
"in", "into", "is", "it", "no", "not", "of", "on", "or", "s", "such",
"t", "that", "the", "their", "then", "there", "these", "they", "this",
"to", "was", "will", "with"
]
def initialize(regexp = /[[:alpha:]]+/, stop_words = ENGLISH_STOP_WORDS)
@regexp = regexp
@stop_words = stop_words.inject({}) {|h, word| h[word] = true; h}
end
def each_token(string)
ss = StringScanner.new(string)
while ss.scan_until(@regexp)
token = ss.matched.downcase
yield token unless @stop_words[token]
end
end
end
class Index
def initialize(analyzer = Analyzer.new())
@analyzer = analyzer
@index = Hash.new(0)
@docs = []
@doc_map = {}
@deleted = 0
end
def add(id, string)
delete(id) if @doc_map[id] # clear existing entry using that id
doc_num = @docs.size
@docs << id
@doc_map[id] = doc_num
doc_mask = 1 << doc_num
@analyzer.each_token(string) do |token|
@index[token] |= doc_mask
end
end
alias :[]= :add
def delete(id)
@deleted |= 1 << @doc_map[id]
end
def search(search_string)
must = []
should = []
must_not = []
search_string.split.each do |st|
case st[0]
when ?+: @analyzer.each_token(st) {|t| must << t}
when ?-: @analyzer.each_token(st) {|t| must_not << t}
else @analyzer.each_token(st) {|t| should << t}
end
end
if not must.empty?
bitmap = -1 # 0b111111111111....
must.each {|token| bitmap &= @index[token]}
else # no point in using should if we have must
bitmap = 0
should.each {|token| bitmap |= @index[token]}
end
if bitmap > 0
must_not.each {|token| bitmap &= ~ @index[token]}
end
bitmap &= ~ @deleted
doc_num = 0
results = []
while (bitmap > 0)
if (bitmap & 1) == 1
results << score_result(doc_num, should, must.size)
end
bitmap >>= 1
doc_num += 1
end
results.sort! do |(adoc, ascore), (bdoc, bscore)|
bscore <=> ascore
end.each do |(doc, score)|
yield(doc, score)
end
end
def size
delete_count = 0
bitmask = 1
while bitmask < @deleted
delete_count += 1 if (bitmask & @deleted) > 0
bitmask <<= 1
end
@docs.size - delete_count
end
alias :num_docs :size
def unique_terms
@index.size
end
# will need to give it a name the first time
def write(fname = @fname)
@fname = fname
File.open(fname, "wb") {|f| Marshal.dump(self, f)}
end
def Index.read(fname)
Marshal.load(File.read(fname))
end
# removes deleted documents from the index
def optimize
masks = []; bitmask = 1;
mask = 0; bm = 1; last_mask = -1;
doc_num = 0
while (bitmask < @deleted)
if (@deleted & bitmask) == 0
mask |= bm
bm <<= 1
last_mask <<= 1
doc_num += 1
elsif
@docs.delete_at(doc_num)
masks << mask
mask = 0
end
bitmask <<= 1
end
@doc_map = {}
@docs.each_index {|i| @doc_map[@docs[i]] = i}
masks << last_mask
@index.each_pair do |id, bitmap|
new_bitmap = 0
masks.each do |mask|
new_bitmap |= (bitmap & mask)
bitmap >>= 1
end
if new_bitmap > 0
@index[id] = new_bitmap
else
@index.delete(id)
end
end
@deleted = 0
end
private
def score_result(doc_num, should, must_count)
score = must_count
should.each do |term|
score += 1 if (@index[term] & 1 << doc_num) > 0
end
return [@docs[doc_num], score]
end
end
end
if $0 == __FILE__
include SimpleFerret
INDEX_FILE = "simple.idx"
if File.exists?(INDEX_FILE)
idx = Index.read(INDEX_FILE)
else
idx = Index.new
end
case ARGV.shift
when 'add'
ARGV.each {|fname| idx.add(fname, File.read(fname))}
idx.write(INDEX_FILE)
when 'find'
idx.search(ARGV.join(" ")) { |doc, score| puts "#{score}:#{doc}" }
else
print <<-EOS
Usage: #$0 add file [file...] Adds files to index
#$0 find term [term...] Runs the query on the index
EOS
end
end
> PS: I'm currently benchmarking the solutions with some surprising
> results.
Please do share...
James Edward Gray II
Will do. Still got a bit of work to do.
http://members.iinet.net.au/~soxbox/ruby_quiz_54/indexer.rb
And also a script that benchmarks it against grepping the files manually
(My first attempts ended up being slower... eeep):
http://members.iinet.net.au/~soxbox/ruby_quiz_54/indexer_bm.rb
I used shakespeare's sonnets from project gutenburg to do the testing.
Results:
user system total real
first word 0.157000 0.047000 0.204000 ( 0.219000)
median word 0.109000 0.046000 0.155000 ( 0.156000)
last word 0.141000 0.063000 0.204000 ( 0.203000)
index last word 0.062000 0.000000 0.062000 ( 0.063000)
non-existant 0.063000 0.000000 0.063000 ( 0.062000)
grep first 1.656000 1.453000 3.109000 ( 3.625000)
grep median 1.969000 1.438000 3.407000 ( 3.438000)
grep last 2.343000 1.140000 3.483000 ( 4.125000)
And, if I take out the indexing of the word list file (which wont occur
if you use it from the commandline):
user system total real
first word 0.562000 0.078000 0.640000 ( 0.687000)
median word 0.547000 0.047000 0.594000 ( 0.703000)
last word 0.547000 0.078000 0.625000 ( 1.000000)
index last word 0.469000 0.000000 0.469000 ( 0.500000)
non-existant 0.468000 0.031000 0.499000 ( 0.500000)
grep first 1.875000 1.219000 3.094000 ( 3.219000)
grep median 1.875000 1.250000 3.125000 ( 3.188000)
grep last 1.813000 1.344000 3.157000 ( 3.203000)
#####################################################################################
This email has been scanned by MailMarshal, an email content filter.
#####################################################################################
class Indexer
attr_reader :words, :index
def initialize(docs)
@words = []
@index = {}
docs.each do |key,doc|
docwords = divide_words(doc)
@words |= docwords
@index[key] = 0
docwords.each do |w|
n = @words.index(w)
@index[key] |= 1 << n if n
end
end
end
def divide_words(words)
words_list = words.downcase.split(/[^\w']/).uniq - [""]
words_list.each { |w| w.gsub!(/^\W*|\W*$/, '') }
words_list.uniq!
words_list
end
def [](word)
query(word)
end
def query(query)
search_words = divide_words(query)
bit_mask = 0
search_words.each do |w|
word_index = @words.index(w)
(bit_mask = 0; break) if(!word_index)
bit_mask |= 1 << word_index
end
result = []
if(bit_mask>0) then
@index.each do |name,bits|
(result << name) if(bits & bit_mask == bit_mask)
end
end
result
end
def display
puts "Index #{@words.length} word#{'s' if @words.length > 1}"
puts "[#{@words.join(', ')}]"
@index.each do |k,v|
printf("%s: %b\n", k, v)
end
end
end
docs = {
:doc1 => "The quick brown fox",
:doc2 => "Jumped over the brown dog",
:doc3 => "Cut him to the quick",
:doc4 => "He's got some punctuation.",
:doc5 => "I just need a lot more different words to put in here",
:doc6 => "1 2 3 4 5 6 7 8 9 0 a b c d e f g h i j k l m n o p q r",
:doc7 => "She's going to the 'store' or \"store\""
}
index = Indexer.new(docs)
index.display
puts "[#{index["the"].join(",")}]"
puts "[#{index["quick"].join(",")}]"
puts "[#{index["fox"].join(",")}]"
puts "[#{index["blah"].join(",")}]"
puts "[#{index["fox quick"].join(",")}]"
looking at all those long subscriptions.. couldn't resist to make my own. probably not as complete/fast as those, but small and neat looking :D
should be pretty much self explenators. it's using bignums for the index bitmaps.
have a nice day,
Kash
-----------
docs = {
:doc1 => "The quick brown fox",
:doc2 => "Jumped over the brown dog",
:doc3 => "Cut him to the quick"
}
# building word list
lst = docs.map {|k,v| v.split(/[^\w']+/) }.flatten.uniq
# building index
index = docs.inject(Hash.new(0)) do |hash, (k, v)|
lst.each { |x| hash[k] |= 2**lst.index(x) if v.index(x) }
hash
end
# query the string
q = ARGV[0] || "Cut"
res = index.map { |k,v| k if (v & 2**lst.index(q)) > 0 }.compact
p res
______________________________________________________________
Verschicken Sie romantische, coole und witzige Bilder per SMS!
Jetzt bei WEB.DE FreeMail: http://f.web.de/?mc=021193
> Thanks, I'll remember that. I was getting rather confused about which
> of the print, puts, or p commands to use for any given task.
print() is for when you want to do all the work yourself:
>> print [1, 2, 3]
123=> nil
>> print "a line"
a line=> nil
(Note the lack of newlines above.)
puts() is when you want Ruby to make pretty human readable output for
you:
>> puts [1, 2, 3]
1
2
3
=> nil
>> puts "a line"
a line
=> nil
p() is for "inspect()ing" objects to see what they look like under
the hood (great for debugging):
>> p [1, 2, 3]
[1, 2, 3]
=> nil
>> p "a line"
"a line"
=> nil
http://www.davebalmain.com/articles/2005/11/15/ruby-quiz-54-results
There were certainly a few surprises in there for me. Like how did Bob
and Dale manage to index so quickly without using the StringScanner
class. And why is Bob's search so fast, even though he didn't use an
inverted index. Even for much larger document sets his search runs
almost the same speed as Simple Ferret. He used exactly the same
algorithm as is used in Simple Search, yet his searches are much
quicker. I was also amazed at how well these simple solutions compared
to Lucene, although I realize I'm comparing apples with oranges.
Cheers,
Dave
After reading your results I thought I would try and make a couple of
simple changes. I attempted to cleanup the 'insert' routine since that
is where most of the processing time seemed to be spent. I also added
the ability to perform multi-term searching (individual terms or single
string). This will worsen the look-up times, but it might be a good
change.
If possible, could you run this version through your test to see how it
does?
class IndexHash
def initialize( documents=nil )
@index = Hash.new( [] )
input( documents ) if documents
end
def input( documents )
documents.each_pair do |symbol, contents|
contents.split.each { |word| insert( symbol, word) }
end
end
def insert( document_symbol, word )
w = word.downcase
@index[w] += [ document_symbol ] unless @index[w].include?(
document_symbol )
end
def find( *strings )
result = []
strings.each do |string|
string.split.each do |word|
result += @index[ word.downcase ]
end
end
result.uniq
end
def words
@index.keys.sort
end
end
class IndexBitmap
def initialize( documents=nil )
@index = []
@documents = Hash.new( 0 )
input( documents ) if documents
end
def input( documents )
documents.each_pair do |symbol, contents|
contents.split.each { |word| insert( symbol, word) }
end
end
def insert( document_symbol, word )
w = word.downcase
@index.push( w ) unless @index.include?( w )
@documents[ document_symbol ] |= (1<<@index.index( w ))
end
def find( *strings )
result = []
mask = 0
strings.each do |string|
string.split.each do |word|
w = word.downcase
mask |= (1<<@index.index(w)) if @index.index(w)
end
end
@documents.each_pair do |symbol, value|
result.push( symbol ) if value & mask
Hi Dale,
I've updated the results. You may notice a number of the indexing
times have changed. I modified my test to make it a little fairer.
Anyway, you earned yourself some green realestate on the bottom chart.
Somethings up with the search results from your first bitmap index.
I'm just running that one again. By the way, I had to change the
second last line of your find method to;
result.push( symbol ) if value & mask > 0
Cheers,
Dave
Thanks for updating the results.
My first version of the IndexBitmap class has a bug in the 'insert'
routine. It added values rather than ORed values together. Oops. None
of my tests originally covered this case so it slipped by me.
@documents[ document_symbol ] += (1<<@index.index( word.downcase ))
should be
@documents[ document_symbol ] |= (1<<@index.index( word.downcase ))
Thanks for catching my error where I wasn't checking the mask correctly
in all cases. Again, I lacked a test for this case. Shame. Shame.
Thank you very much for your testing and analysis.
--Dale
Blind dumb luck, I'm sure :)
For kicks, I've created a version of my solution that uses an inverted
index:
http://users.adelphia.net/~showaltb/rubyquiz/54/indexer2.rb
Nice. Search is twice as fast as your original version. You've
restored my faith in smart algorithms. ;-)
Dave
Before we get into the bit magic, check out this simple Hash style solution by
Dale Martenson:
class IndexHash
def initialize( documents=nil )
@index = Hash.new( [] )
input( documents ) if documents
end
def input( documents )
documents.each_pair do |symbol, contents|
contents.split.each { |word| insert( symbol, word) }
end
end
def insert( document_symbol, word )
w = word.downcase
unless @index[w].include?( document_symbol )
@index[w] += [ document_symbol ]
end
end
def find( *strings )
result = []
strings.each do |string|
string.split.each do |word|
result += @index[ word.downcase ]
end
end
result.uniq
end
def words
@index.keys.sort
end
end
We see in initialize() that the index is just a Hash. The input() method is
also easy to understand, as it just adds each word of a document to the Hash, as
an Array of symbolic document names where the word can be found (see insert()).
The other side of the equation is find(). It takes an Array of Strings, dices
those up into words, combs the index for each word, and adds all the connected
document symbols to the result Array it returns.
David Balmain has done some timing of the submitted solutions:
http://www.davebalmain.com/articles/2005/11/15/ruby-quiz-54-results
This isn't the fastest solution for searches, but it still helped me understand
what we are shooting for here.
Now let's take it a step further and look at Bob Showalter's bit solution:
#!/usr/local/bin/ruby
# document indexing/searching class
class Index
# default index file name
INDEX_FILE = 'index.dat'
# loads existing index file, if any
def initialize(index_file = INDEX_FILE)
@terms = {}
@index = {}
@index_file = index_file
if File.exists? @index_file
@terms, @index = Marshal.load(
File.open(@index_file, 'rb') {|f| f.read})
end
end
# ...
We immediately see that this code has two Hashes, one for the terms and one for
the index. We can also see that Marshal is used to save and load these Hashes.
Now let's look at the methods that add to the index:
# ...
# sets the current document being indexed
def document=(name)
@document = name
end
# adds given term to the index under the current document
def <<(term)
raise "No document defined" unless defined? @document
unless @terms.include? term
@terms[term] = @terms.length
end
i = @terms[term]
@index[@document] ||= 0
@index[@document] |= 1 << i
end
# ...
The first method just sets the name of the document we are currently dealing
with. The second adds a single term to the index, under that document name.
The first step in adding a term is to place it in the terms Hash. The key is
the term and the value is the number of pairs already in the Hash (basically a
numerical index). Why didn't Bob just use an Array here? Because it would slow
down lookups. You would have to walk the Array to find the term in question
each time you needed to know its index.
Once you have an index for the new term, it's time to record it in the index
Hash, under the current document name. Each document name is given a single
Integer for a value. The bit at the term index is then just flipped on to
indicate the presence of a term. This will make for some big numbers, but
remember that Ruby will automatically switch to Bignum as needed.
Now we need the tools to get the info back out:
# ...
# ...
Again, find() is our primary search method. It walks the document listing,
checking for any document containing all the terms. (That's different from the
first solution we looked at which found documents containing any terms.) A term
is found, simply by checking to see if a given bit is on. Ruby makes this easy
since the indexing method, [](), returns bits for Integers. If you provided a
block to find(), each document is passed when found. Otherwise, find() collects
and returns an Array of the documents.
Both dump() and save() are obvious and do exactly what the comments say they do.
Here's the last bit of code:
# ...
if $0 == __FILE__
idx = Index.new
case ARGV.shift
when 'add'
ARGV.each do |fname|
idx.document = fname
IO.foreach(fname) do |line|
line.downcase.scan(/\w+/) { |term| idx << term }
end
end
idx.save
when 'find'
idx.find(*ARGV.collect { |s| s.downcase }) do |document|
puts document
end
when 'dump'
idx.dump
else
print <<-EOS
Usage: #$0 add file [file...] Adds files to index
#$0 find term [term...] Lists files containing all term(s)
#$0 dump Dumps raw index data
EOS
end
end
That's a clever interface in very little code. It reads the first argument to
see if you want to "add", "find", or "dump" (similar to svn, cvs, or gem). The
rest of the arguments are the files to "add" or the terms to "find".
The other interesting element at David Balmain's result comparison page, is the
chart of capabilities. Think about how you might support queries like "Apple
AND NOT fruit". See David's Simple Ferret solution for a start on this kind of
logic.
Then think about how you might add the ability to search for phrases instead of
terms, like "Ruby Programming Language". This problem space is vast and
interesting to explore, I think.
My thanks to everyone who worked the quiz and to David Balmain for results that
helped me know what to even look at.
Tomorrow we will explore James's favorite "trick taking game"...