Gmail Calendar Documents Reader Web more »
Recently Visited Groups | Help | Sign in
Google Groups Home
Short But Unique (#83)
There are currently too many topics in this group that display first. To make this topic appear first, remove this option from another topic.
There was an error processing your request. Please try again.
flag
  8 messages - Collapse all  -  Translate all to Translated (View all originals)
The group you are posting to is a Usenet group. Messages posted to this group will make your email address visible to anyone on the Internet.
Your reply message has not been sent.
Your post was successful
 
From:
To:
Cc:
Followup To:
Add Cc | Add Followup-to | Edit Subject
Subject:
Validation:
For verification purposes please type the characters you see in the picture below or the numbers you hear by clicking the accessibility icon. Listen and type the numbers you hear
 
Ruby Quiz  
View profile  
 More options Jun 16 2006, 8:33 am
From: Ruby Quiz <ja...@grayproductions.net>
Date: Fri, 16 Jun 2006 21:33:17 +0900
Local: Fri, Jun 16 2006 8:33 am
Subject: [QUIZ] Short But Unique (#83)
The three rules of Ruby Quiz:

1.  Please do not post any solutions or spoiler discussion for this quiz until
48 hours have passed from the time on this message.

2.  Support Ruby Quiz by submitting ideas as often as you can:

http://www.rubyquiz.com/

3.  Enjoy!

Suggestion:  A [QUIZ] in the subject of emails about the problem helps everyone
on Ruby Talk follow the discussion.  Please reply to the original quiz message,
if you can.

-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=- =-=-=

by Ryan Williams

I use Eclipse (with RadRails!) I have a bunch of files open in tabs. Once enough
files are open, Eclipse starts to truncate the names so that everything fits.
It truncates them from the right, which means that pretty soon I'm left unable
to tell which tab is "users_controller.rb" and which is
"users_controller_test.rb", because they're both truncated to
"users_control...".

The quiz would be to develop an abbrev-like module that shortens a set of
strings so that they are all within a specified length, and all unique.  You
shorten the strings by replacing a sequence of characters with an ellipsis
character [U+2026].  If you want it to be ascii-only, use three periods instead,
but keep in mind that then you can only replace blocks of four or more
characters.

It might look like this in operation:

  ['users_controller', 'users_controller_test',
   'account_controller', 'account_controller_test',
   'bacon'].compress(10)
  => ['users_c...', 'use...test', 'account...', 'acc...test', 'bacon']

There's a lot of leeway to vary the algorithm for selecting which characters to
crop, so extra points go to schemes that yield more readable results.


    Reply to author    Forward  
You must Sign in before you can post messages.
To post a message you must first join this group.
Please update your nickname on the subscription settings page before posting.
You do not have the permission required to post.
Discussion subject changed to "Short But Unique (#83) : Clarifications" by Gautam Dey
Gautam Dey  
View profile  
 More options Jun 17 2006, 4:15 pm
From: Gautam Dey <g...@mac.com>
Date: Sun, 18 Jun 2006 05:15:50 +0900
Local: Sat, Jun 17 2006 4:15 pm
Subject: Re: [QUIZ] Short But Unique (#83) : Clarifications
Two things:

    Are the entries in the array always unique?
Or do we have to be able to handle the array such as:

['users_controller', 'users_controller_test', 'account_controller',  
'account_controller_test',  'bacon', 'users_controller_test']

Also is the unicode ellipsis counted as one or three characters?

-Gautam Dey
On Jun 16, 2006, at 5:33 AM, Ruby Quiz wrote:


    Reply to author    Forward  
You must Sign in before you can post messages.
To post a message you must first join this group.
Please update your nickname on the subscription settings page before posting.
You do not have the permission required to post.
James Edward Gray II  
View profile  
 More options Jun 18 2006, 12:40 pm
From: James Edward Gray II <ja...@grayproductions.net>
Date: Mon, 19 Jun 2006 01:40:59 +0900
Local: Sun, Jun 18 2006 12:40 pm
Subject: Re: [QUIZ] Short But Unique (#83) : Clarifications
On Jun 17, 2006, at 3:15 PM, Gautam Dey wrote:

> Two things:

>    Are the entries in the array always unique?

Let's assume they are, sure.

> Also is the unicode ellipsis counted as one or three characters?

One, in my opinion.

James Edward Gray II


    Reply to author    Forward  
You must Sign in before you can post messages.
To post a message you must first join this group.
Please update your nickname on the subscription settings page before posting.
You do not have the permission required to post.
Discussion subject changed to "[Solution (Quiz #83)]" by Gautam Dey
Gautam Dey  
View profile  
 More options Jun 20 2006, 4:18 am
From: Gautam Dey <g...@mac.com>
Date: Tue, 20 Jun 2006 17:18:13 +0900
Local: Tues, Jun 20 2006 4:18 am
Subject: Re: [QUIZ] [Solution (Quiz #83)]
My entry is simple, and not very complicated. At first I was thinking  
of make it much more complicated and using the abbrev to get human  
readable entries for and abbreviated version
of the title. But that seemed to complicated things more then help.  
So, I just went for a simple algorithm. My solution basically  
consistest of taking a simple truncation of the file name, then if  
that is already taken, going to the end and shifting the ellipsis to  
the left while reveling more of the last word, till a the title does  
not match anymore. There is a very large possibility of getting an  
infinite loop. And I have not tested it on many strings. Also,  
another flaw is that if two string are identical but smaller then the  
value sent to the function, it will return both string untouched.  
Since it does not touch any string smaller or equal to the length  
passed to it.

Gautam.
------------------------------------------------------------------------
--------------------------------------------------------
#!/usr/bin/env ruby -w
#  Suggestion:  A [QUIZ] in the subject of emails about the problem  
helps everyone
#  on Ruby Talk follow the discussion.  Please reply to the original  
quiz message,
#  if you can.
#
#  -=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-
=-=-=-=-=-=-=
#
#  by Ryan Williams
#
#  I use Eclipse (with RadRails!) I have a bunch of files open in  
tabs. Once enough
#  files are open, Eclipse starts to truncate the names so that  
everything fits.
#  It truncates them from the right, which means that pretty soon I'm  
left unable
#  to tell which tab is "users_controller.rb" and which is
#  "users_controller_test.rb", because they're both truncated to
#  "users_control...".
#
#  The quiz would be to develop an abbrev-like module that shortens a  
set of
#  strings so that they are all within a specified length, and all  
unique.  You
#  shorten the strings by replacing a sequence of characters with an  
ellipsis
#  character [U+2026].  If you want it to be ascii-only, use three  
periods instead,
#  but keep in mind that then you can only replace blocks of four or  
more
#  characters.
#
#  It might look like this in operation:
#
#    ['users_controller', 'users_controller_test',
#     'account_controller', 'account_controller_test',
#     'bacon'].compress(10)
#    => ['users_c...', 'use...test', 'account...', 'acc...test',  
'bacon']
#
#  There's a lot of leeway to vary the algorithm for selecting which  
characters to
#  crop, so extra points go to schemes that yield more readable results.
#
#  This code is released under the GPL.

require 'Abbrev'
module  GDCompress
   def compress (size)
     usedNameHash = Hash.new
     compressedTitleNames = Array.new
     for tabTitle in self
       newTabTitle = "" # start with empty string.
       if tabTitle.length > size
           caseValue = 0
           loop do
             newTabTitle = tabTitle[0,size-(1+caseValue)] + "…" +  
tabTitle[-caseValue,caseValue]
             #print "\t#{newTabTitle} is the new tabTitleTitle for #
{tabTitle}\n"
             caseValue = caseValue + 3
             break unless usedNameHash[newTabTitle]
           end
       else
         newTabTitle = tabTitle
       end
       usedNameHash[newTabTitle] = tabTitle
       compressedTitleNames[compressedTitleNames.length] = newTabTitle
     end
     compressedTitleNames
   end
end

class Array
   include GDCompress
   extend  GDCompress
end

print ['users_controller', 'users_controller_test',
      'account_controller', 'account_controller_test',
      'bacon'].compress(10)

------------------------------------------------------------------------
---------------------------------------------------

On Jun 16, 2006, at 5:33 AM, Ruby Quiz wrote:


    Reply to author    Forward  
You must Sign in before you can post messages.
To post a message you must first join this group.
Please update your nickname on the subscription settings page before posting.
You do not have the permission required to post.
Discussion subject changed to "Short But Unique (#83)" by Sander Land
Sander Land  
View profile  
 More options Jun 20 2006, 11:38 am
From: "Sander Land" <sander.l...@gmail.com>
Date: Wed, 21 Jun 2006 00:38:40 +0900
Local: Tues, Jun 20 2006 11:38 am
Subject: Re: [QUIZ] Short But Unique (#83)
Here is my solution.
It tries to generate unambiguous abbrevations, if those don't exist,
it uses the least ambiguous one and always avoids using the same
abbrevation twice.
There's also a readability thing built in, strings with many
characters at the beginning or having the characters split equally
over the beginning and ending parts are considered the most readable.

class String
  def compress(total_length, end_length)
    self[0...total_length-end_length] + '...' + self[length-end_length..-1]
  end
end

class Array
  def compress!(max_length)
    max_length = 4 if max_length < 4
        score = Hash.new(0)
        usable_length = max_length - 3
        order = (0..usable_length).sort_by{|len|
[(len-usable_length.to_f/2).abs,len].min}
        to_compress = select {|s| s.length > usable_length}
        to_compress.each {|s| order.map{|l| score[s.compress(usable_length,l)] += 1 } }
        to_compress.each{|s|
          s.replace order.map{|l| s.compress(usable_length,l) }.min{|a,b|
score[a] <=> score[b]}
          score[s] += 100
        }
        self
  end
end

if __FILE__==$0
  p ['users_controller', 'users_controller_test','account_controller',
'account_controller_test','bacon'].compress!(10)
  p Array.new(10){'abcdefghijklmnopqrstuvwxyz'}.compress!(12)
  p ['aaaaaazbbbbb','aaaaaaybbbbb'].compress!(9)
end


    Reply to author    Forward  
You must Sign in before you can post messages.
To post a message you must first join this group.
Please update your nickname on the subscription settings page before posting.
You do not have the permission required to post.
Ruby Quiz  
View profile  
 More options Jun 22 2006, 10:10 am
From: Ruby Quiz <ja...@grayproductions.net>
Date: Thu, 22 Jun 2006 23:10:59 +0900
Local: Thurs, Jun 22 2006 10:10 am
Subject: [SUMMARY] Short But Unique (#83)
This is a pretty real world problem many applications struggle with.  I dare say
it's hard to get right, which is probably why so many applications don't even
try.  We had a few adventurous solvers though and they got some workable
results.  Let's look into Daniel Martin's solution.

I felt Daniel got some great output, by favoring the word boundaries of the
terms to shorten.  Here's a sample of the code being run on the quiz problem
set:

        users...er
        use...test
        account...
        acc...test
        bacon

Notice how the compression favors keeping word intact, when possible.  That
seems to give fairly good results overall.

Alright, let's see how Daniel does it:

        # Returns the length of the longest common
        # substring of "a" and "b"
        def string_similarity(a, b)
          retval = 0
          (0 ... b.length).each { |offset|
            len = 0
            (0 ... b.length - offset).each { |aind|
              if (a[aind] and b[aind+offset] == a[aind])
                len += 1
                retval = len if retval < len
              else
                len = 0
              end
            }
          }
          (1 ... a.length).each { |offset|
            len = 0
            (0 ... a.length - offset).each { |bind|
              if (b[bind] and a[bind+offset] == b[bind])
                len += 1
                retval = len if retval < len
              else
                len = 0
              end
            }
          }
          retval
        end

        # ...

The comment above this method tells you exactly what it does, which is to return
the numerical length of the longest common substring for the parameters.  It
works by walking each possible substring of b and checking that the string
occurs in a somewhere.  It then reverses and repeats the process because one
string might be longer than the other.  The highest count seen overall is
returned.

Daniel mentioned that he has been out of Ruby for a bit and his Ruby might be a
little rusty.  The above code works just fine, of course, but we can shorten it
up a bit, if we want to.  Here's another way to write the above:

        $KCODE = "u"          # make jcode happy (silence a warning)
        require "jcode"       # for String#each_char
        require "enumerator"  # for Enumerable#each_cons

        def string_similarity(str1, str2)
          long, short = [str1, str2].sort_by { |str| str.length }
          long.length.downto(0) do |len|
            long.enum_for(:each_char).each_cons(len) do |substr|
              return substr.length if short.include? substr.join
            end
          end
          return 0
        end

Again, this works the same.  I try all possible substrings of the longer string,
from longest to shortest, for inclusion in the shorter string.  Since the code
works with the longest strings first, we can return the first answer we find.
I'm not saying either method is better, but hopefully you can figure out how
they work, between the two of them.

Let's get back to Daniel's code:

        # ...

        def score_compression(target, start, len, alltargets)
          score = target.length - len
          score += 3 if len == 0
          score += 3 if (target[start,1] =~ %r(_|\W) or
                         target[start-1,2] =~ %r([a-z0-9][A-Z]))
          score += 3 if (target[start+len-1,1] =~ %r(_|\W) or
                         target[start+len-1,2] =~ %r([a-z0-9][A-Z]))
          prebit = target[0,start]
          postbit = target[start+len,target.length]
          scoreminus = 0
          alltargets.each{|s|
            scoreminus += string_similarity(s,prebit)
            scoreminus += string_similarity(s,postbit)
          }
          score - (1.0 / alltargets.length) * scoreminus
        end

        # ...

The above code just rates a possible replacement.  The target variable holds the
original string, start and len mark the area to be replaced with the repeat
string, and alltargets holds the other strings that need compressing.

The most interesting part starts on the third line of the method, where a couple
of checks are used to score substitutions at word boundaries higher.  Note that
the checks include punctuation boundaries and changes in case as a boundary.
This, in my opinion, is the source of the quite readable end result.

Note that the last section of that method compares the pieces of the string that
will be left after the replacement with other strings in the result set using
the substring count method we examined earlier.  This code is there to prevent
two strings from being compressed to the same representation (though I doubt
this weighted scoring system is perfect).

Finally, here's the method that does the actual work:

        # ...

        class Array
          def compress(n, repstr = '...')
            retval = []
            self.each { |s|
              short_specs =
              (s.length - n + repstr.length ..
               s.length - n + repstr.length + 2).inject([]) { |sp,l|
                sp + (0 .. s.length - l).map {|st| [st,l]}
              }.select { |a| a[1] > repstr.length}
              if (s.length <= n) then short_specs.unshift([0,0]) end
              retval.push(
                short_specs.inject([-999,""]) { |record, spec|
                  candidate = s.dup
                  candidate[spec[0],spec[1]] = repstr if spec[1] > 0
                  if retval.include?(candidate)
                    record
                  else
                    score = score_compression(s,spec[0],spec[1],self)
                    if score >= record[0]
                      [score, candidate]
                    else
                      record
                    end
                  end
                }[1]
              )
            }
            retval
          end
        end

        # ...

This code was a little tricky for me to break down, so let me see if I can
explain it in manageable chunks.  This method works over each string in the
Array, building a compressed replacement for each one as it goes.  To build
those replacements it first constructs an Array of indices and lengths where it
could replace content with the replacement string.  It then scores each of those
replacements and selects the highest candidate as the result.

Here's how one might invoke the above:

        # ...

        if __FILE__ == $0
          ARGV[1,ARGV.length].compress(ARGV[0].to_i).each {|s| puts s}
        end

Now, since Unicode has been such a hot topic lately, let's see how this code
handles using a real ellipsis character.  If I change the above code to:

        # ...

        if __FILE__ == $0
          puts ARGV[1..-1].compress(ARGV[0].to_i, "…")
        end

and run the code again, here's the new output:

        users…er
        use…test
        account…
        acc…test
        bacon

Oops, I asked for 10 characters but got less than that.  The reason is that the
code we just examined is counting the byte length of our replacement string, in
this bit of code right here:

        # ...

        class Array
          def compress(n, repstr = '...')
            retval = []
            self.each { |s|
              short_specs =
              (s.length - n + repstr.length ..
               s.length - n + repstr.length + 2).inject([]) { |sp,l|
                sp + (0 .. s.length - l).map {|st| [st,l]}
              }.select { |a| a[1] > repstr.length}

              # ...

As a fix, I'm going to set the $KCODE variable for Unicode support, load the
jcode library for its helper methods, and change all those repstr.length calls
to repstr.jlength.  Here's a look at those changes:

        $KCODE = "u"
        require "jcode"

        # ...

        class Array
          def compress(n, repstr = '...')
            retval = []
            self.each { |s|
              short_specs =
              (s.length - n + repstr.jlength ..
               s.length - n + repstr.jlength + 2).inject([]) { |sp,l|
                sp + (0 .. s.length - l).map {|st| [st,l]}
              }.select { |a| a[1] > repstr.jlength}

              # ...

Does that fix us up?  Let's see:

        users…ller
        users…test
        account…er
        account…st
        bacon

Bingo.  Now we get the extra characters from using the shorter string.  Don't
let people tell you Ruby can't handle Unicode today.

My thanks to all who gave this quiz a shot.  I expect you all to email your
solutions as patches all the applications out there that botch the display of
tabs like this.

Tomorrow we have an easy problem for those working through Learn to Program...


    Reply to author    Forward  
You must Sign in before you can post messages.
To post a message you must first join this group.
Please update your nickname on the subscription settings page before posting.
You do not have the permission required to post.
Daniel Martin  
View profile  
 More options Jun 23 2006, 10:47 pm
From: Daniel Martin <mar...@snowplow.org>
Date: Sat, 24 Jun 2006 11:47:12 +0900
Local: Fri, Jun 23 2006 10:47 pm
Subject: Re: [SUMMARY] Short But Unique (#83)

Ruby Quiz <ja...@grayproductions.net> writes:
> Note that the last section of that method compares the pieces of the
> string that will be left after the replacement with other strings in
> the result set using the substring count method we examined earlier.
> This code is there to prevent two strings from being compressed to
> the same representation (though I doubt this weighted scoring system
> is perfect).

Actually, that's not why that code is there, or what it's doing.  What
it's doing is favoring unique strings over non-unique strings.  For
example:

irb(main):002:0> %w(apple_juice orange_juice grape_juice
                    prune_juice).compress(10)
=> ["apple...", "orange...", "grape...", "prune..."]
irb(main):003:0> %w(orange_juice orange_marmalade orange_flavor
                    orange_pulp).compress(10)
=> ["...juice", "...rmalade", "...flavor", "o...pulp"]

The idea is that stuff common to all or almost-all of the original
strings doesn't really help you when staring at a bunch of tabs.
Abbreviating "orange_juice" as "orange..." makes sense when you have a
bunch of juices, but not when you have a bunch of orange things.

Avoiding previously used abbreviations is done with the bit:

>              if retval.include?(candidate)
>                record

That is, if the to-be-returned list of abbreviations already includes
this string, don't even score it and just assume that whatever else
you had was better.  This leads to bad results when all the possible
abbreviations are already taken.

Also, if you're going to allow for unicode in the output, you really
should allow it in the input, and change all those .length to .jlength
calls, but I'm still holding out for true transparent unicode support
in ruby 2.0...


    Reply to author    Forward  
You must Sign in before you can post messages.
To post a message you must first join this group.
Please update your nickname on the subscription settings page before posting.
You do not have the permission required to post.
James Edward Gray II  
View profile  
 More options Jun 24 2006, 2:37 pm
From: James Edward Gray II <ja...@grayproductions.net>
Date: Sun, 25 Jun 2006 03:37:57 +0900
Local: Sat, Jun 24 2006 2:37 pm
Subject: Re: [SUMMARY] Short But Unique (#83)
On Jun 23, 2006, at 9:47 PM, Daniel Martin wrote:

Thanks for setting me straight Daniel.

> Also, if you're going to allow for unicode in the output, you really
> should allow it in the input, and change all those .length to .jlength
> calls, but I'm still holding out for true transparent unicode support
> in ruby 2.0...

Good point!

James Edward Gray II


    Reply to author    Forward  
You must Sign in before you can post messages.
To post a message you must first join this group.
Please update your nickname on the subscription settings page before posting.
You do not have the permission required to post.
End of messages
« Back to Discussions « Newer topic     Older topic »

Create a group - Google Groups - Google Home - Terms of Service - Privacy Policy
©2009 Google