Suggestion: A [QUIZ] in the subject of emails about the problem helps everyone on Ruby Talk follow the discussion. Please reply to the original quiz message, if you can.
I use Eclipse (with RadRails!) I have a bunch of files open in tabs. Once enough files are open, Eclipse starts to truncate the names so that everything fits. It truncates them from the right, which means that pretty soon I'm left unable to tell which tab is "users_controller.rb" and which is "users_controller_test.rb", because they're both truncated to "users_control...".
The quiz would be to develop an abbrev-like module that shortens a set of strings so that they are all within a specified length, and all unique. You shorten the strings by replacing a sequence of characters with an ellipsis character [U+2026]. If you want it to be ascii-only, use three periods instead, but keep in mind that then you can only replace blocks of four or more characters.
> Suggestion: A [QUIZ] in the subject of emails about the problem > helps everyone > on Ruby Talk follow the discussion. Please reply to the original > quiz message, > if you can.
> I use Eclipse (with RadRails!) I have a bunch of files open in > tabs. Once enough > files are open, Eclipse starts to truncate the names so that > everything fits. > It truncates them from the right, which means that pretty soon I'm > left unable > to tell which tab is "users_controller.rb" and which is > "users_controller_test.rb", because they're both truncated to > "users_control...".
> The quiz would be to develop an abbrev-like module that shortens a > set of > strings so that they are all within a specified length, and all > unique. You > shorten the strings by replacing a sequence of characters with an > ellipsis > character [U+2026]. If you want it to be ascii-only, use three > periods instead, > but keep in mind that then you can only replace blocks of four or more > characters.
> There's a lot of leeway to vary the algorithm for selecting which > characters to > crop, so extra points go to schemes that yield more readable results.
My entry is simple, and not very complicated. At first I was thinking of make it much more complicated and using the abbrev to get human readable entries for and abbreviated version of the title. But that seemed to complicated things more then help. So, I just went for a simple algorithm. My solution basically consistest of taking a simple truncation of the file name, then if that is already taken, going to the end and shifting the ellipsis to the left while reveling more of the last word, till a the title does not match anymore. There is a very large possibility of getting an infinite loop. And I have not tested it on many strings. Also, another flaw is that if two string are identical but smaller then the value sent to the function, it will return both string untouched. Since it does not touch any string smaller or equal to the length passed to it.
Gautam. ------------------------------------------------------------------------ -------------------------------------------------------- #!/usr/bin/env ruby -w # Suggestion: A [QUIZ] in the subject of emails about the problem helps everyone # on Ruby Talk follow the discussion. Please reply to the original quiz message, # if you can. # # -=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=- =-=-=-=-=-=-= # # by Ryan Williams # # I use Eclipse (with RadRails!) I have a bunch of files open in tabs. Once enough # files are open, Eclipse starts to truncate the names so that everything fits. # It truncates them from the right, which means that pretty soon I'm left unable # to tell which tab is "users_controller.rb" and which is # "users_controller_test.rb", because they're both truncated to # "users_control...". # # The quiz would be to develop an abbrev-like module that shortens a set of # strings so that they are all within a specified length, and all unique. You # shorten the strings by replacing a sequence of characters with an ellipsis # character [U+2026]. If you want it to be ascii-only, use three periods instead, # but keep in mind that then you can only replace blocks of four or more # characters. # # It might look like this in operation: # # ['users_controller', 'users_controller_test', # 'account_controller', 'account_controller_test', # 'bacon'].compress(10) # => ['users_c...', 'use...test', 'account...', 'acc...test', 'bacon'] # # There's a lot of leeway to vary the algorithm for selecting which characters to # crop, so extra points go to schemes that yield more readable results. # # This code is released under the GPL.
require 'Abbrev' module GDCompress def compress (size) usedNameHash = Hash.new compressedTitleNames = Array.new for tabTitle in self newTabTitle = "" # start with empty string. if tabTitle.length > size caseValue = 0 loop do newTabTitle = tabTitle[0,size-(1+caseValue)] + "…" + tabTitle[-caseValue,caseValue] #print "\t#{newTabTitle} is the new tabTitleTitle for # {tabTitle}\n" caseValue = caseValue + 3 break unless usedNameHash[newTabTitle] end else newTabTitle = tabTitle end usedNameHash[newTabTitle] = tabTitle compressedTitleNames[compressedTitleNames.length] = newTabTitle end compressedTitleNames end end
class Array include GDCompress extend GDCompress end
> Suggestion: A [QUIZ] in the subject of emails about the problem > helps everyone > on Ruby Talk follow the discussion. Please reply to the original > quiz message, > if you can.
> I use Eclipse (with RadRails!) I have a bunch of files open in > tabs. Once enough > files are open, Eclipse starts to truncate the names so that > everything fits. > It truncates them from the right, which means that pretty soon I'm > left unable > to tell which tab is "users_controller.rb" and which is > "users_controller_test.rb", because they're both truncated to > "users_control...".
> The quiz would be to develop an abbrev-like module that shortens a > set of > strings so that they are all within a specified length, and all > unique. You > shorten the strings by replacing a sequence of characters with an > ellipsis > character [U+2026]. If you want it to be ascii-only, use three > periods instead, > but keep in mind that then you can only replace blocks of four or more > characters.
> There's a lot of leeway to vary the algorithm for selecting which > characters to > crop, so extra points go to schemes that yield more readable results.
Here is my solution. It tries to generate unambiguous abbrevations, if those don't exist, it uses the least ambiguous one and always avoids using the same abbrevation twice. There's also a readability thing built in, strings with many characters at the beginning or having the characters split equally over the beginning and ending parts are considered the most readable.
class String def compress(total_length, end_length) self[0...total_length-end_length] + '...' + self[length-end_length..-1] end end
if __FILE__==$0 p ['users_controller', 'users_controller_test','account_controller', 'account_controller_test','bacon'].compress!(10) p Array.new(10){'abcdefghijklmnopqrstuvwxyz'}.compress!(12) p ['aaaaaazbbbbb','aaaaaaybbbbb'].compress!(9) end
This is a pretty real world problem many applications struggle with. I dare say it's hard to get right, which is probably why so many applications don't even try. We had a few adventurous solvers though and they got some workable results. Let's look into Daniel Martin's solution.
I felt Daniel got some great output, by favoring the word boundaries of the terms to shorten. Here's a sample of the code being run on the quiz problem set:
users...er use...test account... acc...test bacon
Notice how the compression favors keeping word intact, when possible. That seems to give fairly good results overall.
Alright, let's see how Daniel does it:
# Returns the length of the longest common # substring of "a" and "b" def string_similarity(a, b) retval = 0 (0 ... b.length).each { |offset| len = 0 (0 ... b.length - offset).each { |aind| if (a[aind] and b[aind+offset] == a[aind]) len += 1 retval = len if retval < len else len = 0 end } } (1 ... a.length).each { |offset| len = 0 (0 ... a.length - offset).each { |bind| if (b[bind] and a[bind+offset] == b[bind]) len += 1 retval = len if retval < len else len = 0 end } } retval end
# ...
The comment above this method tells you exactly what it does, which is to return the numerical length of the longest common substring for the parameters. It works by walking each possible substring of b and checking that the string occurs in a somewhere. It then reverses and repeats the process because one string might be longer than the other. The highest count seen overall is returned.
Daniel mentioned that he has been out of Ruby for a bit and his Ruby might be a little rusty. The above code works just fine, of course, but we can shorten it up a bit, if we want to. Here's another way to write the above:
$KCODE = "u" # make jcode happy (silence a warning) require "jcode" # for String#each_char require "enumerator" # for Enumerable#each_cons
def string_similarity(str1, str2) long, short = [str1, str2].sort_by { |str| str.length } long.length.downto(0) do |len| long.enum_for(:each_char).each_cons(len) do |substr| return substr.length if short.include? substr.join end end return 0 end
Again, this works the same. I try all possible substrings of the longer string, from longest to shortest, for inclusion in the shorter string. Since the code works with the longest strings first, we can return the first answer we find. I'm not saying either method is better, but hopefully you can figure out how they work, between the two of them.
Let's get back to Daniel's code:
# ...
def score_compression(target, start, len, alltargets) score = target.length - len score += 3 if len == 0 score += 3 if (target[start,1] =~ %r(_|\W) or target[start-1,2] =~ %r([a-z0-9][A-Z])) score += 3 if (target[start+len-1,1] =~ %r(_|\W) or target[start+len-1,2] =~ %r([a-z0-9][A-Z])) prebit = target[0,start] postbit = target[start+len,target.length] scoreminus = 0 alltargets.each{|s| scoreminus += string_similarity(s,prebit) scoreminus += string_similarity(s,postbit) } score - (1.0 / alltargets.length) * scoreminus end
# ...
The above code just rates a possible replacement. The target variable holds the original string, start and len mark the area to be replaced with the repeat string, and alltargets holds the other strings that need compressing.
The most interesting part starts on the third line of the method, where a couple of checks are used to score substitutions at word boundaries higher. Note that the checks include punctuation boundaries and changes in case as a boundary. This, in my opinion, is the source of the quite readable end result.
Note that the last section of that method compares the pieces of the string that will be left after the replacement with other strings in the result set using the substring count method we examined earlier. This code is there to prevent two strings from being compressed to the same representation (though I doubt this weighted scoring system is perfect).
Finally, here's the method that does the actual work:
# ...
class Array def compress(n, repstr = '...') retval = [] self.each { |s| short_specs = (s.length - n + repstr.length .. s.length - n + repstr.length + 2).inject([]) { |sp,l| sp + (0 .. s.length - l).map {|st| [st,l]} }.select { |a| a[1] > repstr.length} if (s.length <= n) then short_specs.unshift([0,0]) end retval.push( short_specs.inject([-999,""]) { |record, spec| candidate = s.dup candidate[spec[0],spec[1]] = repstr if spec[1] > 0 if retval.include?(candidate) record else score = score_compression(s,spec[0],spec[1],self) if score >= record[0] [score, candidate] else record end end }[1] ) } retval end end
# ...
This code was a little tricky for me to break down, so let me see if I can explain it in manageable chunks. This method works over each string in the Array, building a compressed replacement for each one as it goes. To build those replacements it first constructs an Array of indices and lengths where it could replace content with the replacement string. It then scores each of those replacements and selects the highest candidate as the result.
Here's how one might invoke the above:
# ...
if __FILE__ == $0 ARGV[1,ARGV.length].compress(ARGV[0].to_i).each {|s| puts s} end
Now, since Unicode has been such a hot topic lately, let's see how this code handles using a real ellipsis character. If I change the above code to:
# ...
if __FILE__ == $0 puts ARGV[1..-1].compress(ARGV[0].to_i, "…") end
and run the code again, here's the new output:
users…er use…test account… acc…test bacon
Oops, I asked for 10 characters but got less than that. The reason is that the code we just examined is counting the byte length of our replacement string, in this bit of code right here:
As a fix, I'm going to set the $KCODE variable for Unicode support, load the jcode library for its helper methods, and change all those repstr.length calls to repstr.jlength. Here's a look at those changes:
Bingo. Now we get the extra characters from using the shorter string. Don't let people tell you Ruby can't handle Unicode today.
My thanks to all who gave this quiz a shot. I expect you all to email your solutions as patches all the applications out there that botch the display of tabs like this.
Tomorrow we have an easy problem for those working through Learn to Program...
Ruby Quiz <ja...@grayproductions.net> writes: > Note that the last section of that method compares the pieces of the > string that will be left after the replacement with other strings in > the result set using the substring count method we examined earlier. > This code is there to prevent two strings from being compressed to > the same representation (though I doubt this weighted scoring system > is perfect).
Actually, that's not why that code is there, or what it's doing. What it's doing is favoring unique strings over non-unique strings. For example:
The idea is that stuff common to all or almost-all of the original strings doesn't really help you when staring at a bunch of tabs. Abbreviating "orange_juice" as "orange..." makes sense when you have a bunch of juices, but not when you have a bunch of orange things.
Avoiding previously used abbreviations is done with the bit:
> if retval.include?(candidate) > record
That is, if the to-be-returned list of abbreviations already includes this string, don't even score it and just assume that whatever else you had was better. This leads to bad results when all the possible abbreviations are already taken.
Also, if you're going to allow for unicode in the output, you really should allow it in the input, and change all those .length to .jlength calls, but I'm still holding out for true transparent unicode support in ruby 2.0...
>> Note that the last section of that method compares the pieces of the >> string that will be left after the replacement with other strings in >> the result set using the substring count method we examined earlier. >> This code is there to prevent two strings from being compressed to >> the same representation (though I doubt this weighted scoring system >> is perfect).
> Actually, that's not why that code is there, or what it's doing. What > it's doing is favoring unique strings over non-unique strings. For > example:
> The idea is that stuff common to all or almost-all of the original > strings doesn't really help you when staring at a bunch of tabs. > Abbreviating "orange_juice" as "orange..." makes sense when you have a > bunch of juices, but not when you have a bunch of orange things.
> Avoiding previously used abbreviations is done with the bit:
>> if retval.include?(candidate) >> record
> That is, if the to-be-returned list of abbreviations already includes > this string, don't even score it and just assume that whatever else > you had was better. This leads to bad results when all the possible > abbreviations are already taken.
Thanks for setting me straight Daniel.
> Also, if you're going to allow for unicode in the output, you really > should allow it in the input, and change all those .length to .jlength > calls, but I'm still holding out for true transparent unicode support > in ruby 2.0...