URL/ID shortener/exploder

13 views
Skip to first unread message

Andrew Badera

unread,
Jan 6, 2010, 11:33:28 AM1/6/10
to techva...@googlegroups.com
Hi all,

I've implemented these fairly simple algorithms twice now, in both C# and T-SQL:

(End of blog entry) http://snook.ca/archives/php/url-shortener

I can successfully shorten codes and watch them increment through the
character set, so I think that part's right, but no matter I do, I
can't decode/explode shortened values back to original form without
storing them in a DB. Is there a flaw in the decompression algo that
I'm not seeing?

∞ Andy Badera
+1 518-641-1280 Google Voice
∞ This email is: [ ] bloggable [x] ask first [ ] private
∞ Google me: http://www.google.com/search?q=andrew%20badera

Andrew Badera

unread,
Jan 6, 2010, 11:47:26 AM1/6/10
to techva...@googlegroups.com
Actually taking another look at the compression side, this is shaving
off remainders, right? Isn't that kind of ... broken?

--ab

Doon

unread,
Jan 6, 2010, 12:08:47 PM1/6/10
to TechValley Ruby Brigade
I've never seen this done without some sort of DB. Normally you are
using this technique to shorten the ID field.

So as opposed to example.com/12345678 you have example.com/a14s or
the like thusly saving 4 chars. Basically you just creating your own
base System and converting between decimal and your custom base. So
the rounding / dropping of the remainder normally doesn't matter since
you are converting to whole integers normally used as the ID fields.


I do it in python (stolen from someplace on the web) like this..


url56 = '23456789abcdefghijkmnpqrstuvwxyzABCDEFGHJKLMNPQRSTUVWXYZ'

def to_url56(num):
return dec_to_anybase(num, url56)

def from_url56(value):
return anybase_to_dec(value, url56)

# base 10 to any base using basestring for digits
def dec_to_anybase(num, basestring):
base = len(basestring)
new = ''
current = num
while current >= base:
remainder = current % base
digit = basestring[remainder]
new = '%s%s' % (digit, new)
current = current / base
if basestring[current]: # non-zero indexing
new = '%s%s' % (basestring[current], new)
return new

# any base defined by basestring to base 10
def anybase_to_dec(value, basestring):
base = len(basestring)
n = 0
count = 0
while value:
last = len(value) - 1
digit = value[last]
n += basestring.index(digit) * base**count
value = value[:last]
count += 1
return n


-Patrick


On Jan 6, 11:47 am, Andrew Badera <and...@badera.us> wrote:
> Actually taking another look at the compression side, this is shaving
> off remainders, right? Isn't that kind of ... broken?
>
> --ab
>
>
>
> On Wed, Jan 6, 2010 at 11:33 AM, Andrew Badera <and...@badera.us> wrote:
> > Hi all,
>
> > I've implemented these fairly simple algorithms twice now, in both C# and T-SQL:
>

> > (End of blog entry)http://snook.ca/archives/php/url-shortener

Andrew Badera

unread,
Jan 6, 2010, 12:12:27 PM1/6/10
to techva...@googlegroups.com
Yeah, this is DB backed, shortening floating point (18,0) numerics as IDs.

Does the base value need to be even? I'm coming up small percentages
short on decode.

> --
> You received this message because you are subscribed to the Google Groups "TechValley Ruby Brigade" group.
> To post to this group, send email to techva...@googlegroups.com.
> To unsubscribe from this group, send email to techvalleyrb...@googlegroups.com.
> For more options, visit this group at http://groups.google.com/group/techvalleyrb?hl=en.
>
>
>
>

Doon

unread,
Jan 6, 2010, 12:23:49 PM1/6/10
to TechValley Ruby Brigade
On Jan 6, 12:12 pm, Andrew Badera <and...@badera.us> wrote:
> Yeah, this is DB backed, shortening floating point (18,0) numerics as IDs.
>
> Does the base value need to be even? I'm coming up small percentages
> short on decode.
>

The IDs are floats? so you could have an id 1.2456 ? I not sure if
what will work do to the use of modulus, in the conversion algorithm.
The length of the base shouldn't matter, neither should weather or not
it is even or odd. since you are basically just using it as a
substitution hash. I like the 56char one that I was using since it
removes some of the letter/numbers that can be confused (0O 1l,
etc..).

-Patrick

Andrew Badera

unread,
Jan 6, 2010, 12:24:57 PM1/6/10
to techva...@googlegroups.com
floats with all 18 places to the left of 0. all whole numbers.

Marnen Laibow-Koser

unread,
Jan 6, 2010, 12:33:44 PM1/6/10
to techva...@googlegroups.com, TechValley Ruby Brigade
On Jan 6, 2010, at 12:08 PM, Doon <pmul...@gmail.com> wrote:

> I've never seen this done without some sort of DB. Normally you are
> using this technique to shorten the ID field.
>
> So as opposed to example.com/12345678 you have example.com/a14s or
> the like thusly saving 4 chars. Basically you just creating your own
> base System and converting between decimal and your custom base. So
> the rounding / dropping of the remainder normally doesn't matter since
> you are converting to whole integers normally used as the ID fields.

I just implemented something like this in Ruby yesterday, using Doug
Crockford's enhanced base 32 encoding. I'll post code later today and
maybe put it into a gem.

Note also that in Ruby, Numeric#to_s and String#to_i take an optional
base argument.

Best,
--
Marnen Laibow-Koser
http://www.marnen.org
mar...@marnen.org
>

Marnen Laibow-Koser

unread,
Jan 6, 2010, 12:36:22 PM1/6/10
to techva...@googlegroups.com, techva...@googlegroups.com

On Jan 6, 2010, at 12:24 PM, Andrew Badera <and...@badera.us> wrote:

> floats with all 18 places to the left of 0. all whole numbers.

Then why not make it a bigint?

Doon

unread,
Jan 6, 2010, 12:41:17 PM1/6/10
to TechValley Ruby Brigade
On Jan 6, 12:24 pm, Andrew Badera <and...@badera.us> wrote:
> floats with all 18 places to the left of 0. all whole numbers.


Yeah I re-read it, and was like oh yeah, forgot about the precision
field :)


Can you share the code that is causing the error? Perhaps something
off in your implementation?

-Patric

Andrew Badera

unread,
Jan 6, 2010, 12:41:35 PM1/6/10
to techva...@googlegroups.com
because a bigint in sql server is only 8 bytes. numeric(18,0) starts
there, and can run up to 38 significant places.

Andrew Badera

unread,
Jan 6, 2010, 12:43:06 PM1/6/10
to techva...@googlegroups.com
I may have introduced issues with my debugging as well, but here it
is, current form:

CREATE PROCEDURE dbo.ExplodeShortID
@shortenedID varchar(20),
@idExploded numeric(18,0) OUT
AS

BEGIN
SET @idExploded = -1

IF LTRIM(RTRIM(@shortenedID)) = ''
RETURN

DECLARE @ShortChars varchar(100) =
'8A2aBb95CcDdeFfGgHhJjK6kLMm3NnOPpQqRrSsTtUuV7vXxYyZ4z'
DECLARE @charBase int = LEN(@ShortChars)
DECLARE @currentCharacter char
DECLARE @currPos int = 0

DECLARE @exploded numeric (18, 0) = 0

DECLARE @lenShortenedID int = LEN(@shortenedID)
DECLARE @i int = @lenShortenedID
DECLARE @pow int = 0;
DECLARE @someInt int = 0

PRINT '@shortenedID: ' + @shortenedID
WHILE (@i > 0)
BEGIN
PRINT 'i: ' + CONVERT(varchar(20), @i)

SET @someInt = (-1 * ( @i - LEN(@shortenedID) ))
PRINT '@someInt: ' + CONVERT(varchar(20), @someInt)

PRINT 'Getting SUBSTRING(' + CONVERT(varchar(20), @shortenedID) + ',
' + CONVERT(varchar(20), @someInt) + ', 1)'
SET @currentCharacter = SUBSTRING( @shortenedID, @someInt+1, 1 )
PRINT '@currentCharacter: ' + ISNULL(@currentCharacter, 'NULL')

SET @currPos = CHARINDEX( @currentCharacter, @ShortChars )
PRINT '@currPos: ' + CONVERT(varchar(20), @currPos)

SET @pow = POWER(@charBase, @i-1)
PRINT 'pow: ' + CONVERT(varchar(20), @pow)
SET @exploded = @exploded + (@currPos * @pow)
PRINT 'currPos * pow: ' + CONVERT(varchar(20), (@currPos * @pow))
PRINT 'exploded: ' + CONVERT(varchar(20), @exploded)

SET @i = @i - 1
END

SET @idExploded = @exploded
END

Marnen Laibow-Koser

unread,
Jan 6, 2010, 1:03:08 PM1/6/10
to techva...@googlegroups.com, techva...@googlegroups.com

On Jan 6, 2010, at 12:41 PM, Andrew Badera <and...@badera.us> wrote:

> because a bigint in sql server is only 8 bytes. numeric(18,0) starts
> there, and can run up to 38 significant places.

Ah, I'm happy to say I've never used MS SQL Server. Now I see.

[BTW, top-posting makes discussions rather hard to follow.]

Andrew Badera

unread,
Jan 6, 2010, 1:07:04 PM1/6/10
to techva...@googlegroups.com
On Wed, Jan 6, 2010 at 1:03 PM, Marnen Laibow-Koser <mar...@marnen.org> wrote:
>
>
> On Jan 6, 2010, at 12:41 PM, Andrew Badera <and...@badera.us> wrote:
>
>> because a bigint in sql server is only 8 bytes. numeric(18,0) starts
>> there, and can run up to 38 significant places.
>
> Ah, I'm happy to say I've never used MS SQL Server.  Now I see.
>
> [BTW, top-posting makes discussions rather hard to follow.]

A habit I've tried to break unsuccessfully.

We simply need smarter email clients.

Doon

unread,
Jan 6, 2010, 3:05:40 PM1/6/10
to TechValley Ruby Brigade
On Jan 6, 12:43 pm, Andrew Badera <and...@badera.us> wrote:
> I may have introduced issues with my debugging as well, but here it
> is, current form:

Hmm what version of SQL Server is this? SQL Server 2k5 doesn't seem
to like the procedure (Lots of complainting about assigning defaults
to local variables).

but lets see if we can break this down and see (MY TSQL is a bit
rusty, so No promises )


>
> CREATE PROCEDURE dbo.ExplodeShortID
>         @shortenedID varchar(20),
>         @idExploded numeric(18,0) OUT
> AS
>
> BEGIN
>         SET @idExploded = -1
>
>         IF LTRIM(RTRIM(@shortenedID)) = ''
>                 RETURN

Return if blank..

>
>         DECLARE @ShortChars varchar(100) =
> '8A2aBb95CcDdeFfGgHhJjK6kLMm3NnOPpQqRrSsTtUuV7vXxYyZ4z'

any reason above is varchar(100), why not char(53)? (mostly just
curious, doesn't seem to affect anything...


>         DECLARE @charBase int = LEN(@ShortChars)
>         DECLARE @currentCharacter char
>         DECLARE @currPos int = 0
>
>         DECLARE @exploded numeric (18, 0) = 0
>
>         DECLARE @lenShortenedID int = LEN(@shortenedID)
>         DECLARE @i int = @lenShortenedID
>         DECLARE @pow int = 0;
>         DECLARE @someInt int = 0
>
>         PRINT '@shortenedID: ' + @shortenedID
>         WHILE (@i > 0)
>         BEGIN
>                 PRINT 'i: ' + CONVERT(varchar(20), @i)
>
>                 SET @someInt = (-1 * ( @i - LEN(@shortenedID) ))
>                 PRINT '@someInt: ' + CONVERT(varchar(20), @someInt)
>
>                 PRINT 'Getting SUBSTRING(' + CONVERT(varchar(20), @shortenedID) + ',
> ' + CONVERT(varchar(20), @someInt) + ', 1)'
>                 SET @currentCharacter = SUBSTRING( @shortenedID, @someInt+1, 1 )
>                 PRINT '@currentCharacter: ' + ISNULL(@currentCharacter, 'NULL')
>
>                 SET @currPos = CHARINDEX( @currentCharacter, @ShortChars )

IN looking I think your problem is with the char index? When I run
your code I get this

@shortenedID: a23
i: 3
@someInt: 0
Getting SUBSTRING(a23,
0, 1)
@currentCharacter: a
@currPos: 2
pow: 2809
currPos * pow: 5618
exploded: 5618

But in looking 'a' is in position 4 in the string , not position 2.
Position 2 is 'A' Which makes since by default most stuff in sql is
case-insensitive.

Try using this.

SET @currPos = CHARINDEX( @currentCharacter, @ShortChars COLLATE
Latin1_General_CS_AS)

which then gives this

@shortenedID: a23
i: 3
@someInt: 0
Getting SUBSTRING(a23,
0, 1)
@currentCharacter: a
@currPos: 4

which looks way better..

-Patrick

Andrew Badera

unread,
Jan 6, 2010, 3:14:59 PM1/6/10
to techva...@googlegroups.com

COLLATE did the trick! Thanks a million Doon!

--Andy Badera

Doon

unread,
Jan 6, 2010, 3:23:32 PM1/6/10
to TechValley Ruby Brigade

On Jan 6, 3:14 pm, Andrew Badera <and...@badera.us> wrote:

WooHoo. Not bad for a non M$ guy right :) (We do have 2 MS SQL
servers that we store our backend in, and I am the author of
mod_sql_tds for proftpd, so I guess I not 100% non M$ :)) .

-Patrick

>
> --Andy Badera

Andrew Badera

unread,
Jan 6, 2010, 4:26:21 PM1/6/10
to techva...@googlegroups.com

Marnen Laibow-Koser

unread,
Jan 6, 2010, 5:09:18 PM1/6/10
to techva...@googlegroups.com
On Jan 6, 2010, at 12:33 PM, Marnen Laibow-Koser wrote:

> On Jan 6, 2010, at 12:08 PM, Doon <pmul...@gmail.com> wrote:
>
>> I've never seen this done without some sort of DB. Normally you are
>> using this technique to shorten the ID field.
>>
>> So as opposed to example.com/12345678 you have example.com/a14s or
>> the like thusly saving 4 chars. Basically you just creating your own
>> base System and converting between decimal and your custom base. So
>> the rounding / dropping of the remainder normally doesn't matter since
>> you are converting to whole integers normally used as the ID fields.
>
> I just implemented something like this in Ruby yesterday, using Doug Crockford's enhanced base 32 encoding. I'll post code later today and maybe put it into a gem.

Here's the code. It could probably use some refinement, and as I said, I'd like to make it into a gem, but...

-----begin base32.rb-----

# Utilities to support Crockford's base-32 encoding, described at http://crockford.com/wrmg/base32.html
module Base32
DIGITS = ('0'..'9').to_a + (('a'..'z').to_a - ['i', 'l', 'o', 'u'])

class << self
=begin rdoc
Encodes a number to a base-32 string.
If leading_zero is true, then a leading zero is added to the encoded string if it does not already contain a digit.
This makes it possible to determine if something is a base-32 string by checking for the presence of digits, which may be useful in writing routes.
=end
def encode(n, leading_zero = true)
return nil if !n or n.blank?

string = ''
tmp = n.to_i

begin
string = DIGITS[tmp % 32] + string
tmp = tmp >> 5 # n /= 32
end until tmp.zero?

prefix = (leading_zero and /\d/ !~ string) ? '0' : ''

prefix + string
end

# Decodes a base-32 string to an integer.
def decode(string)
return nil if string.blank?

int = 0

string.downcase.strip.each_char do |char|
int = int << 5 # int *= 32
int += DIGITS.index char
end

int
end
end
end

-------end base32.rb-------

Enjoy!

John

unread,
Jan 8, 2010, 10:10:32 AM1/8/10
to TechValley Ruby Brigade
Andrew,

I was thinking a bit about your motivation for shortening the URLs as
it relates to support. It is my opinion, having done a considerable
amount of support related to seemingly-random mixed-case strings,
particularly those used in passwords, that you might want a solution
like Marnen's that produces strings that consist only of numbers and
lower-case letters. Knowing that the letter your discussing with the
person you are supporting is definitely lower-case is worth the work.

John Fitzpatrick
Vita Rara

On Jan 6, 5:09 pm, Marnen Laibow-Koser <mar...@marnen.org> wrote:
> On Jan 6, 2010, at 12:33 PM, Marnen Laibow-Koser wrote:
>

> > On Jan 6, 2010, at 12:08 PM, Doon <pmuld...@gmail.com> wrote:
>
> >> I've never seen this done without some sort of DB.  Normally you are
> >> using this technique to shorten the ID field.
>
> >> So as opposed to example.com/12345678   you have example.com/a14s or
> >> the like  thusly saving 4 chars.  Basically you just creating your own
> >> base System and converting between decimal and your custom base. So
> >> the rounding / dropping of the remainder normally doesn't matter since
> >> you are converting to whole integers normally used as the ID fields.
>
> > I just implemented something like this in Ruby yesterday, using Doug Crockford's enhanced base 32 encoding.  I'll post code later today and maybe put it into a gem.
>
> Here's the code.  It could probably use some refinement, and as I said, I'd like to make it into a gem, but...
>
> -----begin base32.rb-----
>

> # Utilities to support Crockford's base-32 encoding, described athttp://crockford.com/wrmg/base32.html

Andrew Badera

unread,
Jan 8, 2010, 10:12:59 AM1/8/10
to techva...@googlegroups.com
Hi John,

You'll note easily-confused letters have generally been removed. But
that said, you do still have a pretty decent address space with just
numbers and lower case letters, and it certainly wouldn't hurt.

:)

--ab

Reply all
Reply to author
Forward
0 new messages