Account Options

  1. Sign in
The old Google Groups will be going away soon, but your browser is incompatible with the new version.
Google Groups Home
« Groups Home
Message from discussion Comparing strings from the back?
The group you are posting to is a Usenet group. Messages posted to this group will make your email address visible to anyone on the Internet.
Your reply message has not been sent.
Your post was successful
 
From:
To:
Cc:
Followup To:
Add Cc | Add Followup-to | Edit Subject
Subject:
Validation:
For verification purposes please type the characters you see in the picture below or the numbers you hear by clicking the accessibility icon. Listen and type the numbers you hear
 
Dan Goodman  
View profile  
 More options Sep 10 2012, 1:44 pm
Newsgroups: comp.lang.python
From: Dan Goodman <dg.gm...@thesamovar.net>
Date: Mon, 10 Sep 2012 19:44:46 +0200
Local: Mon, Sep 10 2012 1:44 pm
Subject: Re: Comparing strings from the back?
On 10/09/2012 18:07, Dan Goodman wrote:

> On 04/09/2012 03:54, Roy Smith wrote:
>> Let's assume you're testing two strings for equality.  You've already
>> done the obvious quick tests (i.e they're the same length), and you're
>> down to the O(n) part of comparing every character.

>> I'm wondering if it might be faster to start at the ends of the strings
>> instead of at the beginning?  If the strings are indeed equal, it's the
>> same amount of work starting from either end.  But, if it turns out that
>> for real-life situations, the ends of strings have more entropy than the
>> beginnings, the odds are you'll discover that they're unequal quicker by
>> starting at the end.

>  From the rest of the thread, it looks like in most situations it won't
> make much difference as typically very few characters need to be
> compared if they are unequal.

> However, if you were in a situation with many strings which were almost
> equal, the most general way to improve the situation might be store a
> hash of the string along with the string, i.e. store (hash(x), x) and
> then compare equality of this tuple. Almost all of the time, if the
> strings are unequal the hash will be unequal. Or, as someone else
> suggested, use interned versions of the strings. This is basically the
> same solution but even better. In this case, your startup costs will be
> higher (creating the strings) but your comparisons will always be instant.

Just had another thought about this. Although it's unlikely to be
necessary in practice since (a) it's rarely necessary at all, and (b)
when it is, hashing and optionally interning seems like the better
approach, I had another idea that would be more general. Rather than
starting from the beginning or the end, why not do something like: check
the first and last character, then the len/2 character, then the len/4,
then 3*len/4, then len/8, 3*len/8, etc. You'd need to be a bit clever
about making sure you hit every character but I'm sure someone's already
got an efficient algorithm for this. You could probably even make this
cache efficient by working on cache line length blocks. Almost certainly
entirely unnecessary, but I like the original question and it's a nice
theoretical problem.

Dan


 
You must Sign in before you can post messages.
To post a message you must first join this group.
Please update your nickname on the subscription settings page before posting.
You do not have the permission required to post.