Re: hash function for large strings

@consumerdotorg Bernie Deitrick

non lue,

7 mars 2006, 12:35:3707/03/2006

à

It depends on your strings, and how similar they are. Anything from a simple =LEFT(A1,255) to a
more colplicated sampling every third (or fourth or thenth) letter could work....

HTH,
Bernie
MS Excel MVP

"jheby" <jh...@discussions.microsoft.com> wrote in message
news:FB83AD4A-0184-4717...@microsoft.com...
> Excel vlookup function returns #VALUE when the lookup value exceeds 256
> characters. I need a hash function to transform large strings into a value
> that does not exceed the 256 character limit. Are there any quick solutions
> to this problem?

Niek Otten

non lue,

7 mars 2006, 12:54:3007/03/2006

à

It depends on what possible values of each character of the string you can
exclude. If they can be each of the possible 256 characters, there are
limited hashing possibilities; you can treat upper- and lowercase letters
the same, since VLOOKUP will do that anyway. But that doesn't save you a
lot. If there are just the alphabet and 0-9, you can exclude a lot more.
Then you can use a number base conversion function to realize the gain in
storage requirement. It may, however, be difficult to find one that uses all
256 possibilities. I have seen one (from Ron Rosenfeld) that uses 0-9 and
upper- and lowercase letters.

I find it difficult to believe that there isn't any logic behind the
population of your keys. Still that is the "key" to hashing.

Can you tell a bit more about the nature of your keys?

--
Kind regards,

Niek Otten

non lue,

7 mars 2006, 13:00:3307/03/2006

à

That isn't true. But it is if you want to use it with VLOOKUP.
Maybe you should tell us more about the nature of the lookup problem too.

--
Kind regards,

Niek Otten

"Niek Otten" <nico...@xs4all.nl> wrote in message
news:ejDMvAhQ...@TK2MSFTNGP14.phx.gbl...

jheby

non lue,

8 mars 2006, 09:29:3908/03/2006

à

I am pulling log entries into Excel from Essbase, a multi-dimensional
database, and want to assign categories to each entry using a vlookup. The
entries could use pretty much any character. Below is a sample of a large
string entry:
'Calculating [
Accounts(AC_IS_Asset_Based_Fixed,AC_IS_Conditional_Sales_Contracts_Fixed,AC_IS_Construction_Fixed,AC_IS_Factoring_Fixed,AC_IS_Finance_Leases_Fixed,AC_IS_Floorplan_Fixed,AC_IS_Inventory_Fixed...]
with fixed members [Business_Type(BT_Adds_Non_Recourse_NB_Recv_Amt,
BT_Adds_Non_Recourse_NB_Equip_Cost_Amt, BT_Adds_Non_Recourse_NB_RE_Cost_Amt,
BT_Adds_Non_Recourse_EB_Recv_Amt, BT_Adds_Non_Recourse_EB_Equip_Cost_Amt, ]

Regards,
Jon

Pete_UK

non lue,

8 mars 2006, 10:47:4708/03/2006

à

Looking at your example string, you can see many repeating groups of
characters, eg:

AC_IS_ _Fixed BT_Adds_Non_Recourse _Cost_Amt _Recv_Amt

to pick just five. You could envisage, then, a user-defined function
which is applied to the original string and returns a reduced string by
repeated application of SUBSTITUTE( ) to change these character groups
to some other character which is not contained within any of the
strings.

If you apply the UDF to the lookup table as well, then you can used the
modified string as the key field.

Is this the kind of thing you were looking for?

Pete

Niek Otten

non lue,

8 mars 2006, 11:10:4508/03/2006

à

Also, it is almost only letters, commas, underscores and round and square
brackets. That is less than 30 possibilities and should fit in just 5 bits
instead of 8. Or, store 8 positions in 1 character.

But probably Pete's suggestion for repeating strings is easier to code.

--
Kind regards,

Niek Otten

"Pete_UK" <pash...@auditel.net> wrote in message
news:1141832867....@i40g2000cwc.googlegroups.com...

Niek Otten

non lue,

8 mars 2006, 14:14:1108/03/2006

à

I can't see how you would use such a string as a key. As a search argument,
OK, but not as an identifier that uniquely extracts one key.
Can you tell us how you intend to use these items? How is you search key
assembled/where does it come from? What data are you trying to look up using
the key?

--
Kind regards,

Niek Otten

"jheby" <jh...@discussions.microsoft.com> wrote in message
news:9D5A6507-411D-42EC...@microsoft.com...
> Unfortunately, the repeating strings, while present in this example, don't
> occur in most cases. This example is more of an anomoly in that regard.

Harlan Grove

non lue,

8 mars 2006, 14:29:3108/03/2006

à

jheby wrote...

>I am pulling log entries into Excel from Essbase, a multi-dimensional
>database, and want to assign categories to each entry using a vlookup. The
>entries could use pretty much any character. Below is a sample of a large
>string entry:

...

There are tasks for which Excel is most definitely not the most
appropriate tool, and in this case, Excel may be one of the worst tools
for the task. Most scripting languages (Perl, Pyton, Ruby, gawk)
provide associative arrays that handle hashing arbitrarily long
strings. But if you want a simple spreadsheet approach, get a capable
spreadsheet like the one in OpenOffice. OpenOffice Calc doesn't choke
on string operations with string lengths > 256 characters. This is one
of the ugly little backwaters of Excel functionality that hasn't been
changed in more than a decade.

jheby

non lue,

8 mars 2006, 14:55:3208/03/2006

à

The log that I'm analyzing consist of perhaps 5K unique entries that get
recorded perhaps 300K times over a period of several days. I've created a
map that groups the 5K unique entries into about a hundred categories. I
want to go back and tag each of the 300K instances with its category so that
I can summarize & analyze the log entries by those category groupings -
that's where the need for the vlookup comes in. The majority of the unique
entries are well under 255 characters, but unfortunately, a significant
number of them are not.

jheby

non lue,

8 mars 2006, 14:59:0408/03/2006

à

thanks - I hope that the Excel designers are planning to fix this along with
a number of other items in the new product (including the eight level nesting
limit and the 65K row limit)

"Harlan Grove" wrote:

> jheby wrote...
> >I am pulling log entries into Excel from Essbase, a multi-dimensional
> >database, and want to assign categories to each entry using a vlookup. The
> >entries could use pretty much any character. Below is a sample of a large
> >string entry:

> ....

Pete_UK

non lue,

9 mars 2006, 05:49:3109/03/2006

à

Well, you could wait for Excel 12 later in the year, but if you want to
do this now, the following illustrates my point:

Your example text string is 428 characters long. I put it in A1 and
this formula in B1:

=SUBSTITUTE(A1,"BT_Adds_Non_Recourse",CHAR(161))

The length came down to 333 characters.

Changing the formula to this:

=SUBSTITUTE(SUBSTITUTE(SUBSTITUTE(SUBSTITUTE(SUBSTITUTE
(A1,"BT_Adds_Non_Recourse",CHAR(161)),"_Cost_Amt",CHAR(162)),
"_Recv_Amt",CHAR(163)),"AC_IS_",CHAR(164)),"_Fixed",CHAR(165))

brought the length down to 233 characters - so you could use this as a
key in your VLOOKUP formula. The formula merely changes the character
groups identified earlier into a single character.

Hopefully, you can see how this method could be extended by using a UDF
to pick up other character groups.

Pete

jheby

non lue,

9 mars 2006, 09:41:3109/03/2006

à

Thanks - that approach will definately work in this instance. It still
appears to me that I will have to create a VBA-based hashing function to
address the general case.

Regards,
Jon

Pete_UK

non lue,

9 mars 2006, 09:55:2609/03/2006

à

Hi Jon,

You won't be able to extend the above formula very much because of the
limits on nesting. Here's an example UDF to do the same thing:

Function reduce(test As String) As String
reduce = test
If Len(reduce) < 255 Then Exit Function
reduce = Application.WorksheetFunction.Substitute(reduce,
"BT_Adds_Non_Recourse", Chr(161))
reduce = Application.WorksheetFunction.Substitute(reduce, "_Cost_Amt",
Chr(162))
reduce = Application.WorksheetFunction.Substitute(reduce, "_Recv_Amt",
Chr(163))
reduce = Application.WorksheetFunction.Substitute(reduce, "AC_IS_",
Chr(164))
reduce = Application.WorksheetFunction.Substitute(reduce, "_Fixed",
Chr(165))
reduce = Application.WorksheetFunction.Substitute(reduce, "_NB",
Chr(166))
reduce = Application.WorksheetFunction.Substitute(reduce, "_EB",
Chr(167))
reduce = Application.WorksheetFunction.Substitute(reduce,
"Finance_Leases", Chr(168))
reduce = Application.WorksheetFunction.Substitute(reduce,
"Conditional_Sales_Contracts", Chr(169))
End Function

I've added a few extra functions to the end, which reduce the length of
the original string to 184 characters when used as:

=reduce(A1)

The function does not change any original text which is already less
than 255 characters in length. Obviously you can apply this to other
strings by copying down, but you can also extend the UDF to cope with
other patterns - just change the character number at the end of the
line.

Hope this helps.

Pete

Harlan Grove

non lue,

9 mars 2006, 13:24:5409/03/2006

à

Pete_UK wrote...

And thinking holistically, how does this udf help? While it may reduce
the length of the 1st argument to VLOOKUP, it's not usable directly for
the first column of the 2nd argument to VLOOKUP. It's not even usable
directly in an INDEX/MATCH approach because it can't handle range or
array arguments and return values. At best, the OP would need to use
one additional cell per log file record with each of those cells
calling this UDF with the corresponding record value. Calling *any* udf
a few thousand times even on a superfast system will bring back fond
memories of the responsiveness of 110 baud dial-up mainframe
connections.

MATCH is afflicted with the same 256 character limit on comparisons as
VLOOKUP. However, the = operator isn't so afflicted. Use INDEX/MATCH.
Given A1:A6 containing

=REPT("0123456789ABCDEF",16)&" - "&CHAR(64+ROW())

and B1:B6 containing

=ROW()

Then instead of VLOOKUP(x,A1:B6,2,0) use arrays formulas like

=INDEX(B1:B6,MATCH(TRUE,(A1:A6=x),0))

A kludge, but this will be faster than any udf.

Niek Otten

non lue,

9 mars 2006, 17:24:4709/03/2006

à

Hi Harlan,

An ingenious solution! I like the Out of the box approach.

However, I'm not sure I agree with what you say about performance; in my
experience array formulas are rather slow.
I tested your solution (somewhat modified because of the limits of CHAR()).
Because it requires some efforts to arrange for more records than there are
rows, I tested with 30,000 (instead of 300,000) of your formulas; that took
(on my machine) 284 seconds.
I built a reducekey function similar to Pete's one; 30,000 reducements took
15 seconds.
If the 5K table is or can be sorted, Pete's solutions (which indeed requires
some more work, like adding the reduced key to the lookup table) might be
quite a lot faster than applying array formulas.

I know UDF's have a bad reputation for performance. Indeed the overhead of
switching between Worksheet and VBA is significant. But often the
performance problems seem to be caused by inefficient programming,
especially wrong data types.
My actuarial function system manages 10,000 calls from a worksheet per
second.

It would be a lot of work, but it would be nice to test Jon's actual case
with the different approaches suggested.
--
Kind regards,

Niek Otten

"Harlan Grove" <hrl...@aol.com> wrote in message
news:1141928694....@j33g2000cwa.googlegroups.com...

Harlan Grove

non lue,

9 mars 2006, 18:39:0309/03/2006

à

Niek Otten wrote...
...

>However, I'm not sure I agree with what you say about performance; in my
>experience array formulas are rather slow.

...

It's not the array formulas per se, it's the 30K LONG string
comparisons. If the OP had relatively static data (apparently the
case), a macro essentially performing repeated Edit > Replace
operations on copies of the log file records would be better than UDFs.
One VBA invocation beats 30K separate ones.

The greater difficulty is whether the chaff that Pete_UK has
*hardcoded* into his udf would apply to *all* the OP's records. Usually
it's a bad idea to generalize from one sample record. But there's also
the question whether given the *entire* population of Essbase fields,
categories, dimensions, etc. could prove awkward. For instance, the
OP's single sample record contained

BT_Adds_Non_Recourse_NB_Recv_Amt,
BT_Adds_Non_Recourse_NB_Equip_Cost_Amt,
BT_Adds_Non_Recourse_NB_RE_Cost_Amt,
BT_Adds_Non_Recourse_EB_Recv_Amt,
BT_Adds_Non_Recourse_EB_Equip_Cost_Amt,

Pete_UK's udf would reduce these to

,
Equip,
RE,
,
Equip,

Do you begin to see a problem?

GIGO processing can be very runtime efficient. When you're after
correct results, however, it may require more calculation time and
perhaps a wee bit more time figuring out the full extent of the
problem.

Myself, I don't believe the OP can *RELIABLY* accomplish his/her task
by removing any text other than space characters on either side of
punctuation.

Pete_UK

non lue,

9 mars 2006, 19:02:4109/03/2006

à

Unfortunately Jon has only posted one example of his long strings - we
have no idea of what the other strings look like, but I had assumed
that similar character groupings would appear in many of them. He says
a significant number of his strings are very long, but that could mean
a couple of thousand if he has a batch of 15k records.

Applying the reduced key formula to the lookup table would only have to
be done once - the values could be fixed and then copied over the
original keys.

Jon said he does 300k weekly, so performance is an issue, and like you,
Niek, I have found that array formulae can be very slow. He wanted a
way to reduce his long strings so that he could use his existing
vlookup formulae, and my approach allows him to do that with not a lot
of effort in setting it up nor in using it - I also tested it on 30k
identical records following Harlan's posting, and it took less than 10
seconds.

Anyway, it's up to him - I think writing a generalised hashing
algorithm as he was contemplating would take considerable time.

Pete

Pete_UK

non lue,

9 mars 2006, 21:00:1809/03/2006

à

Harlan,

this is what the final part of the string gets converted into:

[Business_Type(¡¦£,¡¦_Equip¢, ¡¦_RE¢,¡§£, ¡§_Equip¢, ]

which represents the following:

[Business_Type(
[Business_Type(
BT_Adds_Non_Recourse_NB_Recv_Amt, ¡¦£,
BT_Adds_Non_Recourse_NB_Equip_Cost_Amt, ¡¦_Equip¢,
BT_Adds_Non_Recourse_NB_RE_Cost_Amt, ¡¦_RE¢,
BT_Adds_Non_Recourse_EB_Recv_Amt, ¡§£,
BT_Adds_Non_Recourse_EB_Equip_Cost_Amt, ] ¡§_Equip¢, ]

I am substituting the phrases with individual characters in the range
161 onwards, not with blanks. The uniqueness is maintained.

Pete

Harlan Grove

non lue,

10 mars 2006, 02:41:1210/03/2006

à

"Pete_UK" <pash...@auditel.net> wrote...

>this is what the final part of the string gets converted into:
>
>[Business_Type(¡¦£,¡¦_Equip¢, ¡¦_RE¢,¡§£, ¡§_Equip¢, ]

...

OK, I screwed up. Sorry.

This might shorten the OP's records enough, or it might not. Also, if there
just happen to be more than 128 such common phrases, you'd need to alter
your approach. I still believe a macro approach would be better than a udf
approach, and a macro approach could prompt for selection of a different
range containing a table of common susbtrings rather than hard coding them
into a udf.

Harlan Grove

non lue,

12 mars 2006, 19:35:1712/03/2006

à

Pete_UK wrote...

>Well, you could wait for Excel 12 later in the year, but if you want to

...

Maybe you could wait for Excel 13 or 14, but this particular problem
won't be fixed in Excel 12. David Gainer confirmed in the Excel 12 blog
that VLOOKUP will remain limited to searching up to but no more than
255 chars in Excel 12.

FWIW, Lotus 123 97 Edition (almost 10 years old) handles lookup values
> 255 characters, as does the Windows port of Gnumeric 1.6.0. I'd be surprised if Excel wasn't the only arguably high-end spreadsheet that can't handle such strings. Microsoft has a very bad tendency to leave matters of actual functionality unchanged version to version over decades but spend thousands of developers' hours on interface and other eyewash issues.

Excel 12 is Microsoft's latest attempt to prove their assertion that
they're years ahead of the competition. Then stuff like this comes
along an proves that when low-level backwaters of functionality are at
issue, it's Excel that's a decade or so behind the competition. And a
long as the lemmings dutifully upgrade, there'll be no incentive for
them to fix this.

jheby

non lue,

16 mars 2006, 14:07:5616/03/2006

à

Regarding the array solution that you proposed - I'm assuming that the large
string gets placed at the location of "x" in the following array formula:

=INDEX(B1:B6,MATCH(TRUE,(A1:A6=x),0))

Was that your intent? - when I tried that, I got a "formula too long" error.