Terje, Nick - do you have a minute?

Robert

unread,

Jul 4, 2009, 11:59:53 AM7/4/09

to

What I have:
1) i7-920 system with 12 gigs of ram
2) 70 gigs of data

What I want is a fast efficient data structure for semi automated queries.
I think it is somewhat possible to derive this from the data..
basically a code generator that would look at each field in a record:
1) looks at min, max - a narrow band can use a smaller datatype with a hard
coded offset
2) distinct values, use a dictionary, maybe run it through a huffman
encoder, or maybe just an enum.
3) Basic curve fitting, normal, binormal, exponential, etc.
4) some binning or bucketing - percentile (or bytile?!? 256 slots, use a
whole byte as a pointer..)
5) extra bits in some fields, ie int with only 250 million values. Move a
couple of booleans here..
6) Some things have a time series, ok 1-D, or X, Y 2-D, we can give
hints up front.
7) ditto with precision, margin of error, implied precision in the values
8) some fields can be omitted, and made part of the
path/filename.extension..
9) flat files would be fine, as no locking is needed. same for
transactions.

Question is what can be done with maybe a weekend, or a week of computer
time that would generate
a useful recursive data set?

Some rounding would be allowed. lets say first pass accurate to 1 percent.
Recurse to a bigger dataset with maybe 10x the accuracy. Lazy loading all
the way, so some easy
ad-hoc queries are fast, allowing drill down as needed. At the lowest
level, all the raw data available.

We all know how to pack a data structure by hand. How can this standard bag
of tricks be applied to
an arbitrary dataset?

If this is all old news, please post some links. Googling is not helpful
since I do not want PhD papers
on the newest trick, just a good general overview of what is possible, and
with a fair chance of being
applicable.

nm...@cam.ac.uk

unread,

Jul 5, 2009, 3:43:14 AM7/5/09

to

In article <h2nuh7$93b$1...@news.eternal-september.org>,

Basically, it can't. Sorry. To summarise the general rules (which
I think you know):

Don't bother with indexing unless there are likely to be only
a few hits - if you are likely to touch every 'cylinder', then you
may as well search sequentially. That isn't true when comparisons
are foully expensive, but they rarely are.

Similarly, it's worth doing only if you are going to be doing
the same comparisons fairly often - obviously more than once on
average. You can have multiple index files for different comparisons,
and that is often a winner.

Structuring a dataset is worthwhile only if you know how you are
going to index it. If you don't, a flat, direct-access file and
using offsets is as good as you will get (i.e. caching methodology).

You have described a wide range of common comparisons, and the
methods of generating an index are different for each; I am pretty
sure that you have thought of most of the simpler ones.

Putting that all together, it isn't hard to write a suitable program
for any particular search (or small set of searches), and it wouldn't
be much harder to write a fairly general toolkit. But how you would
use a general toolkit would depend on the data and searches, and the
decisions would be non-trivial.

Regards,
Nick Maclaren.

Robert

unread,

Jul 6, 2009, 3:36:48 AM7/6/09

to

<nm...@cam.ac.uk> wrote in message
news:h2plii$pjl$1...@smaug.linux.pwf.cam.ac.uk...

> In article <h2nuh7$93b$1...@news.eternal-september.org>,
> Robert <rob...@nospam.com> wrote:
>>What I have:
>>1) i7-920 system with 12 gigs of ram
>>2) 70 gigs of data

BIG SNIP

> Structuring a dataset is worthwhile only if you know how you are
> going to index it. If you don't, a flat, direct-access file and
> using offsets is as good as you will get (i.e. caching methodology).
>
> You have described a wide range of common comparisons, and the
> methods of generating an index are different for each; I am pretty
> sure that you have thought of most of the simpler ones.

I am not really looking at this as a DB style index. Just packing
efficiently
into native data types. Bit twiddling is allowed. Non selective fields can
be simply made part of the file path/name/extension.

We can certainly stream all the data through an analysis pass. Save the
chunk id, field id, min max avg histogram, # of distinct values, etc.

A small code generator could build a flat file reader/writer that would
maybe?!? reduce the size of the data by 25-50%. Pack some booleans into
unused bits in an int, etc. Build some enums, and trade code for data.
Do all the easy/obvious packing. Huge win for comma delimited data.
This part is lossless..

Just let it grind, for however long. Get a much smaller, more tractable
dataset.

> Putting that all together, it isn't hard to write a suitable program
> for any particular search (or small set of searches), and it wouldn't
> be much harder to write a fairly general toolkit. But how you would
> use a general toolkit would depend on the data and searches, and the
> decisions would be non-trivial.

Yes, after the above long grinding process, we would then start solving
whatever
problem is at hand. With a much smaller dataset prototyping is a lot less
painful.

This can not be a new or original idea. Preprocessing data is as old as
computers.
But with multiple cores sitting idle, why not just try ALL of the standard
tricks?
Worst case we waste some cpu time that would have gone unused anyway.

Basically we trade disk/memory space for some bit twiddling to get our
values
back out. I hope Terje weighs in on this! Main memory is about 100 clock
cycles
away from the cpu. I7's do well with bit twiddling and unaligned accesses.
I think
if the data is chunked into L1/L2 friendly sizes we would get a nice perf
boost
A few simple heuristics in the code generator would go a long way here.

I feel there ought to be a sweet spot in there somewhere. Wide datasets
would
probably not work as well.. 100 fields, with 100 cycles to burn does not
give you much
chance of gain. But with 10 fields, we have 10 cycles each to do lookups,
bit twiddles, etc.

And then there is JPEG or MP3 style compression. Slightly lossy, but maybe
OK.
For DOB we do not seconds of precision. GIS coords do not need to be to the
mm, etc.
This would need programmer input.

We also have access to the real data. Save some metadata mapping packed
chunks to raw
chunks. Multiple levels of precision. Just change the working directory!
Generate arbitrary
precision views of the raw data. As long as the each custom generated
reader/writer has a
standard interface, it would not matter which dataset is being used.
Prototype on small/fast.
Get a working algorithm, and change the path to use the lossless packed
data.

Anyway, thanks for your reply. Glad it was not of the "It is patented"
variety..

Jan Vorbrüggen

unread,

Jul 6, 2009, 4:11:24 AM7/6/09

to

Robert schrieb:

> What I have:
> 1) i7-920 system with 12 gigs of ram
> 2) 70 gigs of data
>
> What I want is a fast efficient data structure for semi automated queries.
> I think it is somewhat possible to derive this from the data..

So you are looking for a way to compress the data to fit into memory to
speed your queries, is that correct?

Before you on to the data-specific ways you mention below, have you
tried a simple gzip or bzip2 at the highest compression setting? That
would give you an idea what can be achieved without going to the lengths
of data-specific coding.

> basically a code generator that would look at each field in a record:
> 1) looks at min, max - a narrow band can use a smaller datatype with a
> hard coded offset
> 2) distinct values, use a dictionary, maybe run it through a huffman
> encoder, or maybe just an enum.
> 3) Basic curve fitting, normal, binormal, exponential, etc.
> 4) some binning or bucketing - percentile (or bytile?!? 256 slots, use
> a whole byte as a pointer..)
> 5) extra bits in some fields, ie int with only 250 million values. Move
> a couple of booleans here..
> 6) Some things have a time series, ok 1-D, or X, Y 2-D, we can give
> hints up front.
> 7) ditto with precision, margin of error, implied precision in the values
> 8) some fields can be omitted, and made part of the
> path/filename.extension..

A lot can be done that way, but I suspect you can never automate it to
the degree you desire in this application: My impression is that you
have a very diverse collection of data.

One of the reasons for that is that many of your - quite valid -
propositions above require knowledge external to the data - at least if
your data is not totally static.

The other reason is, well, basically owed to G�del and Turing (waves
hands wildly).

Jan

Robert Myers

unread,

Jul 6, 2009, 9:39:14 AM7/6/09

to

On Jul 6, 4:11 am, Jan Vorbrüggen <Jan.Vorbrueg...@not-thomson.net>
wrote:

>
> A lot can be done that way, but I suspect you can never automate it to
> the degree you desire in this application: My impression is that you
> have a very diverse collection of data.
>
> One of the reasons for that is that many of your - quite valid -
> propositions above require knowledge external to the data - at least if
> your data is not totally static.
>

> The other reason is, well, basically owed to Gödel and Turing (waves
> hands wildly).

Isn't it likely that our brains operate in some kind of middle ground,
where the data represent processes that aren't stationary but change
slowly enough and are sufficiently repetitive and interrelated that we
can pack data densely in fairly ad hoc ways using information derived
only from the data (our experience)? Not a rhetorical question.

I don't know of anyone even close to getting a computer to function
that way, but I don't think anything we know about automata would
forbid arbitrary levels of experience-based data compression,
especially given the demonstrated ability of humans to infer the most
amazingly subtle patterns in the natural world).

Robert.

Robert

unread,

Jul 6, 2009, 10:29:02 AM7/6/09

to

"Jan Vorbr�ggen" <Jan.Vor...@not-thomson.net> wrote in message
news:h2s1kg$p83$1...@s1.news.oleane.net...

> Robert schrieb:
>> What I have:
>> 1) i7-920 system with 12 gigs of ram
>> 2) 70 gigs of data
>>
>> What I want is a fast efficient data structure for semi automated
>> queries.
>> I think it is somewhat possible to derive this from the data..
>
> So you are looking for a way to compress the data to fit into memory to
> speed your queries, is that correct?
>
> Before you on to the data-specific ways you mention below, have you tried
> a simple gzip or bzip2 at the highest compression setting? That would give
> you an idea what can be achieved without going to the lengths of
> data-specific coding.

Yes, I still get another 50% compression with gzip, but the data-specific
coding is automated and close to free.

So we have binary, packed binary, gzip, and others all sitting somewhere
on a decode time/size chart. Restating my premise, which compression
techniques fit into L1, L2, or L3 cache?

L1 is rather tiny.. Not much we can do here.
L2 is more interesting. 256k is big enough to hold a decent size chunk of
data and a wrapper class to extract it.
L3 is shared on my cpu, but still about 2 megs per real core. Some
percentage
of this will be code (OS and other).
RAM would be used for buffering, disk cache, etc

I propose using light and fast packing schemes, where we can hide most of
the time wasted on unpacking behind the memory hierarchy speed hits.

>> basically a code generator that would look at each field in a record:
>> 1) looks at min, max - a narrow band can use a smaller datatype with a
>> hard coded offset
>> 2) distinct values, use a dictionary, maybe run it through a huffman
>> encoder, or maybe just an enum.
>> 3) Basic curve fitting, normal, binormal, exponential, etc.
>> 4) some binning or bucketing - percentile (or bytile?!? 256 slots, use a
>> whole byte as a pointer..)
>> 5) extra bits in some fields, ie int with only 250 million values. Move
>> a couple of booleans here..
>> 6) Some things have a time series, ok 1-D, or X, Y 2-D, we can give
>> hints up front.
>> 7) ditto with precision, margin of error, implied precision in the values
>> 8) some fields can be omitted, and made part of the
>> path/filename.extension..
>
> A lot can be done that way, but I suspect you can never automate it to the
> degree you desire in this application: My impression is that you have a
> very diverse collection of data.

Throw compute time at the problem and see what works.
Anything is better than nothing, and the box idling a lot anyway.

> One of the reasons for that is that many of your - quite valid -
> propositions above require knowledge external to the data - at least if
> your data is not totally static.

Most datasets are fairly static. In DB terms, the table would be considered
static, and the log file would be dynamic. Since I want a code generator
anyway, we can have different wrapper classes for each case. Pack the
static data, and store the new data sit in a more verbose way. We could
then
stream it all transparently, merely by looking at the file extension.

Packing data that is growing 1 percent a day should be easy!

> The other reason is, well, basically owed to G�del and Turing (waves hands
> wildly).

Those bastards :)

nm...@cam.ac.uk

unread,

Jul 6, 2009, 10:34:37 AM7/6/09

to

In article <91ef8bca-b50a-42c8...@r33g2000yqn.googlegroups.com>,

Robert Myers <rbmye...@gmail.com> wrote:
>
>Isn't it likely that our brains operate in some kind of middle ground,
>where the data represent processes that aren't stationary but change
>slowly enough and are sufficiently repetitive and interrelated that we
>can pack data densely in fairly ad hoc ways using information derived
>only from the data (our experience)? Not a rhetorical question.

That's what I have read, yes.

>I don't know of anyone even close to getting a computer to function
>that way, but I don't think anything we know about automata would
>forbid arbitrary levels of experience-based data compression,
>especially given the demonstrated ability of humans to infer the most
>amazingly subtle patterns in the natural world).

There are plenty of programs that attempt to operate that way, but few
achieve the intellectual capacity of a jellyfish, let alone a beetle.

Regards,
Nick Maclaren.

Robert

unread,

Jul 6, 2009, 12:10:59 PM7/6/09

to

> Isn't it likely that our brains operate in some kind of middle ground,
> where the data represent processes that aren't stationary but change
> slowly enough and are sufficiently repetitive and interrelated that we
> can pack data densely in fairly ad hoc ways using information derived
> only from the data (our experience)? Not a rhetorical question.

Interesting line of thought. Off the cuff I would say it is likely.
How big is the brain of an ant? They act pretty smart. Sure the heuristics
used are not super great, but they perceive the environment, navigate,
coordinate, seek food, etc. In the tropics there are many ants.. I open a
can of Coke, and 5 minutes later there are 50 ants on the spray pattern!
Maybe a few milligrams of mist, but they are on it.

> I don't know of anyone even close to getting a computer to function
> that way, but I don't think anything we know about automata would
> forbid arbitrary levels of experience-based data compression,
> especially given the demonstrated ability of humans to infer the most
> amazingly subtle patterns in the natural world).

I would bet there is some kind of quantum basis to intelligence.
Not sure if automata are an appropriate tool to start with. But seeing that
is what we have, we can try to shoehorn something plausible in.

Difficult to prove or disprove though, so it is hard to get traction on the
problem.

Further, I bet ONE BEER that we have an AI before we understand how our
own consciousness works.

Any takers? It is only one beer.. And maybe a Nobel prize or three!

nm...@cam.ac.uk

unread,

Jul 6, 2009, 12:22:24 PM7/6/09

to

In article <h2t7ui$59i$1...@news.eternal-september.org>,

Robert <rob...@nospam.com> wrote:
>
>Further, I bet ONE BEER that we have an AI before we understand how our
>own consciousness works.
>
>Any takers? It is only one beer.. And maybe a Nobel prize or three!

By the time that's settled, I shall have lost all interest in beer.

Regards,
Nick Maclaren.

Robert Myers

unread,

Jul 6, 2009, 12:49:00 PM7/6/09

to

On Jul 6, 12:10 pm, "Robert" <rob...@nospam.com> wrote:

> I would bet there is some kind of quantum basis to intelligence.
> Not sure if automata are an appropriate tool to start with. But seeing that
> is what we have, we can try to shoehorn something plausible in.
>
> Difficult to prove or disprove though, so it is hard to get traction on the
> problem.

It is possible that some non-classical aspect of the brain is
important to intelligence in a fundamental way, but I don't know of
any empirical evidence that puts that idea beyond the status of pure
conjecture.

I'd suggest that the way that most of us experience recall and insight
should be evidence enough that our brain uses some fairly gimmicky
mechanisms for day-to-day operation, with or without the help of non-
classical effects, and I strongly doubt that the gimmicks are
genetically pre-programmed. That is to say, we infer data compression
strategies from the data.

Consciousness, whatever it is, is beside the point. The evidence I
know of suggests that we don't consciously experience what our brains
are really doing.

Robert.

Robert

unread,

Jul 6, 2009, 12:58:21 PM7/6/09

to

>>Further, I bet ONE BEER that we have an AI before we understand how our
>>own consciousness works.
>>
>>Any takers? It is only one beer.. And maybe a Nobel prize or three!
>
> By the time that's settled, I shall have lost all interest in beer.

OK. I bet TWO beers that we have engineered viruses/nano machines that can
repair age related issues before we have AI.

Fixing things seems easier than designing from scratch. Especially when you
can
copy and paste from billions of working copies.

Stephen Fuld

unread,

Jul 7, 2009, 12:25:06 AM7/7/09

to

Yes, doing the lossless part is pretty simple. For all the integer
values, go through and for each see if the number of different values is
reasonably less than the number that will fit in the space allocated.
If so, substitute am index into a value table. You can do the same with
short alpha fields. Concatenate all those indices together to get one
field that contains the information for all the integer fields. Note
this takes two passes. For longer alpha fields, you can do a simple
compression (assuming you have to read the whole file for your queries
so variable length records isn't a problem.)

But you mentioned perhaps using lossy compression. The code can't guess
when it is acceptable to allow loss and if so, how much loss is
acceptable. If you are willing to specify that ahead of time, perhaps
on a field by field basis, you can easily do more. For example, if 1%
is acceptable, just right shift all the integer fields one bit per byte
length (i.e. 1 bit for 8 bit fields, 2 bits for 16 bit fields, etc.)
before applying the code above. If 1% is acceptable, all floating point
mantissas can be trimmed to 6 bits.

You also mentioned leaving out some fields. Without knowing what fields
the queries would want to test against, it seems like it would require
an oracle to do that. But with information from the user, it is simple.

> Just let it grind, for however long. Get a much smaller, more tractable
> dataset.
>
>> Putting that all together, it isn't hard to write a suitable program
>> for any particular search (or small set of searches), and it wouldn't
>> be much harder to write a fairly general toolkit. But how you would
>> use a general toolkit would depend on the data and searches, and the
>> decisions would be non-trivial.
>
> Yes, after the above long grinding process, we would then start solving
> whatever
> problem is at hand. With a much smaller dataset prototyping is a lot
> less painful.
>
> This can not be a new or original idea. Preprocessing data is as old as
> computers.
> But with multiple cores sitting idle, why not just try ALL of the
> standard tricks?
> Worst case we waste some cpu time that would have gone unused anyway.

Well, not quite. You would require multiple passes over the dataset, so
you are using the one resource that is in scarce supply on a multi-core
chip, memory bandwidth.

snip

> And then there is JPEG or MP3 style compression. Slightly lossy, but
> maybe OK.
> For DOB we do not seconds of precision. GIS coords do not need to be to
> the mm, etc.
> This would need programmer input.

Yes. But once you have it, it would be pretty easy.

snip

> Anyway, thanks for your reply. Glad it was not of the "It is patented"
> variety..

I would say it is too simple and obvious to be patentable, but who
knows???? :-(

Anyway, given programmer input it seems pretty straightforward to write
something that does pretty well. I don't know how useful it would be,
though. It depends to some extent on the original data being "loosely"
coded.

Perhaps you could give some real world examples of the field in the
records of the data sets you have and what the queries look like and we
could be of more help.

--
- Stephen Fuld
(e-mail address disguised to prevent spam)

Robert

unread,

Jul 7, 2009, 4:00:56 AM7/7/09

to

Snip

> field that contains the information for all the integer fields. Note this
> takes two passes. For longer alpha fields, you can do a simple
> compression (assuming you have to read the whole file for your queries so
> variable length records isn't a problem.)

I do not much care about the encode times. As long as decode
can be sufficiently fast and hidden by the memory latency.

> But you mentioned perhaps using lossy compression. The code can't guess
> when it is acceptable to allow loss and if so, how much loss is
> acceptable. If you are willing to specify that ahead of time, perhaps on
> a field by field basis, you can easily do more. For example, if 1% is
> acceptable, just right shift all the integer fields one bit per byte
> length (i.e. 1 bit for 8 bit fields, 2 bits for 16 bit fields, etc.)
> before applying the code above. If 1% is acceptable, all floating point
> mantissas can be trimmed to 6 bits.

No need to guess the compression. Just generate multiple
data sets. The code generator would emit a typed class with an interface.
A front end object factory could instantiate the proper version of whatever
object. Specify a root directory for packed values. The packed files could
have different extensions depending on the lossiness level.

After it is built, while prototyping whatever solution, just change a flag
and use which ever level gives you useful results. If a given set is not
useful, just delete it.. It is little more than scratch data and can be
derived
again.

I read a data format spec from ILM or Pixar that worked with huge render
images. I think they rendered at maybe 256*256, then 512*512, etc,
recursively. You can drill down into the data, all the way to film quality
at
4096*4096. You want small and fast, you got it. You want big/raw for the
final output then wait a while and the raid pack will be delivered!

> You also mentioned leaving out some fields. Without knowing what fields
> the queries would want to test against, it seems like it would require an
> oracle to do that. But with information from the user, it is simple.

Not leaving them out, just moving them into the file path/name/extension.
If we have arbitrary data, we can chunk it as needed. We need the file name
to be able to load, so File_1_2_xyz.dat has 3 extra fields with values 1, 2,
xyz.
No need to store those dupes in each record.

After the first analysis pass, we would know very quickly which fields had
a small number of distinct values. Just sort by those, then chunk them into
say 1 meg pieces, so we can get decent disk speed, while still fitting OK in
cache.

Basically I want a knob that goes from small/fast to raw/slow.
Prototyping we use fast. Once we start getting useful answers, we dial the
speed down, to get more accurate.

> Anyway, given programmer input it seems pretty straightforward to write
> something that does pretty well. I don't know how useful it would be,
> though. It depends to some extent on the original data being "loosely"
> coded.
>
> Perhaps you could give some real world examples of the field in the
> records of the data sets you have and what the queries look like and we
> could be of more help.

Over the years I have worked in many industries. Logistics, insurance, GIS,
oil
futures settlement. The details are different for each, but the data is
always
loose.

As for what the data is used for, different every time. But smaller is
faster, and
while prototyping, I can do more iterations of "What if.."

Once you have a nice overview of the data, and see how it fits together, and
can process it in arbitrary ways, with varying degrees of accuracy and
speed,
many domain specific things will become possible.

L2 cache seems to be the key metric. I have 256 kb. Lets say 64 kb worth
of rows.
Throw in a couple 64 kb lookup tables, pack some booleans into unused bits,
a bit of code running through it. You lift a couple of fields to the
filename, and it
adds up fast! Maybe you save a couple of strings, and a wasted byte for
each boolean.
Looks like 40-50 bytes saved per record. L3 is even better, 2 megs per real
core. Now
we can do more rows and/or dictionaries but higher latency.

The cost is some pointer derefencing wasting a cycle or two, and some bit
twiddling to get your booleans back. This is hopefully cheaper than the
memory
overhead..

All we need is some disk space and some idle cpu time.

Jan Vorbrüggen

unread,

Jul 7, 2009, 4:50:45 AM7/7/09

to

> Packing data that is growing 1 percent a day should be easy!

HOwever, if your packing/compression uses heuristics such as the reduced
range of actual data compared to that of the data type used, you
potentially have to re-process all of the data on the addition of a
single, out-of-range item. And that's only one example.

Jan

Robert

unread,

Jul 7, 2009, 9:05:11 AM7/7/09

to

"Jan Vorbr�ggen" <Jan.Vor...@not-thomson.net> wrote in message

news:h2uo8r$kst$2...@s1.news.oleane.net...

>> Packing data that is growing 1 percent a day should be easy!
>

> However, if your packing/compression uses heuristics such as the reduced

> range of actual data compared to that of the data type used, you
> potentially have to re-process all of the data on the addition of a
> single, out-of-range item. And that's only one example.

If 99% of the records follow a rule, take the easy win. The other percent
can
have special handling, even just falling through to the raw values. Add
some
kind of special handling indicator to the file name. Update the class
factory
to provide a new instance of an appropriate object.

If all you need is arbitrary computing applied to a set of data, and you do
not
much care about the sequence, you can cheat a lot!

My current set of data is 70 gigs. Maybe 20 record types out of 30 that I
am
interested in. Even with 5 instances of each (raw, packed, reduced
precision,
ball park histogram 64k points, ball park 256 points) with maybe a few
hundred lines of code for each wrapper class is 20 * 5 * 200 = 20 kloc..

Double that with a special handling flag, and we bloat to 40 kloc.
Maybe a few megs compiled. A rounding error sitting next to 70 gigs.

Hundreds of megs of code might be a problem, but if the data is that
pathological then just stick to the raw version.

Terje Mathisen

unread,

Jul 8, 2009, 5:19:18 AM7/8/09

to

Robert wrote:
> What I have:
> 1) i7-920 system with 12 gigs of ram
> 2) 70 gigs of data

I don't really have a minute this summer:

I'm currently a troop leader on the national scout jamboree, when I get
back I'll go directly on a 3-week sailing trip.

If I have WiFi in a harbor or two, I might get back to you...

>
> What I want is a fast efficient data structure for semi automated queries.
> I think it is somewhat possible to derive this from the data..

[snipped data description]

What you're asking for is basically jpeg compression for arbitrary data,
the problem is that we really don't have a good model for determining
when rough averages is fine and when all the info is in a single outlier
or two.

I.e. automated lossy compression is going to be very hard, some hints
might give useful heuristics.

Terje

--
- <Terje.Mathisen at tmsw.no>
"almost all programming can be viewed as an exercise in caching"

Robert

unread,

Jul 9, 2009, 8:58:45 AM7/9/09

to

Check this out: http://homepages.cwi.nl/~boncz/x100.html
In particular this: "Super-Scalar RAM-CPU Cache Compression. "
http://www.cwi.nl/htbin/ins1/publications?request=intabstract&key=ZuHeNeBo:ICDE:06

They are getting a factor of 100 perf boost on TPC-H 100 gig runs! And
doing
well on other datasets. Through lightweight compression, and other stuff
that seems remarkably similar to my thoughts... DOH!

Plus side is it shows the possibilities are there.

Time for me to start prototyping.

Gavin Scott

unread,

Jul 9, 2009, 6:29:09 PM7/9/09

to

Robert <rob...@nospam.com> wrote:
> What I have:
> 1) i7-920 system with 12 gigs of ram
> 2) 70 gigs of data

Well, depending on what the problem is worth to you, a 96-128 GB
Nehalem workstation is an off-the-shelf item these days for
somewhere in five figures of US$.

G.

Andy "Krazy" Glew

unread,

Jul 9, 2009, 7:02:10 PM7/9/09

to Robert

One of my bugbears: aging. Heredity and development do not provide a
plan, a design, that is easy to use as a target, a goal to repair things
to. Heredity provides a program, that executes under varying conditions
during development, to produce the body at any point in time. Both the
good parts, and the bad parts.

It can be hard to tell what are the good parts and the bad parts of a
program's output are, just by looking at the program.

It is easier to restart the program, than it is to repair the running
system.

It is easier to reboot a computer system, than to locate and fix
corruptions in the running memory image and keep the system running.

Armando Fox et al argue that it is easier to micro-reboot subcomponents
of a computer system, than it is to keep everything running.

My guess, wrt biology: reboot = let the old generation die, and start
off with a new generation.

I.e. it is easier to raise a child than it is to fix the aging issues in
the parent.

--
The content of this message is my personal opinion only.
Although I am an employee - currently of Intel,
in the past of other computer companies such as AMD, Motorola, and Gould
- I reveal this only so that the reader may account
for any possible bias I may have towards my employer's products.
The statements I make here in no way represent my employer's position,
nor am I authorized to speak on behalf of my employer.

In fact, this posting may not even represent my personal opinion,
since occasionally I play devil's advocate.

Robert Myers

unread,

Jul 9, 2009, 8:06:14 PM7/9/09

to

On Jul 9, 7:02 pm, "Andy \"Krazy\" Glew" <ag-n...@patten-glew.net>
wrote:

> My guess, wrt biology: reboot = let the old generation die, and start
> off with a new generation.
>
> I.e. it is easier to raise a child than it is to fix the aging issues in
> the parent.

With every funeral of a senior scientist, there is progress.

Not mine, but I don't know to whom to attribute it.

Robert.

Paul Wallich

unread,

Jul 12, 2009, 3:33:57 PM7/12/09

to

Depends on what you mean by settled. Thus far, every time a machine has
demonstrated abilities previously believed to required intelligence, it
has become widely accepted that those abilities don't really require
intelligence, just some kind of "chinese room" unintelligent
manipulation of inputs. (And of course whenever they do those
turing-test-equivalent contests, some of the people fail.)

paul

Robert Myers

unread,

Jul 12, 2009, 7:11:59 PM7/12/09

to

On Jul 12, 3:33 pm, Paul Wallich <p...@panix.com> wrote:

>Thus far, every time a machine has
> demonstrated abilities previously believed to required intelligence, it
> has become widely accepted that those abilities don't really require
> intelligence, just some kind of "chinese room" unintelligent
> manipulation of inputs.

Which may be what the human brain is doing, anyway:

http://nextbigfuture.com/2009/07/synapse-is-memristor-and-memcapacitors.html

The brain may be little more than a huge regexp processor, where the
rules for regular expressions are constantly updated to suit new
input.

Never underestimate what hardware guys might accomplish, even if the
"hardware guy" is "unintelligent manipulation" by evolutionary
pressure.

The google search "neuron memristor" yields a cornucopia of
fascinating reads.

Robert.

Andrew Reilly

unread,

Jul 12, 2009, 9:50:17 PM7/12/09

to

Indeed. I've never been impressed by the "Chinese room" argument. The
shape of a machine's pieces doesn't describe how it behaves, whole.
There was a story on the news today about an autonomous glider that can
stay aloft by finding and using thermals, just like birds. "Just a
machine" is definitely a moving goal-post.

Cheers,

--
Andrew