efficient compare

Andersen

unread,

Apr 21, 2006, 6:50:43 AM4/21/06

to

I have two sets A and B, each containing lots of key/value pairs. I want
to compare them, to find out what is missing so that I can transfer
whatever is needed from A to B and whatever is needed from B to A such
that A=B. Any effective algorithms for that? Where should I look?

Rephrasing it in math terminology:
If I have two sets A, and B, containing tuples, and I want to find
complement(A intersect B), how do I do that efficiently?

Bob Badour

unread,

Apr 21, 2006, 9:46:30 AM4/21/06

to

Andersen wrote:

The efficiency will depend on the physical structure representing the
sets and on the costs of indexes or sorts.

Is this for a course you are taking?

Andersen

unread,

Apr 21, 2006, 10:44:39 AM4/21/06

to Bob Badour

Bob Badour wrote:
>
> The efficiency will depend on the physical structure representing the
> sets and on the costs of indexes or sorts.

Thats why I am asking, which structure and indexes would be preferable.
Assume A and B are on different computers, the goal is to minimize the
amount of traffic needed to "unify" these two sets.

> Is this for a course you are taking?

Of course not. I am not looking for a full-fleged solution. I want to
know the "state-of-the-art". I assumed the database community would have
done lots of work on this. I know a trivial solution, hash each set's
key/values and compare, if the differ, split the sets in two pieces
(sorted) and exchange... and repeat this until you track the changes.
This would require log(n) exchanges to get down to any particular bit.
There must be lots of other ways...

Bob Badour

unread,

Apr 21, 2006, 12:51:18 PM4/21/06

to

Andersen wrote:
> Bob Badour wrote:
>>
>> The efficiency will depend on the physical structure representing the
>> sets and on the costs of indexes or sorts.
>
> Thats why I am asking, which structure and indexes would be preferable.
> Assume A and B are on different computers, the goal is to minimize the
> amount of traffic needed to "unify" these two sets.

As an example of the importance of physical structure, where do you
intend to evaluate this result? At the computer containing A, c(A), at
the computer containing B, c(B), or at some other computer c(C) ?

As another example, what is the maximum size of a single datagram passed
over the network and what is the size of the representation of the
tuples? I can think of an optimization that would improve performance if
two or more tuples fit in a single datagram.

If you are trying to minimize traffic between the computers, then
presumably the cost of any sorts or index creation on the computers
matters less. But then again, one would still have to weigh the expected
savings against the costs.

Do you envision this as part of a distributed database? Some sort of
replication architecture? To perform a merge-purge of mailing lists?
Simply to reconcile to similar but independent databases?

Each of those scenarios will affect the opportunities for optimization.
For example, in the case of some sort of replication scheme, presumably
the databases were reconciled at some earlier time. One can also presume
that the volume of updates is low relative to the total size of the
database. Otherwise, each database would very quickly become entirely
out of date with respect to the other.

Since the size of the intersection will be relatively small and because
the dbmses will have to reconcile updates temporally, it probably makes
sense to just share log files from the previous checkpoint forward.
Compressing the log files would be your primary efficiency opportunity.

>> Is this for a course you are taking?
>
> Of course not. I am not looking for a full-fleged solution. I want to
> know the "state-of-the-art". I assumed the database community would have
> done lots of work on this. I know a trivial solution, hash each set's
> key/values and compare, if the differ, split the sets in two pieces
> (sorted) and exchange... and repeat this until you track the changes.
> This would require log(n) exchanges to get down to any particular bit.
> There must be lots of other ways...

Pardon me for observing that it sounds like a question or essay topic
for a course of some sort. People working for a dbms vendor implementing
some sort of distributed database or replication feature would tend to
keep abreast of the state of the art using much better sources than usenet.

-CELKO-

unread,

Apr 21, 2006, 12:56:03 PM4/21/06

to

>> ind out what is missing so that I can transfer whatever is needed from A to B and whatever is needed from B to A such that A=B. <<

Straight forward in Standard SQL.

BEGIN
INSERT INTO B
SELECT *
FROM A EXCEPT B;

INSERT INTO A
SELECT *
FROM B EXCEPT A;
END;

I am assuming that A and B are identical in structure (UNION
compatible). But that leads to the question as to why there are two
tables for one set of entities.

len...@kommunicera.umea.se

unread,

Apr 21, 2006, 1:19:25 PM4/21/06

to

-CELKO- wrote:
[...]

> SELECT *
> FROM A EXCEPT B;
>

Shouldnt that be

SELECT * FROM A
EXCEPT

SELECT * FROM B;

?

/Lennart

[...]

Bob Badour

unread,

Apr 21, 2006, 2:18:42 PM4/21/06

to

len...@kommunicera.umea.se wrote:

> -CELKO- wrote:
> [...]
>
>>SELECT *
>> FROM A EXCEPT B;
>
> Shouldnt that be
>
> SELECT * FROM A
> EXCEPT
> SELECT * FROM B;
>
> ?

But what about the tuples in B that are not in A? In SQL terms:

SELECT * FROM ( SELECT * FROM A UNION SELECT * FROM B )
EXCEPT
SELECT * FROM ( SELECT * FROM A INTERSECT SELECT * FROM B )

But the OP wasn't asking how to express the query at the logical
level--he already did that quite handily, himself. He asked quite
specifically about physical implementation and efficiency.

David Cressey

unread,

Apr 21, 2006, 3:27:08 PM4/21/06

to

"Andersen" <anders...@hotmail.com> wrote in message
news:4448b904$0$16468$892e...@authen.yellow.readfreenews.net...

I'm not sure what you mean by "efficiently". Could you be a little more
specific about the envirnment where you intend to do this, and what would
be wasted by an inefficient solution?

In the meantime, have you tried (A Union B) minus (A intersect B)? Is this
inefficient? Why?

By the way, I'm assuming you don't mean "complement" as in the entire
universe of discourse minus the intersection.

-CELKO-

unread,

Apr 21, 2006, 6:49:37 PM4/21/06

to

Opps! And some products will accept "TABLE A EXCEPT TABLE B", but I
like "SELECT * FROM ." a bit better.

To answer Bob's question about physical implementation and efficiency,
that would vary from product to product. However, I think that DB2
turns the whole row into a long bit string to do the matching, so it
runs really fast. Years ago, I found that the Oracle MINUS operator
was the fastest way to do a relational division.

Alfredo Novoa

unread,

Apr 22, 2006, 5:28:42 AM4/22/06

to

If the two sets are sorted or indexed a O(n) algorithm is trivial. If
they are not, you might use hashing or to sort them first.

Regards
Alfredo

Andersen

unread,

Apr 22, 2006, 7:16:15 AM4/22/06

to Bob Badour

Bob Badour wrote:
>
> As an example of the importance of physical structure, where do you
> intend to evaluate this result? At the computer containing A, c(A), at
> the computer containing B, c(B), or at some other computer c(C) ?

I really want to do unification between A and B. I.e., after A and B
exchange some information, they should each have (A union B) locally.
Only computers in the world are A and B, and there is a network between
them, and we want to minimize traffic. Local computations on A and B
almost take zero time (well exclude solutions where you use some fancy
extremely costly fractal compression or anything of that sort).

> As another example, what is the maximum size of a single datagram passed
> over the network and what is the size of the representation of the
> tuples? I can think of an optimization that would improve performance if
> two or more tuples fit in a single datagram.

MTU=1500?

> If you are trying to minimize traffic between the computers, then
> presumably the cost of any sorts or index creation on the computers
> matters less. But then again, one would still have to weigh the expected
> savings against the costs.

Sorry, my assumption was that index creation and things of that sort are
definitely allowed (I assumed the solution would involve something like
that).

> Do you envision this as part of a distributed database? Some sort of
> replication architecture? To perform a merge-purge of mailing lists?
> Simply to reconcile to similar but independent databases?

Some kind of replication architecture.

> Each of those scenarios will affect the opportunities for optimization.
> For example, in the case of some sort of replication scheme, presumably
> the databases were reconciled at some earlier time. One can also presume
> that the volume of updates is low relative to the total size of the
> database. Otherwise, each database would very quickly become entirely
> out of date with respect to the other.

Right, I will periodically do this unification, to try to keep the
replicated dbms in sync.

> Since the size of the intersection will be relatively small and because
> the dbmses will have to reconcile updates temporally, it probably makes
> sense to just share log files from the previous checkpoint forward.
> Compressing the log files would be your primary efficiency opportunity.

I want to be able to use the same algorithm to sync node A and B where
one of them has not been around at all, or has a very outdated database.
But since the sync is done periodically, most of the time, the computers
should be in sync. So another assumption is that if A = B, then the
algorithm should be very efficient.

> Pardon me for observing that it sounds like a question or essay topic
> for a course of some sort. People working for a dbms vendor implementing
> some sort of distributed database or replication feature would tend to
> keep abreast of the state of the art using much better sources than usenet.

I find asking on usenet much better, as it would maybe take me years to
gain the experience that a particular person has in a field. I have
Ullman/Garcia Molina's Database book, but that would take me some while
to dig through (I did take database courses many years ago in my
undergrad education).

Andersen

unread,

Apr 22, 2006, 7:21:29 AM4/22/06

to Alfredo Novoa

Alfredo Novoa wrote:

>
> If the two sets are sorted or indexed a O(n) algorithm is trivial. If
> they are not, you might use hashing or to sort them first.

Sorry I am not following. O(n) what? In number of msgs or size of total
data transferred? Either case, it does not sound efficient, here is the
algorithm for the latter, assuming n is the number of items in the set A
union B
send A to B, A execute A=A union B
send A to B, B execute B=A union B

Or were you talking bit complexity?

I want to do some kind of hashing...

len...@kommunicera.umea.se

unread,

Apr 22, 2006, 7:24:15 AM4/22/06

to

Bob Badour wrote:
> len...@kommunicera.umea.se wrote:
>
> > -CELKO- wrote:
> > [...]
> >
> >>SELECT *
> >> FROM A EXCEPT B;
> >
> > Shouldnt that be
> >
> > SELECT * FROM A
> > EXCEPT
> > SELECT * FROM B;
> >
> > ?
>
> But what about the tuples in B that are not in A?

Not relevant for my question.

/Lennart

[...]

Andersen

unread,

Apr 22, 2006, 7:27:12 AM4/22/06

to

David Cressey wrote:

>
> I'm not sure what you mean by "efficiently". Could you be a little more
> specific about the envirnment where you intend to do this, and what would
> be wasted by an inefficient solution?

I want to use some kind of hashing to avoid sending the actual data.
Inefficient solution would be transferring actual data that is really of
no use to the receiver. It is a simple optimization problem, minimize
traffic between A and B. Something similar to the RSYNC (by Trigdell)
algorithm, but for sets of tuples rather than BLOBs.

> In the meantime, have you tried (A Union B) minus (A intersect B)? Is this
> inefficient? Why?

The two sets are on different computers and initially computer X only
knows the set A and computer Y knows set B. So your solution is assuming
A and B are knows at a node, which is part of the problem that I want to
solve.

> By the way, I'm assuming you don't mean "complement" as in the entire
> universe of discourse minus the intersection.

Right, sorry, Universe=A union B.

Alfredo Novoa

unread,

Apr 22, 2006, 8:08:07 AM4/22/06

to

>Thats why I am asking, which structure and indexes would be preferable.
>Assume A and B are on different computers, the goal is to minimize the
>amount of traffic needed to "unify" these two sets.

Ah! This is a very different problem!

It depends a lot on the characteristics of the data.

You might search for distributed diff algorithms and to look in
Citeseer.

Regards
Alfredo

David Cressey

unread,

Apr 22, 2006, 8:30:23 AM4/22/06

to

"Andersen" <anders...@hotmail.com> wrote in message

news:444A107F...@hotmail.com...

> Right, I will periodically do this unification, to try to keep the
> replicated dbms in sync.

Is your question really about keeping two tables in sync? I intend the word
"tables", rather than "sets" or "relations".

If so, I have a question? How will one ever delete data from (A union B)?
Does it require a delete to A and a delete to B before the synchronizer has
a chance to undo the deletion?

Also, what wrong with having a rule that enforces uniqueness in A and B,
and just copying one table over to the other, while throwing away entries
rejected by the duplicate catcher?

Is this a question requiring a theoretical answer or are you trying for a
practical solution?

Bob Badour

unread,

Apr 22, 2006, 9:54:23 AM4/22/06

to

Andersen wrote:

> Bob Badour wrote:
>
>> As an example of the importance of physical structure, where do you
>> intend to evaluate this result? At the computer containing A, c(A), at
>> the computer containing B, c(B), or at some other computer c(C) ?
>
> I really want to do unification between A and B. I.e., after A and B
> exchange some information, they should each have (A union B) locally.
> Only computers in the world are A and B, and there is a network between
> them, and we want to minimize traffic. Local computations on A and B
> almost take zero time (well exclude solutions where you use some fancy
> extremely costly fractal compression or anything of that sort).
>
>> As another example, what is the maximum size of a single datagram
>> passed over the network and what is the size of the representation of
>> the tuples? I can think of an optimization that would improve
>> performance if two or more tuples fit in a single datagram.
>
> MTU=1500?

In the general case, the MTU is insufficient. One must also know the
size of the representation of the tuples. However, in the replication
case, this becomes unimportant. Use the logs.

>> If you are trying to minimize traffic between the computers, then
>> presumably the cost of any sorts or index creation on the computers
>> matters less. But then again, one would still have to weigh the
>> expected savings against the costs.
>
> Sorry, my assumption was that index creation and things of that sort are
> definitely allowed (I assumed the solution would involve something like
> that).

It won't matter in the replication case. Use the logs.

>> Do you envision this as part of a distributed database? Some sort of
>> replication architecture? To perform a merge-purge of mailing lists?
>> Simply to reconcile to similar but independent databases?
>
> Some kind of replication architecture.

Then use the logs as I already suggested.

>> Since the size of the intersection will be relatively small and
>> because the dbmses will have to reconcile updates temporally, it
>> probably makes sense to just share log files from the previous
>> checkpoint forward. Compressing the log files would be your primary
>> efficiency opportunity.
>
> I want to be able to use the same algorithm to sync node A and B where
> one of them has not been around at all, or has a very outdated database.
> But since the sync is done periodically, most of the time, the computers
> should be in sync. So another assumption is that if A = B, then the
> algorithm should be very efficient.

If B has a very outdated database, simply copy A over B. This assumes
the data is sufficiently outdated that one can discard any changes to B
that were never reflected in A. If B has not been around, simply copy A
over B. Otherwise, have each send the other its logs from the last sync
forward.

If nothing has changed in either A or B, the logs from the last sync
forward will be empty. In this case A=B, and the algorithm will be very
efficient.

I suppose as an optimization, one can first send over the smaller log
and then send back an edited log of just the changes required. Suppose
c(A) sends a request to sync to c(B) and includes the size of the logs
for A. c(B) then compares that figure with the size of the logs for B.
If the logs of B are smaller, c(B) responds to the request with the logs
of B. If the logs of B are larger, c(B) requests the logs from A.

At this point, either c(A) or c(B) has all of the information. Suppose
c(B) has all of the information. It detects all of the changes required.
It applies the changes required to B and sends a log of the changes
required for A back to c(A). c(A) then applies those changes and the
sync is complete.

If A and B had identical changes since the last synch, c(B) will detect
that no changes are required and send c(A) an empty log of changes.

>> Pardon me for observing that it sounds like a question or essay topic
>> for a course of some sort. People working for a dbms vendor
>> implementing some sort of distributed database or replication feature
>> would tend to keep abreast of the state of the art using much better
>> sources than usenet.
>
>
> I find asking on usenet much better, as it would maybe take me years to
> gain the experience that a particular person has in a field. I have
> Ullman/Garcia Molina's Database book, but that would take me some while
> to dig through (I did take database courses many years ago in my
> undergrad education).

None of what I wrote above required years as an implementer of dbmses. I
simply reasoned it out on my own. The only item of consideration missing
from your original question was the almost certain existence of the
files that log all changes to the dbms.

I suggest where the state of the art lies is in the merging of two log
files to determine the necessary changes and the potential to abbreviate
both log files before the process begins. If the same value changes
twice in a dbms, one might be able to get away with tossing out the
first change and keeping only the last change. Triggered procedures, of
course, will complicate this.

Bob Badour

unread,

Apr 22, 2006, 9:56:57 AM4/22/06

to

len...@kommunicera.umea.se wrote:

Ah, quite likely. Joe is the chief denizen of my killfile so I have no
idea what was in [...]

Andersen

unread,

Apr 22, 2006, 8:33:09 PM4/22/06

to Bob Badour

Bob Badour wrote:

> If B has a very outdated database, simply copy A over B. This assumes
> the data is sufficiently outdated that one can discard any changes to B
> that were never reflected in A. If B has not been around, simply copy A
> over B. Otherwise, have each send the other its logs from the last sync
> forward.

That is the whole problem. I do not know how up to date A and B are, and
I want to efficiently find that out so that I can send as little data
over as possible. Logs might be huge, how do I know which part of the
log to send over, I do not want to send all of it, unless it is needed!
Your example with sending the size of the log over is in line with what
I am looking for. But I want something much more efficient.

Here is a naive scheme I was thinking about:
Send a checksum the sorted set A to c(B), and vice versa.
If the checksums match, you are done.
If not, split the set into two parts, take the checksum of each of the 2
parts and let the nodes exchange. For those that it does not match,
repeat the divide and conquer scheme until you know which parts to
exchange. This would mean you would be done in worst case log(N) rounds.

Bob Badour

unread,

Apr 22, 2006, 8:43:16 PM4/22/06

to

Andersen wrote:

I don't see how you are going to get log(N).

Consider A & B are equal each with 65535 tuples. Insert a tuple into A
at the beginning of the sort order. Insert a tuple into B at the end of
the sort order.

None of your subdivided sets are going to give matching checksums except
in error until you get down to individual tuples.

Andersen

unread,

Apr 22, 2006, 9:00:18 PM4/22/06

to

Bob Badour wrote:

>
> I don't see how you are going to get log(N).

log(N) message exchanges! Not bit complexity (i.e. size of the messages).

> Consider A & B are equal each with 65535 tuples. Insert a tuple into A
> at the beginning of the sort order. Insert a tuple into B at the end of
> the sort order.
>
> None of your subdivided sets are going to give matching checksums except
> in error until you get down to individual tuples.

Here is a different solution to that problem. Keep N bins, where N is
some very large constant. Hash every tuple to an integer between 1-N and
put it in the corresponding bin. Now run the divide and conquer on this.

I.e, first send checksum of the value of all bins (each bin might
contain many tuples). Next level would divide the bins into two parts.
Of course your smallest "atom" or level of granularity would be the
number of bins, so if you have number of tuples far exceeding the number
of bins, you'd be in trouble. But you should pick large enough number of
bins...

Bob Badour

unread,

Apr 22, 2006, 11:18:56 PM4/22/06

to

Andersen wrote:

> Bob Badour wrote:
>
>> I don't see how you are going to get log(N).
>
> log(N) message exchanges! Not bit complexity (i.e. size of the messages).

You are not going to get log(N) in messages either.

>> Consider A & B are equal each with 65535 tuples. Insert a tuple into A
>> at the beginning of the sort order. Insert a tuple into B at the end
>> of the sort order.
>>
>> None of your subdivided sets are going to give matching checksums
>> except in error until you get down to individual tuples.
>
>
> Here is a different solution to that problem. Keep N bins, where N is
> some very large constant. Hash every tuple to an integer between 1-N and
> put it in the corresponding bin. Now run the divide and conquer on this.
>
> I.e, first send checksum of the value of all bins

But you said N is a large number which means you have to send a large
number of checksums.

Frank Hamersley

unread,

Apr 23, 2006, 4:23:59 AM4/23/06

to

My view is aligned with Bob, but if you want to see a case where there
has been some brain power expended on the problem then have a look at rsync.

Cheers, Frank.

Christopher Browne

unread,

Apr 23, 2006, 8:12:22 AM4/23/06

to

I wouldn't think so.

I seem to recall Sybase having a "lock promotion" feature where they
would basically conclude "Hey, you've locked so much of the table that
we'll promote those page locks into a table lock. Yeah, it's a worse
lock, but it's way cheaper to manage than 28,742 page locks, and any
other processes trying to access this table were screwed anyways..."

This looks like a case to use a similar technique.

In effect, "Hey, there are so few matches that we might as well do a
bulk copy of the whole table."
--
wm(X,Y):-write(X),write('@'),write(Y). wm('cbbrowne','gmail.com').
http://linuxdatabases.info/info/lisp.html
"Sigh. I like to think it's just the Linux people who want to be on
the `leading edge' so bad they walk right off the precipice."
-- Craig E. Groeschel

Bob Badour

unread,

Apr 23, 2006, 9:17:15 AM4/23/06

to

Christopher Browne wrote:

But if you consider the example I gave previously that involved a single
insert into each table, swapping logs would be trivial.

Likewise, by comparing the size of the log to the size of the relation
variable, one could choose to send the relvar instead of the logs.
However, one would then have trouble reconciling multiple changes to
similar tuples.

Jay Dee

unread,

Apr 23, 2006, 1:03:20 PM4/23/06

to

From reading the clarifications made in subsequent posts to this
discussion, it seems that you don't have two sets, you have one,
and the issue isn't how to reconcile more than one set but how to
maintain one set in more than one place.

Which begs the question: does it have to be more than one place?
is there no acceptable way to connect data sources and sinks to
one database?

Certainly, there are many good reasons to replicate data stores.
If one of those is applicable and the answer is, "Yes, the data
have to be in more than one place," then consistency and latency
(among others) become significant potential issues.

It seems you've already answered this question and are trying to
figure out how to maintain a distributed database. I suggest you
check out the distributed databases already on the market (dismal
performers) or check out Stonebraker's work-in-process (tolerates
inconsistencies "for a while.")

Andersen

unread,

Apr 23, 2006, 4:02:30 PM4/23/06

to Bob Badour

Bob Badour wrote:

>> I.e, first send checksum of the value of all bins
>
>
> But you said N is a large number which means you have to send a large
> number of checksums.

Eh you misunderstood. Here it is more detailed.
Have N bins. Hash every tuple to get an integer between 1..N. So every
tuple now belongs to a bin.

Basic operation: Hashing a bin. Take all the tuples in the bin and use
some SHA1 hash of each and add or xor them. This means you now have 1
hash value representing the whole bin.

Basic operation: Hashing several bins. Just recursively apply "hashing a
bin". I.e., to get the hash of bins i, j, just hash each bin and add or
xor their values. That means you now have 1 hash value representing two
bins.

Assume computer X has set A and computer Y has set B.

Round1:
Computer A hash all bins 1..N and get a single hash value representing
all. Send to computer B.
Computer B hash all bins 1..N and compare with hash value sent by
computer A. If they match, we are done and have the same data, otherwise
go to round 2.

Round 2:
Computer A hash bins 1..N/2 and bins N/2..N, to get two hash values.
Computer B does the same and compares.

If only first hash mismatch: Round 3 does 2 hashes of bins
0N/4..1N/4, and 1N/4..2N/4.

If only second hash mismatch: Round 3 does 2 hashes of bins
2N/4..3N/4, and 3N/4..4N/4.

If both hashes mistmatch: Round 3 does 4 hashes of bins
0N/4..1N/4, and 1N/4..2N/4, and 2N/4..3N/4, and 3N/4..4N/4.

I think you get the idea. Lets say you change a single tuple... that
means a single bin changes. That would require:
Round 1: 1 msg with 1 hash value,
Round 2: 1 msg with 2 hash values,
Round 3: 1 msg with 2 hash values,
...
Round log(N): 1 msg with 2 hash values.
I.e. log(N) msgs, with total 2*log(N) hashes exchanged.

Of course if N is huge, log(N) will be something like 160 msgs. But you
could say that you stop after msg 30, and simple transfer whatever bins
that have mismatches. That would mean that even if a single tuple
changed, you'd be forced to transfer 2^(-30) fraction of the bins.
Whether that would be a lot or not would depend on the number of msgs in
each bin...

This is a naive algorithm. There must be smarter ones.

Andersen

unread,

Apr 23, 2006, 4:03:30 PM4/23/06

to Frank Hamersley

Frank Hamersley wrote:

>
> My view is aligned with Bob, but if you want to see a case where there
> has been some brain power expended on the problem then have a look at
> rsync.

I know rsync in detail. The rolling checksums is nice, but I really cant
see the application to my example...

Andersen

unread,

Apr 23, 2006, 4:07:57 PM4/23/06

to Jay Dee

Jay Dee wrote:

> From reading the clarifications made in subsequent posts to this
> discussion, it seems that you don't have two sets, you have one,
> and the issue isn't how to reconcile more than one set but how to
> maintain one set in more than one place.

Agree,

> Which begs the question: does it have to be more than one place?
> is there no acceptable way to connect data sources and sinks to
> one database?

Yes, it needs to be in 100 places, eventual consistency is all that matters.

> It seems you've already answered this question and are trying to
> figure out how to maintain a distributed database. I suggest you
> check out the distributed databases already on the market (dismal
> performers) or check out Stonebraker's work-in-process (tolerates
> inconsistencies "for a while.")

Doubt any database systems would help. I am interested in algorithms,
not off the shelf dbms, becaues I doubt they scale to 100 or 500
nodes... geographically distributed... It might seem like asking for a
lot, but I am ready to accept eventual consistency.

Bob Badour

unread,

Apr 23, 2006, 5:26:51 PM4/23/06

to

Andersen wrote:

> Bob Badour wrote:
>
>>> I.e, first send checksum of the value of all bins
>>
>> But you said N is a large number which means you have to send a large
>> number of checksums.
>
> Eh you misunderstood. Here it is more detailed.
> Have N bins. Hash every tuple to get an integer between 1..N. So every
> tuple now belongs to a bin.

[snipped example]

We are dealing with three numbers: C, N, and M. C is the cardinality of
the relation and is linearly proportional to the size of the relation. N
is the size of the hash table, which is arbitrarily chosen as a large
number. M is the number of changes since the last synch and is linearly
proportional to the size of the log file.

We established earlier that C >> M.

Assuming a uniform distribution and C > N, one can expect to find C/N
tuples in each bucket. It makes no sense to choose N < M, because one
will expect to find M/N > 1 updates in each bucket meaning you will send
O(2^N) messages plus O(C) tuples. If we choose N > M and have a uniform
distribution of changes in buckets, we can expect O(M) <= M buckets to
have updates.

In the example snipped, you give M = 1 and log(N) = 160. I simply
observe that 160 >> 1. For M << N, your example suggests you will need
to send O(M * log(N)) messages plus O(M * C/N) tuples, which is much
larger than the logfile O(M).

What did I misunderstand?

Bob Badour

unread,

Apr 23, 2006, 5:48:53 PM4/23/06

to

Andersen wrote:

> Jay Dee wrote:
>
>> Which begs the question: does it have to be more than one place?
>> is there no acceptable way to connect data sources and sinks to
>> one database?
>
> Yes, it needs to be in 100 places, eventual consistency is all that
> matters.
>
>> It seems you've already answered this question and are trying to
>> figure out how to maintain a distributed database. I suggest you
>> check out the distributed databases already on the market (dismal
>> performers) or check out Stonebraker's work-in-process (tolerates
>> inconsistencies "for a while.")
>
> Doubt any database systems would help. I am interested in algorithms,
> not off the shelf dbms, becaues I doubt they scale to 100 or 500
> nodes... geographically distributed... It might seem like asking for a
> lot, but I am ready to accept eventual consistency.

I direct you once again to my original answer: "The efficiency will
depend on the physical structure representing the sets and on the costs
of indexes or sorts."

100 or 500 nodes is physically much different than 2 nodes. Because
there are C(N,2) possible edges between N nodes, we are now talking
about 4950 to 124750 pairwise connections distributed geographically.

Thus, the first optimization is to arrange your nodes into a
more-or-less balanced binary tree.

Pair off your leaf nodes to use the algorithm I already suggested with
the modification that the computer evaluating the necessary changes not
only applies the changes to itself and its paired computer, but it then
pairs itself again with the evaluator from another pair and repeats the
process.

Alternatively, arrange the nodes so that they are each paired with two
computers and alternate between pairings. The trick there is to make
sure you have no islands of nodes separated from the rest.

Since you change the physical structure constantly, I am done. Plonk!

Jay Dee

unread,

Apr 23, 2006, 8:51:39 PM4/23/06

to

You know, when it was just "Table A" and Table B," I was thinking, "Why
not exchange tapes?" A couple tapes and FedEx is quicker than networks
when lots of data are involved.

But it seems that you've got lots of data in lots of places. And, I'll
bet, more than two tables, right? This is a very different situation.

And you expect to find answers on c.d.t? I admire your optimism, but
I'd need some portraits of presidents before telling anyone how I did
it.

Frank Hamersley

unread,

Apr 24, 2006, 12:50:51 AM4/24/06

to

Fair enough - I haven't done the detailed unpicking suffice to say it
seems good enough to lead some ppl to try and replicate massive data sets.

Cheers Frank.

Andersen

unread,

Apr 24, 2006, 11:38:19 AM4/24/06

to Bob Badour

Bob Badour wrote:

> But if you consider the example I gave previously that involved a single
> insert into each table, swapping logs would be trivial.
>
> Likewise, by comparing the size of the log to the size of the relation
> variable, one could choose to send the relvar instead of the logs.
> However, one would then have trouble reconciling multiple changes to
> similar tuples.

I want a generic algorithm. You are right that swapping logs is
excellent for your example, but what if one node does not have half of
what the other node has, then logs might be a bad idea. I want to
periodically run this algorithm to synchronize the nodes. Something
similar to what I described in the other posting (the algorithm I gave),
but maybe something more efficient.

David Cressey

unread,

Apr 24, 2006, 6:21:37 PM4/24/06

to

"Andersen" <anders...@hotmail.com> wrote in message

news:444CF0EB...@hotmail.com...

OK, let's take it from the top.

You want to sunchronize sets A and B, after each of them has been subjected
to inserts and deletes. For some reason,
the idea of keeping and swapping logs of the changes is impractical. I
don't get this, but no matter. and you want a basic algorithm, not just a
tool.

OK, let's start from there. It seems to me that one basic algorithm would
be the merge algorithm. If you stored A and B in such a way that you could
easily retrieve both of them in some order, you could apply a simple merge
process to determine
which elements are missing in either A or B. To simplify (?) let's say that
A and B are tables with at least one candidate key, and that a candidate
key has been arbitrarily chosen as the primary key. Further, let's say
that there is an index on the primary key in both tables, and we're paying
the freight to keep the two indexes up to data for some other reasons.

Now we can retrieve A and B in the same order relatively quickly. If we do
a merge algorithm, and zig zag back and forth between advancing A and
advancing B, we should detect A minus B and B minus A in the process. Now
the question is how to do this process by doing part of the process on each
computer, and swapping a minimal amount of data for all the rows that
match. At the very least, you could exchange only primary keys, instead of
entire records, and each process could detect keys not present in their own
copy of the set. They could then exchange entire rows, only when needed.

This should reduce traffic. I suspect that there are ways of further
compressing the data to be exchanged on matching rows.
I admit that this algorithm doesn't cover updates made to non key data in A
and B. Is this an important consideration in the problem you are
addressing?

Andersen

unread,

Apr 25, 2006, 9:23:42 AM4/25/06

to Bob Badour

I am glad someone is actually commenting on my algorithm.

Bob Badour wrote:
> >
> [snipped example]
>
> We are dealing with three numbers: C, N, and M. C is the cardinality of
> the relation and is linearly proportional to the size of the relation. N
> is the size of the hash table, which is arbitrarily chosen as a large
> number. M is the number of changes since the last synch and is linearly
> proportional to the size of the log file.

C is the cardinality of the sets right? The only relations I know are
the tuples (which form a subset of SxS where S is whatever the
components of the tuple are).

> We established earlier that C >> M.

Not necessary. I want a generic algorithm which works under all
circumstances, where you pay a "cost proportional to the amount of
asynchrony".

> Assuming a uniform distribution and C > N, one can expect to find C/N
> tuples in each bucket. It makes no sense to choose N < M, because one
> will expect to find M/N > 1 updates in each bucket meaning you will send
> O(2^N) messages plus O(C) tuples. If we choose N > M and have a uniform
> distribution of changes in buckets, we can expect O(M) <= M buckets to
> have updates.

Still with you... also agree with assumption of N > M.

> In the example snipped, you give M = 1 and log(N) = 160. I simply
> observe that 160 >> 1. For M << N, your example suggests you will need
> to send O(M * log(N)) messages plus O(M * C/N) tuples, which is much
> larger than the logfile O(M).

I agree to the cost of my algo being
O(M * log(N)) [for transfering M checksum path of the checksum tree) +
O(M * C/N) [cost of M buckets each containing C/N items]

> What did I misunderstand?

Nothing, I think we agree about what my proposal does. I guess the
problem is I do not know exactly what it means to send a logfile. So I
really cannot compare to it. Please help me understand a bit more about
the basics of having log files.

Some questions that come to my mind:
The log would grow beyond C, right? As time goes, the size of the log
grows to inifinity, even with a fixed C? Then you have to find ways
around that with checkpoints etc?

If the log can be really big, how do you know how much of it to
transfer? Lets say the checksum of the two logs on the two computers
mismatch, how much of it should we send?

My main question: If we have several machines, and they are each making
updates locally, and trying to synchronize all with eachother (running
your pairwise log synch algo), the logs are not just this monotonically
increasing log.

At time t nodes A and B and C have the log X. Some moment later, A
appends some entry to its local log X. At the same time, B does some
modification to its X. A third node also makes some updates. How do we
go about making this work?

Jan Hidders

unread,

Apr 25, 2006, 10:47:07 AM4/25/06

to

Andersen wrote:
>
> Rephrasing it in math terminology:
> If I have two sets A, and B, containing tuples, and I want to find
> complement(A intersect B), how do I do that efficiently?

Just as a small side note: since the worst-case communication
complexity of comparing two strings is O(n) where n is the length of
the strings you won't be able to do better than that in the worst case.

-- Jan Hidders

David Cressey

unread,

Apr 25, 2006, 3:48:29 PM4/25/06

to

"Andersen" <anders...@hotmail.com> wrote in message

news:444E22DE...@hotmail.com...

> Nothing, I think we agree about what my proposal does. I guess the
> problem is I do not know exactly what it means to send a logfile. So I
> really cannot compare to it. Please help me understand a bit more about
> the basics of having log files.

The log files can be truncated whenever a synchonization has been acheived
for sure, or am I wrong?

Also, if there are more than two nodes, can't you designate one node as the
"hub" and have all the others synchronize to it?

Andersen

unread,

Apr 27, 2006, 3:42:06 PM4/27/06

to David Cressey

David Cressey wrote:
>
> The log files can be truncated whenever a synchonization has been acheived
> for sure, or am I wrong?

You cannot do that as you might want to synchronize with someone else.
We will have hundreds of nodes, and each will occasionally synchronize
with each other. If you at any time leave the system without altering
any data, they should all eventually converge to have the union of all
the same data.

> Also, if there are more than two nodes, can't you designate one node as the
> "hub" and have all the others synchronize to it?

No, do not want a centralized algorithm for many reasons.

Andersen

unread,

Apr 27, 2006, 3:42:11 PM4/27/06

to David Cressey

David Cressey wrote:
>
> The log files can be truncated whenever a synchonization has been acheived
> for sure, or am I wrong?

You cannot do that as you might want to synchronize with someone else.

We will have hundreds of nodes, and each will occasionally synchronize
with each other. If you at any time leave the system without altering
any data, they should all eventually converge to have the union of all
the same data.

> Also, if there are more than two nodes, can't you designate one node as the

> "hub" and have all the others synchronize to it?

Andersen

unread,

Apr 27, 2006, 3:44:12 PM4/27/06

to Jan Hidders

Of course, but that is a worst case analysis. Roughly speaking (not
exactly the same problem), I want to compare two strings and make one
into the other (similar to RSYNC), but having to transfer as little as
possible over the network.

Jan Hidders

unread,

Apr 30, 2006, 8:06:33 AM4/30/06

to

Andersen schreef:

That problem has clearly the same worst-complexity, and you already
know an algorithm with that worst-case complexity (David Cressey's for
example). If you want something that behaves better on average you
first need to specify more precisely what "on average" means. It also
depends a lot on what you can and cannot do on each client. Can we keep
timestamps per key-value pair? Can we keep timestamps for each peer for
the last moment of synchronization?

-- Jan Hidders