Lets start from this design. We will implement it, run on cluster of
Amazon EC2 instances, measure performance and learn all possible
lessons.
In the same time I would like to relate to the above design in number
of perspectives:
Scalability:
It is viable Idea for the small-to-medium set of reliable computers.
We assume that one central computer is strong
enough to merge results from other computers (lets call them leafs) in
the resonable time.
In fact - merge process also consumes CPU, and in some scale central
computer will also became a bottleneck. So we need to address it in
our design.
Reliability: In the big-enough scale some failures will happens
frequently enough to prevent system from completing many queries
successfully. So
we should improve our design in the way, that system will survive
failure of any particular node. We need to achive a kind of inta-query
failover. It mean that we
do not want to fail the whole query because of one node failure.
Instead we want to recompute minimum required information and get the
result.
Multi-tenancy - the system will be used by the many users, and we
should schedule the work in some logical manner.
David