Bulk queries in dynamo?

47 views
Skip to first unread message

Josh Israel

unread,
Mar 7, 2013, 12:30:41 AM3/7/13
to stanford-...@googlegroups.com
So in the dynamo paper, they say that the API is only get/set. I was thinking about this, and in my own (probably very skewed) experience, latency in handling requests is more strongly correlated to the number of queries than the expensiveness of any single query. Particularly when the query has to hit multiple data centers, network latency can be significant compared to the actual look-up time in any single location. In the project I work on, a lot of slow responses are due to some code like
for (id : list of entity ids) {
    do_some_work(get_entity(id))
}

Usually that get_entity call is fast, but the latency adds up pretty quickly. Since it's SQL, we typically correct the problem by converting to an IN clause or some other bulk query and iterating over that result set. Could/does dynamo support a bulk getters?

In a similar vein, one of the useful things you can do with a BigTable is issue queries for keys in a given range. Is that possible?

dm-list-c...@scs.stanford.edu

unread,
Mar 7, 2013, 12:39:38 AM3/7/13
to Josh Israel, stanford-...@googlegroups.com
At Wed, 6 Mar 2013 21:30:41 -0800 (PST),
Josh Israel wrote:
>
> for (id : list of entity ids) {
> do_some_work(get_entity(id))
> }
>
> Usually that get_entity call is fast, but the latency adds up pretty quickly.
> Since it's SQL, we typically correct the problem by converting to an IN clause
> or some other bulk query and iterating over that result set. Could/does dynamo
> support a bulk getters?
>
> In a similar vein, one of the useful things you can do with a BigTable is
> issue queries for keys in a given range. Is that possible?

I suspect Dynamo does not offer bulk queries or range queries
(particularly since the keys are hashed with MD5). However, I'm sure
many applications do multiple queries in parallel to overlap latency.
Moreover, it is common to structure data in such a way as to minimize
the number of dependent queries, so that you don't have to wait for
nearly as many back-to-back queries as the total queries issued.

David

Josh Israel

unread,
Mar 7, 2013, 1:10:11 AM3/7/13
to stanford-...@googlegroups.com, Josh Israel, David Mazieres expires 2013-04-20 PDT
That makes sense to me for range queries. However, could you explain why hashing would prevent bulk queries? Naively, it seems like it wouldn't be too difficult to do something like
1) compute the nodes to hit for each key in the bulk query
2) create a map of node id --> key set
3) issue bulk look-up requests for each node in that map
4) group the responses from the nodes by key and return

David Mazieres

unread,
Mar 7, 2013, 2:53:35 AM3/7/13
to Josh Israel, stanford-...@googlegroups.com
At Wed, 6 Mar 2013 22:10:11 -0800 (PST),
Josh Israel wrote:
>
> That makes sense to me for range queries. However, could you explain why
> hashing would prevent bulk queries? Naively, it seems like it wouldn't be too
> difficult to do something like
> 1) compute the nodes to hit for each key in the bulk query
> 2) create a map of node id --> key set
> 3) issue bulk look-up requests for each node in that map
> 4) group the responses from the nodes by key and return

Hashing means that you don't have any kind of locality and cannot
specify a key range concisely. So any kind of bulk query would just
be equivalent to a whole bunch of concurrent individual queries (which
I'm sure does happen, so you could view as a bulk query).

David
Reply all
Reply to author
Forward
0 new messages