YCSB scans

70 views
Skip to first unread message

Amy Tai

unread,
Jan 13, 2016, 3:43:57 PM1/13/16
to hyperdex-discuss
Hi,

I'm looking at the way YCSB scans are implemented in the Hyperdex YCSB java client, but it doesn't seem to be correct. In particular, in order to scan N items starting at some key K, Hyperdex's scan function (in bindings/java/org/ycsb/Hyperdex.java) searches for datapoints in the range [K, K+N], which is incorrect, because there could be anywhere from 0-N desired data objects in this range. If there are less than N items (which is probably the case, because of the sparseness of data), then Hyperdex is returning incorrect results. 

I changed the Hyperdex scan function to limit search results using the predicate GreaterEqual(K), which seems to give the correct number of results. Unfortunately, this makes the performance of scans abysmal, because a next() call on the search iterator returned to the Hyperdex client requires making another search on the Hyperdex servers.

Two questions: 1) Could you verify that this scan function is indeed incorrect? I'm wondering if this was the YCSB client that was used to generate results in the Hyperdex paper? 2) How are searches implemented on the client side? Does it matter which predicate I use? For example, if the client limits search results with a Range predicate, are all results found and immediately sent to the client? Or does the HD server create an iterator that is then queried everytime the client calls next()? What about, if I use a GreaterEqual predicate? Are all results similarly summarized in an iterator on the server-side, and then lazily fetched when the client calls next()?

Thanks,
Amy

Robert Escriva

unread,
Jan 30, 2016, 10:46:13 AM1/30/16
to hyperdex...@googlegroups.com
Hi Amy,

The scan function was correct for the version of YCSB we developed it
against. The YCSB benchmark we used for the SIGCOMM submission
generates keys from continuous sequence of integers. The search [x,
x+N) is guaranteed to return the N items after item x. You can see this
for yourself in the CoreWorkload.java and the *Generator.java files from
the last version of YCSB avialable at the time of publication[1].

Searches in HyperDex are performed in the most restrictive subspace, and
the predicates are used to compute which subspace that is. For example,
a range search on strings cannot be used client-side to narrow down a
search (because of hashing), but a range search on a numeric value will
be used.

The current HyperDex implementation does exactly what you describe with
the iterator. The implementation we used for SIGCOMM requested more
items concurrently, much like the current implementation of sorted
search does. The current sorted search will do one round trip to each
server, because the search results are inherently limited in number;
whereas, an unsorted search is unbounded and the iterator model made
more sense.

If you simply use a GreaterEqual predicate, it will return more than
half the key space roughly 50% of the time, and as you noticed, will
perform poorly. If you're looking to play around with changing scan, an
incremental improvement would be to use sorted search and limit it to N
results. I would still recommend an upper bound on the search for
efficiency.

-Robert

[1] https://github.com/brianfrankcooper/YCSB/tree/5813dbcaaf0218184086f2f319ea7e36baf52575

Pandian R

unread,
Mar 30, 2017, 11:52:53 AM3/30/17
to hyperdex-discuss
Hi Robert,

I'm also facing the same issue as Amy with the YCSB benchmark for workload E. If I use only the lower bound, it takes a lot of time to return the iterator. Can you explain why this would happen (why it returns more than half the key space) ? It still just returns an iterator and the data will be fetched lazily only on doing next() right ? If I use both upper and lower bound, it returns incorrect values (because the keys are not sequential). I tried using sorted_search but it didn't seem to help (failing with some assertion).

Keeping this aside, I would like to know if it is possible to do a scan range query(or search) on the primary key itself instead of an attribute within the usertable ? (like the seek functionality of HyperLevelDB). I know HyperDex provides additional search functionality to search among the attributes but I only need a scan workload on the primary key itself. I couldn't find any documentation for the same and it seems like we can only do search on some attributes within the table and not the primary key? Any help will be greatly appreciated.

Thanks and Regards,
Pandian Raju

Robert Escriva

unread,
Mar 31, 2017, 3:40:39 PM3/31/17
to hyperdex...@googlegroups.com
Some of the bugs you mention re:incorrect values were fixed after the
last release. The code never had a query planner for search, so if you
ask for an open ended iterator over half the data set, you'll get an
open ended search over half the data set.

You should be able to search over the key just like any other attribute
(assuming it is string or int).

I've shifted my development attention to Consus[1], which provides a
sorted table abstraction, rather than hashing the keyspace.

-Robert

[1] http://hack.systems/2016/12/29/why-consus/
> --
> You received this message because you are subscribed to the Google Groups
> "hyperdex-discuss" group.
> To unsubscribe from this group and stop receiving emails from it, send an email
> to hyperdex-discu...@googlegroups.com.
> For more options, visit https://groups.google.com/d/optout.

Reply all
Reply to author
Forward
0 new messages