skydb query language on Elasticsearch

192 views
Skip to first unread message

Greg Brown

unread,
Apr 25, 2013, 10:50:18 AM4/25/13
to sk...@googlegroups.com
Great presentation last night at Boulder/Denver BigData Ben. I thought I'd throw an idea your way.

We are currently using Kissmetrics (now https://qualaroo.com/) for tracking behavioral user data. Good in that it didn't take much work to get it setup, not great in that we have our data with a third party.

I'd rather us have a way to run it internally, but doing so with skydb would require running (and learning to scale) yet another DB.

I think you could map everything you are trying to do onto Elasticsearch and you'd get scaling, multiple data types, and full text search for free. Its very easy to shard ES and allocate users to particular shards. Building this into an ES plugin would allow you to build on top of the existing infrastructure around ES. For instance http://logstash.net/

Full text search is a very necessary feature IMO. For instance, users doing searches on a web site and trying to track how well those searches are working. With ES you get the ability to do multi-lingual search in a very robust and performant way. The ES plugin system would probably allow you to build on top of it pretty easily.

I realize this would probably require a lot of rewriting, but I really like the thoughts you've put into a query language based around users analytics.

Anyway, just figured I'd throw the idea at you.
-Greg

Ben Johnson

unread,
Apr 25, 2013, 12:07:22 PM4/25/13
to sk...@googlegroups.com
hey Greg-

I think elasticsearch is an awesome tool and I've used it for a couple analytics apps before. I think it might be possible to build something like Sky on top of ES but I don't think you'll get nearly the same type of performance. The overhead of hitting an index to sort by a user id and then copying that data to the plugin for analysis would probably drop performance below 1MM/events/core/second. I also don't think you can control sharding by values inside the document which would cause multiple events for a user to be placed on different shards.

Sky doesn't have any full text search right now. Actually, there aren't any indexes either. It literally executes over your whole dataset on each query (although there's also an option for doing sampling coming up in v0.3.1). Behavioral data is interesting in that since you're organizing by timelines you can really get a  representative sample by picking out a random sample. Once there are more use cases around full text search and indexes then I'll add that feature. It hasn't come up as an issue so far. The complexity of the queries you can do in Sky can make indexing more difficult.

Thanks for sending the idea. I can definitely appreciate not wanting to run/scale a new database. I think the performance trade off will be too big though. I'd be interested to see if it would work if you ever have the interest in trying it.


Greg Brown

unread,
Apr 25, 2013, 12:21:00 PM4/25/13
to sk...@googlegroups.com
Hi Ben,

The way I think you could map to ES would be using routing to map users to particular shards based on a field in each doc: http://www.elasticsearch.org/guide/reference/mapping/routing-field/

Presumably each doc would be an individual event. Counting most operations becomes either a count query or a faceted search (http://www.elasticsearch.org/guide/reference/api/search/facets/).

In addition, by leaning heavily on ES filters I think many operations wouldn't need to hit the index much at all: http://www.elasticsearch.org/guide/reference/query-dsl/

I don't know enough about the reality of what skydb queries end up looking like, but I am pretty sure one could intelligently map queries in a very high performance way. By "1MM/events/core/second" do you mean adding events at that rate, being able to do 1 mil queries per sec, or query through 1 mil events?

-Greg

Ben Johnson

unread,
Apr 25, 2013, 1:47:15 PM4/25/13
to sk...@googlegroups.com
hey Greg-

That's good to know about the routing field. I haven't used that in ES before.

I would probably store one document per event too. The problem that you run into with the query system is that ElasticSearch has no way to relate two documents together. I think ES would be perfect to rollup counts or facets of individual events (e.g. how many people performed "commented on a blog post", or break down the number of actions by a dimension like gender). The real power of Sky is when you want to find the next event that occurred after that comment or before the comment or perhaps what occurred in the session that came before the user wrote the comment. Skipping around on the timeline becomes difficult since ES doesn't give a good way to do relate documents to each other over time.

As far as the performance, I was referring to how many events you can iterate over in a query per second on each CPU core. Right now Sky can normally do ~10MM events/second on each core (although there's a performance regression bug in v0.3.0). A big reason to write my own database is to have control over the very low level event storage and get that kind of crazy performance.


Ben

ch...@jumis.com

unread,
Sep 8, 2013, 2:19:43 AM9/8/13
to sk...@googlegroups.com
Hopping in on this older thread.

Were Sky to support ES JSON semantics for bulk import/export, I could use existing scripts and fire off at the Sky endpoint the same as I do for ES.
Simply having a basic endpoint compatibility layer would mean I could use tools like elasticsearch-hadoop out of the box, for instance. That'd be swell.


That corresponds to Sky's bulk import stream:

Both packages have a simple JSON API (DSL?). In some sense, Sky is a little simpler as the goal is more focused.
ES does not yet have a matured aggregation language. That's something they're putting as a priority for 1.x (enhanced aggregates).
So there's not much reason to consider working into the existing faceting DSL. Sky has something to offer ES on that end.

Bulk export is handled through scan:
That corresponds to "List all events" in Sky.

That's all we'd really need to re-use most tools out there.

As for a plugin, the benefit isn't as big; it'd just make sure Sky popped up/down when ES does and maybe provide for some data locality (and save an extra port).
So we might have a nice _scan request type to populate Sky with results without (necessarily) hopping across the network, instead it'd just be IPC.

Such a plugin would be a lot more work than a simple compatibility class for scan/_bulk, and it doesn't provide all that much benefit that can't already be handled through elasticsearch-hadoop.

So, that's more-or-less how I see the integration story, based on the work I've been doing.

-Charles

On Thursday, April 25, 2013 10:47:15 AM UTC-7, Ben Johnson wrote:
hey Greg-

That's good to know about the routing field. I haven't used that in ES before.

I would probably store one document per event too. The problem that you run into with the query system is that ElasticSearch has no way to relate two documents together. I think ES would be perfect to rollup counts or facets of individual events (e.g. how many people performed "commented on a blog post", or break down the number of actions by a dimension like gender). The real power of Sky is when you want to find the next event that occurred after that comment or before the comment or perhaps what occurred in the session that came before the user wrote the comment. Skipping around on the timeline becomes difficult since ES doesn't give a good way to do relate documents to each other over time.

As far as the performance, I was referring to how many events you can iterate over in a query per second on each CPU core. Right now Sky can normally do ~10MM events/second on each core (although there's a performance regression bug in v0.3.0). A big reason to write my own database is to have control over the very low level event storage and get that kind of crazy performance.

Ben Johnson

unread,
Sep 11, 2013, 11:35:52 AM9/11/13
to sk...@googlegroups.com
hey Charles-

Sorry it took so long to get back to you. I think an interface matching ES makes sense and probably wouldn't be that hard. I would structure that as a separate server process that you send the events through, performs the translation, and then inserts via the bulk stream in Sky.

I'm pretty slammed for time right now but if you want to take a crack at it then I'll definitely help out.


Ben

Reply all
Reply to author
Forward
0 new messages