Bulk load to skydb, large dataset, usage pattern.

170 views
Skip to first unread message

harry...@gmail.com

unread,
Jan 14, 2014, 8:27:49 PM1/14/14
to sk...@googlegroups.com

First of all, thanks for all your help the help getting it up and running.  I have successfully done a test load of about 1.5 million events in about 3 minutes. Btw, I am using the sky.go API as indicated in the previous threads.

I would appreciate any feedback on what I am trying to do and whether Skydb is the right tool for the job.

I have web events data keyed by a brower cookie. The event data includes a few URL parameters, couple of 'factor-type' items. The goal is to link certain events to previous activity by the surfer and extract relevant/interesting data to a different system (Postgres).  Doing this entirely using Postgres is possible, but I am at the mercy of the join working just right all the time. The "join" being the ones between the recent activity and say 7 days of history for each browser. There is no partitioning in Postgres that is guaranteed to work (make it parallel) and I can't reduce it to 'embarrassingly parallel' problem. My choices were HBase, RocksDB (the new thing out of Facebook), or SkyDB (which I found from the RocksDB hacker news thread). If this sounds weird / not the right tool, let me know.

Right now, my test bulkload is as follows

find logs -name "*.log" |  parallel -L1 -P10 '/home/harry/gocode/bin/skyloader -infile={}'

This is run copies of my loader, posting each event, one POST at a time. I did not a specific bulk load API. I tried to follow
https://github.com/skydb/sky-importer/blob/master/sky_importer.go

Now the questions on performance

1. If I have the events within a file already sorted by browser, event-time, would it help SkyDB ?

2. Do I need to configure anything? number of connections?

3. Is it better to have 10 parallel processes pushing data or just sequentially via one connection?

4. Do you suggest have multiple databases and instances of skydb on each host to reduce any resource contention and use all the 24 cores this machine has?

I am not saying the performance is bad, just asking if there are anything I should keep an eye on.

My second test load just finished, 20 million events and it is still hasn't crashed :). BTW, SkyDB did crash on me once with "cgo: signal received" error (unfortunately, I lost the stacktrace because of a rerun and wiping of skyd.log ). I am running skydb: unstable branch.

$ curl -X GET http://localhost:8585/tables/logs/stats
{"count":19340767}

CPU - 24 core Intel(R) Xeon(R) CPU  X5680  @ 3.33GHz
All flash storage.

Thanks
--
Harry

Ben Johnson

unread,
Jan 15, 2014, 9:22:25 AM1/15/14
to harry...@gmail.com, sk...@googlegroups.com
I would appreciate any feedback on what I am trying to do and whether Skydb is the right tool for the job.

Sky works well for trying to relate events together for a single object (e.g. user). For example, it's good for things like funnel analysis where you want to see users who did action X and then performed action Y. There's some fancier state machine stuff you can do with queries too but it's all within the context of an individual object.

Sky doesn't work well (or at all) for queries that relate objects together. For example, if you wanted to see connections between people then a graph database would work well. That being said, many times you can structure actions between two users as two separate events -- e.g. a friendship request action between two users could be a "send friendship request" event for the sending user and a "receive friendship request" event for the receiving user.


This is run copies of my loader, posting each event, one POST at a time. I did not a specific bulk load API.
The EventStream bulk endpoint helps speed things up significantly. It groups multiple inserts into a single LMDB transaction instead of one transaction per insert. We're currently in the process of moving to LLVM for the backend so code is in flux but improving this bulk endpoint is on my todo list.

Another way to speed up the import is to specify "-nosync" as a command line argument. By default Sky plays it safe and fsyncs after every transaction but if you're bulk loading then no sync will let the OS handle flushing the database to disk every so often.


1. If I have the events within a file already sorted by browser, event-time, would it help SkyDB ?

If you're using the bulk endpoint then it can help some. I wouldn't worry about it too much.


2. Do I need to configure anything? number of connections?

There are only a small handful of options that you can specify from the CLI. You probably don't need to set most of them (except maybe -nosync).

https://github.com/skydb/sky/blob/unstable/skyd/main.go#L35


3. Is it better to have 10 parallel processes pushing data or just sequentially via one connection?

One bulk loader endpoint is probably good enough. LMDB uses a single writer so you won't gain much by adding additional connections.


4. Do you suggest have multiple databases and instances of skydb on each host to reduce any resource contention and use all the 24 cores this machine has?
That is a hefty box. :)

Running one instance of Sky on your machine is good. Sky splits your data into local shards when you start it up initially. It will create one per logical core. So if your Xeon has hyper threading then you'll probably see 48 (24 x 2) directories in your /var/lib/sky/data directory. All events for an object will go to a single shard and that shard is determined by an FNV-1a hash of the user id.


Let me know if you have any other questions!

-- 
Ben

harry...@gmail.com

unread,
Jan 15, 2014, 7:22:28 PM1/15/14
to sk...@googlegroups.com, harry...@gmail.com
Ben,


On Wednesday, January 15, 2014 6:22:25 AM UTC-8, Ben Johnson wrote:
The EventStream bulk endpoint helps speed things up significantly. It groups multiple inserts into a single LMDB transaction instead of one transaction per insert. We're currently in the process of moving to LLVM for the backend so code is in flux but improving this bulk endpoint is on my todo list.


Such an interface would be great. Any rough ETA? I can try to help develop or at least test any code. Right now I am getting about 500k events / minute insert speed, with which I can insert 10 minutes of live data in about 6 minutes. This is ok for now, but I was looking to have a better buffer for traffic spikes. The data is spread among 84 files with 50 to 60k lines each. If I sort each file in parallel, I can sort all 4.7 million lines in about 16 seconds. A single threaded sort of the entire data set as one file took 2 min 30 sec. Of course this only sorts within the new/incremental data, not comparable to what Sky has to do internally. A wc -l on whole data set (a already cached from a previous run) takes 2 seconds. Hence I am looking for ways to help Sky from an external process as much as possible or the bulk load feature that can pass multiple events per POST.
 
 
Another way to speed up the import is to specify "-nosync" as a command line argument. By default Sky plays it safe and fsyncs after every transaction but if you're bulk loading then no sync will let the OS handle flushing the database to disk every so often.

 
This seems to speed up things slightly, but I didn't see a huge difference though.

BTW, a multi-process load to Sky was visibly faster than a single-process load. I didn't measure exact numbers, but this may have something to with the client being able to load a C/TSV file and format to JSON while some other process is doing the actual POST to Sky. So I would hope to be able to continue at least a  3 or 4 parallel connections for write (unless of course I start using go channels to pipe data via one connection). I am new to Golang, so it will take me a while to get there.

Thanks
--
Harry

Ben Johnson

unread,
Jan 16, 2014, 3:51:03 PM1/16/14
to harry...@gmail.com, sk...@googlegroups.com, harry...@gmail.com
Such an interface would be great. Any rough ETA?

I don't have an exact ETA. I'm guessing it will be in the next two weeks or so. I'll definitely let you know. It's always good to have more people testing.


Right now I am getting about 500k events / minute insert speed

That sounds about right with what I'm seeing. The benchmarks for LMDB show ~200k writes/sec and we're currently at 8k/sec so there's a ton of room for improvement:

http://symas.com/mdb/microbench/

The deserialization over HTTP is fairly slow as well. I wrote megajson (https://github.com/benbjohnson/megajson) to improve performance but it hasn't been integrated yet. It might be a good idea to provide another transport besides HTTP for these bulk loads.


This seems to speed up things slightly, but I didn't see a huge difference though.

That's good to know. I haven't done bulk performance testing since the db package rewrite. If you're running SSDs then it probably won't be a noticeable difference.


-- 
Ben
Reply all
Reply to author
Forward
0 new messages