I would appreciate any feedback on what I am trying to do and whether Skydb is the right tool for the job.
Sky works well for trying to relate events together for a single object (e.g. user). For example, it's good for things like funnel analysis where you want to see users who did action X and then performed action Y. There's some fancier state machine stuff you can do with queries too but it's all within the context of an individual object.
Sky doesn't work well (or at all) for queries that relate objects together. For example, if you wanted to see connections between people then a graph database would work well. That being said, many times you can structure actions between two users as two separate events -- e.g. a friendship request action between two users could be a "send friendship request" event for the sending user and a "receive friendship request" event for the receiving user.
This is run copies of my loader, posting each event, one POST at a time. I did not a specific bulk load API.
1. If I have the events within a file already sorted by browser, event-time, would it help SkyDB ?
If you're using the bulk endpoint then it can help some. I wouldn't worry about it too much.
2. Do I need to configure anything? number of connections?
There are only a small handful of options that you can specify from the CLI. You probably don't need to set most of them (except maybe -nosync).
https://github.com/skydb/sky/blob/unstable/skyd/main.go#L35
3. Is it better to have 10 parallel processes pushing data or just sequentially via one connection?
One bulk loader endpoint is probably good enough. LMDB uses a single writer so you won't gain much by adding additional connections.
4. Do you suggest have multiple databases and instances of skydb on each host to reduce any resource contention and use all the 24 cores this machine has?
The EventStream bulk endpoint helps speed things up significantly. It groups multiple inserts into a single LMDB transaction instead of one transaction per insert. We're currently in the process of moving to LLVM for the backend so code is in flux but improving this bulk endpoint is on my todo list.
Another way to speed up the import is to specify "-nosync" as a command line argument. By default Sky plays it safe and fsyncs after every transaction but if you're bulk loading then no sync will let the OS handle flushing the database to disk every so often.
Such an interface would be great. Any rough ETA?
I don't have an exact ETA. I'm guessing it will be in the next two weeks or so. I'll definitely let you know. It's always good to have more people testing.
Right now I am getting about 500k events / minute insert speed
That sounds about right with what I'm seeing. The benchmarks for LMDB show ~200k writes/sec and we're currently at 8k/sec so there's a ton of room for improvement:
http://symas.com/mdb/microbench/
The deserialization over HTTP is fairly slow as well. I wrote megajson (https://github.com/benbjohnson/megajson) to improve performance but it hasn't been integrated yet. It might be a good idea to provide another transport besides HTTP for these bulk loads.
This seems to speed up things slightly, but I didn't see a huge difference though.
That's good to know. I haven't done bulk performance testing since the db package rewrite. If you're running SSDs then it probably won't be a noticeable difference.