Some thoughts on ArangoDB (Participation from the community encouraged)

Frank Mayer

unread,

May 4, 2013, 8:35:04 AM5/4/13

to aran...@googlegroups.com

Hi everyone :) ,

After seeing transactions being implemented in the upcoming version 1.3 of ArangoDB, I'd like to propose/discuss the features that I'd like to see next.

These views are meant to be food for thought, from a fellow developer in the ArangoDB community, with the best of intentions. Hopefully this will turn out to be a vibrant discussion for the good of ArangoDB and its community :)

In general, these are the things I would like to see in ArangoDB (and I'd like to believe, others, too):

Replication:

- multiple replication of data storage, either of a single-server database or of a sharded one
- replication strategy selectable by user, not forced by the database in any form

Sharding / Scaling:

- Elastic and automatic horizontal scaling in order to be *limitless* in terms of scalability.
- optional Application-agnostic scaling: Application does not need to know where parts of the data are stored, it can talk to any server in the cluster and the server will check with its peers in order to gather any missing data
- High availability mode for cluster set-ups (Manual set, or automatic replication of data)
- Transactions should work in any of the above set-ups, transparently

Certainly nothing of the above is easy to do, but ArangoDB isn't a project that was done easily. A lot of thought and effort has been given to make it what it is, and we all love it.

However, I remember reading this:
"No large cluster/zero administration/synchronous master-master replication

Our design aim is to achieve zero administration of consistent, synchronous master-master replicating clusters on few servers. The same data is available on all servers per synchronous replication with minimal administrative effort.

We expect that most projects are not becoming the next Amazon and a single node or small cluster fits in 99 percent of the use cases."
==> here: http://www.arangodb.org/2012/03/07/avocadodbs-design-objectives


While this mentions zero administration which is good and the "Amazon" example is theoretically acceptable, it practically raises uncertainty when one starts to use ArangoDB in a project. Here are some points, I have personally thought of:

- "My project will probably not be the next "Amazon", but I would like to be flexible in horizontal scalability, as my data volume might hopefully grow beyond a single server's capability."

- "I want to use lot's of smaller (8GB) machines instead of a huge 64GB machine"
- "What happens when I need to geographically replicate or shard my data, not yet knowing how big this might become?"
- "Since Replication and/or sharding are not implemented yet, can I safely start without those now and shard/replicate later, or will something that I am using right now in ArangoDB restrict me from doing that later?"
- "When will horizontal scalablity become available?"

These are some points that I got in my mind sometimes when I think about projects in general and how one could implement them in ArangoDB.

The thing is, that, in these days, database systems like ArangoDB must be able to do BigData and be flexible for horizontal expansion of any size, in order to provide the users the capability of doing so. It might not be needed for every project, but when one starts a project, that might (hopefully) grow a lot, they should not hit those limitations on the database back-end. That would probably be a no-go for people who look into what to use for their next project.

Finally, as you know, from my contributions concerning the server and the PHP-client, I am a big fan.
I'd really like to do every project that fits, with ArangoDB, and look forward to all the amazing stuff you bring to ArangoDB.

Thanks!!! :)

PS: I'd also like to see opinions from other developers using ArangoDB, or are in the process of evaluating ArangoDB on this.

Patrick Mulder

unread,

May 4, 2013, 4:15:27 PM5/4/13

to aran...@googlegroups.com

Hi Frank,

thanks for sharing this.

From a Ruby perspective, I think an important point is application deployment (and dealing/skipping schema migrations in this context). Especially, I am thinking on deployment to Heroku, where I test some ideas/prototypes, or host some simple static sites/CMS type of projects. So, an easy connection from Heroku to ArangoDB would be very interesting to me, and looking at the list of Heroku addons https://addons.heroku.com/ - probably to others too.

Apart from this, I think data import/export is always interesting, and I am wondering if others think that MRuby would give a nice context for transporting data in and out of the DB.

Last but not least, the Foxx project could be interesting, esp. for fast CMS style of data services (just an admin backend, but hosting multiple sites, a bit similar entry point to http://www.locomotivecms.com/ - which is MongoDB based )

Well, so far my thoughts. Maybe helpful.

Cheers,

Patrick

--
You received this message because you are subscribed to the Google Groups "ArangoDB" group.
To unsubscribe from this group and stop receiving emails from it, send an email to arangodb+u...@googlegroups.com.
For more options, visit https://groups.google.com/groups/opt_out.

Frank Mayer

unread,

May 4, 2013, 4:42:21 PM5/4/13

to aran...@googlegroups.com

Hi Patrick,

thanks for your input.

Yes I find the Ruby / Heroku points very interesting, too. It would probably be very good to have an ArangoDB addon for Heroku.

Both Javascript and mRuby could be used to manipulate data during import/export.

I believe that would be easier to do, when this issue (The part with file.io) is solved => https://github.com/triAGENS/ArangoDB/issues/14 (you have already commented on this one ;) )

Your locomotivecms example is something that I believe, could already be possible? :)

Kind regards,

Frank

Frank Mayer

unread,

May 8, 2013, 12:25:45 PM5/8/13

to aran...@googlegroups.com

Another thing that I forgot to write in my first post, would be a backup implementation.

I imagine it could be in form of a replication (either of a single server or of a cluster of shards) that can be backed up either using some logical snapshot - or by taking it out of the replication-set, backing it up and re-add it to the set again.

Any other ideas on that from the community?

Kind regards,

Frank

F21

unread,

May 8, 2013, 7:51:28 PM5/8/13

to aran...@googlegroups.com

All very good points.

The above is one of the reasons why I moved to ArangoDB from OrientDB after spending/wasting many months on OrientDB. OrientDB looked good on paper and seems to have a decent amount of features, but the REST interface was pretty buggy, and there were small problems here and there that made in unsuitable for me to use as I had to bring in Rexster (which had its own limitations). Making the move to ArangoDB was the right decision for me, because bugs/features are implemented really fast, sometimes, on the same day. See https://github.com/triAGENS/ArangoDB/issues/512.

I myself am most interested in sharding and backups. I think there's work being done to export data from the database (https://github.com/triAGENS/ArangoDB/issues/105), but seems it needs to read and export all the data out of the files from the database, I would suspect that it's not something that's really cheap. I would love to have some way to work at a lower level, for example, maybe a small script or program that can just copy all the data files from the filesystem (while the server is running) without causing any corruption. This should be a pretty efficient process for backups.

In regards to sharding and scaling, I am really impressed with Elasticsearch's model. In Elasticsearch, we define a number of shards/models and then define a number of replicas (backups) for the shards. I can easily setup a machine by changing a few config params for the server and then connect it to the network. Once connected, the cluster automatically discovers the node (and depending on the cluster name), and if the cluster name matches, the node is allowed to join and the cluster automatically rebalances the data across the new node. This is a really awesome and intelligent process because the developer no longer have to manually balance/shared the data.

With shards and replicas, there probably isn't really a need for specific replication (master/master), (master/slave) anymore, because clients can hit any machine in the cluster, and it will provide results.

I think this is probably the best way to allow elastic scaling and sharding and store "big data".

Cheers,

Francis

Frank Mayer

unread,

May 8, 2013, 9:35:41 PM5/8/13

to aran...@googlegroups.com

Yeah, we're on the same page ;)

Elasticsearch's model looks really good. Maybe the team can look into the inner workings of that, and implement something like this or even better? :)

I, too started with OrientDB but just about for the same reasons as you, I moved to ArangoDB as soon as I found out about the amazing features and of course the team and inspiration behind it.

I believe, if today ArangoDB would already offer elastic Sharding / Clusters and Backups in addition to transactions which it will have in 1.3 it would be really high in people's consideration.

As I mentioned in my first post in this thread. In today's world it's all about being elastic/expandable and database systems should not impose any limits. For example, a lot of people chose MongoDB because it has some nice NoSQL features and it is relatively easy expandable through sharding. But in the end, although MongoDB has limitations elsewhere, where ArangoDB really shines (for example Transactions, REST Interface, AQL and many more great stuff the team has put into it), people need the expand-ability so they will unfortunately not chose ArangoDB at this time.

Making elastic clusters/sharding available is for me definitely the next major feature for ArangoDB that should be developed. It will unlock the skepticism and open up ArangoDB to a much broader community.

Cheers,

Frank

Message has been deleted

F21

unread,

May 9, 2013, 12:23:47 AM5/9/13

to aran...@googlegroups.com

Arg! Why is google groups deleting my posts randomly?

Can an admin please restore it? It's quite a huge chunk, so will be quite annoying to type again :(

Message has been deleted

F21

unread,

May 9, 2013, 9:50:12 PM5/9/13

to aran...@googlegroups.com, f21.g...@gmail.com

@frankmayer: thanks :) I am now ccing everything to myself, just in case it happens again.

My original post:

I think the replication, sharding, scaling are targeted for 2.0 (at least from what I have read and bits of conversations here and there). It is not a matter of whether it will be available or not, but when :)

Given that the ArangoDB devs work really fast, I would expect to see it available soon, but the question is: How far away is 2.0? Would it come after 1.4? Or do we have 1.5, 1.6, etc? An updated roadmap would be really cool.

For me, initially, I choose OrientDB because it already had replication and what not. But then, my conclusion was that I don't need replication and sharding now, as my application is still small. I just need a good backup solution to back up the data. Then, as my application grow and I need to scale, ArangoDB would also gain those features by the time I need it. I have not run ArangoDB in production yet, but I suspect that most people are in the same boat and can run with just 1 ArangoDB server and by the time they need to scale, ArangoDB already has that feature.

Of course, if a big company already has those requirements and they feel that ArangoDB has the potential to help them solve their problems, I am sure they will be happy to contribute some resources to implement those things into ArangoDB :)

So, this is probably more direct at Jan: Can we get an updated roadmap for ArangoDB? Roughly when can we expect replication or sharding to be available?

Cheers!

Francis

Frank Mayer

unread,

May 9, 2013, 10:15:18 PM5/9/13

to aran...@googlegroups.com, f21.g...@gmail.com

Yes, it says 2.,0 on the roadmap, so I guess it's still on for that version.

My fear is just that we might build an application using some of the great ArangoDB features, that might bite us when we want to scale out horizontally. That could be anything or nothing.

It could be for example, that transactions can only work locally, and not across the cluster. That would mean that one would possibly have to keep a lot of connected data on one instance. That might be a problem.

Or, lets say we have a Graph (Either managed through the graph module or a simple but eventually more flexible one). What if this doesn't scale horizontally? Again it's a problem.

My saying is, that those are the uncertainties that bug me personally and also other people planning something.

And it's not that I am the big company who's going to launch the next big thing, but as I wrote earlier,

when you start a project it might or might not take off. If it takes off, you have to be sure that your database-backend can grow with you and not be a problem.

This is of greatest importance.

Of course, if I write some next big thing and it takes off, I'd surely would contribute a lot to ArangoDB itself. If I am able to make some profit, I am happy to use some of it to give back.

After all, I have already made a lot of contributions to it and a lot more to ArangoDB-PHP, as I believe a lot in contributing to OpenSource :)

Yes, Frank or Jan, will probably be the most suited persons to shed some light on the road ahead.

This thread was merely intended to get a conversation going and to also provide some feedback/ideas to the dev-team of what the community would really liketo have in ArangoDB. The dev-team of course has to weigh things and decide, but I think that input from us is important to the development of the project.

Cheers,

Frank

Frank Celler

unread,

May 13, 2013, 9:04:03 AM5/13/13

to aran...@googlegroups.com, f21.g...@gmail.com

We will start with a master-slave, asynchronous replication for 1.4. This has at least the following advantages:

- It is a good fit for most use cases.
- It will allow us to implement backup as "slave".
- It easily gives you redundancy by setting up multiple instances.
- It gives you read-scaling.

There are also drawbacks. For example, you need to manually select and switch masters in case of fail-over. However, restricting to a simple solution (which is still hard enough to implement) should allow us to release V1.4 this summer. If you think about MySQL, you will see that in most case a master-slave replication is sufficient.

The next step will be master-master replication. This, however, requires more complex protocols like Paxos to elect a master and at least three nodes. We have to decide, if this will be in version 1.5 or maybe already 2.0. We have to see how much has to be changed.

We do not want to go the road Riak has chosen, to scale out unlimitedly.
In our experience there really is a trade-off between "scaling" and "querying". If you scale massively, than you restrict yourself basically to key/value queries. That is what Riak is extremely good at. That is not what we are aiming at. We want to replicate to a moderate number of computers and assume that most of the time the whole dataset fits onto a single computer. With SSD and RAM getting cheaper and cheaper, this is not a totally unreasonable assumption. However, there will be cases where we want to shard the data. If we do this, it will restrict the possible queries for that collection in some way. Maybe restricting complex queries to one shard or only allowing very simple queries on sharded collections. The same is true for transactions.

As Frank has pointed out, there will problems, if for instance you implement something based on the graph features and you become the next Twitter. However, if this happens no automagic sharding or replication will help you. Amazon changed their view of the world and throw away landmarks like consistency, so that they can put their vast amount of data into Dynamo. But this does require carefully planning of the application. If you have so much data, then you end up with Dynamo or Riak or some-other massively distributed key-value store.

With our replication and sharding feature, we would like to implement something useful for a lot of cases, but not for extreme ones. For instance, master-master replication makes automatic fail-over much easier. The setup will be more complex, because you need at least three nodes and you have to deal with inconsistencies. Various approaches exist. CouchDB keeps all the document versions, CouchBase only the newest, Riak has vector-clocks and CFRD (Conflict-Free Replicated Data) for some types. If we use conflict documents, then the applications need to be aware of them and handle them accordingly. Our idea here is, again, to start with something straight forward (e.g. last-write wins) and later give an alternative (e.g. conflict documents). The simple solution should be helpful in most cases; harder cases can be dealt with; extreme cases however require support from the application.

Hope that clarifies some points from our roadmap
Frank

Frank Mayer

unread,

May 14, 2013, 9:34:36 PM5/14/13

to aran...@googlegroups.com, f21.g...@gmail.com

Hi Frank, thanks for your feedback. I have a few questions I'd like to follow up on your post. Will get back to you in a few days.

Dobrosław Żybort

unread,

Jun 10, 2013, 11:28:02 AM6/10/13

to aran...@googlegroups.com

+1 for Elasticsearch or HDFS model, it's 2 in 1: scaling + backup.

Not having scaling capabilities in pretty new db in today world of "big data everywhere" looks strange.

Best regards,
Dobrosław Żybort

Frank Mayer

unread,

Jun 10, 2013, 5:39:20 PM6/10/13

to aran...@googlegroups.com

I Wanted to reply earlier to Frank's post, but didn't get to it :D, However Dobroslaw's post reminded me of that...

I think,his phrase

'Not having scaling capabilities in pretty new db in today world of "big data everywhere" looks strange.'

actually does "nail" it.

While I realize that good horizontal scaling is not an easy task to accomplish, I strongly believe that without it, ArangoDB will not gain the popularity that it rightfully deserves. After all, it's a very well designed piece of software.

Even if strong servers and SSD's etc become cheaper, in terms of scalability, load-balancing and resilience, this model can't beat lots of low cost machines. And that's not possible with replication only, or only handful of shards.

And while Big-Data is - and will be getting even bigger very fast in the near future, most application developers out there will want to be on the safe side.

Please correct me if I am wrong, but all the signs I am reading in Big-Data development on the web, seem to state that the "extreme cases" that you refer to, become more and more "the new normal".

A Big-Data database like ArangoDB should at least support the "normal", whatever that was yesterday, is today or will be tomorrow.

While the Amazon example is true, it's actually only really true for that point in time and with their criteria (economic, or whatever). While technology quickly evolves, new approaches can, and will be chosen.

I mean, the biggest thing that is evidently in peoples minds right now, is mostly, scalability. Just look at the numbers of developers that use MongoDB, even with its shortcomings, just because it's relatively easy to handle and scalable.

I would very much like to see ArangoDB shine in that department. ArangoDB is well designed and this great design should really carry on into the scalability department :)

And it would be a pity, if not having that capabilities would hold ArangoDB back from its success in the Big-Data world.

Best regards,

Frank Mayer

Dobrosław Żybort

unread,

Jun 11, 2013, 5:53:29 AM6/11/13

to aran...@googlegroups.com

While I realize that good horizontal scaling is not an easy task to accomplish, I strongly believe that without it, ArangoDB will not gain the popularity that it rightfully deserves. After all, it's a very well designed piece of software.

Even if strong servers and SSD's etc become cheaper, in terms of scalability, load-balancing and resilience, this model can't beat lots of low cost machines. And that's not possible with replication only, or only handful of shards.

+1, today people start using even machines with intel atom processors and build clusters from them.
Personally I would prefer this way of operating instead of making simple backups.
One disk died? No problem, we still have 2 more copies in our cluster, so data is still accessible (I don't need to do anything to restore data from backup) and when cluster find out that one disk is dead (or some data have only 2 copies) it should auto-balance/make more copies of under replicated data.

And while Big-Data is - and will be getting even bigger very fast in the near future, most application developers out there will want to be on the safe side.

Please correct me if I am wrong, but all the signs I am reading in Big-Data development on the web, seem to state that the "extreme cases" that you refer to, become more and more "the new normal".
A Big-Data database like ArangoDB should at least support the "normal", whatever that was yesterday, is today or will be tomorrow.

+1. There will be only more data every day, no matter what or how we feel about it.

While the Amazon example is true, it's actually only really true for that point in time and with their criteria (economic, or whatever). While technology quickly evolves, new approaches can, and will be chosen.

I mean, the biggest thing that is evidently in peoples minds right now, is mostly, scalability. Just look at the numbers of developers that use MongoDB, even with its shortcomings, just because it's relatively easy to handle and scalable.

And with distribution at the core new databases appearing today like Titan[1] (graph db) or RethingDB[2].

[1] http://thinkaurelius.github.io/titan/

Titan is a scalable graph database optimized for storing and querying graphs containing hundreds of billions of vertices and edges distributed across a multi-machine cluster. Titan is a transactional database that can support thousands of concurrent users executing complex graph traversals.

In addition, Titan provides the following features:

Elastic and linear scalability for a growing data and user base.
Data distribution and replication for performance and fault tolerance.
Multi-datacenter high availability and hot backups.
...

[2] http://www.rethinkdb.com/

RethinkDB is built to store JSON documents, and scale to multiple machines with very little effort.

I suggest viewing 2 minute video at http://www.rethinkdb.com/videos/what-is-rethinkdb/
It showing nice RethingDB admin panel and how easy is to e.g. set number of shards/replicas.

Best regards,
Dobrosław Żybort

Frank Mayer

unread,

Jun 11, 2013, 6:27:42 AM6/11/13

to aran...@googlegroups.com

FYI, There is also a ~14min video that shows a bit more. : http://www.rethinkdb.com/screencast/

I like their scaling approach a lot. If ArangoDB could combine that with its already great existing capabilities, it would be a huge success! :)

Best regards,

Frank

Claudius Weinberger

unread,

Jun 11, 2013, 6:57:26 AM6/11/13

to aran...@googlegroups.com

You have also seen this in den FAQs von RethinkDB:

"How can I understand the performance of slow queries?

Understanding query performance currently requires a pretty deep understanding of the system. For the moment, the easiest way to get an idea of why your query isn't performing well is to ask us."

This is and will not be a temporally problem. If you support complex queries, joins, graph and things like this you will get performance issues if you shards over to munch machines, especially if you shards automatically. And yes, we had a deep look inside of RethinkDB.

ArangoDB will support sharding and if you like you can shard over as many machines as you like.

But if you choose the automatically sharding and too many machines, you will get performance issues with certain queries. For us, we think it is way better to point out the possible problems and implications beforehand instead of hiding them behind some obscure FAQ entries.

What do you prefer: the truth or marketing slogans?

We believe the truth is much better. It would not be helpful to say "you can do everything" and then later if you reach a critical mass you encounter performance "problems" that cannot be solved because your queries are too complex. If you know exactly what the dependencies between sharding, scaling, transacions and queries are, you can design your architecture in such a way, that you avoid performance issues.

--

Frank Mayer

unread,

Jun 11, 2013, 7:37:01 AM6/11/13

to aran...@googlegroups.com

Claudius, of course I prefer your way, the actual truth! No doubt about that :)

RethinkDB was an example as was MongoDB before that, because these are databases that people also know and maybe have used.

I still prefer ArangoDB over any of those, because I believe that you have done amazing work with ArangoDB and I believe there is lot of more fine capabilitiues to come.

However, and this was the initial spark when I started this thread, there were some concerns (as it also turned out from other's comments in this thread) about ArangoDB's scalability.

We might have misunderstood Frank's example with SSD's and powerful servers, as the primary goal of ArangoDB's scaling.

Your phrase "ArangoDB will support sharding and if you like you can shard over as many machines as you like. " actually cleared things up for me personally. :D

Sharding is of course not the solution to all problems but I just want to be sure that I can take advantage of it and mix it with replications if I want, or better said, need to. Either Region-specific (data-center aware replication) or else.

And you're right, of course one needs to have in-depth application and database specific knowledge in order to manage scaling correctly. 100% Agreed! :D

Best regards,

Frank

Message has been deleted

Dobrosław Żybort

unread,

Jun 14, 2013, 5:39:10 PM6/14/13

to aran...@googlegroups.com

ArangoDB will support sharding and if you like you can shard over as many machines as you like.

It wasn't obvious from Frank Celler mail (or maybe that was my fault from not understanding it properly).
Thank you for clarification.

Best regards,
Dobrosław Żybort

F21

unread,

Jun 26, 2013, 6:54:15 AM6/26/13

to aran...@googlegroups.com

Great discussion guys! ArangoDB heading down the sharding path is definitely the way I would love to see it going.

In regards to sharding not being automatic, I fully agree. If you are sharding across thousands of machines, then things can be come slow as data is stored on different machines. However, at such a scale, there should be expertise to manually "shard" the data.

I think ArangoDB can use some inspiration for Elasticsearch, routing: http://www.elasticsearch.org/blog/customizing-your-document-routing/

If having data spread across many shards poses a performance problem, the application developer can define a routing for each document when saving. This ensures that only 1 shard is queried when a query is run (really fast!). You can then put multiple replicas in place to guard against data failure.

Reply all

Reply to author

Forward