LDBC Social Network Benchmark Implementation

84 views
Skip to first unread message

Jonathan Ellithorpe

unread,
Jul 9, 2019, 8:06:18 PM7/9/19
to ArangoDB
Hello All,

Has anyone worked on an implementation of the LDBC Social Network Benchmark for ArangoDB?

I see some folks here evidently struggling with ArangoDB performance on even very simple queries (e.g. https://groups.google.com/forum/#!topic/arangodb/sIOQ1xzJSpc), as well as how to efficiently bulk load graph data (e.g. https://groups.google.com/forum/#!topic/arangodb/4eI3fvUzDYg). 

An implementation of the above mentioned benchmark should serve nicely to show how to performantly use ArangoDB and AQL, including the bulk loading of graph data, besides showing ArangoDB's performance capabilities.

Jonathan




Jan Stücke

unread,
Jul 10, 2019, 12:06:23 AM7/10/19
to aran...@googlegroups.com
Hi Jonathan,

this is Jan from ArangoDB.

Thanks for the hint with the LDBC Benchmark. We will have a look if this is a suitable setup for ArangoDB. Quite often these benchmarks are focused on RDF stores but the graph part of ArangoDBs multi model offering is rather following a property graph model.

I forwarded the reported bulk load question to our Java specialist. Hope he will find some time to assist here.

Please note, that the problem with the “very simple query” wasn’t necessarily on ArangoDB side and was solved by remodeling the data. The user was storing huge binaries in ArangoDB which is possible but its recommended to store it in a way that allows fast queries on the meta data and only access the binary data if necessary. E.g if you store pictures, pdfs or similar blobs, we recommend to store the meta data in collection A and the actual blob in collection B if you want to store both in Arango. Because if you store everything in one big JSON document, a query against it has to access the whole document during runtime -> a lot of unneeded processing -> query runtime increases. 

The recommended way fro mour side for best performance in these cases is to store meta data in ArangoDB and use a dedicated filesystem for your binary data.

Hope that helped.

Best, Jan

--
You received this message because you are subscribed to the Google Groups "ArangoDB" group.
To unsubscribe from this group and stop receiving emails from it, send an email to arangodb+u...@googlegroups.com.
To view this discussion on the web visit https://groups.google.com/d/msgid/arangodb/3fa4003d-90c6-4aa9-9e40-d833155c14d0%40googlegroups.com.
For more options, visit https://groups.google.com/d/optout.
--

Jan Stücke
Head of Communications

Jonathan Ellithorpe

unread,
Jul 10, 2019, 1:44:32 AM7/10/19
to ArangoDB
Hi Jan,
ldbc_snb_schema.png

Thanks for that explanation, that does help, I'm glad that got resolved (haven't seen that thread updated yet with the resolution).


The LDBC Social Network Benchmark is more property graph focused actually. I've included an image of the graph schema to illustrate.


While the schema is relatively straightforward, the benchmark is fairly comprehensive and challenging, including a total of 29 queries, 14 complex "analytical" type read-only queries, 7 simple read-only queries, and 8 update queries that add people and posts and likes and so on to the graph.


I have a working implementation for Neo4j (as well as my own graph database I've been working on as a research project) in the following repo:




I just added a skeleton for an ArangoDB implementation. Since I'm not familiar with AQL (just started playing around with it today), I estimate it would take me considerable time to complete a full implementation. I may be able to flesh out the simpler short read queries and updates in a couple of days, but the 14 "analytical" style complex queries are where things get... well... complicated. The hard part is making sure I'm doing the target database justice and making sure I've written the query in the most performant manner possible. Even with the gracious help of the (amazing) developers at Apache TinkerPop (many thanks to them for their help), getting a Gremlin implementation just to pass validation was about a man month of work (includes learning Gremlin), and then another week or two on top of that to work out inefficiencies in the query implementations.


Would be happy to collaborate on this, as I've already been working with this benchmark for quite a while and have datasets (up to 1TB in size) available for use, along with various tools and validation data for testing. What I do not have, however, is ArangoDB / AQL expertise to produce the highest performance complex query implementations possible for ArangoDB (the simple read and update queries are simple enough I believe I can work those out fairly easily).


Cheers,

Jonathan




On Tuesday, July 9, 2019 at 9:06:23 PM UTC-7, jan.stuecke wrote:
Hi Jonathan,

this is Jan from ArangoDB.

Thanks for the hint with the LDBC Benchmark. We will have a look if this is a suitable setup for ArangoDB. Quite often these benchmarks are focused on RDF stores but the graph part of ArangoDBs multi model offering is rather following a property graph model.

I forwarded the reported bulk load question to our Java specialist. Hope he will find some time to assist here.

Please note, that the problem with the “very simple query” wasn’t necessarily on ArangoDB side and was solved by remodeling the data. The user was storing huge binaries in ArangoDB which is possible but its recommended to store it in a way that allows fast queries on the meta data and only access the binary data if necessary. E.g if you store pictures, pdfs or similar blobs, we recommend to store the meta data in collection A and the actual blob in collection B if you want to store both in Arango. Because if you store everything in one big JSON document, a query against it has to access the whole document during runtime -> a lot of unneeded processing -> query runtime increases. 

The recommended way fro mour side for best performance in these cases is to store meta data in ArangoDB and use a dedicated filesystem for your binary data.

Hope that helped.

Best, Jan
On Tue 9. Jul 2019 at 17:06, Jonathan Ellithorpe <j...@cs.stanford.edu> wrote:
Hello All,

Has anyone worked on an implementation of the LDBC Social Network Benchmark for ArangoDB?

I see some folks here evidently struggling with ArangoDB performance on even very simple queries (e.g. https://groups.google.com/forum/#!topic/arangodb/sIOQ1xzJSpc), as well as how to efficiently bulk load graph data (e.g. https://groups.google.com/forum/#!topic/arangodb/4eI3fvUzDYg). 

An implementation of the above mentioned benchmark should serve nicely to show how to performantly use ArangoDB and AQL, including the bulk loading of graph data, besides showing ArangoDB's performance capabilities.

Jonathan




--
You received this message because you are subscribed to the Google Groups "ArangoDB" group.
To unsubscribe from this group and stop receiving emails from it, send an email to aran...@googlegroups.com.

Jan Stücke

unread,
Jul 10, 2019, 12:55:34 PM7/10/19
to aran...@googlegroups.com
Hey Jonathan,

ok, sounds very interesting! Super cool pre-work. That helps a lot.

Would be happy to collaborate with you on this but have to check back with our graph specialists on our side first. I don’t want to promise anything and then our guys are fully booked with customer projects & prod dev.

Happy to keep this thread alive and post potential updates here for everybody but for the details, we could switch to email. You can reach me via jan.s...@arangodb.com. Would be great if you could send me the analytical queries and the amount of documents per collection like persons, tabs, etc in your 1 TB dataset. Then I can discuss with our seniors over here.

Best, Jan

To unsubscribe from this group and stop receiving emails from it, send an email to arangodb+u...@googlegroups.com.
To view this discussion on the web visit https://groups.google.com/d/msgid/arangodb/14150ba1-330c-416a-b264-2fb374f37f44%40googlegroups.com.

For more options, visit https://groups.google.com/d/optout.

Jonathan Ellithorpe

unread,
Jul 10, 2019, 4:14:53 PM7/10/19
to ArangoDB
Hi Jan,

Yup, completely understand. I'll send you the details you asked about over e-mail later today. 

Best,
Jonathan

Jonathan Ellithorpe

unread,
Jul 12, 2019, 12:24:05 AM7/12/19
to ArangoDB
Hi All,

Hammered out two of the simple read queries from the benchmark. Thought I would share and ask for some early feedback, make sure I'm not missing out on any obvious query performance optimizations. The graph schema for all of this is here (same as before):


ldbc_snb_schema.png





ShortQuery1:

 657   /**
 658    * Given a start Person, retrieve their first name, last name, birthday, IP
 659    * address, browser, and city of residence.[1]
 660    */
 ...
 672       ArangoDatabase db = ((ArangoDbConnectionState) dbConnectionState).getDatabase();
 673       String statement =
 674           "WITH Place"
 675           + " FOR p IN Person"
 676           + " FILTER p._key == @personId"
 677           + "   FOR c IN 1..1 OUTBOUND p isLocatedIn"
 678           + " RETURN {"
 679           + "   firstName: p.firstName,"
 680           + "   lastName: p.lastName,"
 681           + "   birthday: p.birthday,"
 682           + "   locationIP: p.locationIP,"
 683           + "   browserUsed: p.browserUsed,"
 684           + "   cityId: c._key,"
 685           + "   gender: p.gender,"
 686           + "   creationDate: p.creationDate"
 687           + "  }";
 688
 689       ArangoCursor<BaseDocument> cursor = db.query(
 690           statement,
 691           new MapBuilder()
 692               .put("personId", String.valueOf(operation.personId()))
 693               .get(),
 694           new AqlQueryOptions(),
 695           BaseDocument.class
 696         );
 697
 698       if (cursor.hasNext()) {
 699         BaseDocument doc = cursor.next();
 700
 701         resultReporter.report(0,
 702             new LdbcShortQuery1PersonProfileResult(
 703                 (String)doc.getAttribute("firstName"),
 704                 (String)doc.getAttribute("lastName"),
 705                 (Long)doc.getAttribute("birthday"),
 706                 (String)doc.getAttribute("locationIP"),
 707                 (String)doc.getAttribute("browserUsed"),
 708                 Long.decode((String)doc.getAttribute("cityId")),
 709                 (String)doc.getAttribute("gender"),
 710                 (Long)doc.getAttribute("creationDate")),
 711               operation);
 712       } else {
 713         resultReporter.report(0, null, operation);
 714       }


ShortQuery2:

 718   /**
 719    * Given a start Person, retrieve the last 10 Messages (Posts or Comments)
 720    * created by that user. For each message, return that message, the original
 721    * post in its conversation, and the author of that post. If any of the
 722    * Messages is a Post, then the original Post will be the same Message, i.e.,
 723    * that Message will appear twice in that result. Order results descending by
 724    * message creation date, then descending by message identifier.[1]
 725    */
 ...
 737       ArangoDatabase db = ((ArangoDbConnectionState) dbConnectionState).getDatabase();
 738       String statement =
 739           "WITH Comment, Post"
 740           + " FOR person IN Person"
 741           + " FILTER person._key == @personId"
 742           + "   FOR message IN 1..1 INBOUND person hasCreator"
 743           + "     SORT message.creationDate DESC, message._key DESC"
 744           + "     LIMIT @limit"
 745           + "     FOR originalPost IN 0..1024 OUTBOUND message replyOf"
 746           + "       FILTER IS_SAME_COLLECTION('Post', originalPost._id)"
 747           + "         FOR originalPostAuthor IN 1..1 OUTBOUND originalPost hasCreator"
 748           + " RETURN {"
 749           + "   messageId: message._key,"
 750           + "   messageContent: message.content,"
 751           + "   messageImageFile: message.imageFile,"
 752           + "   messageCreationDate: message.creationDate,"
 753           + "   originalPostId: originalPost._key,"
 754           + "   originalPostAuthorId: originalPostAuthor._key,"
 755           + "   originalPostAuthorFirstName: originalPostAuthor.firstName,"
 756           + "   originalPostAuthorLastName: originalPostAuthor.lastName"
 757           + "  }";
 758
 759       ArangoCursor<BaseDocument> cursor = db.query(
 760           statement,
 761           new MapBuilder()
 762               .put("personId", String.valueOf(operation.personId()))
 763               .put("limit", new Integer(operation.limit()))
 764               .get(),
 765           new AqlQueryOptions(),
 766           BaseDocument.class
 767         );
 768
 769       List<LdbcShortQuery2PersonPostsResult> resultList = new ArrayList<>();
 770
 771       while (cursor.hasNext()) {
 772         BaseDocument doc = cursor.next();
 773
 774         String content = (String)doc.getAttribute("messageContent");
 775         if (content == null) {
 776           content = (String)doc.getAttribute("messageImageFile");
 777         }
 778
 779         resultList.add(new LdbcShortQuery2PersonPostsResult(
 780             Long.valueOf((String)doc.getAttribute("messageId")),
 781             content,
 782             (Long)doc.getAttribute("messageCreationDate"),
 783             Long.valueOf((String)doc.getAttribute("originalPostId")),
 784             Long.valueOf((String)doc.getAttribute("originalPostAuthorId")),
 785             (String)doc.getAttribute("originalPostAuthorFirstName"),
 786             (String)doc.getAttribute("originalPostAuthorLastName")));
 787       }
 788
 789       resultReporter.report(0, resultList, operation);


Thanks in advance!

Best,
Jonathan

Jonathan Ellithorpe

unread,
Jul 13, 2019, 8:20:20 PM7/13/19
to ArangoDB
OK, I've finished all the short read queries and all the update queries. 

You can find them all at the github repo below, which includes tools for performance testing ArangoDB on the queries in the benchmark.


And you can find the SF0001 dataset available for download below, which includes a script for loading the dataset into an ArangoDB cluster / instance (you'll probably need to modify it for your own needs e.g. login credentials or server locations):


Complex queries are not implemented. Happy to take pull requests on those if anyone is up for the challenge.

Feel free to contact me if you need any help getting things setup.

Best,
Jonathan
Reply all
Reply to author
Forward
0 new messages