Best way to generate sample/mock data for benchmarking and testing

746 views
Skip to first unread message

Ryan Quey

unread,
Jun 18, 2021, 6:05:29 AM6/18/21
to Gremlin-users

In the book "Graph Databases in Action" by David Bechberger and Josh Perryman, they state that:

"Testing at a sufficient scale means using sufficiently deep and sufficiently connected data, not just sufficiently large data. Testing at scale is especially paramount with highly recursive applications or those using a distributed database. For example, if the application we’re building involves an unbounded recursive traversal that performs acceptably on our test graph with a depth (number of iterations) of 5, it might function horribly on a production graph with a depth of 10.

It’s extremely complex to effectively simulate graph data. The scale-free nature of most graph domains represents a unique challenge. In our experience, the best approach to creating test data isn’t to create it at all; but instead to use real data whenever possible, sometimes in combination with data masking approaches in order to protect personally identifying information in the data. But more often than not this isn't possible. If you do have to simulate data, then take extra care that the data’s shape matches the expected production data." (Graph Databases in Action, 2020, 10.3.3, pg 280)

Any suggestions on how to go about doing this? I want to generate a large sample dataset of various shapes and sizes in a Tinkerpop-enabled graph databases (specifically DSE Graph, but this question is about Gremlin tooling in general). Searching around online didn't come up with any clear solutions.

Besides writing my own scripts to generate some csvs that I can load in, is there a standard way of doing this?

In some ways this question is similar to this SO post, but the difference is I want to be able to dynamically generate the data also.

I want to be able to specify details about the shape of the sample data, e.g.,

  • number of vertices
  • number of edges
  • depth of the connected records (as per the recommendation from Bechberger and Perryman, cited above)

Specifically, I want to be able to generate some supernodes, but again, I'm wondering about the best way to go about this in general. Any suggestions?

Shay Nehmad

unread,
Jun 20, 2021, 8:20:41 AM6/20/21
to Gremlin-users
This is a problem we're facing as well and I haven't found a good solution yet. We are currently using real data to test which is a real problem.

The best thing we have come up with as a future plan is to retrofit hypothesis to graph data as well, something like “fuzzing” a graph for many unit tests which test functions that get graphs as input, based on our data model. We’ll have fuzzing strategies for each vertex/edge, and we’ll be able to test our unit tests with those. Something like:

@given(two groups that have an unknown amount of assets connecting them)
@given(unknown amount of users from domain x, that are one hop from those assets) 
@given(unknown amount of users from domain y, that are not connected to those assets) 
def test_get_domain_files(assets_1, assets_2, users_1, users_2, test_gremlin):
    g = fuzz_scene_into_graph(test_gremlin, fuzz_info=assets_1...) 
    results = run the function using g 
    run a few generic asserts on the results, given the data

I'd love to get more concrete examples from the group if anyone has any!

Ryan Quey

unread,
Jun 26, 2021, 12:03:43 PM6/26/21
to gremli...@googlegroups.com
I was waiting to see if there were any more responses, but it sounds like this is my answer! Thanks for sharing

--
You received this message because you are subscribed to a topic in the Google Groups "Gremlin-users" group.
To unsubscribe from this topic, visit https://groups.google.com/d/topic/gremlin-users/VmYochR3l3M/unsubscribe.
To unsubscribe from this group and all its topics, send an email to gremlin-user...@googlegroups.com.
To view this discussion on the web visit https://groups.google.com/d/msgid/gremlin-users/db0bedcb-814d-4318-a0e7-dccae7ed4796n%40googlegroups.com.

Joshua Shinavier

unread,
Jun 27, 2021, 12:17:35 PM6/27/21
to Gremlin-users
Hi Ryan,

This is a problem we grappled with at Uber, as well. We needed a way to evaluate the performance of different traversal-based queries (mainly filtered shortest-paths) on graphs intermediate between our tiny, hand-written test datasets, and our huge production datasets. Even more importantly, we also needed to know what effect it would have on our queries if and when we ingested new, large datasets into the graph, which suggested either downsampling our production data in a statistically representative way (which was hard and query-intensive), or generating statistically representative graph data.

There are a number of frameworks (e.g. LUBM, the Berlin SPARQL Benchmark, the DBPedia SPARQL Benchmark, etc.) which will generate graphs of any size. However, we didn't find any of these helpful/appropriate at the time because a) they generated graphs according to a fixed schema which was unrelated to our own schema, and b) they all generated scale-free graphs, which ours was not. Scale-free graphs are pretty common, but they're not universal; in the case of Uber's graph, we actually found log-normal distributions in most cases.

What we came up with was this workflow in which we would aim to derive accurate degree distributions from a new dataset before we began ingesting it into the graph, then simulate the new dataset's effect on our queries using smaller generated graphs in a staging deployment. My coworker Chris Lu extended the graph generation code I wrote for our internal project and made it available as an open-source project here. See also Chris's presentation from Data Day Texas.

The above worked well enough for our purposes, but there is a lot more which could and should be done in this area. I would say an ideal framework for generating graph data ought to produce graphs which:
  1. Conform to a given graph schema
  2. Are characterized by statistics, including degree distributions in each edge label, which are specified by the developer, and
  3. Are characterized by topological characteristics which are specified by the developer
Single-relational graph generation solutions tend to be pretty good at capturing 2 and 3, while the framework I described above was pretty good at capturing 1 and 2 (given that our schemas were pretty simple). I know there are other solutions which have appeared in the last few years, as well. It would be interesting to see a survey of the current state of the art.

Josh


You received this message because you are subscribed to the Google Groups "Gremlin-users" group.
To unsubscribe from this group and stop receiving emails from it, send an email to gremlin-user...@googlegroups.com.
To view this discussion on the web visit https://groups.google.com/d/msgid/gremlin-users/CAKTaX8U7X%3D6QXQT30uJWCt6f4PQmLLhnCbHYVgnTnC-eSuSt4Q%40mail.gmail.com.

Ryan Quey

unread,
Jun 29, 2021, 4:34:03 AM6/29/21
to Gremlin-users
Hi Josh,

Thanks, yeah this looks interesting. Just to confirm, the general idea for loading data into a tinkerpop-enabled db would be to use the `com.uber.ugb.db.CsvOutputDB` class to generate a CSV and then bulk load that into the db, is that correct?

It's too bad it wasn't maintained or continued - I saw the call to action in your co-worker's slides inviting participation from the community but looks like it never took off. I'm in the same boat though to be honest - my current project calls for a quick solution but don't really have the bandwidth/justification to work on something or contribute much once this project ends in the near future. 

Josh Perryman

unread,
Jun 29, 2021, 10:17:24 AM6/29/21
to Gremlin-users
First, thank you for the citation! 

We did include in our book code repository a very, very crude sample data generator. It was purpose-built specifically for the immediate needs of the book and doesn't have any of the functionality one would desire for a general approach. 

Thankfully, Josh S's excellent response outlines the true needs and compelling constraints. To that I'll add another resource: The Linked Data Benchmarking Counsel (LDBC) Social Network Benchmark (SNB) is an interesting venture in this arena. I haven't worked with it extensively, but I have admired that they published both a comprehensive documentation of the non-trivial schema, and provide data generating tools: LDBC Social Network Benchmark (LDBC-SNB)

-Josh Perryman

Joshua Shinavier

unread,
Jun 29, 2021, 12:50:11 PM6/29/21
to Gremlin-users
(See inline comments)


On Tue, Jun 29, 2021 at 1:34 AM Ryan Quey <rlq...@gmail.com> wrote:
Hi Josh,

Thanks, yeah this looks interesting. Just to confirm, the general idea for loading data into a tinkerpop-enabled db would be to use the `com.uber.ugb.db.CsvOutputDB` class to generate a CSV and then bulk load that into the db, is that correct?


That's one way, but there are also other "DB" implementations for output of the generated graph data. E.g. see GremlinDB, which wraps a GremlinGroovyScriptEngine. The first version of the code only had one way of outputting data: with calls into our internal graph API. Chris went to great lengths to make sure you could combine the open-source generator with a variety of storage solutions, including key/value stores.


It's too bad it wasn't maintained or continued - I saw the call to action in your co-worker's slides inviting participation from the community but looks like it never took off. I'm in the same boat though to be honest - my current project calls for a quick solution but don't really have the bandwidth/justification to work on something or contribute much once this project ends in the near future. 


I agree; it would be nice to see this framework used and extended outside of Uber, and also to see a bit more theoretical work exploring some of the assumptions we made. I think UGB is pretty good at doing what it was designed to do: generating statistically representative graphs which can be used in benchmarking and capacity planning. "In the small", however, these graphs are not very realistic, as the in- and out-vertices for any given edge are randomly chosen, whereas in real data you usually find clusters / communities of vertices which connect to each other much more than they connect to vertices outside of the cluster. This structure may have non-negligible effects on query performance.

Josh

 
Reply all
Reply to author
Forward
0 new messages