In the book "Graph Databases in Action" by David Bechberger and Josh Perryman, they state that:
"Testing at a sufficient scale means using sufficiently deep and sufficiently connected data, not just sufficiently large data. Testing at scale is especially paramount with highly recursive applications or those using a distributed database. For example, if the application we’re building involves an unbounded recursive traversal that performs acceptably on our test graph with a depth (number of iterations) of 5, it might function horribly on a production graph with a depth of 10.
It’s extremely complex to effectively simulate graph data. The scale-free nature of most graph domains represents a unique challenge. In our experience, the best approach to creating test data isn’t to create it at all; but instead to use real data whenever possible, sometimes in combination with data masking approaches in order to protect personally identifying information in the data. But more often than not this isn't possible. If you do have to simulate data, then take extra care that the data’s shape matches the expected production data." (Graph Databases in Action, 2020, 10.3.3, pg 280)
Any suggestions on how to go about doing this? I want to generate a large sample dataset of various shapes and sizes in a Tinkerpop-enabled graph databases (specifically DSE Graph, but this question is about Gremlin tooling in general). Searching around online didn't come up with any clear solutions.
Besides writing my own scripts to generate some csvs that I can load in, is there a standard way of doing this?
In some ways this question is similar to this SO post, but the difference is I want to be able to dynamically generate the data also.
I want to be able to specify details about the shape of the sample data, e.g.,
Specifically, I want to be able to generate some supernodes, but again, I'm wondering about the best way to go about this in general. Any suggestions?
--
You received this message because you are subscribed to a topic in the Google Groups "Gremlin-users" group.
To unsubscribe from this topic, visit https://groups.google.com/d/topic/gremlin-users/VmYochR3l3M/unsubscribe.
To unsubscribe from this group and all its topics, send an email to gremlin-user...@googlegroups.com.
To view this discussion on the web visit https://groups.google.com/d/msgid/gremlin-users/db0bedcb-814d-4318-a0e7-dccae7ed4796n%40googlegroups.com.
You received this message because you are subscribed to the Google Groups "Gremlin-users" group.
To unsubscribe from this group and stop receiving emails from it, send an email to gremlin-user...@googlegroups.com.
To view this discussion on the web visit https://groups.google.com/d/msgid/gremlin-users/CAKTaX8U7X%3D6QXQT30uJWCt6f4PQmLLhnCbHYVgnTnC-eSuSt4Q%40mail.gmail.com.
Hi Josh,Thanks, yeah this looks interesting. Just to confirm, the general idea for loading data into a tinkerpop-enabled db would be to use the `com.uber.ugb.db.CsvOutputDB` class to generate a CSV and then bulk load that into the db, is that correct?
It's too bad it wasn't maintained or continued - I saw the call to action in your co-worker's slides inviting participation from the community but looks like it never took off. I'm in the same boat though to be honest - my current project calls for a quick solution but don't really have the bandwidth/justification to work on something or contribute much once this project ends in the near future.
To view this discussion on the web visit https://groups.google.com/d/msgid/gremlin-users/2554f7cc-2923-435b-8f6a-0925a5038178n%40googlegroups.com.