Processed result logging

115 views
Skip to first unread message

Lucie S.

unread,
Sep 2, 2018, 1:16:26 PM9/2/18
to Gremlin-users
Hi,

I work on a project that requires logging of a path method outcome when a result is of a GraphTraversal type. That original result must stay untouched. So let's say I want to silently log all the elements that have been passed when searching for a result. There should also exist a possibility of avoiding sensitive data logging (anonymization).

I have already forked the tinkerpop project and created a working implementation of this functionality.
I tried to minimize changes of existing files. Four existing files have been influenced https://github.com/apache/tinkerpop/compare/master...svitaluc:processed-result-log (comparing with master, although I branched from 3.3.3., but my changes do not conflict with later changes in master so far)

My questions are:
  1. Is there a better way how to achieve my objective?
  2. Have I located my adjustments to proper files? Mainly asking about AbstractEvalOpProcessor.java. I tried to follow implementation of authenticator.enableAuditLog about where to incorporate my changes.
  3. Is the new package located where it should be? Should it be f.e. a separate maven module instead?
  4. Do you think that such a functionality could be beneficial to others so it has a potential of a pull request (after some discussion and potential changes)?
  5. Shall I communicate this in another place? Create a HipChat account or send it to d...@tinkerpop.apache.org?

Thank you!

Have a great day.
Lucie S.

HadoopMarc

unread,
Sep 3, 2018, 3:15:15 PM9/3/18
to Gremlin-users
Hi Lucie,

Author of authenticator.enableAuditLog here. Most of these questions are best directed to @StephenMallette, I'll go into 1 and 4.

1. Although not necessarily better as a solution, the altenative would be to take to the available audit logs and replay the queries with a separate application. This could have advantages in managing TinkerPop software complexity and possibly gremlin server's IO performance, but it would be much harder to replicate exact query results.

4.This is the age of big data and these logs would enable you to mine the actual use of a production graph in much more detail than just from the audit logs. I am not a privacy regulations expert, but I could imagine situations where these logs are needed in addition the very basic audit logs. So, yes, potential uses are there.

HTH,    Marc


Op zondag 2 september 2018 19:16:26 UTC+2 schreef Lucie S.:

Stephen Mallette

unread,
Sep 4, 2018, 4:00:32 PM9/4/18
to Gremlin-users
I'm not quite sure where to start, so apologies if this reply takes some random turns.

Are you ok with taking this approach:


I'm not sure how much you care about the completeness of your logging solution, but since you're dealing with scripts it's easy to get around that - instead of:

g.V().out()

someone just needs to send:

g.V().out().toList()

and they've bypassed your logging because that script will evaluate to List and not Traversal.  Even without scripts, do you care that path() will return nothing for:

g.V().has('name','josh').out().drop().path()

I'm sure there are other situations like that where path() really won't help you, but perhaps you've thought that part through and are ok with your solution.

I think that there's a lot of cost to the approach you've taken, because it forces every traversal to get executed twice and even "fast" ones will then gain the burden of having to track path data which will hurt how quickly you're getting results (i guess you're doing that logging async, but i could imagine that process backing up to some degree).

That said, I'm not sure what you should do exactly. I want to suggest you write a custom TraversalStrategy that could inject logger steps of some sort into the traversal. Like, if you only cared about vertices/edges, just search for the various steps that return those types and inject a sideEfffect() that does your logging (or perhaps if you're going that far, you might create your own LogStep or perhaps you do a DSL for the log() step ???) - in other words, you convert something like:

g.V().has('name','josh').outE().has('weight',gt(0.1)).inV().out()

to 

g.V().has('name','josh').log().outE().has('weight',gt(0.1)).log().inV().out().log()

and then as each traverser passes through you somehow coordinate the logging of traversed elements?? If you took the TraversalStrategy approach, your "g" would be more fool-proof to some degree as it would take someone unregistering your custom strategy to get rid of the logging but I think that could be mitigated if you needed to with sandboxing. The toList() iteration and drop() examples I used above would no longer be problems as well as a whole host of other holes. Anyway....I think i'd try to do this kind of thing at a lower level than Gremlin Server if i could. 

As for your specific questions:

> Is there a better way how to achieve my objective?

Offered a possible answer to that above.

> Have I located my adjustments to proper files? Mainly asking about AbstractEvalOpProcessor.java. I tried to follow implementation of authenticator.enableAuditLog about where to incorporate my changes.

I think so...though TraversalOpProcessor would probably also need some change if you expected to process bytecode based requests

> Is the new package located where it should be? Should it be f.e. a separate maven module instead?

i mean...if you continued with this approach, you've effectively forked Gremlin Server, so you can organize it however you like. Ideally, I'd imagine you don't want to maintain a fork, so I would say you would want to figure out how to make this work in such a way that it was instead its own maven module that plugged into a standard build of Gremlin Server. I'm not sure what that would entail offhand.

> Do you think that such a functionality could be beneficial to others so it has a potential of a pull request (after some discussion and potential changes)?

I appreciate your asking that question and presenting this as a possible contribution, but I would say that while the capability might have some general applicability for some other users, the current approach is coming with a lot of cost to performance and doesn't cover all the possible execution scenarios, thus allowing the log to be "incomplete". Perhaps that approach is ok for your use case but I'd imagine that the vast majority of TinkerPop users would likely not want to stomach those drawbacks. If you implemented this as a TraversalStrategy as I suggested and it was done in a really general way, perhaps that would make a good pull request - hard to say. I think that  I'm torn on the topic because if we implemented such a thing as a first-class citizen in TinkerPop, i'm not sure it would come in as a TraversalStrategy on its own. I really only offered that to you as a way for you to meet your requirements by way of our exposed methods for extending Gremlin behavior. I've often thought about some form of trace() step to go alongside the profile() and explain() steps that offered some kind of debug feedback or logging - maybe that's what a first class implementation would look like if we thought that was useful. That would require a bunch of discussion on the dev list.

> Shall I communicate this in another place? Create a HipChat account or send it to d...@tinkerpop.apache.org?

Here was fine :)

HTH,

Stephen


--
You received this message because you are subscribed to the Google Groups "Gremlin-users" group.
To unsubscribe from this group and stop receiving emails from it, send an email to gremlin-user...@googlegroups.com.
To view this discussion on the web visit https://groups.google.com/d/msgid/gremlin-users/d52dbc01-898d-433d-bed8-2e9d1b01f226%40googlegroups.com.
For more options, visit https://groups.google.com/d/optout.
Message has been deleted

Stephen Mallette

unread,
Sep 4, 2018, 6:25:54 PM9/4/18
to Gremlin-users
>  Do tags as @StephenMallette work in google groups?

i see all

On Tue, Sep 4, 2018 at 4:27 PM Lucie S. <svit...@gmail.com> wrote:
Hi Marc,

thanks for your reply!

That's exactly what I was thinking about regarding point 1. Not to delay the main operations too much I have the extra processing of a result in separate thread(s) - which is taking some performance of course but still a better option imho. I find replicating queries harder as you say. If we talked about gathering some data from a real product at a customer, it could be even impossible.

Do tags as @StephenMallette work in google groups? Not sure if it sends any notification as it seems inactive.

Lucie S.


Dne neděle 2. září 2018 19:16:26 UTC+2 Lucie S. napsal(a):

--
You received this message because you are subscribed to the Google Groups "Gremlin-users" group.
To unsubscribe from this group and stop receiving emails from it, send an email to gremlin-user...@googlegroups.com.

Lucie S.

unread,
Sep 10, 2018, 4:11:21 PM9/10/18
to Gremlin-users
Hi Stephen,

thank you so much for your extensive reply!!

I think I can survive with the current implementation so far but as soon as I have some other parts of my project done I want to come back to this extension and improve it. I really like the idea of injecting the side effects.

When I'm done with my new implementation and if I feel like it's good enough to share, I will renew this discussion.

I really appreciate your input A LOT! Thank you again.

Have a great day.
Lucie S.



Dne úterý 4. září 2018 22:00:32 UTC+2 Stephen Mallette napsal(a):
Reply all
Reply to author
Forward
0 new messages