Help Needed: Optimizing Gremlin Traversal for Performance Without Losing Data

67 views
Skip to first unread message

Sukru

unread,
Feb 2, 2024, 2:16:38 PMFeb 2
to Gremlin-users
Hello Gremlin Community,

I'm seeking advice on optimizing a Gremlin traversal query. My current query functions correctly but suffers from performance issues, likely due to redundant traversal steps. I've attempted to restructure it for efficiency, but the new version doesn't retrieve all the expected data. Insights or suggestions for improvement while keeping the data integrity intact would be much appreciated.

Original Query Structure:
```
.by(
    __.out("EDGE_A")
    .out("EDGE_B")
    .dedup()
    .project("key1", "key2", "key3", "key4", "key5", "key6", "key7", "key8", "additionalData")
    .by(__.id_())
    .by("key2")
    .by("key3")
    .by("key4")
    .by("key5")
    .by(__.coalesce(__.values("key6"), __.constant("")))
    .by(__.coalesce(__.values("key7"), __.constant("")))
    .by(__.coalesce(__.values("key8"), __.constant("")))
    .by(
        __.in_("EDGE_B")
        .where(__.in_("EDGE_A").as_("alias"))
        .project("data1", "data2", "data3")
        .by("dataKey1")
        .by("dataKey2")
        .by("dataKey3")
        .fold()
    )
    .fold()
)
```
This query involves an 'out' step followed by another 'out' step, which seems inefficient. To enhance performance, I tried reorganizing it as follows:
Modified Query Structure:
```
.by(
    __.out("EDGE_A").as_("alias")
    .out("EDGE_B")
    .dedup()
    .project("key1", "key2", "key3", "key4", "key5", "key6", "key7", "key8", "additionalData")
    .by(__.id_())
    .by("key2")
    .by("key3")
    .by("key4")
    .by("key5")
    .by(__.coalesce(__.values("key6"), __.constant("")))
    .by(__.coalesce(__.values("key7"), __.constant("")))
    .by(__.coalesce(__.values("key8"), __.constant("")))
    .by(
        __.select("alias")
        .unfold()
        .project("data1", "data2", "data3")
        .by("dataKey1")
        .by("dataKey2")
        .by("dataKey3")
        .fold()
    )
    .fold()
)
```
However, the modified query doesn't capture all the data, especially when there are multiple entries to be retrieved. My goal is to eliminate the inefficient traversal pattern without compromising on data completeness.

Any guidance on how to achieve this in an optimized manner would be greatly appreciated.

Thanks in advance!

Valentyn Kahamlyk

unread,
Feb 5, 2024, 4:36:29 PMFeb 5
to Gremlin-users
Hi,

Do you have sample data set to reproduce issue? Best if only few vertices and edges. Often this greatly helps to understand what is wrong with the query.

For data like
EDGE_A -> VERTEX_A1 ->  EDGE_B -> VERTEX_B1
EDGE_A -> VERTEX_A2 ->  EDGE_B -> VERTEX_B1

first query will get VERTEX_B1 after dedup(), then go back by EDGE_B and find VERTEX_A1 _and_ VERTEX_A2, then project...
second query also produce VERTEX_B1 after dedup(), then will get only one from VERTEX_A1 _or_ VERTEX_A2, because other path eliminated by  dedup()
Reply all
Reply to author
Forward
0 new messages