Performance issues while loading data to Neptune and querying

713 views
Skip to first unread message

Prathyusha Reddy

unread,
Feb 18, 2019, 10:42:56 AM2/18/19
to Gremlin-users
Hi,

we have loaded 4 Million Vertexes and 2 Million Edges to the Neptune DB and the performance issues start to show up. We have loaded Neptune Bulk Loader APIs to insert the data to neptune , our cluster has only one instance in it, the cluster size is 
r4.2xlarge when loading data we saw a CPU spike on the instance , the CPU was about 97% .

After loading the data we saw the below issues

it take 10 s to do a count g.V().count() 

and any traversal queries as below:
gremlin> g.V().hasLabel('entity').flatMap(group().by('name').by(both().values('cash_in').sum())).and(select(values).unfold().is(gt(10000)))

{"requestId":"e79898f6-17a5-4714-b268-2894ab45dca9","code":"TimeLimitExceededException","detailedMessage":"A timeout occurred within the script during evaluation of [RequestMessage{, requestId=e79898f6-17a5-4714-b268-2894ab45dca9, op='eval', processor='', args={gremlin=g.V().hasLabel('entity').flatMap(group().by('name').by(both().values('cash_in').sum())).and(select(values).unfold().is(gt(10000))), batchSize=64}}] - consider increasing the timeout"}

and for any queries after this, the cluster gives a timeout.

Would someone please help us understand why we are seeing these performance issues with respect to memory and CPU ?. Is there any anti pattern with our query above or do we have to enforce any DB parameter settings like increasing the query timeouts? please advise .


Kelvin Lawrence

unread,
Feb 18, 2019, 2:47:07 PM2/18/19
to Gremlin-users
Hi Prathyusha, as these are Neptune specific questions you may want to post them to the AWS Neptune support forum at [1]

The Neptune Bulk Loader will attempt to load your CSV files using many parallel workers. During a bulk load the fact that the CPU is very busy means that those workers are able to keep the Neptune engine busy which is actually a good thing.

Without more information as to your data model and how you are connecting to Neptune it is hard to give you a concrete answer on the performance of your queries. You might want to try decomposing your query into parts and using the profile() step to see where time is being taken.

Reply all
Reply to author
Forward
0 new messages