problems loading a large (300 millon entries) graph

100 views
Skip to first unread message

Tinto Binto

unread,
Oct 1, 2016, 8:29:26 AM10/1/16
to Aurelius
hi,

I am trying to load a graph into titan + cassandra single node setup.
The graph has users, posts as vertices (2.5 million)
and follows, comments, created,liked as edges (250 million)

There is one file per relation : follow, comment, like  and each has a (user_id, post_id combo)

I wrote a simple groovy script to load this into titan but its barfing after loading 20 million follow edges.

I keep getting a  cassandra thrift socket timeout. frogot to capture the actual logs.  Will run it again to capture.
But does the script show any obvious issues that can explain this ?

thanks
warshi


largeGraphLoad.groovy

Jason Plurad

unread,
Oct 1, 2016, 10:19:54 AM10/1/16
to Aurelius
Hi warshi,

Reduce the number of edges to commit per batch. 100,000 is a lot, even more than the 50,000 that you were using for vertices.

I'd suggest starting much lower, perhaps 10,000, then if that succeeds, you could try increasing it.

-- Jason

warshi

unread,
Oct 4, 2016, 5:38:42 PM10/4/16
to Aurelius
Thanks jason,

that definitely helped!
the upload completed in  2.5 hours without any problems.

Just curious what was the bottleneck ? is it titan or Cassandra ? 
does each commit explode into some 8x cassandra write thats the issue or is it heap space?

thanks
warshi

Antriksh Shah

unread,
Apr 19, 2018, 3:40:36 AM4/19/18
to Aurelius
Hey Warshi,

Do you remember what settings pertaining to bulk upload optimisations did you use in the property file?
Reply all
Reply to author
Forward
0 new messages