Neo4j Import Tool Help

29 views
Skip to first unread message

o.adeg...@gmail.com

unread,
Jun 12, 2018, 8:02:03 PM6/12/18
to Neo4j
Hi, I'm trying to import a large network graph (about 90GB uncompressed and 30GB compressed) 2.36 Billion Nodes and baout as many relationships into neo4j using the import tool.

The problem I'm having is my data is formatted

User1(Follower) User1ID User2(Followed) User2ID

The problem I'm running into is while using the import tool I have to use 2 different headers on the same files to collect the users from the right side and the left side increasing the initial node count to 4.73 Billion I also have a lot of duplicates because of multiple users following one user in my data which causes a lot of time to be spent on the node index sort.

The current time for node import is 1h 30ms and current time for node index sort is 4d 12m 

I'm running neo4j on a dedicated server
  • Ubuntu Server 16.04 "Xenial Xerus" LTS
  • RAM: 64GB
  • Hard drive: SoftRAID 3x2 TB Server
  • Processor: Intel Xeon E5-1620 Quad-core (4 Core) 3.60 Ghz
I was originally working with the data split across 10 files but have compressed it down into 2 files to speed up the node import.

The command I ran with the import tool:

neo4j-admin import --database instaGraphPostPurge.db --nodes:User "/home/headers/graph_header_following.csv,/home/instanet/postpurge/following/PPfollowing.tgz" --nodes:User "/home/headers/graph_header2_following.csv,/home/instanet/postpurge/following/PPfollowing.tgz" --relationships:FOLLOWS "/home/headers/graph_relate_following.csv,/home/instanet/postpurge/following/PPfollowing.tgz" --nodes:User "/home/headers/graph_header_followedBY.csv,/home/instanet/postpurge/followedBy/PPfollowedBY.tgz" --nodes:User "/home/headers/graph_header2_followedBY.csv,/home/instanet/postpurge/followedBy/PPfollowedBY.tgz" --relationships:FOLLOWS "/home/headers/graph_relate_followedBY.csv,/home/instanet/postpurge/followedBy/PPfollowedBY.tgz" --delimiter TAB --ignore-duplicate-nodes --ignore-extra-colums

The current import tool output:

        ******** DETAILS 2018-06-11 20:31:50.825+0000 ********

        Nodes
        [*Nodes---------------------------------------------------------------------------------------]4.73B
        Memory usage: 36.27 GB
        I/O throughput: 82.71 MB/s
        VM stop-the-world time: 4s 32ms
        Duration: 1h 30m 47s 544ms
        Done batches: 473332

        Prepare node index
        [*SORT----------------------------------------------------------------------------------------]6.24B
        Memory usage: 60.51 GB
        VM stop-the-world time: 7s 99ms
        Duration: 4d 12m 1s 881ms
        Done batches: 624857

        Environment information:
          Free physical memory: 543.31 MB
          Max VM memory: 13.99 GB
          Free VM memory: 19.35 MB
          VM stop-the-world time: 11s 131ms
          Duration: 4d 1h 42m 49s 425ms

.......... .......... .......... .......... .......... 5%
.......... .......... .......... .......... .......... 10%
.......... .......... .......... .......... .......... 15%
.......... .......... .......... .......... .......... 20%
.......... .......... .......... .......... .......... 25%
.......... .......... .......... .......... .......... 30%
.......... .......... .......... .......... .......... 35%
.......... .......... .......... .......... .......... 40%
.......... .......... .......... .......... .......... 45%
.......... .......... .......... .......... .......... 50%
.......... .......... .......... .......... .......... 55%
.......... .......... .......... .......... .......... 60%
.......... .......... .......... .......... .......... 65%
.......... .......... .......... .......... .......... 70%
.......... .......... .......... .......... .......... 75%
.......... .......... .......... .......... .......... 80%
.......... .......... .......... .......... .......... 85%
.......... .......... .......... .......... .......... 90%
.......... .......... .......... .......... .......... 95%
.......... .......... .......... .......... .......... 100%


My config file settings:

dbms.memory.heap.initial_size=10G
dbms.memory.heap.max_size=20G
dbms.memory.pagecache.size=20G

I was wondering if there a way to speed up this import without editing my data files as I have another graph of similar size I want to import after.

Any help is appreciated, Thanks in advance. 
Reply all
Reply to author
Forward
0 new messages