Importing a large xml file to Neo4j with Py2neo

mourchid youssef

unread,

Jul 1, 2016, 10:45:43 AM7/1/16

to Neo4j

I have a problem in importing a very big XML file with 36196662 lines (2Gb). I am trying to create a Neo4j Graph Database of this XML file with Py2neo my xml file look like that:

and My python code to import the xml data into Neo4j is like that:

from xml.dom import minidom

from py2neo import Graph, Node, Relationship, authenticate

from py2neo.packages.httpstream import http

http.socket_timeout = 9999

import codecs

authenticate("localhost:7474", "neo4j", "FCBFAR123")

graph = Graph("http://localhost:7474/db/data/")

xml_file = codecs.open("User_profilesL2T1.xml","r", encoding="latin-1")

xml_doc = minidom.parseString (codecs.encode (xml_file.read(), "utf-8"))

#xml_doc = minidom.parse(xml_file)

persons = xml_doc.getElementsByTagName('user')

label1 = "USER"

# Adding Nodes

for person in persons:

if person.getElementsByTagName("id")[0].firstChild:

Id_User=person.getElementsByTagName("id")[0].firstChild.data

else:

Name="NO ID"

if person.getElementsByTagName("name")[0].firstChild:

Name=person.getElementsByTagName("name")[0].firstChild.data

else:

Name="NO NAME"

if person.getElementsByTagName("screen_name")[0].firstChild:

Screen_name=person.getElementsByTagName("screen_name")[0].firstChild.data

else:

Screen_name="NO SCREEN_NAME"

if person.getElementsByTagName("location")[0].firstChild:

Location=person.getElementsByTagName("location")[0].firstChild.data

else:

Location="NO Location"

if person.getElementsByTagName("description")[0].firstChild:

Description=person.getElementsByTagName("description")[0].firstChild.data

else:

Description="NO description"

if person.getElementsByTagName("profile_image_url")[0].firstChild:

Profile_image_url=person.getElementsByTagName("profile_image_url")[0].firstChild.data

else:

Profile_image_url="NO profile_image_url"

if person.getElementsByTagName("friends_count")[0].firstChild:

Friends_count=person.getElementsByTagName("friends_count")[0].firstChild.data

else:

Friends_count="NO friends_count"

if person.getElementsByTagName("url")[0].firstChild:

URL=person.getElementsByTagName("url")[0].firstChild.data

else:

URL="NO URL"

node1 = Node(label1,ID_USER=Id_User,NAME=Name,SCREEN_NAME=Screen_name,LOCATION=Location,DESCRIPTION=Description,Profile_Image_Url=Profile_image_url,Friends_Count=Friends_count,URL=URL)

graph.merge(node1)

My problem is when i run the code, it's take a long time to import this file almost a week to do that, so if can anyone help me to import data more faster than that i will be very grateful.

NB: My laptop configuration is: 4Gb RAM, 500Gb Hard Disc, i5

Santiago Videla

unread,

Jul 1, 2016, 11:49:34 AM7/1/16

to ne...@googlegroups.com

Hi,

See my reply in stackoverflow: http://stackoverflow.com/questions/38121022/importing-a-large-xml-file-to-neo4j-with-py2neo/38128897#38128897

cheers!

--
You received this message because you are subscribed to the Google Groups "Neo4j" group.
To unsubscribe from this group and stop receiving emails from it, send an email to neo4j+un...@googlegroups.com.
For more options, visit https://groups.google.com/d/optout.

--

Santiago Videla
http://www.linkedin.com/in/svidela

Michael Hunger

unread,

Jul 1, 2016, 12:22:24 PM7/1/16

to ne...@googlegroups.com, Nigel Small

I think you should use a streaming parser, otherwise it might be even on the python side that you overflow on memory.

Also I recommend doing transactions in Neo4j with batches of 10k to 100k updates per transaction.

Don't store "NO xxxx" fields, just leave them off it is just a waste of space and effort.

I don't know how merge(node) works. I recommend creating a unique constraint on :User(userId) and using a cypher query like this:

UNWIND {data} as row

MERGE (u:User {userId: row.userId}) ON CREATE SET u += {row}

where {data} parameter is a list (e.g. 10k entries) of dictionaries with the properties.

Michael

mourchid youssef

unread,

Jul 4, 2016, 11:33:32 AM7/4/16

to Neo4j

Hi Friend,

they propose in this link to convert the XML file into CSV and then import it into neo4j, I tried this solution but it takes a long time (a Week) for the conversion of XML to CSV !

Reply all

Reply to author

Forward