Incremental Data Load in Neo4j DB From Hive

Pranab Banerjee

unread,

Jan 27, 2016, 5:32:15 AM1/27/16

to Neo4j

Hi

This is regarding incremental Data Ingestion (Modify/New-Add) to Neo4j from Hive Data source.

We need to incorporate the On-going "ChangeOnly/New" data load from source table (in Hive) to Neo4j DB.

1. If node already exist in Neo4j DB then only update/modify that specific node data.

2. If node doesn't exist in Neo4j DB then append that specific node data as a new

Can you please suggest any effective solution when the data volume is at scale (~5 million rows per day).

Thanks

Pranab

Michael Hunger

unread,

Jan 27, 2016, 6:21:05 AM1/27/16

to ne...@googlegroups.com, David Fauth

If you have a timestamp or other flag in hive that shows the data as "new", you can use just a SELECT statement to get the information

In general you'd use merge with parameters for that:

MERGE (n:Label {id:{id}}) ON CREATE SET n.foo = {foo}, n.bar = {bar}

or if you always want to update properties

MERGE (n:Label {id:{id}})

SET n.foo = {foo}, n.bar = {bar}

For the actual run, there are several, options

You can also export the select results to CSV (or make that CSV available via http) and use LOAD CSV

LOAD CSV WITH HEADERS FROM "URL" as row

MERGE (n:Label {id:row.id}) ON CREATE SET n.foo = row.foo, n.bar = row.bar

or even

LOAD CSV WITH HEADERS FROM "URL" as row

MERGE (n:Label {id:row.id}) ON CREATE SET n += row

Or pass in all rows of the batch in as parameters, e.g. {id:id, data: {col1:value, col2:value})

UNWIND {rows} as row

MERGE (n:Label {id:row.id}) ON CREATE SET n += row.data

Michael

--
You received this message because you are subscribed to the Google Groups "Neo4j" group.
To unsubscribe from this group and stop receiving emails from it, send an email to neo4j+un...@googlegroups.com.
For more options, visit https://groups.google.com/d/optout.

Pranab Banerjee

unread,

Jan 27, 2016, 6:56:14 AM1/27/16

to ne...@googlegroups.com, michael...@neotechnology.com, David Fauth

Thank you Michael

I am new to Neo4j and as per my knowledge "MERGE" will not partially use existing patterns- its ALL or Nothing. that means either whole patterns matches or whole patter created.

for example like below, I am not sure whether MERGE will be effective ??

One time load (/user/home/onetime.csv)-

=============================================

Id, field1, field2, field3, last_modified_dt

=== ====== ====== ======= =================

1 abcd efgh 12 2016-01-25 09:22:03 <-(New entry)

2 mnop efgh 14 2016-01-25 09:22:04 <-(New entry)

After loading the above onetime data into respective node we received the below incremental data

Incremental Load (/user/home/incremental.csv)-

===============================================

Id, field1, field2, field3, last_modified_dt

=== ====== ====== ======= =================

2 txyz efgh 18 2016-01-27 09:48:03 <-(modified data)

3 hijk octu 17 2016-01-27 09:49:00 <-(New entry)

Appreciate any suggestion on this. Thanks in advance.

Pranab

Michael Hunger

unread,

Jan 27, 2016, 7:22:41 AM1/27/16

to Pranab Banerjee, ne...@googlegroups.com, David Fauth

The "pattern" for MERGE in this case is just label + id, so that is exactly what you want.

you would use this one then:

MERGE (n:Label {id:{id}})
SET n.foo = {foo}, n.bar = {bar}

or for row based

LOAD CSV WITH HEADERS FROM "URL" as row
MERGE (n:Label {id:row.id})

SET n += row

There would be an option where you first check for the fields that have changed and only updates those, but I'm not sure it's worth it.

Something like

LOAD CSV WITH HEADERS FROM "URL" as row

MERGE (n:Label {id:row.id})

FOREACH (key IN filter(key IN keys(row) WHERE row[key] <> n[key]) | SET n[key] = row[key] )