batch pipeline integration and batch inport

43 visualizzazioni
Passa al primo messaggio da leggere

Martin Neumann

da leggere,
16 apr 2014, 03:54:2216/04/14
a ne...@googlegroups.com
Hej everyone


I want to build the following pipeline (I'm prototyping right now):

Description:
Data -> Map/Reduce Giraph Pipeline -> graph -> Neo4j -> application

The system would run nightly rebuilding/replacing the Graph. It would then be dumped into a graph DB to make it possible for the application layer to query.
The graph is 20 million V and 200 million E currently in edgelist format with String vertex ID's und key/value pair data on edges (one of them is the edge type). The application layer only reads from that graph.

Here my questions:
1. Is Neo4j the right tool for the job? (I have no updates, no transactions but lots of queries)
2. What is the best way to import the data into Neo4j (I have heard the batch import can be slow for large data, and this would be a bottleneck)
3. Is there a simply online query tool I can hand to the application developer to "browse" the graph?

cheeers Matin

Michael Hunger

da leggere,
16 apr 2014, 06:23:1216/04/14
a ne...@googlegroups.com
1. yes, that's what neo4j is built for
2. Actually the batch-import is fast even for large datasets.
3. how about the Neo4j-Browser, which comes with Neo4j out of the box, see this video for an example: https://www.youtube.com/watch?v=qbZ_Q-YnHYo

Cheers

Michael


--
You received this message because you are subscribed to the Google Groups "Neo4j" group.
To unsubscribe from this group and stop receiving emails from it, send an email to neo4j+un...@googlegroups.com.
For more options, visit https://groups.google.com/d/optout.

Vishal Bisht

da leggere,
17 apr 2014, 00:14:2017/04/14
a ne...@googlegroups.com
UNSUBSCRIBE


 

Martin Neumann

da leggere,
23 apr 2014, 09:06:1323/04/14
a ne...@googlegroups.com, michael...@neopersistence.com
I tried to batch import some mock up data to just test how long this would take. The file structure I use is as follows:

The vertex file:
i:id name l:label
1 1 lable
2 2 lable
...

Edge file:
start end weight
1 2 25
...

In the real data each Vertex has a StringID (and maybe a Long DB id) and needs to be Indexed on those as well.
Would it be faster to map to long ID's first (the DB id) before loading it in, or can I somehow use the string ID's.

Finally with the current check I can get 15 million nodes and 150 million edges in ~2 hours. How can I speed this up further since the real data will be slower since it has more properties and needs to be indexed. Is it possible to distribute this between machines?

thanks for the help
Martin

Martin Neumann

da leggere,
23 apr 2014, 14:41:4523/04/14
a ne...@googlegroups.com, michael...@neopersistence.com
I spend some more time exploring and it seems a lot of what I have been told by co-workers seems to be not true anymore or outdated. So I think I should rephrase the last question.

I currently have a list of edges nodes are represented as string ID's each edge has a label and one key value pair of data where the value is a list of String's. The whole thing needs to be indexed on the String ID's of the nodes. The whole thing is stored as .deflate files in htfs. 

So the real question is what is the fastest way to turn that dataset into a Neo4J database. Since the whole thing was created by MapReduce I can extend the pipeline to create other formats if beneficial.

cheers Martin
Rispondi a tutti
Rispondi all'autore
Inoltra
0 nuovi messaggi