Perfomance issues of import operation.

35 views
Skip to first unread message

maxteneff

unread,
Feb 22, 2013, 4:37:15 AM2/22/13
to dex...@googlegroups.com
Hello!
I have graph with 3M nodes, 10M edges and want to import it to DEX GraphDB.
But there are some multiple edges between different pair of nodes. 
So, I want count multiple edges between two nodes with the help of special counter attribute on single edge instead of import all edges.
Code example:
long found_edge = graph.FindEdge(edge_type, a_node, b_node);
if (found_edge == Objects.InvalidOID)
{
    found_edge
= graph.NewEdge(edge_type, a_node, b_node);
    graph
.SetAttribute(found_edge, edge_attr_count_type, new Value().SetInteger(0));
}
else
{
   
int get_count = graph.GetAttribute(found_edge, graph.FindAttribute(edge_type, "COUNT")).GetInteger();
    graph
.SetAttribute(found_edge, edge_attr_count_type, val.SetInteger(get_count + 1));
}

FInally, my problem: FindEdge operation works too long. With this operation my import runs 30 minutes, but without it (import all edges) only 15 minutes!
Does it normal values? Or how can I optimize my import?

Cache size: 2048MB, each edge has only one additional attribute "COUNT".
I use DEXnet 4.7.0 (32bit version).
My hardware configuration: Intel Core i3, 4GB DDR3, Windows 7 Pro.

Thank you.

c3po.ac

unread,
Feb 22, 2013, 7:33:04 AM2/22/13
to dex...@googlegroups.com

Hello,

The 10M edges include the multiple edges between the same nodes? Or is the final number of "unique" edges?
What's the average number of duplicates for an unique edge?

If you have only a few duplicates, you are adding the cost of a FindEdge (which is expensive), to the NewEdge for a lot of edges. That could explain the performance problem.

But, if you have a lot of duplicates, i guess that the performance should not be that different. In this case a FIndEdge/GetAttribute is added but the NewEdge is avoided.


To optimize a little your code, you could try these:
  • Avoid doing a FindAttribute for each duplicated edge. You have the "edge_attr_count_type" precalculated somewhere. Use it in the GetAttribute too.
  • You could also use another GetAttribute method where you give a value as an argument instead of receiving a new one. That's a little better (probably not much) because you can reuse a single Value object for all your edges. The same value object could be reused in the SetAttribute of a new edge too.

long found_edge = graph.FindEdge(edge_type, a_node, b_node);
 
if (found_edge == Objects.InvalidOID)
 
{
     found_edge
= graph.NewEdge(edge_type, a_node, b_node);

     
// Reuse the existing Value Object ("val")
     graph
.SetAttribute(found_edge, edge_attr_count_type, val.SetInteger(0));
 
}
 
else
 
{
     
// Get the attribute in the existing "val" object using the precalculated attribute type
     graph
.GetAttribute(found_edge, edge_count_attr_type, val);
     
int get_count = val.GetInteger();

     graph
.SetAttribute(found_edge, edge_attr_count_type, val.SetInteger(get_count + 1));
 
}

You could always do some optimizations independent of DEX. For example, you could try to sort your edges first. Then you could just count the equal consecutive ones and do a single NewEdge / SetAttribute for each unique edge without having to check it's existence. But I don't know your data. That may not be possible or easy.

Best regards.

El divendres 22 de febrer de 2013 10:37:15 UTC+1, maxteneff va escriure:

maxteneff

unread,
Feb 22, 2013, 9:59:10 AM2/22/13
to dex...@googlegroups.com
Thank you for your answer!

10M edges include multiple edges.
My global problem is to import several same files, each of 10M edges. And some files can have more multiple edges, but another one can have less multiple edges.
My performance problem was detected in file with only a few duplicates.

Your comments about optimization of my code is absolutly right.
I'll do these corrections and try to optimize my data to get better performance.

Best regards,
Max Bartenev.

пятница, 22 февраля 2013 г., 16:33:04 UTC+4 пользователь c3po.ac написал:

maxteneff

unread,
May 12, 2013, 4:51:59 PM5/12/13
to dex...@googlegroups.com
Hello again!

I continue to experiment with import and now I have following problem:
I have big graph which contains 10M nodes, 500M edges and try to import it into the DEX GraphDB.
While cache in my RAM is not full performance is normal, but when cache is full performance is greatly reduced.
And cache is filled too fast! Now size of my cache is 24GB and it is filled after import of about 300M edges!
I think the problem is cache contains all my graph, but now for my import I need only vertices because I not verify edges for existing.

Code of import:
Value val = new Value();
long a_node = graph.FindObject(node_id_type, val.SetString(a));                        
if (a_node == Objects.InvalidOID)
{
    a_node
= graph.NewNode(node_type);
    graph
.SetAttribute(a_node, node_id_type, val.SetString(a));                      
}

long b_node = graph.FindObject(node_id_type, val.SetString(b));                          
if (b_node == Objects.InvalidOID)
{
    b_node
= graph.NewNode(node_type);
    graph
.SetAttribute(b_node, node_id_type, val.SetString(b));
}

long new_edge = graph.NewEdge(edge_type, a_node, b_node);
graph
.SetAttribute(new_edge, edge_linkid_type, val.SetString(data));

Does it possible to configure cache for my purposes?

Cache size: 24GB
I use DEXnet 4.8.0 (64bit version).
My hardware configuration: Intel Xeon X5650, 36GB RAM DDR3, Windows Server 2008 64-bit.

Thank you.

c3po.ac

unread,
May 13, 2013, 5:11:56 AM5/13/13
to dex...@googlegroups.com
Hi,

When you say "Cache size" I assume it's the dex max cache size that you can set in DexConfig (or the config file).
24Gb of 36Gb could be a good setting. But dex needs more memory apart from the cache. And the file system cache could also be very big because the load may be reading big files and dex, at some point, must write the persistent db information.

You could check the total memory usage in your computer while dex is loading. I think that in this case 24Gb may be too much cache memory. You could try to set the dex max cache to 20Gb or less. It's better to have less cache and not surpass the total physical memory of the computer.

About the cache contents, dex tries to keep in the cache the maximum information possible in the space assigned. When not all the information fits, it will keep the most recently used data. You don't search the edges, but you are adding new ones and attributes to the edges, so the edge information is being intensively used and must be cached. Even if you stop using edges at some point, dex will keep them cached until it needs the cache space.

Best regards.


El diumenge 12 de maig de 2013 22:51:59 UTC+2, maxteneff va escriure:

maxteneff

unread,
May 13, 2013, 8:10:37 AM5/13/13
to dex...@googlegroups.com
I tried 2GB, 8GB, 16GB  and 20GB cache. Every time after import of some graph (100M edges for 8GB or 250M edges for 20GB) performance is greatly reduced.
Performance reduses immediately after cache became full. And of course I have much more free RAM.

And one more question.
Now I do every import operations into one open Database object.
Previously i tried open database before import and close database after import.
And operation of database closing was running too long even if I insert in my database a few edges (less 10k)

понедельник, 13 мая 2013 г., 13:11:56 UTC+4 пользователь c3po.ac написал:

c3po.ac

unread,
May 13, 2013, 10:25:48 AM5/13/13
to dex...@googlegroups.com
If you are only loading new data, when the cache becomes full, all new information will imply writing to disk. So disk writes will be the bottleneck.
The same is happening when you close the database after loading or updating. All data not previously written to disk must be written when the db is closed.

El dilluns 13 de maig de 2013 14:10:37 UTC+2, maxteneff va escriure:

damaris

unread,
May 13, 2013, 11:02:07 AM5/13/13
to
Hello Max,

If you think that we could help you optimizing your specific data, do not hesitate to send to dam...@sparsity-technologies.com or to in...@sparsity-technologies.com your data schema, and we can take a look and we may give you some tips!.

Best,

Damaris
Reply all
Reply to author
Forward
0 new messages