INSERT DATA performance

18 views
Skip to first unread message

r...@swirrl.com

unread,
Jun 16, 2015, 9:32:39 AM6/16/15
to sta...@clarkparsia.com
I've noticed that the performance of INSERT DATA is much slower than an INSERT WHERE (copying data from one graph to another). I was wondering if you had any advice on how to improve the performance of the INSERT DATA update statement? 

For example, a statement like this:

INSERT DATA
   GRAPH <http://graph> { 
       [ 10 thousand triples of new data here ]
   } 
}

...takes 5 to 10 seconds.

But a statement like this:

INSERT {
  GRAPH <http://graph> {
    ?s ?p ?o
}
} WHERE {
  SELECT ?s ?p ?o WHERE {
     GRAPH <http://anothergraph> {
       ?s ?p ?o
       }
    } LIMIT 10000 OFFSET 0
}

...takes <1 second.

Background:
~1.2 billion quad database total size
4G heap, 7G direct memory
8 cores

Evren Sirin

unread,
Jun 16, 2015, 10:00:16 AM6/16/15
to Stardog
There are several differences between the two things you are
comparing. As the second example shows actual insertion is fast but
the first example does more than that. There is the time it takes to
send the data from the client to the server, parse the query string,
and perform dictionary encoding for the values in the data. The second
example does none of these and instead reads 10K encoded triples from
the index directly (possibly from the cache). But even with these
differences, the time difference between two queries is quite high. Is
it still that slow if you save the 10K triples in an RDF file and run
the command `data add -g http://graph myDb data.ttl`? If the client
and server are running on the same machine, you can also try `LOAD
<file:[path]/data.ttl> INTO GRAPH <http://graph>`.

Best,
Evren
> --
> -- --
> You received this message because you are subscribed to the C&P "Stardog"
> group.
> To post to this group, send email to sta...@clarkparsia.com
> To unsubscribe from this group, send email to
> stardog+u...@clarkparsia.com
> For more options, visit this group at
> http://groups.google.com/a/clarkparsia.com/group/stardog?hl=en

r...@swirrl.com

unread,
Jun 16, 2015, 10:40:01 AM6/16/15
to sta...@clarkparsia.com
The timings I supplied were stardog execution time only (from the Queries pane of the stardog web console), so don't include network time. 

...But I'll try using `data add` and get back to you.

r...@swirrl.com

unread,
Jun 16, 2015, 11:23:30 AM6/16/15
to sta...@clarkparsia.com, r...@swirrl.com

Hi.


Here is the output from running `stardog data add`:


Loading 10,000 triples in a graph with 35 million existing triples:

> Adding data from file: 10k.nt

> Added 10,000 triples in 00:00:19.600


And loading the same file into an empty graph:

> Adding data from file: 10k.nt

> Added 10,000 triples in 00:00:00.604


Loading 100,000 triples in a graph with 35 million existing triples:

> Adding data from file: 100k.nt

> Added 100,000 triples in 00:02:28.885


And loading that same file into an empty graph:

> Adding data from file: 100k.nt

> Added 100,000 triples in 00:00:01.696


Thanks, Ric

Evren Sirin

unread,
Jun 16, 2015, 3:44:20 PM6/16/15
to Stardog, r...@swirrl.com
The number of triples in the destination graph would not slow things
down. The size of the database and the size of the update are the
important factors. When the database size is very large as in your
case, the number of new terms (URIs and literals) in the additions
becomes an important factor especially when the system is not warm.
The second add command in your example completes very fast because all
the terms in that file have already been encoded by the first add
command. When the system is warmed up, the time difference between
adding new terms vs adding existing terms should decrease. We created
issue #2352 to investigate if we can improve this behavior and help
the system get to a warm state quicker.

Best,
Evren
Reply all
Reply to author
Forward
0 new messages