Newbie question

53 views
Skip to first unread message

Jeff Mixter

unread,
Apr 3, 2014, 10:05:59 AM4/3/14
to 4store-...@googlegroups.com
I have been working with RDF for a while and have been using 4store to manage and query small amounts of triples (~100,000) for about 4 months.  I am not looking to increase my testing data but I am having trouble finding documentation that clearly outlines how to optimize 4store to handle ~500,000,000 triples.  I  am using a AWS EC2 instance with 8 cores and 30 gigs of RAM.  My main problem is that I do not understand how to optimize either the storage of the triples or the means to access and query the triples.

Having read through multiple post I understand that increasing the number of Clusters and distribute the load but I am not sure if/how to install 4store on multiple clusters if it is on one machine. I have also read that you can set up multiple backends, but the documentation seemed to indicate that you needed multiple http addresses to do so and again I was not sure how to do this using one machine. Finally, if I can get multiple clusters set up (with multiple nodes) do I need to segment the triples prior to uploading them or will 4store manage that for me. In summary my main questions are:

1) What optimizations can be done using a  single AWS instance?

2) How to enable such optimizations on a single AWS instance?

3) How to distribute a large amount of triples across multiple storage containers and still have a single SPARQL endpoint that will query the whole set?

Sorry for the amateur questions... I am relatively new to not only using 4store but also using Linux in general.  

swh

unread,
Apr 3, 2014, 10:19:50 AM4/3/14
to 4store-...@googlegroups.com
The main way to optimise is the set the number of "segments".

Each segment has it's own process, and has it's own subdirectory.

In general one segment per core is a good choice.

If you want to spread the IO across different storage areas, you can move the segment directory (e.g. 0002 in /var/lib/4store/KBNAME/) and symlink it into the real location.

The main performance benefit you can get is backing it with SSDs, AWS offers that as an option.

I've loaded over 250M triples into a low-powered linux box with a single SSD, so it makes a significant difference.
http://steveharris.tumblr.com/post/2781566381/want-to-store-hundreds-of-megatriples-on-lowend

Jeff Mixter

unread,
Apr 3, 2014, 10:57:40 AM4/3/14
to 4store-...@googlegroups.com
Thanks for the quick response.  The SSD suggestion makes perfect sense, thanks for the Blog link.  Not to bother you with more amateur questions, but I have not been able to find any good documentation (or at least documentation that I can decipher) that explains the benefit of having multiple 4store backends.  Additionally I can not find any good documentation on how to start multiple backends, where one handles just the SPARQL queries and the others store the triples.  I assume that you need multiple machines for this (in my case multiple AWS instances), but again I have found seemingly contradictory information online.  

Kevin Ford

unread,
Apr 3, 2014, 11:22:15 AM4/3/14
to 4store-...@googlegroups.com
Hi Jeff,

I assume when you say multiple 4store backends you are describing a
situation where you are clustering 4store across several machines? In
that case, the benefits comes from being able to leverage the memory and
processor speed of each machine.

That may not be necessary in your case (and a good many more cases).

Have you tried setting up your store with something like:

4store-backend-setup --node 0 --cluster 1 --segments 8 mystore

?

That sets up 4store on a single machine with 8 segments (one per core,
as mentioned by Steve). I've noticed that when I forget to use more
than 2 segments (the 4store default) for a big dataset there can be a
big performance hit.

Yours,
Kevin
> --
> You received this message because you are subscribed to the Google
> Groups "4store-support" group.
> To unsubscribe from this group and stop receiving emails from it, send
> an email to 4store-suppor...@googlegroups.com
> <mailto:4store-suppor...@googlegroups.com>.
> To post to this group, send email to 4store-...@googlegroups.com
> <mailto:4store-...@googlegroups.com>.
> Visit this group at http://groups.google.com/group/4store-support.
> For more options, visit https://groups.google.com/d/optout.

Jeff Mixter

unread,
Apr 3, 2014, 12:24:44 PM4/3/14
to 4store-...@googlegroups.com
Kevin,

Your assumption about the clustering was correct.  I have played around with changing the number of segments but have not applied to to a massive dataset yet.  One problem I have noticed, even with a medium size dataset is selecting the correct "soft limit" (which I gather it amount of time in milliseconds before the request times out).  The default is 1000, but I have seen that even on a small dataset (~80,000 triples) simple queries like asking for the total number of triples will time out and give you a false answer.  Is there an guidelines that suggest what the "soft limit" parameter should be based on number of triples or size of the dataset? 

Kevin Ford

unread,
Apr 3, 2014, 12:40:29 PM4/3/14
to 4store-...@googlegroups.com
I don't know about any guidelines, but I routinely set the soft limit to
"-1", which disables it. With the soft limit removed, all results to
the query will returned (or it might be more accurate to say all the
data will be tested). Be careful of greedy queries, but it works when
you have a large dataset and you know there is a reasonably finite
answer within the entire dataset.

Counting distinct ?s ?p ?o probably qualifies as a greedy query. :)

Yours,
Kevin
> > <mailto:4store-suppor...@googlegroups.com>.
> > To post to this group, send email to 4store-...@googlegroups.com
> > <mailto:4store-...@googlegroups.com>.
> > Visit this group at http://groups.google.com/group/4store-support
> <http://groups.google.com/group/4store-support>.
> > For more options, visit https://groups.google.com/d/optout
> <https://groups.google.com/d/optout>.
>
> --
> You received this message because you are subscribed to the Google
> Groups "4store-support" group.
> To unsubscribe from this group and stop receiving emails from it, send
> an email to 4store-suppor...@googlegroups.com
> <mailto:4store-suppor...@googlegroups.com>.

Jeff Mixter

unread,
Apr 4, 2014, 10:48:50 AM4/4/14
to 4store-...@googlegroups.com
Kevin,

Thanks for the suggestions.  One last questions.  Do you know how large a 4store index is relative to the total size of the triple file that is uploaded.  Is it the same size, smaller, larger?  I am planing on uploading a nt file that is ~60 gigs and I was not sure how much space I needed to allocate on drive. 
>      > To post to this group, send email to 4store-...@googlegroups.com
>      > <mailto:4store-...@googlegroups.com>.
>      > Visit this group at http://groups.google.com/group/4store-support
>     <http://groups.google.com/group/4store-support>.
>      > For more options, visit https://groups.google.com/d/optout
>     <https://groups.google.com/d/optout>.
>
> --
> You received this message because you are subscribed to the Google
> Groups "4store-support" group.
> To unsubscribe from this group and stop receiving emails from it, send
> an email to 4store-suppor...@googlegroups.com

Kevin Ford

unread,
Apr 4, 2014, 11:28:39 AM4/4/14
to 4store-...@googlegroups.com
> Do you know how large
> a 4store index is relative to the total size of the triple file
> that is
> uploaded.
-- I haven't a clue. I'm not at work today, otherwise I could give you
a sporting answer.

Yours,
Kevin
> > > <mailto:4store-suppor...@googlegroups.com>.
> > > To post to this group, send email to
> 4store-...@googlegroups.com
> > > <mailto:4store-...@googlegroups.com>.
> > > Visit this group at
> http://groups.google.com/group/4store-support
> <http://groups.google.com/group/4store-support>
> > <http://groups.google.com/group/4store-support
> <http://groups.google.com/group/4store-support>>.
> > > For more options, visit https://groups.google.com/d/optout
> <https://groups.google.com/d/optout>
> > <https://groups.google.com/d/optout
> <https://groups.google.com/d/optout>>.
> >
> > --
> > You received this message because you are subscribed to the Google
> > Groups "4store-support" group.
> > To unsubscribe from this group and stop receiving emails from it,
> send
> > an email to 4store-suppor...@googlegroups.com
> > <mailto:4store-suppor...@googlegroups.com>.
> > To post to this group, send email to 4store-...@googlegroups.com
> > <mailto:4store-...@googlegroups.com>.
> > Visit this group at http://groups.google.com/group/4store-support
> <http://groups.google.com/group/4store-support>.
> > For more options, visit https://groups.google.com/d/optout
> <https://groups.google.com/d/optout>.
>
> --
> You received this message because you are subscribed to the Google
> Groups "4store-support" group.
> To unsubscribe from this group and stop receiving emails from it, send
> an email to 4store-suppor...@googlegroups.com
> <mailto:4store-suppor...@googlegroups.com>.

Jeff Mixter

unread,
Apr 11, 2014, 11:31:30 AM4/11/14
to 4store-...@googlegroups.com
Kevin,

I have another newbie question for you.  How exactly do you set up different nodes?  I read through this documentation https://github.com/garlik/4store/blob/master/src/admin/README.md but it was not clear to me exactly what you need to do on each node.  My interpretation was that I can just list the node names (i.e. 4s-backend-setup 'node name') and it should work.  Do I need to do anything on the nodes that are listed there? For example if I run the backend-setup on Test and the run the 4s-boss command (which lists Test, Test2, and Test3 as nodes) do I need to do anything with nodes Test2 and Test3?

Thanks,

Jeff
>      >      > <mailto:4store-support+unsub...@googlegroups.com>.
>      >      > To post to this group, send email to
>     4store-...@googlegroups.com
>      >      > <mailto:4store-...@googlegroups.com>.
>      >      > Visit this group at
>     http://groups.google.com/group/4store-support
>     <http://groups.google.com/group/4store-support>
>      >     <http://groups.google.com/group/4store-support
>     <http://groups.google.com/group/4store-support>>.
>      >      > For more options, visit https://groups.google.com/d/optout
>     <https://groups.google.com/d/optout>
>      >     <https://groups.google.com/d/optout
>     <https://groups.google.com/d/optout>>.
>      >
>      > --
>      > You received this message because you are subscribed to the Google
>      > Groups "4store-support" group.
>      > To unsubscribe from this group and stop receiving emails from it,
>     send
>      > an email to 4store-suppor...@googlegroups.com
>      > To post to this group, send email to 4store-...@googlegroups.com
>      > <mailto:4store-...@googlegroups.com>.
>      > Visit this group at http://groups.google.com/group/4store-support
>     <http://groups.google.com/group/4store-support>.
>      > For more options, visit https://groups.google.com/d/optout
>     <https://groups.google.com/d/optout>.
>
> --
> You received this message because you are subscribed to the Google
> Groups "4store-support" group.
> To unsubscribe from this group and stop receiving emails from it, send
> an email to 4store-suppor...@googlegroups.com

Kevin Ford

unread,
Apr 11, 2014, 12:26:13 PM4/11/14
to 4store-...@googlegroups.com
In short, I don't know. We have a couple of 4store stores clustered
across a couple of machines, but we managed this via the 4s-cluster-*
method, which involves ssh keys. I've never used the 4s-admin/4s-boss
commands.

That said, looking at the README reminded me of something. My memory is
vague but at one point we had to figure out which port 4store was trying
to use to make sure it was reachable. The README talks about 4s-boss
using port 6733. I suspect you've got that sorted out if 4s-boss can
report the nodes (test1, test2, test3) but I thought I would mention it.

I did note that in this listserv post

https://groups.google.com/forum/#!topic/4store-support/X-Xr1mCsCCg

Manuel suggests the '4s-admin create-store storename' command as a way
to set up the backends on the clustered nodes. (It is followed by
'4s-admin start-stores storename'.) This is not mentioned in the
README. You mentioned using "4s-backend-setup."

On this last point, you specifically said:

> My
> interpretation was that I can just list the node names (i.e.
> 4s-backend-setup 'node name') and it should work.

"node name" is not accurate. It should be the name of the store, not
the node. "node name" - I think - refers to the hostnames or IP
addresses of the machines in the cluster and, if using 4s-boss, would be
set in the /etc/4store.conf file.

So, on each node, 'killall 4s-*' and run 4s-boss. Then, on the master
node, create the appropriate /etc/4store.conf file. Then, still on the
master node, run:

4s-admin create-store viaf
4s-admin start-stores viaf
4s-httpd -p 8080 viaf

HTH,
Kevin
> > > > <mailto:4store-suppor...@googlegroups.com>.
> > > <mailto:4store-suppor...@googlegroups.com>.
> > > To post to this group, send email to
> 4store-...@googlegroups.com
> > > <mailto:4store-...@googlegroups.com>.
> > > Visit this group at
> http://groups.google.com/group/4store-support
> <http://groups.google.com/group/4store-support>
> > <http://groups.google.com/group/4store-support
> <http://groups.google.com/group/4store-support>>.
> > > For more options, visit https://groups.google.com/d/optout
> <https://groups.google.com/d/optout>
> > <https://groups.google.com/d/optout
> <https://groups.google.com/d/optout>>.
> >
> > --
> > You received this message because you are subscribed to the Google
> > Groups "4store-support" group.
> > To unsubscribe from this group and stop receiving emails from it,
> send
> > an email to 4store-suppor...@googlegroups.com
> > <mailto:4store-suppor...@googlegroups.com>.
> > To post to this group, send email to 4store-...@googlegroups.com
> > <mailto:4store-...@googlegroups.com>.
> > Visit this group at http://groups.google.com/group/4store-support
> <http://groups.google.com/group/4store-support>.
> > For more options, visit https://groups.google.com/d/optout
> <https://groups.google.com/d/optout>.
>
> --
> You received this message because you are subscribed to the Google
> Groups "4store-support" group.
> To unsubscribe from this group and stop receiving emails from it, send
> an email to 4store-suppor...@googlegroups.com
> <mailto:4store-suppor...@googlegroups.com>.
Reply all
Reply to author
Forward
0 new messages