Fuseki or cumulusRDF?

358 views
Skip to first unread message

Enayat

unread,
Apr 25, 2014, 6:10:01 AM4/25/14
to cumulus...@googlegroups.com
As you probably know, Jena Fuseki is very nice triple store that works
fine with even large number of data. Using recent TDB commands of Jena, you
would be able to load a large dump into Fuseki. So we intend to use Fuseki
in our project.
Recently, we've received a notification that you are providing a
well-performance approach on cloud based infrastructure and I saw that you
are providing a SPARQL endpoint on top. So we were carious about your work,
as we face a large number of data dump and it would be good if we have some
indications that your approach works perfectly in this regard. Though we
had some difficulty to run and try it with simple dump, and it seems it
refers to Cassandra platform which is not familiar for us. Any case, it
seems we have to work on it a little more. Just tell me, if I run the
Cassandra in my machine, I would be able run the tool or some configuration
I should do in your tool (e.g., connection setting, ...)

Andrea Gazzarini

unread,
Apr 25, 2014, 6:41:35 AM4/25/14
to cumulus...@googlegroups.com
Hi Enayat,
thanks for entering this post.

Let me briefly say two words about CumulusRDF (you can find more detailed information in our Wiki [1])

CumulusRDF is an RDF store that uses Apache Cassandra as underlying storage. Apache Cassandra is a good and proven NoSql (specifically column-oriented) storage for managing huge volume of data. You can see here [3] medium / big companies that are using Cassandra.

The very first way of using CumulusRDF is as an HTTP service (i.e a web application providing SPARQL 1.1 and other data services over HTTP protocol).

In this case CumulusRDF will be running as a web application and will provide a REST interface (i.e. you will be able to load, update and query your data); it needs
  • a servlet engine or an application server (e.g. Apache Tomcat, Jetty, JBoss, Oracle Weblogic, IBM Websphere, Glassfish)
  • a Cassandra ring. With "ring" I mean a cluster of Cassandra nodes, which could be composed also by one single node
In order to have that you can follow these 2 alternatives :
  • User
    • Download and start Cassandra 1.2.x (a single node for testing and trying is good) 
    • Download and start Tomcat 6.x or greater. Also another servlet engine that supports  2.5  specs is good
    • (I assume you aolready have a JVM 1.6)
    • Deploy the CumulusRDF war in the servlet engine (in tomcat is just a matter of copying the war archive in webapps folder)
    • Once make sure all is working, use the interface to load some data and query it
  • Technical user (requires no download of external middleware like Cassandra or Tomcat)
    • assuming you have
      • JDK1.6
      • Maven 3.x
      • SVN client
    • checkout the latest stable version (i.e trunk) fror our repository [4]
    • open a shell or a DOS prompt and type
      • mvn clean cassandra:stop cassandra:start tomcat7:run

The second described way would be more fast but requires some technical steps. In any case, at the end of the process you will have

  • a running Cassandra node
  • a running servlet engine
  • a deployed CumulusRDF web application

So you can start loading and querying data.
---------------------------

Having said that, I would say that CumulusRDF could be also used as an API in your code for loading (and also querying) data; in this way things are a bit more performant because basically no HTTP transfer is involved. As last note, you can also use a mixed approach: loading data fast by embedding CumulusRDF client API and publicy expose loaded data by using the SPARQL / HTTP approach.

Let me know if you have / meet some problem with all above...I'll be happy to help you.

Best,

Andrea

[1] https://code.google.com/p/cumulusrdf/wiki/GettingStarted
[2] http://cassandra.apache.org/
[3] http://planetcassandra.org/companies/
[4] https://code.google.com/p/cumulusrdf/source/checkout

--
You received this message because you are subscribed to the Google Groups "cumulusrdf" group.
To unsubscribe from this group and stop receiving emails from it, send an email to cumulusrdf-li...@googlegroups.com.
For more options, visit https://groups.google.com/d/optout.

Andreas Wagner

unread,
Apr 25, 2014, 7:25:08 AM4/25/14
to cumulus...@googlegroups.com, cumulus...@googlegroups.com
Hi guys,

I (cc) the cumulusRDF list, since other users might be interested in this topic also. I fully agree with Andrea ...

On 04/25/2014 12:41 PM, Andrea Gazzarini wrote:
Hi Enayat,
thanks for entering this post.

Let me briefly say two words about CumulusRDF (you can find more detailed information in our Wiki [1])

CumulusRDF is an RDF store that uses Apache Cassandra as underlying storage. Apache Cassandra is a good and proven NoSql (specifically column-oriented) storage for managing huge volume of data. You can see here [3] medium / big companies that are using Cassandra.
the key advantage of Cassandra (and other NoSQL technologies) is its linear scalability. That is, you can easily extend your cluster, if you need more resources for your application ... We leverage this advantage ...

The very first way of using CumulusRDF is as an HTTP service (i.e a web application providing SPARQL 1.1 and other data services over HTTP protocol).
+1 ... see [1].


In this case CumulusRDF will be running as a web application and will provide a REST interface (i.e. you will be able to load, update and query your data); it needs
  • a servlet engine or an application server (e.g. Apache Tomcat, Jetty, JBoss, Oracle Weblogic, IBM Websphere, Glassfish)
  • a Cassandra ring. With "ring" I mean a cluster of Cassandra nodes, which could be composed also by one single node
In order to have that you can follow these 2 alternatives :
  • User
    • Download and start Cassandra 1.2.x (a single node for testing and trying is good) 
    • Download and start Tomcat 6.x or greater. Also another servlet engine that supports  2.5  specs is good
    • (I assume you aolready have a JVM 1.6)
    • Deploy the CumulusRDF war in the servlet engine (in tomcat is just a matter of copying the war archive in webapps folder)
    • Once make sure all is working, use the interface to load some data and query it
  • Technical user (requires no download of external middleware like Cassandra or Tomcat)
    • assuming you have
      • JDK1.6
      • Maven 3.x
      • SVN client
    • checkout the latest stable version (i.e trunk) fror our repository [4]
    • open a shell or a DOS prompt and type
      • mvn clean cassandra:stop cassandra:start tomcat7:run
I added a brief description about this at our "GettingStarted" page last night ... [2]. However, please note that is *only for testing* purposes. You can't setup a production/benchmark/development environment like this. It's only for "having a first glimpse" ... 

The second described way would be more fast but requires some technical steps. In any case, at the end of the process you will have

  • a running Cassandra node
  • a running servlet engine
  • a deployed CumulusRDF web application

So you can start loading and querying data.
---------------------------

Having said that, I would say that CumulusRDF could be also used as an API in your code for loading (and also querying) data; in this way things are a bit more performant because basically no HTTP transfer is involved. As last note, you can also use a mixed approach: loading data fast by embedding CumulusRDF client API and publicy expose loaded data by using the SPARQL / HTTP approach.

Let me know if you have / meet some problem with all above...I'll be happy to help you.

Best,

Andrea


Just as a side note ... there a other nice stores besides Jena Fuseki, too ;)

Kind regards
Andreas

[1] http://code.google.com/p/cumulusrdf/wiki/Webapps
[2] http://code.google.com/p/cumulusrdf/wiki/GettingStarted

Andreas Wagner

unread,
Apr 25, 2014, 7:29:04 AM4/25/14
to cumulus...@googlegroups.com, cumul...@googlecode.com
Note: You may want to have a look at our recent benchmark paper [1].

Kind regards
Andreas

[1] NoSQL Databases for RDF: An Empirical Evaluation for a benchmark comparison of current NoSQL RDF datastores.

Enayat

unread,
Apr 25, 2014, 7:35:06 AM4/25/14
to cumulus...@googlegroups.com, andreas.jo...@googlemail.com
Hi guys,

Thanks for your detail explanations. I never mentioned that CulumisRDF is not nice :-), but I am saying that from performance perspective, there should be a measurement that we can say "OMG, this is awesome!". To this end, the functionality should be precisely evaluated. Trying to understand and compare loading a huge RDF dump into a triple store and importing it to CulumusRDF and testing the query interface. 
Hope this is clear. 
Regarding the Cassandra installation, I did not mention that it would be part of the tool, but instead the tool guidance to install its requirement much helps user to apply your software :-)

Regards,

Andrea Gazzarini

unread,
Apr 25, 2014, 7:44:38 AM4/25/14
to cumulus...@googlegroups.com, Andreas Wagner

Hi,
You're right. At the moment we have, as official documentation, papers Andreas indicated in his previous email.
On top of that, we added and we're adding a lot of improvements so shortly we will come out with new benchmarks. In 1.1.x (next release) we have a dedjcated module for benchmarking that we will use for providing fresh and updated benchmark data

Can I ask you

- what is the (moreless) expected  amount of data you have to manage
- what is the (morless) expexted queries / second you should support?

Best,
Andrea

Reply all
Reply to author
Forward
0 new messages