We are evaluating new technologies for managing semi-structured data and
documents in one of our applications. We've got tired of wrestling
relational databases for this.
I would like to know why would I prefer to use CouchDB instead of a RDF
database, such as Sesame ou Mulgara.
I know some of the RDF advantages, such as open standards, interoperability,
rules engines, semantic queries, community and tool support, maturity, etc.
But I really like the simplicity of the CouchDB model.
Can anyone enlighten me?
Thanks a lot,
Demetrius
--
____________________________
http://www.demetriusnunes.com
Sure, it isn't as sweet as a triple store with a SPARQL endpoint - but, it's
hella better than using Sesame or any of those other Java implementations.
They can't scale - whereas CouchDB can; you can also scale your application
to handle the RDF processing it does in the application logic; as you have
more control over that.
I am also in the process (have been for a while) of building an Erlang
backed triplestore. Not a light undertaking by anymeans and I may be using
CouchDB to build that ontop of it (license permitting).
CouchDB + RDF = FTW
We ( bibkn.org) have investigated and used SQL databases, RDF store
(Virtuoso) and CouchDB for bibliographic metadata management. I am the
project manager and data architect for this project.
Relnl databases are a first choice often but have many limitations in
management of loosely typed, messy, string based data sets. So we are
in agreement on not using that technology.
We, bibkn.org, need both the schemalessness of CouchDB at one end of
our workflow and the strongly-typedness of RDF at the other end of the
workflow when all our data has been cleaned up and "ontologized". So we
don't see this as an either/or between CouchDB and RDF stores.
However we can definitely say one thing - if you need just the
flexible schema aspect and are using RDF to give you that, then that
is massive overkill and the conceptual overhead of the RDF
(ontology, schemas, namespaces, completely normalized everything ie
URI's for subject, predictae, object) , is simply not worth it. If
however you want to do logical inference and reasoning over your data
then clearly the RDF and semantic machinery gives you a whole lot of
goodness that is worth the overhead.
So CouchDB is not a substitute for an RDF-store, but you may be using an
RDF-store for the lesser things it gives you (flexible schema) and in
that case CouchDB can do a lot more for you at a much lower overhead and
much greater ease of use and integration into existing tools.
Additionally SPARQL (like SQL) is not really meant for text search
which is critical for loosely typed data. So even at our RDF end we have
a Solr instance for rapid text search over the RDF store.
Additionally we have couchdb-lucene as an extension on our CouchDB
instance and this has given us everything we need at the loosely typed
data end of our workflow.
So if semi-structured data and document management is your primary use
case and there is no semantic/ontology/inference component then forget
RDF-stores and just go with CouchDB.
In our project we are developing a format on top of JSON to export
bibliographic metadata for integration into JSON friendly date
consumers, it also happens to have easy mapping to RDF.
So even if you go to Couch now you may be able to integrate into an
RDF-store at some later stage if the need arises.
Hope this helps,
Nitin Borwankar,
Project Manager, Bibliographic Knowledge Network
bibkn.org
Great answer. Thanks a lot. One more question...
I am in the Javaland here, so another viable option for my application is
using JCR, such as the Apache Jackrabbit implementation.
Did you happen to take a look at that as well? I think JCR has even more
similarities with CouchDB than RDF.
How would you compare JCR and CouchDB ?
Thanks a lot,
Demetrius
--
____________________________
http://www.demetriusnunes.com
License* permits :)
* http://www.apache.org/licenses/LICENSE-2.0
Cheers
Jan
--
Hi Demetrius,
I am a refugee from Javaland so am familiar with the power and
limitations of Java. Yes, I have looked at JCR and JackRabbit in a
previous project.
These days I just recoil from the verbosity and conceptual layers you
encounter when coding simple things in Java. And then there's XML.....
So I would have held my nose and used JackRabbit if CouchDB didn't exist
- but in my mind it's a distant second in practice even if it is
conceptually similar and close in theory.
Personally when I see layer upon layer of abstraction in Java
architecture diagrams I wonder how much of my CPU cost is going in
converting from strings, to TypeA to LayeredClassB to factoryC to ORM D
to EJB4 to disk and back again all the way to strings. So I am moving
away from Java except when the best of breed solution is in Java (
Lucene/Solr) - so I don't hate Java - I just need to justify the
overhead that it brings both in coding and in the build/install/deploy
process.
CouchDB has minimal overhead in roundtrip datatype translations - it's
what I call "WYSIWIS" - "what you see is what you store" i.e. JSON.
There are people looking at an alternative to LAMP which they call JS3 -
Javascript in all three layers - browser/helma/couchdb ( helma,
helma.org, is a middle tier layer written in Java, runs on Jetty, uses
JS as the language for doing UI templates and also ORM ) - I personally
think CouchDB + CouchDBViews just makes it JS2 - browser-CouchDB.
I would suggest you download Rhino ( JS interpreter in Java) from
Mozilla and start playing with both CouchDB and JackRabbit and then see.
Did I sound biased ? :-)
Nitin Borwankar,
Project Manager, Bibliographic Knowledge Network.
bibkn.org
CouchDB uses JSON, X use XML
CouchDB uses views, X uses XQuery which has some simple indexing and has
a significantly powerful and understandable query language
CouchDB has a lucene plugin, Sedna can have an extra fulltext index
feature enabled.
Updating data in CouchDB requires an entire document be updated, X
databases can modify small parts of the document
CouchDB saves a new document each change, X works on a current document.
CouchDB handles conflicts using conflict resolution, X makes the
modification query on the current document in order of queries
(transactions are also supported).
CouchDB uses a HTTP REST API, most X databases use a normal binary
protocol (Sedna seams to have a good set of libraries for most languages)
CouchDB is distributed and scalable.
In X databases documents can be grouped into collections. (These can
also be used in queries)
It's probably a moot point, but XQuery is w3c standardized and
implemented by a number of databases.
IMHO compiling a comparison of alternative databases and seeing what
features work best for what data you're working with is the best option.
I went through the semantic databases myself to cause our company had
"Semantics" in mind. I had issues getting them to work and finding help
for most of them myself and ended up finding that our data better fit
the document based database type. For us TQL was the only actual one
with a significant improvement (we really needed the walk capabilities)
other than that Semantics were only a little better than a RDBMS
(although we were actually using RDBMS in an ugly semantic like hack;
atoms table 3 columns).
Our reason for moving away from RDBMS' was a need to remove the large
amounts of queries going between our app and the database. We had a huge
amount of hierarchical data the entire app was based around (a tree
structure wasn't even guaranteed, something could have multiple parents
referencing it and be part of multiple trees).
We decided on Sedna (XQuery) rather than CouchDB because CouchDB's views
couldn't handle our hierarchical data in multiple documents, and we
couldn't put everything in one document because of how we update small
pieces of data a lot which doesn't work out well with how entire
documents need to be modified in Couch (Transmitting entire document to
modify a single value, new document revision saved each time, getting a
conflict because an unrelated part of the document was modified).
Personally I have an idea for another type of database. The one thing
I've always wanted was one program oriented. ie: Simplifying a database
down to what it is, centralized data storage. Instead of a query
language, embedding an existing programming language into the database
environment. I wrote a bit of API drafting on it.
~Daniel Friesen (Dantman, Nadir-Seen-Fire)
--
____________________________
http://www.demetriusnunes.com
Sent from my G1 google phone
~Daniel Friesen (Dantman, Nadir-Seen-Fire)
Great answer. Thanks a lot...
Could you say a little more about how this might work? Would you have
parallel CouchDB and RDF stores or are you suggesting that it would be
possible to use SPARQL to query from CouchDB?
In the meantime, here is what I use for RDF parsing in Javascript:
http://dig.csail.mit.edu/2005/ajar/release/tabulator/0.7/doc/api/overview-summary-rdfparser.js.html
It is small, implements the entire RDF spec and is well written.
:-D
These ideas are very new too - I haven't matured them at all or used them
within a production environment, so, more than anything they are ideas and
theories.
Cheers,
Barry
There is a draft for rdf-json here :
http://n2.talis.com/wiki/RDF_JSON_Specification
So it may be possible to store rdf as json and then using _show &
_list to format the output you need.
If you need sparql endpoint it may be build via an external I guess
or even a proxy. I'm currently explorating both. The only problem I
have currently is that' you can't query document per field. I'm
somehow missing json query and json path for that .
- benoît
CLL Independencia 445
Int. 9
Col: Arcos de Guadalupe
Zapopan, Jalisco
45037
You do NOT have my permission to send me solicitations or to sell my
contact information. Any information released is strictly to you, and is
confidential, and shall not be released to a third party without my
permission. Any communication directly conveyed to me to solicit or is
spam in nature is strictly prohibited. I will pursue all means, including
legal action and seek damages, to prevent solicitation and receiving spam..
Thank you.
> - benoīt
>
>
>
>
>
!DSPAM:1892,4a08443448212053121321!
CLL Independencia 445
Int. 9
Col: Arcos de Guadalupe
Zapopan, Jalisco
45037
You do NOT have my permission to send me solicitations or to sell my
contact information. Any information released is strictly to you, and is
confidential, and shall not be released to a third party without my
permission. Any communication directly conveyed to me to solicit or is
spam in nature is strictly prohibited. I will pursue all means, including
legal action and seek damages, to prevent solicitation and receiving spam..
Thank you.
> 2009/5/7 Demetrius Nunes <demetri...@gmail.com>:
> - benoīt
>
>
>
>
>
!DSPAM:1892,4a08443e48212280097057!
My key interest is serializing XML serialized RDF into the JSON objects and
using the proper libs to parse that into triples - N3 also wouldn't be bad,
however, N3 was designed more for humans, and since JSON doesn't like
carriage returns in it's data, that makes XML the more logical choice...
As for a SPARQL endpoint, I believe there could be a few useful hack-arounds
devised (such as this RDF hack-around), but, nothing will supplant a
properly built Erlang backed, distributed SPARQL endpoint/triple store. Only
problem there is Erlang's lacking XML libraries which makes building an RDF
parser that much more difficult.
I've been looking into using Redland and Erlang's C bridge to accomplish the
parsing task... I dunno, I'm still researching it - but all this research is
in my free time (which is stretched as it is) so it is slow moving.
Keep me updated as to what you come up with, last night I made some headway
with serializing XML into JSON objects within CouchDB and will begin doing
more comprehensive (ie: bloggable) experiments with rdfparse.js and the
eulersharp reasoner - may see what I can do with Python.
Another interesting thought would be to embed the Python interpreter in
CouchDB and use it as the primary view server - I like JSON as the storage
medium because it is simple and light weight and I like Python's strength as
a general purpose programming language with many mature libraries (including
RDF, hint hint).
Couch just shells out to spidermonkey...it's no different than shelling out
to python (or anything else -- hence all the alternative view servers).
Couch was designed with this in mind -- hit the wiki for ideas. You can do
pretty much anything you can dream up with externals -- the only drawback is
they're not portable (for those rare few without python in their path, at
least)...
I'll grant you javascript isn't as general purpose as it could be, but don't
write it off yet -- keep an eye on the serverjs project and see what shakes
out.