Hi Ian,
Great to hear! This is an area that hasn't gotten a lot of attention
yet but it'll certainly be increasingly important as the
non-relational solutions mature.
> I come from a SQL relational database
> design background, and my goal is to shed some light (for myself and
> others) on exactly what differs in the process of conceptual data
> design between the two paradigms - what's easier, what's harder, what
> can't be done at all, etc.
Well, unfortunately I don't think you'll be able to look at it as
"data design for relational" and "data design for non-relational." The
non-relational field is just way too heterogeneous at this point (and
this is a necessary and good thing). They all expose different
abstractions. The abstractions are the building blocks of the data
modeling so the data modelling will by necessity be very different.
If we squint a bit, we can maybe see four broad categories of emerging
non-relational paradigms:
1. Bigtable-like systems (HBase, Hypertable, etc)
2. Key-value stores (Tokyo, Voldemort, etc)
3. Document databases (CouchDB, MongoDB, etc)
4. Graph databases (AllegroGraph, Neo4j, Sesame, etc)
You may be able to treat these categories separately.
>
> I'd love to hear from anyone on this list (either in reply or via
> direct message) if you have any insights. Have you come up with
> alternate designs that work much better in the non-relational world?
> Have you hit your head against anything that seems impossible? Have
> you bridged the gap with any design patterns? Do you do explicit data
> models at all now (e.g. in UML) or have you chucked them entirely in
> favor of semi-structured / document-oriented data blobs?
Well, I'm part of the Neo4j crew and for us the data modeling part has
been *vastly* simplified by moving from RDBMS to a graph paradigm. I'm
biased, of course, but I believe this is because the Neo4j data model
(graph) is extremely close to how humans perceive the real world.
Check out the introductory slides ("Our make believe world") of this
presentation about modeling with graph dbs vs modeling with RDBMS:
http://highscalability.com/paper-graph-databases-and-future-large-scale-knowledge-management
As a corollary, there's a property of modeling with graph databases
that we refer to as "white board friendliness." This is the
observation that when we put a system in production, the layout of the
graph (i.e. the ultimate artifact of the data design process) is
frequently extremely similar to the initial whiteboards that we
produced in the first brainstorm sessions with our customer.
This is pretty cool. It means our data model matches the cognitive
model of the domain expert. This is the model we try to express in our
DDD OO business layer [1] and the model that the developers have
internalized. The productivity gain of not having to translate your
cognitive model to tables and declarative String-based queries is
pretty significant.
So with this in mind our data design process is deceptively simple. We
instruct our fellow developers to:
1. Forget everything you've learned about 1-nNF, E/R, O/R, etc.
Fuck that shit.
2. Grab nearest whiteboard and brainstorm about the domain with
your domain expert.
3. The entities on the whiteboard (customers, carts, documents) are
Nodes, the arrows between them (owns, has_a, buys) are Relationships.
The primitive values (name, age, amount) are Properties.
That's the high level view of it. It works out really well. There's
obviously more to say. For example, we have a number of design
patterns that we apply for high-level structuring the graph and for
performance. But this is the core of it.
1] Yes. We 're legacy enough to use OO.
Cheers,
-EE
Thanks!
FWIW, I differ between emerging non-relational models (outlined in
previous mail) and traditional non-relational models. In the
traditional I include:
1. CODASYL/Network
2. Hierarchical/Directories (i.e. that also includes LDAP/AD)
3. XML
4. OODBs
That's not meant to be an exhaustive list. It's just our subjective
view of the paradigms that are important enough that we must know how
to relate to them.
> It would be interesting to see
> how data modeling would be different for each system. If they are all
> different that might be a bit of concern as that's quite confusing.
>
> Would you also include structured vs unstructured data for map-reduce
> type systems?
Well, I'm not an expert on map-reduce systems, but to me M/R as a
processing model is orthogonal to the underlying representation model.
Which is why we have M/R on top of Bigtable-like systems, on top of
document stores, on top of OLAP systems like Aster/Greenplum, etc. So
that's not a parameter I include.
As for structured vs unstructured: My view is that unstructured data
is equal to blobs or media files or something like that, i.e. opaque
entities without relevant internal semantics that are best kept in a
file system (and for text, typically indexed using something like a
full text search engine such as Lucene). The metadata for them is
structured or semi-structured data which does generally belong in the
database. The various models differ in their capabilities for
efficiently handling structured and semi-structured data and that's an
important parameter for me.
Cheers,
-EE
On Mon, Jul 27, 2009 at 8:12 PM, Ian<ianv...@gmail.com> wrote:
> I'd love to hear from anyone on this list (either in reply or via
> direct message) if you have any insights. Have you come up with
> alternate designs that work much better in the non-relational world?
The Apache Jackrabbit [1] project (that implements a hierarchical
content store specified by the JCR spec [2]) has a set of rough
modeling guidelines called David's Model [3]. Some of the rules are
specific to JCR, but many apply well to any unstructured or
semi-structured database.
[1] http://jackrabbit.apache.org/
[2] http://jcp.org/en/jsr/summary?id=170
[3] http://wiki.apache.org/jackrabbit/DavidsModel
BR,
Jukka Zitting