Data Design in Non-Relational Databases

Ian

unread,

Jul 27, 2009, 2:12:55 PM7/27/09

to NOSQL

Hi all,

I'm writing a report about the process of designing data models that
target non-relational stores (Hypertable, CouchDB, SimpleDB, Google
App Engine datastore, etc.). I come from a SQL relational database
design background, and my goal is to shed some light (for myself and
others) on exactly what differs in the process of conceptual data
design between the two paradigms - what's easier, what's harder, what
can't be done at all, etc.

I'd love to hear from anyone on this list (either in reply or via
direct message) if you have any insights. Have you come up with
alternate designs that work much better in the non-relational world?
Have you hit your head against anything that seems impossible? Have
you bridged the gap with any design patterns? Do you do explicit data
models at all now (e.g. in UML) or have you chucked them entirely in
favor of semi-structured / document-oriented data blobs?

I'd be more than happy to share the report here when it's done (should
be 2 or 3 weeks from now). I'll happily cite you in the paper, and
hopefully it'll contribute a little bit to the collective
understanding of this technology.

Thanks!
Ian Varley (in Austin, TX)

Emil Eifrém

unread,

Jul 27, 2009, 3:52:46 PM7/27/09

to nosql-di...@googlegroups.com

On Mon, Jul 27, 2009 at 8:12 PM, Ian<ianv...@gmail.com> wrote:
>
> Hi all,
>
> I'm writing a report about the process of designing data models that
> target non-relational stores (Hypertable, CouchDB, SimpleDB, Google
> App Engine datastore, etc.).

Hi Ian,

Great to hear! This is an area that hasn't gotten a lot of attention
yet but it'll certainly be increasingly important as the
non-relational solutions mature.

> I come from a SQL relational database
> design background, and my goal is to shed some light (for myself and
> others) on exactly what differs in the process of conceptual data
> design between the two paradigms - what's easier, what's harder, what
> can't be done at all, etc.

Well, unfortunately I don't think you'll be able to look at it as
"data design for relational" and "data design for non-relational." The
non-relational field is just way too heterogeneous at this point (and
this is a necessary and good thing). They all expose different
abstractions. The abstractions are the building blocks of the data
modeling so the data modelling will by necessity be very different.

If we squint a bit, we can maybe see four broad categories of emerging
non-relational paradigms:

1. Bigtable-like systems (HBase, Hypertable, etc)
2. Key-value stores (Tokyo, Voldemort, etc)
3. Document databases (CouchDB, MongoDB, etc)
4. Graph databases (AllegroGraph, Neo4j, Sesame, etc)

You may be able to treat these categories separately.

>
> I'd love to hear from anyone on this list (either in reply or via
> direct message) if you have any insights. Have you come up with
> alternate designs that work much better in the non-relational world?
> Have you hit your head against anything that seems impossible? Have
> you bridged the gap with any design patterns? Do you do explicit data
> models at all now (e.g. in UML) or have you chucked them entirely in
> favor of semi-structured / document-oriented data blobs?

Well, I'm part of the Neo4j crew and for us the data modeling part has
been *vastly* simplified by moving from RDBMS to a graph paradigm. I'm
biased, of course, but I believe this is because the Neo4j data model
(graph) is extremely close to how humans perceive the real world.
Check out the introductory slides ("Our make believe world") of this
presentation about modeling with graph dbs vs modeling with RDBMS:

http://highscalability.com/paper-graph-databases-and-future-large-scale-knowledge-management

As a corollary, there's a property of modeling with graph databases
that we refer to as "white board friendliness." This is the
observation that when we put a system in production, the layout of the
graph (i.e. the ultimate artifact of the data design process) is
frequently extremely similar to the initial whiteboards that we
produced in the first brainstorm sessions with our customer.

This is pretty cool. It means our data model matches the cognitive
model of the domain expert. This is the model we try to express in our
DDD OO business layer [1] and the model that the developers have
internalized. The productivity gain of not having to translate your
cognitive model to tables and declarative String-based queries is
pretty significant.

So with this in mind our data design process is deceptively simple. We
instruct our fellow developers to:

1. Forget everything you've learned about 1-nNF, E/R, O/R, etc.
Fuck that shit.
2. Grab nearest whiteboard and brainstorm about the domain with
your domain expert.
3. The entities on the whiteboard (customers, carts, documents) are
Nodes, the arrows between them (owns, has_a, buys) are Relationships.
The primitive values (name, age, amount) are Properties.

That's the high level view of it. It works out really well. There's
obviously more to say. For example, we have a number of design
patterns that we apply for high-level structuring the graph and for
performance. But this is the core of it.

1] Yes. We 're legacy enough to use OO.

Cheers,

-EE

Todd Hoff

unread,

Jul 27, 2009, 4:28:08 PM7/27/09

to nosql-di...@googlegroups.com

Emil I like your different categories. It would be interesting to see
how data modeling would be different for each system. If they are all
different that might be a bit of concern as that's quite confusing.

Would you also include structured vs unstructured data for map-reduce
type systems?

Emil Eifrém

unread,

Jul 27, 2009, 4:53:44 PM7/27/09

to nosql-di...@googlegroups.com

On Mon, Jul 27, 2009 at 10:28 PM, Todd Hoff<toddho...@gmail.com> wrote:
>
> Emil I like your different categories.

Thanks!

FWIW, I differ between emerging non-relational models (outlined in
previous mail) and traditional non-relational models. In the
traditional I include:

1. CODASYL/Network
2. Hierarchical/Directories (i.e. that also includes LDAP/AD)
3. XML
4. OODBs

That's not meant to be an exhaustive list. It's just our subjective
view of the paradigms that are important enough that we must know how
to relate to them.

> It would be interesting to see
> how data modeling would be different for each system. If they are all
> different that might be a bit of concern as that's quite confusing.
>
> Would you also include structured vs unstructured data for map-reduce
> type systems?

Well, I'm not an expert on map-reduce systems, but to me M/R as a
processing model is orthogonal to the underlying representation model.
Which is why we have M/R on top of Bigtable-like systems, on top of
document stores, on top of OLAP systems like Aster/Greenplum, etc. So
that's not a parameter I include.

As for structured vs unstructured: My view is that unstructured data
is equal to blobs or media files or something like that, i.e. opaque
entities without relevant internal semantics that are best kept in a
file system (and for text, typically indexed using something like a
full text search engine such as Lucene). The metadata for them is
structured or semi-structured data which does generally belong in the
database. The various models differ in their capabilities for
efficiently handling structured and semi-structured data and that's an
important parameter for me.

Cheers,

-EE

Jukka Zitting

unread,

Jul 28, 2009, 6:55:30 AM7/28/09

to nosql-di...@googlegroups.com

Hi,

On Mon, Jul 27, 2009 at 8:12 PM, Ian<ianv...@gmail.com> wrote:

> I'd love to hear from anyone on this list (either in reply or via
> direct message) if you have any insights. Have you come up with
> alternate designs that work much better in the non-relational world?

The Apache Jackrabbit [1] project (that implements a hierarchical
content store specified by the JCR spec [2]) has a set of rough
modeling guidelines called David's Model [3]. Some of the rules are
specific to JCR, but many apply well to any unstructured or
semi-structured database.

[1] http://jackrabbit.apache.org/
[2] http://jcp.org/en/jsr/summary?id=170
[3] http://wiki.apache.org/jackrabbit/DavidsModel

BR,

Jukka Zitting

stack

unread,

Jul 28, 2009, 12:21:44 PM7/28/09

to NOSQL

Ian:

You might find this link of interest:
http://www.slideshare.net/hmisty/20090713-hbase-schema-design-case-studies.
Evan(Qingyan) Liu collated solutions to common problems first using an
RDBMS and then sketching how it might be done in a BigTable-type
system.

St.Ack

Ian

unread,

Jul 29, 2009, 6:05:29 PM7/29/09

to NOSQL

All:

Thanks for so many great answers and suggestions. I'm following up on
all of them and learning quite a bit. I'm working on incorporating all
of this information to my report, and I'll aim for passing along a
rough draft by Monday-ish for any interested parties give feedback on.
In quick replies:

Emil: love the categorization. I don't really have the scope within
this report to do justice to graph databases (the existence of which I
only recently learned about). In retrospect, I wish I'd known of them
sooner, as I might have considered shifting the entire focus of the
work to be about graph databases, because they're just so darn cool.
As it is, I'm afraid I'll have to give them a little bit of a short
shrift, but I'm hoping to include at least a section about the
paradigm shift in data design that that implies.

Jukka: David's model looks interesting and useful, and I'll attempt to
incorporate some of that into my analysis as well.

Stack: That's an incredibly helpful slide set, thanks much.

Incidentally, here's the full list of systems that I'm attempting to
survey, or at least give passing reference to. I'll give some amount
of explanation & analysis to:

BigTable (& Hypertable, HBase)
Google App Engine Datastore
Dynamo (& M/DB)
Amazon SimpleDB
Microsoft SQL Data Services
Microsoft Azure Tables
Project Voldemort (LinkedIn Data Store)
Cassandra (Facebook Data Store)
PNUTS (Yahoo Data Store)
CouchDB
MongoDB

Then at least passing mention of:

Neo4j
Tokyo Cabinet
MemcacheDB
SQLite
Berkeley DB
Archipelago::Treasure
Chord & DHash
Scalaris
Redis
Persevere

Anybody know any that I've missed? Or think items in the 2nd list are
sufficiently important and / or different from the first list that
they deserve more indepth discussion (with the exception of Neo4j, as
mentioned above)?

Thanks!
Ian