Google Groups no longer supports new Usenet posts or subscriptions. Historical content remains viewable.
Dismiss

NoSQL Movement?

3 views
Skip to first unread message

Xah Lee

unread,
Mar 3, 2010, 12:36:26 PM3/3/10
to
recently i wrote a blog article on The NoSQL Movement
at http://xahlee.org/comp/nosql.html

i'd like to post it somewhere public to solicit opinions, but in the
20 min or so, i couldn't find a proper newsgroup, nor private list
that my somewhat anti-NoSQL Movement article is fitting.

So, i thought i'd post here to solicit some opinins from the programer
community i know.

Here's the plain text version

-----------------------------
The NoSQL Movement

Xah Lee, 2010-01-26

In the past few years, there's new fashionable thinking about anti
relational database, now blessed with a rhyming term: NoSQL.
Basically, it considers that relational database is outdated, and not
“horizontally” scalable. I'm quite dubious of these claims.

According to Wikipedia Scalability article, verticle scalability means
adding more resource to a single node, such as more cpu, memory. (You
can easily do this by running your db server on a more powerful
machine.), and “Horizontal scalability” means adding more machines.
(and indeed, this is not simple with sql databases, but again, it is
the same situation with any software, not just database. To add more
machines to run one single software, the software must have some sort
of grid computing infrastructure built-in. This is not a problem of
the software per se, it is just the way things are. It is not a
problem of databases.)

I'm quite old fashioned when it comes to computer technology. In order
to convience me of some revolutionary new-fangled technology, i must
see improvement based on math foundation. I am a expert of SQL, and
believe that relational database is pretty much the gist of database
with respect to math. Sure, a tight definition of relations of your
data may not be necessary for many applications that simply just need
store and retrieve and modify data without much concern about the
relations of them. But still, that's what relational database
technology do too. You just don't worry about normalizing when you
design your table schema.

The NoSQL movement is really about scaling movement, about adding more
machines, about some so-called “cloud computing” and services with
simple interfaces. (like so many fashionable movements in the
computing industry, often they are not well defined.) It is not really
about anti relation designs in your data. It's more about adding
features for practical need such as providing easy-to-user APIs (so
you users don't have to know SQL or Schemas), ability to add more
nodes, provide commercial interface services to your database, provide
parallel systems that access your data. Of course, these needs are all
done by any big old relational database companies such as Oracle over
the years as they constantly adopt the changing industry's needs and
cheaper computing power. If you need any relations in your data, you
can't escape relational database model. That is just the cold truth of
math.

Importat data, such as used in the bank transactions, has relations.
You have to have tight relational definitions and assurance of data
integrity.

Here's a second hand quote from Microsoft's Technical Fellow David
Campbell. Source

I've been doing this database stuff for over 20 years and I
remember hearing that the object databases were going to wipe out
the SQL databases. And then a little less than 10 years ago the
XML databases were going to wipe out.... We actually ... you
know... people inside Microsoft, [have said] 'let's stop working
on SQL Server, let's go build a native XML store because in five
years it's all going....'

LOL. That's exactly my thought.

Though, i'd have to have some hands on experience with one of those
new database services to see what it's all about.

--------------------
Amazon S3 and Dynamo

Look at Structured storage. That seems to be what these nosql
databases are. Most are just a key-value pair structure, or just
storage of documents with no relations. I don't see how this differ
from a sql database using one single table as schema.

Amazon's Amazon S3 is another storage service, which uses Amazon's
Dynamo (storage system), indicated by Wikipedia to be one of those
NoSQL db. Looking at the S3 and Dynamo articles, it appears the db is
just a Distributed hash table system, with added http access
interface. So, basically, little or no relations. Again, i don't see
how this is different from, say, MySQL with one single table of 2
columns, added with distributed infrastructure. (distributed database
is often a integrated feature of commercial dbs, e.g. Wikipedia Oracle
database article cites Oracle Real Application Clusters )

Here's a interesting quote on S3:

Bucket names and keys are chosen so that objects are addressable
using HTTP URLs:

* http://s3.amazonaws.com/bucket/key
* http://bucket.s3.amazonaws.com/key
* http://bucket/key (where bucket is a DNS CNAME record
pointing to bucket.s3.amazonaws.com)

Because objects are accessible by unmodified HTTP clients, S3 can
be used to replace significant existing (static) web hosting
infrastructure.

So this means, for example, i can store all my images in S3, and in my
html document, the inline images are just normal img tags with normal
urls. This applies to any other type of file, pdf, audio, but html
too. So, S3 becomes the web host server as well as the file system.

Here's Amazon's instruction on how to use it as image server. Seems
quite simple: How to use Amazon S3 for hosting web pages and media
files? Source

--------------------
Google BigTable

Another is Google's BigTable. I can't make much comment. To make a
sensible comment, one must have some experience of actually
implementing a database. For example, a file system is a sort of
database. If i created a scheme that allows me to access my data as
files in NTFS that are distributed over hundreds of PC, communicated
thru http running Apache. This will let me access my files. To insert,
delete, data, one can have cgi scripts on each machine. Would this be
considered as a new fantastic NoNoSQL?

---------------------

comments can also be posted to
http://xahlee.blogspot.com/2010/01/nosql-movement.html

Thanks.

Xah
http://xahlee.org/


MRAB

unread,
Mar 3, 2010, 2:09:18 PM3/3/10
to Xah Lee, pytho...@python.org
Xah Lee wrote:
> recently i wrote a blog article on The NoSQL Movement
> at http://xahlee.org/comp/nosql.html
>
> i'd like to post it somewhere public to solicit opinions, but in the
> 20 min or so, i couldn't find a proper newsgroup, nor private list
> that my somewhat anti-NoSQL Movement article is fitting.
>
> So, i thought i'd post here to solicit some opinins from the programer
> community i know.
>
[snip]
Couldn't find a relevant newsgroup, so decided to inflict it on a number
of others...

ccc31807

unread,
Mar 3, 2010, 3:54:53 PM3/3/10
to
On Mar 3, 12:36 pm, Xah Lee <xah...@gmail.com> wrote:
> recently i wrote a blog article on The NoSQL Movement
> athttp://xahlee.org/comp/nosql.html

>
> i'd like to post it somewhere public to solicit opinions, but in the
> 20 min or so, i couldn't find a proper newsgroup, nor private list
> that my somewhat anti-NoSQL Movement article is fitting.

I only read the first two paragraphs of your article, so I can't
respond to it.

I've halfway followed the NoSQL movement. My day job is a database
manager and I so SQL databases for a living, as well as Perl. I see a
lot of abuse of relational databases in the Real World, as well as a
lot of abuse for non-SQL alternatives, e.g., (mis)using Excel for a
database. The big, enterprise database we have at work is built on IBM
UniQuery, which is a non-SQL flat file database product, so I've had a
lot of experience with big non-SQL database work.

I've also developed a marked preference for plain text databases. For
a lot of applications they are simpler, easier, and better. I've also
had some experience with XML databases, and find that they are ideal
for applications with 'ragged' data.

As with anything else, you need to match the tool to the job. Yes, I
feel that relational database technology has been much used, and much
abused. However, one of my favorite applications is Postgres, and I
think it's absolutely unbeatable where you have to store data and
perform a large number of queries.

Finally, with regard to Structured Query Language itself, I find that
it's well suited to its purpose. I hand write a lot of SQL statements
for various purposes, and while like any language you find it
exceedingly difficult to express concepts that you can think, it
mostly allows the expression of most of what you want to say.

CC.

toby

unread,
Mar 3, 2010, 4:55:58 PM3/3/10
to
On Mar 3, 3:54 pm, ccc31807 <carte...@gmail.com> wrote:
> On Mar 3, 12:36 pm, Xah Lee <xah...@gmail.com> wrote:
>
> > recently i wrote a blog article on The NoSQL Movement
> > athttp://xahlee.org/comp/nosql.html
>
> > i'd like to post it somewhere public to solicit opinions, but in the
> > 20 min or so, i couldn't find a proper newsgroup, nor private list
> > that my somewhat anti-NoSQL Movement article is fitting.
>
> I only read the first two paragraphs of your article, so I can't
> respond to it.
>
> I've halfway followed the NoSQL movement. My day job is a database
> manager and I so SQL databases for a living, as well as Perl. I see a
> lot of abuse of relational databases in the Real World, as well as a
> lot of abuse for non-SQL alternatives, e.g., (mis)using Excel for a
> database. The big, enterprise database we have at work is built on IBM
> UniQuery, which is a non-SQL flat file database product, so I've had a
> lot of experience with big non-SQL database work.
>
> I've also developed a marked preference for plain text databases. For
> a lot of applications they are simpler, easier, and better. I've also
> had some experience with XML databases, and find that they are ideal
> for applications with 'ragged' data.
>
> As with anything else, you need to match the tool to the job. Yes, I
> feel that relational database technology has been much used, and much
> abused. However, one of my favorite applications is Postgres, and I
> think it's absolutely unbeatable

It is beatable outside of its sweetspot, like any system. NoSQL is not
so much about "beating" relational databases, as simply a blanket term
for useful non-relational technologies. There's not much point in
reading Xah beyond the heading of his manifesto, as it is no more
relevant to be "anti-NoSQL" as to be "anti-integers" because they
don't store fractions.

> where you have to store data and

"relational data"

> perform a large number of queries.

Why does the number matter?

Jonathan Gardner

unread,
Mar 3, 2010, 5:23:25 PM3/3/10
to ccc31807, pytho...@python.org
On Wed, Mar 3, 2010 at 12:54 PM, ccc31807 <cart...@gmail.com> wrote:
>
> As with anything else, you need to match the tool to the job. Yes, I
> feel that relational database technology has been much used, and much
> abused. However, one of my favorite applications is Postgres, and I
> think it's absolutely unbeatable where you have to store data and
> perform a large number of queries.
>

Let me elaborate on this point for those who haven't experienced this
for themselves.

When you are starting a new project and you don't have a definitive
picture of what the data is going to look like or how it is going to
be queried, SQL databases (like PostgreSQL) will help you quickly
formalize and understand what your data needs to do. In this role,
these databases are invaluable. I can see no comparable tool in the
wild, especially not OODBMS.

As you grow in scale, you may eventually reach a point where the
database can't keep up with you. Either you need to partition the data
across machines or you need more specialized and optimized query
plans. When you reach that point, there are a number of options that
don't include an SQL database. I would expect your project to move
those parts of the data away from an SQL database and towards a more
specific solution.

I see it as a sign of maturity with sufficiently scaled software that
they no longer use an SQL database to manage their data. At some point
in the project's lifetime, the data is understood well enough that the
general nature of the SQL database is unnecessary.

--
Jonathan Gardner
jgar...@jonathangardner.net

Avid Fan

unread,
Mar 3, 2010, 5:41:37 PM3/3/10
to
Jonathan Gardner wrote:

>
> I see it as a sign of maturity with sufficiently scaled software that
> they no longer use an SQL database to manage their data. At some point
> in the project's lifetime, the data is understood well enough that the
> general nature of the SQL database is unnecessary.
>

I am really struggling to understand this concept.

Is it the normalised table structure that is in question or the query
language?

Could you give some sort of example of where SQL would not be the way to
go. The only things I can think of a simple flat file databases.

Philip Semanchuk

unread,
Mar 3, 2010, 9:53:19 PM3/3/10
to Python (General)

Well, Zope is backed by an object database rather than a relational one.

Jack Diederich

unread,
Mar 3, 2010, 10:43:01 PM3/3/10
to pytho...@python.org
On Wed, Mar 3, 2010 at 12:36 PM, Xah Lee <xah...@gmail.com> wrote:
[snip]

Xah Lee is a longstanding usenet troll. Don't feed the trolls.

mk

unread,
Mar 4, 2010, 7:01:38 AM3/4/10
to pytho...@python.org
Jonathan Gardner wrote:

> When you are starting a new project and you don't have a definitive
> picture of what the data is going to look like or how it is going to
> be queried, SQL databases (like PostgreSQL) will help you quickly
> formalize and understand what your data needs to do. In this role,
> these databases are invaluable. I can see no comparable tool in the
> wild, especially not OODBMS.

FWIW, I talked to my promoting professor about the subject, and he
claimed that there's quite a number of papers on OODBMS that point to
fundamental problems with constructing capable query languages for
OODBMS. Sadly, I have not had time to get & read those sources.

Regards,
mk

Duncan Booth

unread,
Mar 4, 2010, 8:10:34 AM3/4/10
to
Avid Fan <m...@privacy.net> wrote:

Probably one of the best known large non-sql databases is Google's
bigtable. Xah Lee of course dismissed this as he decided to write how
bad non-sql databases are without actually looking at the prime example.

If you look at some of the uses of bigtable you may begin to understand
the tradeoffs that are made with sql. When you use bigtable you have
records with fields, and you have indices, but there are limitations on
the kinds of queries you can perform: in particular you cannot do joins,
but more subtly there is no guarantee that the index is up to date (so
you might miss recent updates or even get data back from a query when
the data no longer matches the query).

By sacrificing some of SQL's power, Google get big benefits: namely
updating data is a much more localised option. Instead of an update
having to lock the indices while they are updated, updates to different
records can happen simultaneously possibly on servers on the opposite
sides of the world. You can have many, many servers all using the same
data although they may not have identical or completely consistent views
of that data.

Bigtable impacts on how you store the data: for example you need to
avoid reducing data to normal form (no joins!), its much better and
cheaper just to store all the data you need directly in each record.
Also aggregate values need to be at least partly pre-computed and stored
in the database.

Boiling this down to a concrete example, imagine you wanted to implement
a system like twitter. Think carefully about how you'd handle a
sufficiently high rate of new tweets reliably with a sql database. Now
think how you'd do the same thing with bigtable: most tweets don't
interact, so it becomes much easier to see how the load is spread across
the servers: each user has the data relevant to them stored near the
server they are using and index changes propagate gradually to the rest
of the system.

--
Duncan Booth http://kupuguy.blogspot.com

ccc31807

unread,
Mar 4, 2010, 9:21:49 AM3/4/10
to
On Mar 3, 4:55 pm, toby <t...@telegraphics.com.au> wrote:
> >  where you have to store data and
>
> "relational data"

Data is neither relational nor unrelational. Data is data.
Relationships are an artifact, something we impose on the data.
Relations are for human convenience, not something inherent in the
data itself.

> > perform a large number of queries.
>
> Why does the number matter?

Have you ever had to make a large number of queries to an XML
database? In some ways, an XML database is the counterpart to a
relational database in that the data descriptions constitute the
relations. However, since the search is to the XML elements, and you
can't construct indicies for XML databases in the same way you can
with relational databases, a large search can take much longer that
you might expect.

CC.

mk

unread,
Mar 4, 2010, 9:47:22 AM3/4/10
to pytho...@python.org
Duncan Booth wrote:

> If you look at some of the uses of bigtable you may begin to understand
> the tradeoffs that are made with sql. When you use bigtable you have
> records with fields, and you have indices, but there are limitations on
> the kinds of queries you can perform: in particular you cannot do joins,
> but more subtly there is no guarantee that the index is up to date (so
> you might miss recent updates or even get data back from a query when
> the data no longer matches the query).

Hmm, I do understand that bigtable is used outside of traditional
'enterprisey' contexts, but suppose you did want to do an equivalent of
join; is it at all practical or even possible?

I guess when you're forced to use denormalized data, you have to
simultaneously update equivalent columns across many tables yourself,
right? Or is there some machinery to assist in that?

> By sacrificing some of SQL's power, Google get big benefits: namely
> updating data is a much more localised option. Instead of an update
> having to lock the indices while they are updated, updates to different
> records can happen simultaneously possibly on servers on the opposite
> sides of the world. You can have many, many servers all using the same
> data although they may not have identical or completely consistent views
> of that data.

And you still have the global view of the table spread across, say, 2
servers, one located in Australia, second in US?

> Bigtable impacts on how you store the data: for example you need to
> avoid reducing data to normal form (no joins!), its much better and
> cheaper just to store all the data you need directly in each record.
> Also aggregate values need to be at least partly pre-computed and stored
> in the database.

So you basically end up with a few big tables or just one big table really?

Suppose on top of 'tweets' table you have 'dweebs' table, and tweets and
dweebs sometimes do interact. How would you find such interacting pairs?
Would you say "give me some tweets" to tweets table, extract all the
dweeb_id keys from tweets and then retrieve all dweebs from dweebs table?

> Boiling this down to a concrete example, imagine you wanted to implement
> a system like twitter. Think carefully about how you'd handle a
> sufficiently high rate of new tweets reliably with a sql database. Now
> think how you'd do the same thing with bigtable: most tweets don't
> interact, so it becomes much easier to see how the load is spread across
> the servers: each user has the data relevant to them stored near the
> server they are using and index changes propagate gradually to the rest
> of the system.

I guess that in a purely imaginary example, you could also combine two
databases? Say, a tweet bigtable db contains tweet, but with column of
classical customer_id key that is also a key in traditional RDBMS
referencing particular customer?

Regards,
mk


Juan Pedro Bolivar Puente

unread,
Mar 4, 2010, 11:51:21 AM3/4/10
to
On 04/03/10 16:21, ccc31807 wrote:
> On Mar 3, 4:55 pm, toby <t...@telegraphics.com.au> wrote:
>>> where you have to store data and
>>
>> "relational data"
>
> Data is neither relational nor unrelational. Data is data.
> Relationships are an artifact, something we impose on the data.
> Relations are for human convenience, not something inherent in the
> data itself.
>

No, relations are data. "Data is data" says nothing. Data is
information. Actually, all data are relations: relating /values/ to
/properties/ of /entities/. Relations as understood by the "relational
model" is nothing else but assuming that properties and entities are
first class values of the data system and the can also be related.

JP

George Neuner

unread,
Mar 4, 2010, 12:44:23 PM3/4/10
to

Well ... sort of. Information is not data but rather the
understanding of something represented by the data. The term
"information overload" is counter-intuitive ... it really means an
excess of data for which there is little understanding.

Similarly, at the level to which you are referring, a relation is not
data but simply a theoretical construct. At this level testable
properties or instances of the relation are data, but the relation
itself is not. The relation may be data at a higher level.

George

ccc31807

unread,
Mar 4, 2010, 12:52:06 PM3/4/10
to
On Mar 4, 11:51 am, Juan Pedro Bolivar Puente <magnic...@gmail.com>
wrote:
> No, relations are data.

This depends on your definition of 'data.' I would say that
relationships is information gleaned from the data.

> "Data is data" says nothing. Data is
> information.

To me, data and information are not the same thing, and in particular,
data is NOT information. To me, information consists of the sifting,
sorting, filtering, and rearrangement of data that can be useful in
completing some task. As an illustration, consider some very large
collection of ones and zeros -- the information it contains depends on
whether it's views as a JPEG, an EXE, XML, WAV, or other sort of
information processing device. Whichever way it's processed, the
'data' (the ones and zeros) stay the same, and do not constitute
'information' in their raw state.

> Actually, all data are relations: relating /values/ to
> /properties/ of /entities/. Relations as understood by the "relational
> model" is nothing else but assuming that properties and entities are
> first class values of the data system and the can also be related.

Well, this sort of illustrates my point. The 'values' of 'properties'
relating to specific 'entities' depends on how one processes the data,
which can be processed various ways. For example, 10000001 can either
be viewed as the decimal number 65 or the alpha character 'A' but the
decision as to how to view this value isn't inherent in the data
itself, but only as an artifact of our use of the data to turn it into
information.

CC.

Duncan Booth

unread,
Mar 4, 2010, 2:15:42 PM3/4/10
to
mk <mrk...@gmail.com> wrote:

> Duncan Booth wrote:
>
>> If you look at some of the uses of bigtable you may begin to
>> understand the tradeoffs that are made with sql. When you use
>> bigtable you have records with fields, and you have indices, but
>> there are limitations on the kinds of queries you can perform: in
>> particular you cannot do joins, but more subtly there is no guarantee
>> that the index is up to date (so you might miss recent updates or
>> even get data back from a query when the data no longer matches the
>> query).
>
> Hmm, I do understand that bigtable is used outside of traditional
> 'enterprisey' contexts, but suppose you did want to do an equivalent
> of join; is it at all practical or even possible?
>
> I guess when you're forced to use denormalized data, you have to
> simultaneously update equivalent columns across many tables yourself,
> right? Or is there some machinery to assist in that?

Or you avoid having to do that sort of update at all.

There are many applications which simply wouldn't be applicable to
bigtable. My point was that to make best use of bigtable you may have to
make different decisions when designing the software.

>
>> By sacrificing some of SQL's power, Google get big benefits: namely
>> updating data is a much more localised option. Instead of an update
>> having to lock the indices while they are updated, updates to
>> different records can happen simultaneously possibly on servers on
>> the opposite sides of the world. You can have many, many servers all
>> using the same data although they may not have identical or
>> completely consistent views of that data.
>
> And you still have the global view of the table spread across, say, 2
> servers, one located in Australia, second in US?
>

More likely spread across a few thousand servers. The data migrates round
the servers as required. As I understand it records are organised in
groups: when you create a record you can either make it a root record or
you can give it a parent record. So for example you might make all the data
associated with a specific user live with the user record as a parent (or
ancestor). When you access any of that data then all of it is copied onto a
server near the application as all records under a common root are always
stored together.

>> Bigtable impacts on how you store the data: for example you need to
>> avoid reducing data to normal form (no joins!), its much better and
>> cheaper just to store all the data you need directly in each record.
>> Also aggregate values need to be at least partly pre-computed and
>> stored in the database.
>
> So you basically end up with a few big tables or just one big table
> really?

One. Did I mention that bigtable doesn't require you to have the same
columns in every record? The main use of bigtable (outside of Google's
internal use) is Google App Engine and that apparently uses one table.

Not one table per application, one table total. It's a big table.

>
> Suppose on top of 'tweets' table you have 'dweebs' table, and tweets
> and dweebs sometimes do interact. How would you find such interacting
> pairs? Would you say "give me some tweets" to tweets table, extract
> all the dweeb_id keys from tweets and then retrieve all dweebs from
> dweebs table?

If it is one tweet to many dweebs?

Dweeb.all().filter("tweet =", tweet.key())
or:
GqlQuery("SELECT * FROM Dweeb WHERE tweet = :tweet", tweet=tweet)

or just make the tweet the ancestor of all its dweebs.

Columns may be scalars or lists, so if it is some tweets to many dweebs you
can do basically the same thing.

Dweeb.all().filter("tweets =", tweet.key())

but if there are too many tweets in the list that could be a problem.

If you want dweebs for several tweets you could select with "tweet IN " the
list of tweets or do a separate query for each (not much difference, as I
understand it the IN operator just expands into several queries internally
anyway).

Tim X

unread,
Mar 4, 2010, 9:20:02 PM3/4/10
to
ccc31807 <cart...@gmail.com> writes:

Most XML databases are just a re-vamp of hierarchical databases, which are
one of the two common formats that came before relational databases.
Hierarchical, network and relational databases all have their uses.

Some 'xml' databases, like existsdb have some pretty powerful indexing
technologies. while they are different to relational db indexing because
they are based around hierarchies rather than relations, they do provide
the ability to do fast queries in the same way that indexes in
relational databases allow fast queries over relations. Both solutions
can do fast queries, they are just optimised for different types of
queries. Likewise, other database technologies that tend to fall into
this category, such as couch and mungo are aimed at applications and
problems that aren't suitable for the relational db model and are better
suited to the types of applications they have been designed for.

As usual, Xah's rantings are of little substance here. Yes, he is right that
'nosql' is essentially just another buzzword like 'web 2.0', but so
what? This is an industry that loves its buzzwords.Often its just
marketing hype or some convenience holder for a vague 'concept' some
journalist, consultant or blogger wants to wank on about.

You cannot hate or love 'nosql' without defining exactly what you mean
by the term. Xah starts by acknowledging the term is ill defined
and then goes on to say how he doesn't like it because it lacks the
mathematical precision of the relational algebra that underpins the
relational model. It seems somewhat ironic to put forward an argument
focusing on the importance of precision when you fail to be precise
regarding the thing your arguing against.

His point is further weakened by the failure to realise that SQL and the
relational model and relational algebra are different things. Not having
SQL doesn't automatically mean you cannot have a relational model or
operations that are based on relational algebra. SQL is just the
convenient query language and while it has succeeded where other
languages have not, its just one way of interacting with a relational
database. As a language SQL isn't even 'pure' in that it has operations
that don't fit with the relational algebra that he claims is so
important and includes facillities that are really business convenience
operations that actually corrupt the mathematical model and purity that
is the basis of his poorly formed argument. He also overlooks the fact
that none of the successful relational databases have remained true to
either the relational model or the underlying theory. All of the major
RDMS have corrupted things for marketing, performance or maintenance
reasons. Only a very few vendors have stayed true to the relational
model and none of them have achieved much in the way of market share, I
wonder why?

All Xah is doing is being the nets equivalent of radios 'shock jock'. He
searches for some topical issue, identifies a stance that he feels will
elicit the greatest number of emotional responses and lobs it into the
crowd. He rarely hangs around to debate his claims. When he does, he
tends to just yell and shout and more often than not, uses personal
attack to defend his statements rather than arguing the topic. His
analysis is usually shallow and based on popularism If
someone disagrees, they are a moron or a fool and if they agree, they
are a genius just like him.

Just like true radio shock jocks, some willl love him and some will hate
him. The only things we can be certain about are that reaction is
a much hier motivator for his posts than conviction, there is
probably an inverse relationship between IQ and support for his
arguments and that his opinion probably has the same longevity as the
term nosql.

Now if we can just get back to debating important topics like why
medical reform is the start of communism, how single mothers are
leeching of tax payers, the positive aspects of slavery, why blacks are
all criminals, how governments are evil, the holocaust conspiricy, why
all muslims are terrorists, the benefits of global warming, the bad
science corrupting our children's innocence and stop wasting time
debating this technology stuff and please people, listen to your harts
and follow your gut - don't let your intellect or so called facts get in
the way, trust your emotions.

Tim


--
tcross (at) rapttech dot com dot au

John Nagle

unread,
Mar 5, 2010, 1:35:45 AM3/5/10
to
Xah Lee wrote:
> recently i wrote a blog article on The NoSQL Movement
> at http://xahlee.org/comp/nosql.html
>
> i'd like to post it somewhere public to solicit opinions, but in the
> 20 min or so, i couldn't find a proper newsgroup, nor private list
> that my somewhat anti-NoSQL Movement article is fitting.

Too much rant, not enough information.

There is an argument against using full relational databases for
some large-scale applications, ones where the database is spread over
many machines. If the database can be organized so that each transaction
only needs to talk to one database machine, the locking problems become
much simpler. That's what BigTable is really about.

For many web applications, each user has more or less their own data,
and most database activity is related to a single user. Such
applications can easily be scaled up with a system that doesn't
have inter-user links. There can still be inter-user references,
but without a consistency guarantee. They may lead to dead data,
like Unix/Linux symbolic links. This is a mechanism adequate
for most "social networking" sites.

There are also some "consistent-eventually" systems, where a query
can see old data. For non-critical applications, those can be
very useful. This isn't a SQL/NoSQL thing; MySQL asynchronous
replication is a "consistent-eventually" system. Wikipedia uses
that for the "special" pages which require database lookups.

If you allow general joins across any tables, you have to have all
the very elaborate interlocking mechanisms of a distributed database.
The serious database systems (MySQL Cluster and Oracle, for example)
do offer that, but there are usually
substantial complexity penalties, and the databases have to be carefully
organized to avoid excessive cross-machine locking. If you don't need
general joins, a system which doesn't support them is far simpler.

John Nagle

Gregory Ewing

unread,
Mar 5, 2010, 1:54:48 AM3/5/10
to
Duncan Booth wrote:
> Did I mention that bigtable doesn't require you to have the same
> columns in every record? The main use of bigtable (outside of Google's
> internal use) is Google App Engine and that apparently uses one table.
>
> Not one table per application, one table total. It's a big table.

Seems to me in that situation the term "table" ceases to
have any meaning. If every record can have its own unique
structure, and data from completely unrelated applications
can all be thrown in together, there's no more reason to
call it a "table" than, say, "file", "database" or "big
pile of data".

--
Greg

Duncan Booth

unread,
Mar 5, 2010, 4:07:49 AM3/5/10
to
Gregory Ewing <greg....@canterbury.ac.nz> wrote:

I think it depends on how you look at things.

AppEngine uses a programming model which behaves very much as though you
do have multiple tables: you use Django's programming model to wrap each
record in a class, or you can use Gql query languages which looks very
similar to SQL and again the class names appear in the place of table
names.

So conceptually thinking of it as separate tables make a lot of sense,
but the underlying implementation is apparently a single table where
records simply have a type field that is used to instantiate the proper
class on the Python side (and another hidden application field to
prevent you accessing another application's data). Also of course
records with the same type still may not all have the same columns if
you change the class definition between runs, so even at this level they
aren't tables in the strict SQL sense.

There may of course be other applications using bigtable where the
bigtable(s) look much more like ordinary tables, but I don't know what
they are.

From your suggested terms, I'd agree you could call it a database rather
than a table but I think it's too structured for "file" or "big pile of
data" to be appropriate.

Juan Pedro Bolivar Puente

unread,
Mar 5, 2010, 4:37:04 AM3/5/10
to ccc31807

Well, it depends as you said on the definition of information; actually
your definition of data fits into the information-theorical definition
of information as sequence of symbols... But I understand that in other
context /information/ can also mean the next level of abstraction on top
of /data/, in the same way as /knowledge/ is the next level of
abstraction on top of information; lets ground or basis on that.

In any case, your definition and George's still support my point of view
where relations are data: they are stored in the computer as a sequence
of ones and zeroes and is indistinguishable from any other thing in the
data space in that sense. Of course, it is a key data to be able to
recover information and specially to add new information consistently to
the data storage... That SQL includes special syntax for manipulating
relations should not hide this fact; and one can still query the
relational information in the same way one would query non-relational
data in most DBMS anyway...

Anyway I'm sorry for drifting the conversation away... Going back to the
main topic, I agree with the general view on this thread that relational
databases (information-bases ? ;) and non-relational ones are there to
do some different jobs. It is just by carefully examining a problem that
we can define which one fits it better; with relational databases having
the clear advantage that is mathematically grounded basis makes its
fitness for most problems quite clear, while the preference for
non-relational systems is a more technical and empirical problem of the
trade-offs of consistency vs scalability and so on.

JP

Bruno Desthuilliers

unread,
Mar 5, 2010, 5:54:28 AM3/5/10
to
Philip Semanchuk a �crit :

And it ended up being a *major* PITA on all Zope projects I've worked on...

mk

unread,
Mar 5, 2010, 6:56:34 AM3/5/10
to pytho...@python.org
Bruno Desthuilliers wrote:
>> Well, Zope is backed by an object database rather than a relational one.
>
> And it ended up being a *major* PITA on all Zope projects I've worked on...

Care to write a few sentences on nature of problems with zodb? I was
flirting with the thought of using it on some project.

Regards,
mk

floaiza

unread,
Mar 7, 2010, 6:55:14 PM3/7/10
to
I don't think there is any doubt about the value of relational
databases, particularly on the Internet. The issue in my mind is how
to leverage all the information that resides in the "deep web" using
strictly the relational database paradigm.

Because that paradigm imposes a tight and rigid coupling between
semantics and syntax when you attempt to efficiently "merge" or
"federate" data from disparate sources you can find yourself spending
a lot of time and money building mappings and maintaining translators.

That's why approaches that try to separate syntax from the semantics
are now becoming so popular, but, again, as others have said, it is
not a matter of replacing one with the other, but of figuring out how
best to exploit what each technology offers.

I base my remarks on some initial explorations I have made on the use
of RDF Triple Stores, which, by the way, use RDBMSs to persist the
triples, but which offer a really high degree of flexibility WRT the
merging and federating of data from different semantic spaces.

The way I hope things will move forward is that eventually it will
become inexpensive and easy to "expose" as RDF triples all the
relevant data that now sits in special-purpose databases.

(just an opinion)

Francisco

On Mar 3, 12:36 pm, Xah Lee <xah...@gmail.com> wrote:

> recently i wrote a blog article on The NoSQL Movement

> athttp://xahlee.org/comp/nosql.html

>         *http://bucket/key(where bucket is a DNS CNAME record


> pointing to bucket.s3.amazonaws.com)
>
>     Because objects are accessible by unmodified HTTP clients, S3 can
>     be used to replace significant existing (static) web hosting
>     infrastructure.
>
> So this means, for example, i can store all my images in S3, and in my
> html document, the inline images are just normal img tags with normal
> urls. This applies to any other type of file, pdf, audio, but html
> too. So, S3 becomes the web host server as well as the file system.
>
> Here's Amazon's instruction on how to use it as image server. Seems
> quite simple: How to use Amazon S3 for hosting web pages and media
> files? Source
>
> --------------------
> Google BigTable
>
> Another is Google's BigTable. I can't make much comment. To make a
> sensible comment, one must have some experience of actually
> implementing a database. For example, a file system is a sort of
> database. If i created a scheme that allows me to access my data as
> files in NTFS that are distributed over hundreds of PC, communicated
> thru http running Apache. This will let me access my files. To insert,
> delete, data, one can have cgi scripts on each machine. Would this be
> considered as a new fantastic NoNoSQL?
>
> ---------------------
>

> comments can also be posted tohttp://xahlee.blogspot.com/2010/01/nosql-movement.html
>
> Thanks.
>
>   Xah
> ∑http://xahlee.org/
>
> ☄

Xah Lee

unread,
Mar 8, 2010, 10:12:05 AM3/8/10
to
many people mentioned scalibility... though i think it is fruitful to
talk about at what size is the NoSQL databases offer better
scalability than SQL databases.

For example, consider, if you are within world's top 100th user of
database in terms of database size, such as Google, then it may be
that the off-the-shelf tools may be limiting. But how many users
really have such massive size of data?

note that google's need for database today isn't just a seach engine.
It's db size for google search is probably larger than all the rest of
search engine company's sizes combined. Plus, there's youtube (vid
hosting), gmail, google code (source code hosting), google blog, orkut
(social networking), picasa (photo hosting), etc, each are all ranked
within top 5 or so with respective competitors in terms of number of
accounts... so, google's datasize is probably number one among the
world's user of databases, probably double or triple than the second
user with the most large datasize. At that point, it seems logical
that they need their own db, relational or not.

Xah
http://xahlee.org/

On Mar 4, 10:35 pm, John Nagle <na...@animats.com> wrote:
> Xah Lee wrote:
> > recently i wrote a blog article on The NoSQL Movement

> > athttp://xahlee.org/comp/nosql.html

Duncan Booth

unread,
Mar 8, 2010, 2:14:37 PM3/8/10
to
Xah Lee <xah...@gmail.com> wrote:

> For example, consider, if you are within world's top 100th user of
> database in terms of database size, such as Google, then it may be
> that the off-the-shelf tools may be limiting. But how many users
> really have such massive size of data?

You've totally missed the point. It isn't the size of the data you have
today that matters, it's the size of data you could have in several years'
time.

Maybe today you've got 10 users each with 10 megabytes of data, but you're
aspiring to become the next twitter/facebook or whatever. It's a bit late
as you approach 100 million users (and a petabyte of data) to discover that
your system isn't scalable: scalability needs to be built in from day one.

Xah Lee

unread,
Mar 9, 2010, 9:24:42 AM3/9/10
to
On Mar 8, 11:14 am, Duncan Booth <duncan.bo...@invalid.invalid> wrote:
> Xah Lee <xah...@gmail.com> wrote:
> > For example, consider, if you are within world's top 100th user of
> > database in terms of database size, such as Google, then it may be
> > that the off-the-shelf tools may be limiting. But how many users
> > really have such massive size of data?
>
> You've totally missed the point. It isn't the size of the data you have
> today that matters, it's the size of data you could have in several years'
> time.

so, you saying, in several years, we'd all become the world's top 100
database users in terms of size, like Google?

Xah
http://xahlee.org/


Bruno Desthuilliers

unread,
Mar 9, 2010, 10:02:04 AM3/9/10
to
mk a �crit :

Would require more than a few sentences. But mostly, it's about the very
nature of the Zodb : it's a giant graph of Python objects.

So :
1/ your "data" are _very_ tightly dependant on the language and
applicative code
2/ you have to hand-write each and every graph traversal
3/ accessing a given object usually loads quite a few others in memory


I once thought the Zodb was cool.

Duncan Booth

unread,
Mar 10, 2010, 12:26:46 PM3/10/10
to
Xah Lee <xah...@gmail.com> wrote:

No, I'm saying that if you plan to build a business that could grow you
should be clear up front how you plan to handle the growth. It's too late
if you suddenly discover your platform isn't scalable just when you need to
scale it.

Xah Lee

unread,
Mar 10, 2010, 5:36:58 PM3/10/10
to
On Mar 10, 9:26 am, Duncan Booth <duncan.bo...@invalid.invalid> wrote:
> No, I'm saying that if you plan to build a business that could grow you
> should be clear up front how you plan to handle the growth. It's too late
> if you suddenly discover your platform isn't scalable just when you need to
> scale it.

Right, but that doesn't seems to have any relevance about my point.
Many says that scalability is key to NoSQL, i pointed out that unless
you are like google, or ranked top 1000 in the world in terms data
size, the scalability reason isn't that strong.

Xah Lee wrote:

> many people mentioned scalibility... though i think it is fruitful to
> talk about at what size is the NoSQL databases offer better
> scalability than SQL databases.
>

> For example, consider, if you are within world's top 100th user of
> database in terms of database size, such as Google, then it may be
> that the off-the-shelf tools may be limiting. But how many users

> really have such massive size of data? note that google's need for

dkeeney

unread,
Mar 11, 2010, 11:00:29 AM3/11/10
to
On Mar 8, 12:14 pm, Duncan Booth <duncan.bo...@invalid.invalid> wrote:

> You've totally missed the point. It isn't the size of the data you have
> today that matters, it's the size of data you could have in several years'
> time.
>
> Maybe today you've got 10 users each with 10 megabytes of data, but you're
> aspiring to become the next twitter/facebook or whatever. It's a bit late
> as you approach 100 million users (and a petabyte of data) to discover that
> your system isn't scalable: scalability needs to be built in from day one.

Do you have examples of sites that got big by planning their site
architecture from day 0 to be big?

Judging from published accounts, even Facebook and Twitter did not
plan to be 'the next twitter/facebook'; each started with routine
LAMP stack architecture and successfully re-engineered the
architecture multiple times on the way up.

Is there compelling reason to think the 'next twitter/facebook' can't
and won't climb a very similar path?

I see good reasons to think that they *will* follow the same path, in
that there are motivations at both ends of the path for re-engineering
as you go. When the site is small, resources commited to the backend
are not spent on making the frontend useful, so business-wise the best
backend is the cheapest one. When the site becomes super-large, the
backend gets re-engineered based on what that organization learned
while the site was just large; Facebook, Twitter, and Craigslist all
have architectures custom designed to support their specific needs.
Had they tried to design for large size while they were small, they
would have failed; they couldn't have known enough then about what
they would eventually need.

The only example I can find of a large site that architected large
very early is Google, and they were aiming for a market (search) that
was already known to be huge.

Its reasonable to assume that the 'next twitter/facebook' will *not*
be in web search, social-networking, broadcast instant messaging, or
classified ads, just because those niches are taken already. So
whichever 'high-scalability' model the aspiring site uses will be the
wrong one. They might as well start with a quick and cheap LAMP
stack, and re-engineer as they go.

Just one internet watcher's biased opinion...

David


www.rdbhost.com -> SQL databases via a web-service

Jonathan Gardner

unread,
Mar 12, 2010, 4:05:27 AM3/12/10
to Avid Fan, pytho...@python.org
On Wed, Mar 3, 2010 at 2:41 PM, Avid Fan <m...@privacy.net> wrote:
> Jonathan Gardner wrote:
>>
>> I see it as a sign of maturity with sufficiently scaled software that
>> they no longer use an SQL database to manage their data. At some point
>> in the project's lifetime, the data is understood well enough that the
>> general nature of the SQL database is unnecessary.
>>
>
> I am really struggling to understand this concept.
>
> Is it the normalised table structure that is in question or the query
> language?
>
> Could you give some sort of example of where SQL would not be the way to go.
>   The only things I can think of a simple flat file databases.

Sorry for the late reply.

Let's say you have an application that does some inserts and updates
and such. Eventually, you are going to run into a limitation with the
number of inserts and updates you can do at once. The typical solution
to this is to shard your database. However, there are other solutions,
such as storing the files in a different kind of database, one which
is less general but more efficient for your particular data.

Let me give you an example. I worked on a system that would load
recipients for email campaigns into a database table. The SQL database
was nice during the initial design and prototype stage because we
could quickly adjust the tables to add or remove columns and try out
different designs.. However, once our system got popular, the
limitation was how fast we could load recipients into the database.
Rather than make our DB bigger or shard the data, we discovered that
storing the recipients outside of the database in flat files was the
precise solution we needed. Each file represented a different email
campaign. The nature of the data was that we didn't need random
access, just serial access. Storing the data this way also meant
sharding the data was almost trivial. Now, we can load a massive
number of recipients in parallel.

You are going to discover certain patterns in how your data is used
and those patterns may not be best served by a generic relational
database. The relational database will definitely help you discover
and even experiment with these patterns, but eventually, you will find
its limits.

--
Jonathan Gardner
jgar...@jonathangardner.net

D'Arcy J.M. Cain

unread,
Mar 12, 2010, 9:30:51 AM3/12/10
to Jonathan Gardner, pytho...@python.org, Avid Fan
On Fri, 12 Mar 2010 01:05:27 -0800
Jonathan Gardner <jgar...@jonathangardner.net> wrote:
> Let me give you an example. I worked on a system that would load
> recipients for email campaigns into a database table. The SQL database
> was nice during the initial design and prototype stage because we
> could quickly adjust the tables to add or remove columns and try out
> different designs.. However, once our system got popular, the
> limitation was how fast we could load recipients into the database.

Just curious, what database were you using that wouldn't keep up with
you? I use PostgreSQL and would never consider going back to flat
files. The only thing I can think of that might make flat files faster
is that flat files are buffered whereas PG guarantees that your
information is written to disk before returning but if speed is more
important than 100% reliability you can turn that off and let PG use
the file system buffering just like flat files.

--
D'Arcy J.M. Cain <da...@druid.net> | Democracy is three wolves
http://www.druid.net/darcy/ | and a sheep voting on
+1 416 425 1212 (DoD#0082) (eNTP) | what's for dinner.

Paul Rubin

unread,
Mar 12, 2010, 2:23:10 PM3/12/10
to
"D'Arcy J.M. Cain" <da...@druid.net> writes:
> Just curious, what database were you using that wouldn't keep up with
> you? I use PostgreSQL and would never consider going back to flat
> files.

Try making a file with a billion or so names and addresses, then
compare the speed of inserting that many rows into a postgres table
against the speed of copying the file.

> The only thing I can think of that might make flat files faster is
> that flat files are buffered whereas PG guarantees that your
> information is written to disk before returning

Don't forget all the shadow page operations and the index operations,
and that a lot of these operations require reading as well as writing
remote parts of the disk, so buffering doesn't help avoid every disk
seek.

Generally when faced with this sort of problem I find it worthwhile to
ask myself whether the mainframe programmers of the 1960's-70's had to
deal with the same thing, e.g. when sending out millions of phone bills,
or processing credit card transactions (TPF), then ask myself how they
did it. Their computers had very little memory or disk space by today's
standards, so their main bulk storage medium was mag tape. A heck of a
lot of these data processing problems can be recast in terms of sorting
large files on tape, rather than updating database one record at a time
on disk or in memory. And that is still what (e.g.) large search
clusters spend a lot of their time doing (look up the term "pennysort"
for more info).

Jonathan Gardner

unread,
Mar 14, 2010, 3:42:31 AM3/14/10
to da...@druid.net, pytho...@python.org
On Fri, Mar 12, 2010 at 11:23 AM, Paul Rubin <no.e...@nospam.invalid> wrote:
> "D'Arcy J.M. Cain" <da...@druid.net> writes:
>> Just curious, what database were you using that wouldn't keep up with
>> you?  I use PostgreSQL and would never consider going back to flat
>> files.
>
> Try making a file with a billion or so names and addresses, then
> compare the speed of inserting that many rows into a postgres table
> against the speed of copying the file.
>

Also consider how much work it is to partition data from flat files
versus PostgreSQL tables.

>> The only thing I can think of that might make flat files faster is
>> that flat files are buffered whereas PG guarantees that your
>> information is written to disk before returning
>
> Don't forget all the shadow page operations and the index operations,
> and that a lot of these operations require reading as well as writing
> remote parts of the disk, so buffering doesn't help avoid every disk
> seek.
>

Plus the fact that your other DB operations slow down under the load.

--
Jonathan Gardner
jgar...@jonathangardner.net

D'Arcy J.M. Cain

unread,
Mar 14, 2010, 9:55:13 AM3/14/10
to Jonathan Gardner, pytho...@python.org
On Sat, 13 Mar 2010 23:42:31 -0800
Jonathan Gardner <jgar...@jonathangardner.net> wrote:
> On Fri, Mar 12, 2010 at 11:23 AM, Paul Rubin <no.e...@nospam.invalid> wrote:
> > "D'Arcy J.M. Cain" <da...@druid.net> writes:
> >> Just curious, what database were you using that wouldn't keep up with
> >> you?  I use PostgreSQL and would never consider going back to flat
> >> files.
> >
> > Try making a file with a billion or so names and addresses, then
> > compare the speed of inserting that many rows into a postgres table
> > against the speed of copying the file.

That's a straw man argument. Copying an already built database to
another copy of the database won't be significantly longer than copying
an already built file. In fact, it's the same operation.

> Also consider how much work it is to partition data from flat files
> versus PostgreSQL tables.

Another straw man. I'm sure you can come up with many contrived
examples to show one particular operation faster than another.
Benchmark writers (bad ones) do it all the time. I'm saying that in
normal, real world situations where you are collecting billions of data
points and need to actually use the data that a properly designed
database running on a good database engine will generally be better than
using flat files.

> >> The only thing I can think of that might make flat files faster is
> >> that flat files are buffered whereas PG guarantees that your
> >> information is written to disk before returning
> >
> > Don't forget all the shadow page operations and the index operations,
> > and that a lot of these operations require reading as well as writing
> > remote parts of the disk, so buffering doesn't help avoid every disk
> > seek.

Not sure what a "shadow page operation" is but index operations are
only needed if you have to have fast access to read back the data. If
it doesn't matter how long it takes to read the data back then don't
index it. I have a hard time believing that anyone would want to save
billions of data points and not care how fast they can read selected
parts back or organize the data though.

> Plus the fact that your other DB operations slow down under the load.

Not with the database engines that I use. Sure, speed and load are
connected whether you use databases or flat files but a proper database
will scale up quite well.

Steve Holden

unread,
Mar 14, 2010, 10:16:43 AM3/14/10
to pytho...@python.org
D'Arcy J.M. Cain wrote:

> On Sat, 13 Mar 2010 23:42:31 -0800
> Jonathan Gardner <jgar...@jonathangardner.net> wrote:
>> On Fri, Mar 12, 2010 at 11:23 AM, Paul Rubin <no.e...@nospam.invalid> wrote:
>>> "D'Arcy J.M. Cain" <da...@druid.net> writes:
>>>> Just curious, what database were you using that wouldn't keep up with
>>>> you? I use PostgreSQL and would never consider going back to flat
>>>> files.
>>> Try making a file with a billion or so names and addresses, then
>>> compare the speed of inserting that many rows into a postgres table
>>> against the speed of copying the file.
>
> That's a straw man argument. Copying an already built database to
> another copy of the database won't be significantly longer than copying
> an already built file. In fact, it's the same operation.
>
>> Also consider how much work it is to partition data from flat files
>> versus PostgreSQL tables.
>
> Another straw man. I'm sure you can come up with many contrived
> examples to show one particular operation faster than another.
> Benchmark writers (bad ones) do it all the time. I'm saying that in
> normal, real world situations where you are collecting billions of data
> points and need to actually use the data that a properly designed
> database running on a good database engine will generally be better than
> using flat files.
>
>>>> The only thing I can think of that might make flat files faster is
>>>> that flat files are buffered whereas PG guarantees that your
>>>> information is written to disk before returning
>>> Don't forget all the shadow page operations and the index operations,
>>> and that a lot of these operations require reading as well as writing
>>> remote parts of the disk, so buffering doesn't help avoid every disk
>>> seek.
>
> Not sure what a "shadow page operation" is but index operations are
> only needed if you have to have fast access to read back the data. If
> it doesn't matter how long it takes to read the data back then don't
> index it. I have a hard time believing that anyone would want to save
> billions of data points and not care how fast they can read selected
> parts back or organize the data though.
>
>> Plus the fact that your other DB operations slow down under the load.
>
> Not with the database engines that I use. Sure, speed and load are
> connected whether you use databases or flat files but a proper database
> will scale up quite well.
>
A common complaint about large database loads taking a long time comes
about because of trying to commit the whole change as a single
transaction. Such an approach can indeed causes stresses on the database
system, but aren't usually necessary.

I don't know about PostgreSQL's capabilities in this area but I do know
that Oracle (which claims to be all about performance, though in fact I
believe PostgreSQL is its equal in many applications) allows you to
switch off the various time-consuming features such as transaction
logging in order to make bulk updates faster.

I also question how many databases would actually find a need to store
addresses for a sixth of the world's population, but this objection is
made mostly for comic relief: I understand that tables of such a size
are necessary sometimes.

There was a talk at OSCON two years ago by someone who was using
PostgreSQL to process 15 terabytes of medical data. I'm sure he'd have
been interested in suggestions that flat files were the answer to his
problem ...

Another talk a couple of years before that discussed how PostgreSQL was
superior to Oracle in handling a three-terabyte data warehouse (though
it conceded Oracle's superiority in handling the production OLTP system
on which the warehouse was based - but that's four years ago).

http://images.omniti.net/omniti.com/~jesus/misc/BBPostgres.pdf

Of course if you only need sequential access to the data then the
relational approach may be overkill. I would never argue that relational
is the best approach for all data and all applications, but it's often
better than its less-informed critics realize.

regards
Steve
--
Steve Holden +1 571 484 6266 +1 800 494 3119
See PyCon Talks from Atlanta 2010 http://pycon.blip.tv/
Holden Web LLC http://www.holdenweb.com/
UPCOMING EVENTS: http://holdenweb.eventbrite.com/

D'Arcy J.M. Cain

unread,
Mar 14, 2010, 10:48:10 AM3/14/10
to Steve Holden, pytho...@python.org
On Sun, 14 Mar 2010 10:16:43 -0400
Steve Holden <st...@holdenweb.com> wrote:
> A common complaint about large database loads taking a long time comes
> about because of trying to commit the whole change as a single
> transaction. Such an approach can indeed causes stresses on the database
> system, but aren't usually necessary.

True.

> I don't know about PostgreSQL's capabilities in this area but I do know
> that Oracle (which claims to be all about performance, though in fact I
> believe PostgreSQL is its equal in many applications) allows you to
> switch off the various time-consuming features such as transaction
> logging in order to make bulk updates faster.

Yes, PostgreSQL has a bulk loading option as well. It's usually useful
when you are copying data from one database into another and need to do
it quickly.

> I also question how many databases would actually find a need to store
> addresses for a sixth of the world's population, but this objection is
> made mostly for comic relief: I understand that tables of such a size
> are necessary sometimes.

How about Microsoft storing it's user base? Oh wait, they only store
registered users with legitimate copies. Never mind.

> Of course if you only need sequential access to the data then the
> relational approach may be overkill. I would never argue that relational
> is the best approach for all data and all applications, but it's often
> better than its less-informed critics realize.

As a rule I find that in the real world the larger the dataset the
more likely you need a proper database. For smaller datasets it
doesn't matter so why not use a DB anyway and be prepared when that
"throwaway" system suddenly becomes your company's core application.

News123

unread,
Mar 14, 2010, 10:52:06 AM3/14/10
to
Hi DUncan,


any project/product has to adapt over time.


Not using SQL just because your 20 user application with 100 data sets
might grow into the worlds biggest database doesn't seem right to me.


I strongly believe in not overengineering a product.

For anything I do I use the most covnenient python library first.
This allows me to have results quicky and to get feedback about the
product ASAP.

Lateron I switch to the more performant versions.


bye


N

Jonathan Gardner

unread,
Mar 15, 2010, 12:57:58 AM3/15/10
to D'Arcy J.M. Cain, pytho...@python.org
On Sun, Mar 14, 2010 at 6:55 AM, D'Arcy J.M. Cain <da...@druid.net> wrote:

> On Sat, 13 Mar 2010 23:42:31 -0800
> Jonathan Gardner <jgar...@jonathangardner.net> wrote:
>> On Fri, Mar 12, 2010 at 11:23 AM, Paul Rubin <no.e...@nospam.invalid> wrote:
>> > "D'Arcy J.M. Cain" <da...@druid.net> writes:
>> >> Just curious, what database were you using that wouldn't keep up with
>> >> you?  I use PostgreSQL and would never consider going back to flat
>> >> files.
>> >
>> > Try making a file with a billion or so names and addresses, then
>> > compare the speed of inserting that many rows into a postgres table
>> > against the speed of copying the file.
>
> That's a straw man argument.  Copying an already built database to
> another copy of the database won't be significantly longer than copying
> an already built file.  In fact, it's the same operation.
>

I don't understand what you're trying to get at.

Each bit of data follows a particular path through the system. Each
bit of data has its own requirements for availability and consistency.
No, relational DBs don't have the same performance characteristic as
other data systems because they do different things.

If you have data that fits a particular style well, then I suggest
using that style to manage that data.

Let's say I had data that needs to hang around for a little while then
disappear into the archives. Let's say you hardly ever do random
access on this data because you always work with it serially or in
large batches. This is exactly like the recipient d

>> Also consider how much work it is to partition data from flat files
>> versus PostgreSQL tables.
>
> Another straw man.  I'm sure you can come up with many contrived
> examples to show one particular operation faster than another.
> Benchmark writers (bad ones) do it all the time.  I'm saying that in
> normal, real world situations where you are collecting billions of data
> points and need to actually use the data that a properly designed
> database running on a good database engine will generally be better than
> using flat files.
>

You're thinking in the general. Yes, RDBMs do wonderful things in the
general cases. However, in very specific circumstances, RDBMS do a
whole lot worse.

Think of the work involved in sharding an RDBMS instance. You need to
properly implement two-phase commit above and beyond the normal work
involved. I haven't run into a multi-master replication system that is
trivial. When you find one, let me know, because I'm sure there are
caveats and corner cases that make things really hard to get right.

Compare this to simply distributing flat files to one of many
machines. It's a whole lot easier to manage and easier to understand,
explain, and implement.

You should use the right tool for the job. Sometimes the data doesn't
fit the profile of an RDBMs, or the RDBMs overhead makes managing the
data more difficult than it needs to be. In those cases, it makes a
whole lot of sense to try something else out.

>> >> The only thing I can think of that might make flat files faster is
>> >> that flat files are buffered whereas PG guarantees that your
>> >> information is written to disk before returning
>> >
>> > Don't forget all the shadow page operations and the index operations,
>> > and that a lot of these operations require reading as well as writing
>> > remote parts of the disk, so buffering doesn't help avoid every disk
>> > seek.
>

> Not sure what a "shadow page operation" is but index operations are
> only needed if you have to have fast access to read back the data.  If
> it doesn't matter how long it takes to read the data back then don't
> index it.  I have a hard time believing that anyone would want to save
> billions of data points and not care how fast they can read selected
> parts back or organize the data though.
>

I don't care how the recipients for the email campaign were indexed. I
don't need an index because I don't do random accesses. I simply need
the list of people I am going to send the email campaign to, properly
filtered and de-duped, of course. This doesn't have to happen within
the database. There are wonderful tools like "sort" and "uniq" to do
this work for me, far faster than an RDBMS can do it. In fact, I don't
think you can come up with a faster solution than "sort" and "uniq".

>> Plus the fact that your other DB operations slow down under the load.
>
> Not with the database engines that I use.  Sure, speed and load are
> connected whether you use databases or flat files but a proper database
> will scale up quite well.
>

I know for a fact that "sort" and "uniq" are far faster than any
RDBMs. The reason why is obvious.

--
Jonathan Gardner
jgar...@jonathangardner.net

0 new messages