Good short paper on Parallel DB, Hadoop, HadoopDB

ffoe...@gmail.com

unread,

Jul 30, 2010, 5:29:59 PM7/30/10

to vscse-big-data-...@googlegroups.com

email me off list if you can't access the PDF :)

----
http://www.springerlink.com/index/2682X0T22T30141J.pdf

Tradeoffs between Parallel Database Systems, Hadoop, and HadoopDB as Platforms for Petabyte-Scale Analysis
Daniel J. Abadi
Yale University
d...@cs.yale.edu

Abstract. As the market demand for analyzing data sets of increas- ing variety and scale continues to explode, the software options for per- forming this analysis are beginning to proliferate. No fewer than a dozen companies have launched in the past few years that sell parallel database products to meet this market demand. At the same time, MapReduce- based options, such as the open source Hadoop framework are becoming increasingly popular, and there have been a plethora of research publi- cations in the past two years that demonstrate how MapReduce can be used to accelerate and scale various data analysis tasks.
Both parallel databases and MapReduce-based options have strengths and weaknesses that a practitioner must be aware of before selecting an analytical data management platform. In this talk, I describe some ex- periences in using these systems, and the advantages and disadvantages of the popular implementations of these systems. I then discuss a hybrid system that we are building at Yale University, called HadoopDB, that attempts to combine the advantages of both types of platforms. Finally, I discuss our experience in using HadoopDB for both traditional decision support workloads (i.e., TPC-H) and also scientific data management (analyzing the Uniprot protein sequence, function, and annotation data).
Keywords: MapReduce, parallel databases, scalable dystems, fault tol- erant systems, analytical data management.

Tom Scavo

unread,

Jul 30, 2010, 6:27:20 PM7/30/10

to vscse-big-data-...@googlegroups.com

On Fri, Jul 30, 2010 at 4:29 PM, <ffoe...@gmail.com> wrote:
> email me off list if you can't access the PDF :)
>
> ----
> http://www.springerlink.com/index/2682X0T22T30141J.pdf
>
> Tradeoffs between Parallel Database Systems, Hadoop, and HadoopDB as Platforms for Petabyte-Scale Analysis
> Daniel J. Abadi
> Yale University
> d...@cs.yale.edu

Here's a blog post by the author himself with no strings attached :-)

http://dbmsmusings.blogspot.com/2009/07/announcing-release-of-hadoopdb-longer.html

This is a *must read* article, by the way, if you want to understand
the trade-offs between SQL and noSQL.

Tom

Michael Matthews

unread,

Jul 30, 2010, 6:40:32 PM7/30/10

to vscse-big-data-...@googlegroups.com

FYI, in regards to SQL vs. not SQL, if anyone is interested, my (former) company was evaluating CouchDB, and MongoDB. We were storing large amounts of digital media metadata, but my understanding is that these both scale pretty well, and are used by some pretty important companies. Plus they both offer robust Python drivers, which is pretty much all you need in the world anyway.

ffoe...@gmail.com

unread,

Jul 31, 2010, 3:03:32 AM7/31/10

to vscse-big-data-...@googlegroups.com

Another I discovered (commercial $$$) was http://www.greenplum.com

There is a single node free license:

• Unlimited production usage on a single commodity x86 server using up to 2 CPU sockets (and unlimited CPU cores), or in a single virtual machine using up to 8 virtual CPU cores.

• Fully parallel SQL and MapReduce processing leverages multi-core parallel-processing engine for every query.

• No storage capacity cap – from GBs to 10s of TBs.

• Hybrid row and column-oriented processing.

• Free community support as well as a low-cost, paid support option.

• Ability to expand beyond Single-Node Edition to a multi-node massively-parallel Greenplum Database deployment

with the new 6 core Intels and a ton of ram...a 20K machine should be able to handle a lot. The limiting factor there would be the nic... wonder if this would work well with a parallel FS. hmmmm

Reply all

Reply to author

Forward