Business Intelligence solution in Cloud Computing

41 views
Skip to first unread message

SRINIVASAN GANESAN

unread,
Jun 17, 2008, 3:40:14 PM6/17/08
to cloud-c...@googlegroups.com
Folks,
Thanks for sharing valuable points....Just by reading the postings i have picked up quite a bit of information...
I was wondering if any of you have experience (or know a vendor) in running a data warehouse based business intelligence solution in a cloud.
For instance, accept data through FTP, run it through an ETL tool to load the dimensional model and point the reports, dashboards and what not against the model...
Do the cloud vendors support this model?
Thanks
Ramesh.



Khazret Sapenov

unread,
Jun 17, 2008, 4:09:05 PM6/17/08
to cloud-c...@googlegroups.com
Ramesh,
I know similar solution from NASDAQ
quote:
NASDAQ Market Replay provides a NASDAQ-validated replay and analysis of the activity in the stock market. The application is built using the Adobe Flex and AIR platform, and utilizes the Amazon Simple Storage Service (S3) for persisting historical market data.
 
salut,
Khaz Sapenov

Naren Chawla

unread,
Jun 17, 2008, 4:35:34 PM6/17/08
to cloud-c...@googlegroups.com

You can also look at LucidEra - http://www.lucidera.com/solutions/index.php

 

--Naren

--- On Tue, 6/17/08, Khazret Sapenov <sap...@gmail.com> wrote:

Subhasis Dasgupta

unread,
Jun 18, 2008, 3:43:24 AM6/18/08
to cloud-c...@googlegroups.com
This is one link I have seen but I have not used it they are providing BI solutins on EC2
Pentaho
 http://blog.vmdatamine.com/2007/08/pentaho-business-intelligence-suite-on.html
Weka
http://blog.vmdatamine.com/2008/02/gridweka-on-ec2.html

-Subhasis


2008/6/18 Naren Chawla <naren_...@yahoo.com>:



--
Subhasis Dasgupta
Indian Representative
Kaavo Inc
Stamford
CT, USA
www.kaavo.com
Phone : +919830282548
skype : subhasis.dasgupta

Dilli Babu

unread,
Jun 18, 2008, 8:59:32 AM6/18/08
to Cloud Computing
For data intensive requirements such as clickstream analysis, Call
data reports etc, there is a cloud edition available from Vertica in
Amazon web services.

check here for the details:

http://solutions.amazonwebservices.com/connect/entry.jspa?externalID=1469

If you have huge data and have issues in generating data intensive
reports, vertica's columnar on the cloud architecture will be a good
option.
--
Best Regards,
Dilli Babu
On-line Computing Architect,
DataSisar,
5 & 6 Walton road,
Bangalore-560001
E-mail: dill...@datasisar.com
Mobile:+919449191299
Visit:http://www.datasisar.com

On Jun 18, 12:43 pm, "Subhasis Dasgupta" <dasgupta.subha...@gmail.com>
wrote:
> This is one link I have seen but I have not used it they are providing BI
> solutins on EC2
> Pentaho
>
> http://blog.vmdatamine.com/2007/08/pentaho-business-intelligence-suit...
> Wekahttp://blog.vmdatamine.com/2008/02/gridweka-on-ec2.html
>
> -Subhasis
>
> 2008/6/18 Naren Chawla <naren_cha...@yahoo.com>:
>
>
>
>
>
> > You can also look at LucidEra -
> >http://www.lucidera.com/solutions/index.php
>
> > --Naren
>
> > --- On *Tue, 6/17/08, Khazret Sapenov <sape...@gmail.com>* wrote:
>
> > From: Khazret Sapenov <sape...@gmail.com>
> > Subject: Re: Business Intelligence solution in Cloud Computing
> > To: cloud-c...@googlegroups.com
> > Date: Tuesday, June 17, 2008, 1:09 PM
>
> >  Ramesh,
> > I know similar solution from NASDAQ
> > quote:
> > *NASDAQ Market Replay provides a NASDAQ-validated replay and analysis of
> > the activity in the stock market. The application is built using the Adobe
> > Flex and AIR platform, and utilizes the Amazon Simple Storage Service (S3)
> > for persisting historical market data. *
> > sources:
> >https://data.nasdaq.com/mr.aspxand
> > On Tue, Jun 17, 2008 at 3:40 PM, SRINIVASAN GANESAN <s...@yahoo.com>
> > wrote:
>
> >> Folks,
> >> Thanks for sharing valuable points....Just by reading the postings i have
> >> picked up quite a bit of information...
> >> I was wondering if any of you have experience (or know a vendor) in
> >> running a data warehouse based business intelligence solution in a cloud.
> >> For instance, accept data through FTP, run it through an ETL tool to load
> >> the dimensional model and point the reports, dashboards and what not against
> >> the model...
> >> Do the cloud vendors support this model?
> >> Thanks
> >> Ramesh.
>
> --
> Subhasis Dasgupta
> Indian Representative
> Kaavo Inc
> Stamford
> CT, USAwww.kaavo.com
> Phone : +919830282548
> skype : subhasis.dasgupta- Hide quoted text -
>
> - Show quoted text -

Utpal Datta

unread,
Jun 18, 2008, 11:15:40 AM6/18/08
to cloud-c...@googlegroups.com
Hi All

I am *very* new to this group. But i am really excited by the quality
of postings in the group. I am learning a lot, quickly.

I have a couple of questions. May be someone has some answers.

1. I think "data in the cloud" is so far a big block to widespread
adoption and using cloud for large, sensitive and mission critical
applications (espicially for Financial organization). Is someone
thinking of a way to leave the data within the user-premises and do
just the computing in the cloud? Kind of a reverse connection back to
the user datacenter.

That way the conventional data respositories can still be used. The
users will not have to worry about the reliability, availability and
(to a large part) security of the data. We still have to worry about
the security of the data travelling back and forth to and from the
cloud to the user data center.

This probably is more relevant for medium to large scale users with
"sensitive" data.

Comments? tips?

2. Considering the "cloud computing" is at the beginning of its
adoption curve, the user data center will, for a long time, have a
mixture of their own Physical, Virtual devices within their datacenter
along with their "virtual" datacenters in one or more clouds (may be
from different vendors).

The user will obviously look for a management portal that seamlessly
crosses the boundaries of Physical, Virtual and Cloud devices (for
discovery, monitoring at the very least).

Are there some talk/thought on standardizing the "cloud managemnet
actions" and "cloud management data" interfaces?

Comments? tips?

Thanks

--utpal

Khazret Sapenov

unread,
Jun 18, 2008, 1:50:41 PM6/18/08
to cloud-c...@googlegroups.com
On Wed, Jun 18, 2008 at 11:15 AM, Utpal Datta <utpa...@gmail.com> wrote:

1. I think "data in the cloud" is so far a big block to widespread
adoption and using cloud for large, sensitive and mission critical
applications (espicially for Financial organization). Is someone
thinking of a way to leave the data within the user-premises and do
just the computing in the cloud? Kind of a reverse connection back to
the user datacenter.

That way the conventional data respositories can still be used. The
users will not have to worry about the reliability, availability and
(to a large part) security of the data. We still have to worry about
the security of the data travelling back and forth to and from the
cloud to the user data center.

This probably is more relevant for medium to large scale users with
"sensitive" data.

Comments? tips?
I was also thinking about some kind of staged DMZ-like data island on premises (with enforced access policies),
that has protected communication/transport channel to various compute cloud providers.
 
As a simple example, I had a use case with Maya3D render job using NFS/SMB shares for input and output files, where NFS server is located on premises and rendering process was done by multiple remote nodes at Amazon Elastic Compute Cloud, orchestrated by LSF.
 
salut,
Khaz Sapenov
 
 

Chris K Wensel

unread,
Jun 19, 2008, 11:08:15 AM6/19/08
to cloud-c...@googlegroups.com
On Jun 18, 2008, at 8:15 AM, Utpal Datta wrote:

> 1. I think "data in the cloud" is so far a big block to widespread
> adoption and using cloud for large, sensitive and mission critical
> applications (espicially for Financial organization). Is someone
> thinking of a way to leave the data within the user-premises and do
> just the computing in the cloud? Kind of a reverse connection back to
> the user datacenter.
>
> That way the conventional data respositories can still be used. The
> users will not have to worry about the reliability, availability and
> (to a large part) security of the data. We still have to worry about
> the security of the data travelling back and forth to and from the
> cloud to the user data center.
>
> This probably is more relevant for medium to large scale users with
> "sensitive" data.
>
> Comments? tips?


I've been processing large historical data sets for a Financial
company I'm consulting with using Cascading/Hadoop on EC2/S3.

The biggest bottleneck has been getting data to the compute
infrastructure.

The obvious pattern is to have datacenter processes push data to S3,
then have the temporary cluster spin up and pull data from S3, do
something interesting, then push the results to S3, notify the
datacenter the job is complete (SQS), have the datacenter pull down
the results from S3.

Because of the need to support both well defined daily processes and
ad-hoc processes, my clients data generally needs to stay on S3.
Having it pulled from a remote datacenter on duplicate runs would be
extraordinarily slow and expensive considering Amazon charges for
bandwidth in and out. Plus, it is a bit cheaper just to keep data on
S3 than to buy a NAS for storage.

That said, with bandwidth being the bottleneck in the face of the
ability to spin up 100 or 1000 nodes to crunch numbers, larger pipes
into a vendors Cloud would be very welcome. Otherwise your Cloud
solution is only as fast as getting data in and out of it.

chris

--
Chris K Wensel
ch...@wensel.net
http://chris.wensel.net/
http://www.cascading.org/


timothy norman huber

unread,
Jun 19, 2008, 11:18:21 AM6/19/08
to cloud-c...@googlegroups.com

On Jun 19, 2008, at 8:08 AM, Chris K Wensel wrote:

> That said, with bandwidth being the bottleneck in the face of the
> ability to spin up 100 or 1000 nodes to crunch numbers

how big are the datasets you're working with? Random or linear access ?

Timothy Huber
Strategic Account Development

tim....@metaram.com
cell 310 795.6599

MetaRAM Inc.
181 Metro Drive, Suite 400
San Jose, CA 95110

Utpal Datta

unread,
Jun 19, 2008, 12:16:01 PM6/19/08
to cloud-c...@googlegroups.com
You make all the right points on speed, bandwidth, Amazon charging on
bandwidth etc. But consider the need for the user (say a large
financial company with a sensitive business critical application),

1. who will guarantee that the data in S3 is secure from physical and
logical access

2. who will guarantee that the data is always available using a
multi-site recovery system (that is what they would have in their own
data center) that meets their RPO (Recovery Point Objective) and RTO
(Recovery Time Objective) guidelines.

Either Amazon or other Cloud providers will make these available with
EC2 with SP3 (or some other storage mechanism with more robust
security and availability characteristics) or the users will have to
build something similar on their own using EC2 as their basic building
block.

This will be a *very* non-trivial task for any user to do on their own
and they will have to make the decision to put resources to build this
on a cloud or to invest more on their own datacenter.

So I guess a lot will depend on the level of maturity of the clouds.
Not sure if all this work belong in a mid-layer outside of the
original cloud and leave the cloud providers just to provide the basic
building blocks

--utpal

Chris K Wensel

unread,
Jun 19, 2008, 12:27:51 PM6/19/08
to cloud-c...@googlegroups.com
> On Jun 19, 2008, at 8:08 AM, Chris K Wensel wrote:
>
>> That said, with bandwidth being the bottleneck in the face of the
>> ability to spin up 100 or 1000 nodes to crunch numbers
>
> how big are the datasets you're working with? Random or linear
> access ?
>

total data is 100's of G. Individual work loads are ~10G. All linear
(this being Hadoop), but there is much joining, binning, and crunching
between the multiple input datasets (the actual workload translates to
~60 MapReduce jobs, all rendered and managed by Cascading).

So it kinda sucks to have uploads of data to the cluster take longer
than it does to compute on it. Worse since my client then has to fetch
the derived data back.

ckw

Marc Evans

unread,
Jun 19, 2008, 12:46:44 PM6/19/08
to cloud-c...@googlegroups.com
I have been contemplating this issue of safe storage in the cloud. My
opinion is that what I need is at least 4 in the cloud storage vendors,
which I can then layer RAID5 behavior on top of, combined with a
loopback encryption file system. Even with that, pulling the data into
the compute cloud places the data in danger of being observable and
possibly tamperable. This all ignores latency problems, which I am
certain will be a problem, as well as transit costs.

I personally would like my application-at-the-edge software to also span
a number of in the cloud vendors, so that I don't experience vendor
lock-in problems. In particular, I am concerned that my public facing
services will be targets of DDoS attacks and as a result vendors will
consider abruptly discontinuing service.

For these reasons, I have not been able to consider much of what in the
cloud providers can offer to date, though I continue to build proof of
concept packages in preparation for the point in time that the industry
evolves enough to facilitate my needs. I am very curious if others have
similar concerns and if plausible solutions are being found...

- Marc

Chaz.

unread,
Jun 19, 2008, 12:57:55 PM6/19/08
to cloud-c...@googlegroups.com
While data access and recovery is a very important aspect of cloud
computing, I'm curious as to the legal issues surrounding the movement
of data across national boundaries or even across company boundaries.

How does the "cloud" protect data going from the owner to the computing
service without being compromised (read that as sniffed)? Will a
computing service in country A have the right to impose restrictions on
data from another country (even if the results of the computing don't
affect the citizens of country A)? An so on.

Chuck Wegrzyn

Chaz.

unread,
Jun 19, 2008, 1:00:04 PM6/19/08
to cloud-c...@googlegroups.com
Hi Marc! Not the only problems. I'd be worried about trans-country laws
governing the data. After all once it is in country A, the laws of that
country would hold.

Chuck Wegrzyn

Marc Evans

unread,
Jun 19, 2008, 1:06:35 PM6/19/08
to cloud-c...@googlegroups.com
Hey Chuck!

I agree with your concerns. Thus far I have been using vendors within
single governance regions, and then having a policy engine at my
application layer to govern where data is allowed to be operated upon.
So, EU data stays in the EU for example. As the vendors grow to span
multiple boundaries, if they are not providing programmatic interfaces
to allow application layer control of these issues, I may need to avoid
those vendors.

- Marc

On SaaS

unread,
Jun 19, 2008, 1:10:54 PM6/19/08
to cloud-c...@googlegroups.com
Data locality is definitely a huge issue in the cloud. My company works with a lot of multi-nationals with huge data sets in various countries. In many countries, especially the EU ones as well as like Mexico have some fairly strict laws around privacy data (e.g., data with personal info, etc.) Some of these multi-national countries have to architect their on-premise software around these restrictions (e.g., putting on-premise software in each country) and restrict the data movement. One of them took several months to study the laws and legality of data location and movement before implementing their solution. 

So the location of the cloud and data is definitely going to be very important to these multi-nationals. That's part of the reasons why Amazon has an EU cloud and Salesforce is building a cloud in Singapore. Some of the countries are also wary of putting any data inside U.S. due to concerns about patriot act. In general the country where the data resides has jurisdiction over it.

--
OnSaaS.net - Blogging about the SaaS and cloud computing world
OnSaaS.info - Providing a continuous stream of SaaS and cloud computing news

Pittard, Rick

unread,
Jun 19, 2008, 1:13:22 PM6/19/08
to cloud-c...@googlegroups.com
One big concern are compliance with the data privacy laws in the EU and
other countries which require protection of personal data and that it
not be transmitted to locations that have less protections. Since the
laws in the US are generally less protective than those in the EU, then
additional controls/agreements need to be in place to legally move the
data from the EU to the US.

Rick

Chaz.

unread,
Jun 19, 2008, 1:25:33 PM6/19/08
to cloud-c...@googlegroups.com
I think privacy is one aspect of data movement but what I see as a
bigger problem is that it might become a national security issue. How
about one country not allowing the data to leave once "it" has
possession? Or organizations like the NSA mining the data as it passes
through the borders.

Chuck Wegrzyn

On SaaS wrote:
> Data locality is definitely a huge issue in the cloud. My company works
> with a lot of multi-nationals with huge data sets in various countries.
> In many countries, especially the EU ones as well as like Mexico have
> some fairly strict laws around privacy data (e.g., data with personal
> info, etc.) Some of these multi-national countries have to architect
> their on-premise software around these restrictions (e.g., putting
> on-premise software in each country) and restrict the data movement. One
> of them took several months to study the laws and legality of data
> location and movement before implementing their solution.
>
> So the location of the cloud and data is definitely going to be very
> important to these multi-nationals. That's part of the reasons why
> Amazon has an EU cloud and Salesforce is building a cloud in
> Singapore. Some of the countries are also wary of putting any data
> inside U.S. due to concerns about patriot act. In general the country
> where the data resides has jurisdiction over it.
>
> --

> OnSaaS.net - /Blogging about the SaaS and cloud computing world/


> OnSaaS.info - Providing a continuous stream of SaaS and cloud computing news

> /Follow on http://twitter.com/onsaas, http://friendfeed.com/onsaas/
> /
> /

>>>> ch...@wensel.net <mailto:ch...@wensel.net>
>>>> http://chris.wensel.net/
>>>> http://www.cascading.org/
>>>>
>>>>
>>>>
>>>>
>>>>
>>>>
>>>>
>>>
>>>>
>>>
>>
>>
>>
>>
>
>
>
> >

Chaz.

unread,
Jun 19, 2008, 1:22:10 PM6/19/08
to cloud-c...@googlegroups.com
That probably works well now. In the future I would expect compute
clouds to be available in 'cheaper' locales (think of Washington
State...lol) or Finland, at that point it becomes a real issue.

Chuck

Chaz.

unread,
Jun 19, 2008, 1:40:20 PM6/19/08
to cloud-c...@googlegroups.com
While I think trans-national data movement will be an area that requires
governance of some kind I think that companies can get around the
problem in other ways. I think it just requires looking at the problem
in a different way.

I'd think the approach is to keep the data still and move the computing
to it. The idea is to see the thousands of machines it takes to hold the
petabytes worth of data as the compute cloud. What needs to move to it
is the programs that can process the data. I've been working on this
approach for the last 3 years (Twisted Storage).

Chuck Wegrzyn

Khazret Sapenov

unread,
Jun 19, 2008, 1:51:55 PM6/19/08
to cloud-c...@googlegroups.com


On Thu, Jun 19, 2008 at 1:40 PM, Chaz. <eprpa...@gmail.com> wrote:
[snip]

I'd think the approach is to keep the data still and move the computing
to it. The idea is to see the thousands of machines it takes to hold the
 petabytes worth of data as the compute cloud. What needs to move to it
is the programs that can process the data. I've been working on this
approach for the last 3 years (Twisted Storage).

Chuck Wegrzyn
 
This is valid approach, that I personally called "Plumber Pattern", when application, encapsulated in some kind of container (e.g. virtual machine image) is marshalled to secure data islands to iteratively do its unique work (say, do a matches on some criterium in Interpol, FBI, CIA, MI5 and other databases, all distributed across continents). Due to utterly confidential nature of these types of data, it is impossible to move them to public storage (at least this time). Above-mentioned case might be extrapolated to some lines of business as well with reduced privacy/security requirements.
 
Khaz Sapenov 
 

Stuart Altenhaus

unread,
Jun 19, 2008, 1:44:28 PM6/19/08
to cloud-c...@googlegroups.com
I think Chaz is right. There are privacy issues regarding use and exposure of data that vary country by country. If the cloud computes the data, there is no control on where that data is moved for computation, right?

R/s,
Stu Altenhaus

Sent from my Verizon Wireless BlackBerry

-----Original Message-----
From: "Chaz." <eprpa...@gmail.com>

Date: Thu, 19 Jun 2008 13:40:20
To:cloud-c...@googlegroups.com
Subject: Re: Issues of data in the cloud...

On SaaS

unread,
Jun 19, 2008, 1:57:18 PM6/19/08
to cloud-c...@googlegroups.com
That depends on how the cloud is architected, no? 

And I would think the cloud providers will have to start answering these questions if they want large enterprises to start adopting the cloud. There maybe no control of which server in the cloud is doing the computation, but service providers may provide options to restrict based on geographic domains. 

We have quite a few people here from the cloud providers, maybe they can share some insight?

thx

--
OnSaaS.net - Blogging about the SaaS and cloud computing world
OnSaaS.info - Providing a continuous stream of SaaS and cloud computing news

Chaz.

unread,
Jun 19, 2008, 2:00:55 PM6/19/08
to cloud-c...@googlegroups.com
I know from my work that many firms are reluctant to let there data "out
the door" since they see that as their edge in the market. But even that
aside for a minute, it seems to make more sense to move "small" programs
(relative to the size of the data) then to move massive amounts of data.

So my question is as follows: what makes a good "storage cloud"?

Chuck Wegrzyn

Chaz.

unread,
Jun 19, 2008, 2:20:08 PM6/19/08
to cloud-c...@googlegroups.com
That is one approach - again it seems to indicate the model is the data
moving to the compute resources. The other approach is to look at it
from the data perspective - can the data sit some place and the compute
come to it?


Chuck Wegrzyn

On SaaS wrote:
> That depends on how the cloud is architected, no?
>
> And I would think the cloud providers will have to start answering these
> questions if they want large enterprises to start adopting the
> cloud. There maybe no control of which server in the cloud is doing the
> computation, but service providers may provide options to restrict based
> on geographic domains.
>
> We have quite a few people here from the cloud providers, maybe they can
> share some insight?
>
> thx
>
> On Jun 19, 2008, at 10:44 AM, Stuart Altenhaus wrote:
>
>> I think Chaz is right. There are privacy issues regarding use and
>> exposure of data that vary country by country. If the cloud computes
>> the data, there is no control on where that data is moved for
>> computation, right?
>>
>> R/s,
>> Stu Altenhaus
>>
>> Sent from my Verizon Wireless BlackBerry
>>
>> -----Original Message-----
>> From: "Chaz." <eprpa...@gmail.com <mailto:eprpa...@gmail.com>>
>>
>> Date: Thu, 19 Jun 2008 13:40:20
>> To:cloud-c...@googlegroups.com

>> <mailto:cloud-c...@googlegroups.com>

>>>> <mailto:ch...@wensel.net>>

>>>>> ch...@wensel.net <mailto:ch...@wensel.net>


>>>>> http://chris.wensel.net/
>>>>> http://www.cascading.org/
>>>>>
>>>>>
>>>>>
>>>>>
>>>>>
>>>>>
>>>>>
>>>
>>>
>>>
>>>
>>>>
>>>
>>
>>
>>
>>
>>
>>
>
> --

> OnSaaS.net - /Blogging about the SaaS and cloud computing world/


> OnSaaS.info - Providing a continuous stream of SaaS and cloud computing news

Ray Nugent

unread,
Jun 19, 2008, 2:17:13 PM6/19/08
to cloud-c...@googlegroups.com
Chris, couple of thoughts -

1) is it possible to have the app run on AWS so that the derived data does not need to traverse back down in real time (that way you could use a lazy download in the background to archive it in their DC while their app accesses the copy in real time on AWS.)?

2) I've been thinking about the problem of upload times as well (in the context of large DNA data sets). The cost of loading into AWS is not that prohibitive so if one where to pre-process that data such that it could be uploaded in a bunch of parrallel processes to AWS you could reduce the bottleneck considerably. In theory.

Ray

----- Original Message ----
From: Chris K Wensel <ch...@wensel.net>
To: cloud-c...@googlegroups.com
Sent: Thursday, June 19, 2008 9:27:51 AM
Subject: Re: Business Intelligence solution in Cloud Computing


> On Jun 19, 2008, at 8:08 AM, Chris K Wensel wrote:
>
>> That said, with bandwidth being the bottleneck in the face of the
>> ability to spin up 100 or 1000 nodes to crunch numbers
>
> how big are the datasets you're working with?  Random or linear 
> access ?
>

total data is 100's of G. Individual work loads are ~10G. All linear 
(this being Hadoop), but there is much joining, binning, and crunching 
between the multiple input datasets (the actual workload translates to 
~60 MapReduce jobs, all rendered and managed by Cascading).

So it kinda sucks to have uploads of data to the cluster take longer 
than it does to compute on it. Worse since my client then has to fetch 
the derived data back.

ckw

--
Chris K Wensel
ch...@wensel.net

Marc Evans

unread,
Jun 19, 2008, 2:45:44 PM6/19/08
to cloud-c...@googlegroups.com
In my experiences, there are cases where having the data / computation
as close to the customer edge as possible is what is required for an
acceptable user experience. In other cases, the relationship of the user
/ data / computation is not important. Most often, there is a mix of
both. One of the ideas behind Hadoop as I understand it is to bring the
computation to the data location, while also providing for the data to
be in several locations. The scheduler is critical to making good use of
data locality. So yes, I believe that what you are looking for does
exist within Hadoop at a minimum, though I also believe that there is
alot of room to evolve the techniques that it uses.

- Marc

Jim Peters

unread,
Jun 19, 2008, 2:40:35 PM6/19/08
to cloud-c...@googlegroups.com
Even if the cloud providers come up with excellent answers to the security and reliability questions, who's going to trust them? Credit card numbers are one thing, but cloud data is something else entirely.

+J
--
Jim Peters
+415-608-0851

Chaz.

unread,
Jun 19, 2008, 3:10:33 PM6/19/08
to cloud-c...@googlegroups.com
Jim,

I definitely agree with your point. I can't think of very many
multi-nationsls that would let there data out to wander around. I'd
think they would want to protect their data and move the computing
resources close to it....

Chuck

>> <mailto:cloud-c...@googlegroups.com>

>>>> <ch...@wensel.net <mailto:ch...@wensel.net>>

>>>>> ch...@wensel.net <mailto:ch...@wensel.net>


>>>>> http://chris.wensel.net/
>>>>> http://www.cascading.org/
>>>>>
>>>>>
>>>>>
>>>>>
>>>>>
>>>>>
>>>>>
>>>
>>>
>>>
>>>
>>>>
>>>
>>
>>
>>
>>
>>
>>
>
> --

> OnSaaS.net - /Blogging about the SaaS and cloud computing world/


> OnSaaS.info - Providing a continuous stream of SaaS and cloud
> computing news

> --
> Jim Peters
> +415-608-0851
> >

Stuart Altenhaus

unread,
Jun 19, 2008, 2:39:03 PM6/19/08
to cloud-c...@googlegroups.com
If the programs are moved to the data, then what is the distinction between cloud computing and CORBA? Seems like the same basic tenets would have to be in place.

(I'm new to the concept of cloud computing, but do see the opportunities for advancing a network of computers that renders geo location trivial. Surely enhancing existing network clouds such that the computing power were placed at each node, a net-centric approach is achieved... The telcos do that today, right?)

Sent from my Verizon Wireless BlackBerry

-----Original Message-----
From: "Chaz." <eprpa...@gmail.com>

Date: Thu, 19 Jun 2008 14:00:55
To:cloud-c...@googlegroups.com
Subject: Re: Issues of data in the cloud...

Utpal Datta

unread,
Jun 19, 2008, 3:43:50 PM6/19/08
to cloud-c...@googlegroups.com
May be this is a redundant question, where is this protected data
residing? In the cloud or in the user's data center?

If it is in the cloud then we are still dealing with Security,
Availability and Recoverability isues (that everyone agrees on).

If is in the users data center then how will the computing resources
offered (and controlled by Amazon) be brought to that specific user's
datacenter?

--utpal

Lynne VanArsdale

unread,
Jun 19, 2008, 3:57:37 PM6/19/08
to cloud-c...@googlegroups.com
Just joined cloud-computing and this is the first conversation I've received.
 
A couple of weeks ago I attended Gartner Security where Neil MacDonald spoke on "Adaptive Security." In a nutshell, this approach builds a resilient system for secure data, acting much like the human immune system.  It involves whitelisting as the foundation, blacklisting as a mid-tier and learned/adaptive mechanisms at the top.  In such an environment, elements would be "autonomic" and self-managing to a large degree, and would share and communicate with other elements to protect workloads and information (as opposed to endpoints).  There is a lot more to this vision, and it is probably a number of years away, but it may be a reasonable approach to address the concerns about data security being discussed here.
 
In any case, does anyone know of any product or standards efforts for the industry to collaborate on a more cohesive architecture for security in the cloud?

 

Chris K Wensel

unread,
Jun 19, 2008, 4:03:21 PM6/19/08
to cloud-c...@googlegroups.com
CORBA isn't about mobility, it's just typesafe OO RPC. There was work
done by ObjectSpace and GeneralMagic in the 90's on agent based
computing (move code to the data). but that movement died off.

if the Cloud is a collection of compute resources, and you need to
apply them to lots of your data, you have little choice but to move
your data. you can't move the compute power. (unless you order a
shipping container of servers I guess)

ckw

--

Chris K Wensel

unread,
Jun 19, 2008, 3:55:05 PM6/19/08
to cloud-c...@googlegroups.com
1) is it possible to have the app run on AWS so that the derived data does not need to traverse back down in real time (that way you could use a lazy download in the background to archive it in their DC while their app accesses the copy in real time on AWS.)?


The pattern is roughly this:

-- load dataset to S3 from datacenter (in small pieces, in parallel), repeat

- identify current dataset
- boot hadoop cluster
- start job on given dataset
- head of job pulls down parts from S3 in parallel (very natural with Hadoop)
- compete middle of job
- tail of job stuffs results sets into S3 in parallel (again fairly natural with Hadoop)
-- repeat above concurrently as datasets become available (easy to have concurrent Hadoop clusters in EC2).

-- pull data from S3 in parts in parallel 

note 'job' above means a given data processing flow. in terms of Hadoop, the 'job' could be dozens of MapReduce jobs on the cluster.

2) I've been thinking about the problem of upload times as well (in the context of large DNA data sets). The cost of loading into AWS is not that prohibitive so if one where to pre-process that data such that it could be uploaded in a bunch of parrallel processes to AWS you could reduce the bottleneck considerably. In theory.


you will see a boost if you spawn multiple connects from one location to S3. it seems (was clearly in the past, unsure as of today) that individual connections were throttled, and up to a point bandwidth from a given ip was throttled. so doing things in parallel by breaking your big data into small parts give you a boost. I can't remember the numbers, else i'd share. its been a couple months since that project.

one benefit of using small parts, is that a given part will be available before the 'whole' is available. S3 won't show things for download that aren't finished uploading. So this also improves things (especially when coupled with SQS).

by 'parts' i mean, I may have locally 10G of data. I will break it into n MB  pieces (compressed) and push them up to S3 (in parallel). having a manifest (*.parts file) is great when you need to manage the integrity of individual parts (MD5) and the whole (parts list all available, MD5 on parts file). This in part guarantees you aren't processing a job on partial data (because the upload failed an no one noticed).

Chaz.

unread,
Jun 19, 2008, 4:18:45 PM6/19/08
to cloud-c...@googlegroups.com
Security is a funny issue. Can you ever use a cloud computing complex
and know for certain your data is protected? I'm betting there is no
fool proof way that it can be. So the only real way is to fall back to
what we know today: maintain physical control of it for once that is
gone you are on your own baby.

Chaz.

unread,
Jun 19, 2008, 4:30:30 PM6/19/08
to cloud-c...@googlegroups.com
I don't believe it is possible to have data security in the "cloud"
without having physical security of the data. After all whenever I use a
cloud computer I hope that no one has hacked it to replace the security
modules, or to map memory and look into a running program, etc.

Now if you have to build out an autonomic system we will never have
secure cloud computing. No system today is so tight that it can't be
hacked. Just look at all the attempts to protect DVDs or BD disks...

Chuck Wegrzyn

Lynne VanArsdale wrote:
> Just joined cloud-computing and this is the first conversation I've
> received.
>
> A couple of weeks ago I attended Gartner Security where Neil MacDonald
> spoke on "Adaptive Security." In a nutshell, this approach builds a
> resilient system for secure data, acting much like the human immune
> system. It involves whitelisting as the foundation, blacklisting as a
> mid-tier and learned/adaptive mechanisms at the top. In such an
> environment, elements would be "autonomic" and self-managing to a large
> degree, and would share and communicate with other elements to protect
> workloads and information (as opposed to endpoints). There is a lot
> more to this vision, and it is probably a number of years away, but it
> may be a reasonable approach to address the concerns about data security
> being discussed here.
>
> In any case, does anyone know of any product or standards efforts for
> the industry to collaborate on a more cohesive architecture for security
> in the cloud?
>
>

> On 6/19/08, *Chaz.* <eprpa...@gmail.com

> <mailto:eprpa...@gmail.com>> wrote:
>
>
> Jim,
>
> I definitely agree with your point. I can't think of very many
> multi-nationsls that would let there data out to wander around. I'd
> think they would want to protect their data and move the computing
> resources close to it....
>
> Chuck
>
> Jim Peters wrote:
> > Even if the cloud providers come up with excellent answers to the
> > security and reliability questions, who's going to trust them? Credit
> > card numbers are one thing, but cloud data is something else
> entirely.
> >
> > +J
> >
> > On Thu, Jun 19, 2008 at 10:57 AM, On SaaS <ons...@gmail.com
> <mailto:ons...@gmail.com>

> <mailto:To%3Acloud-...@googlegroups.com>
> >> <mailto:cloud-c...@googlegroups.com

> >>> <mailto:cloud-c...@googlegroups.com

> <mailto:ch...@wensel.net <mailto:ch...@wensel.net>>>

> <mailto:ch...@wensel.net <mailto:ch...@wensel.net>>

Chaz.

unread,
Jun 19, 2008, 4:34:38 PM6/19/08
to cloud-c...@googlegroups.com
And CORBA isn't what I am thinking of, or even HADOOP but things like
JavaSpaces (?).

I'm not sure I would agree you have to ship your data to somewhere else.
After all a "cloud data provider" could create just the secure
environment for holding the data and processing it (isn't that really
what S3 is all about?). The only thing the using company needs to do is
write the program and have it installed, more or less automagically, on
the machines that hold the user's data.

Chuck Wegrzyn

Chris K Wensel

unread,
Jun 19, 2008, 4:39:58 PM6/19/08
to cloud-c...@googlegroups.com

If you are deploying an application in EC2, you must architect it to
survive failure, because it will fail in varying degrees. Subsequently
features of AWS allow you to do that, roughly (booting a pre-
configured xen vm, simple db, sqs, s3, etc etc).

I suggest you do the same regarding security, just assume it's a
hostile environment.

The question is, what features of AWS support you in this? shared
keychains/stores, encrypted volumes, CA, kerberos, ?? or will this
always be left to the user. or could you ever really trust those
services the same way you trust them to not lose data.

That said, not being a security person. What 'cloud security services'
could a provider provide? Or should they even bother.

ckw

http://chris.wensel.net/
http://www.cascading.org/


Ray Nugent

unread,
Jun 19, 2008, 4:51:48 PM6/19/08
to cloud-c...@googlegroups.com
Hey Chuck,

I think the front page of the Wall Street Journal proves that even having physical security of your data does not provide security! :-)

Security is really a business issue. Each layer of security should cost no more than the data is worth. So the concept of "secure enough" becomes important. What security is appropriate for a given type of data and is it more or less secure in the cloud than in the corp DC? Is data inherently "less secure" by virtue of being in the cloud than, say, an employees laptop or flash dongle or "on the wire"? I don't think corporate data centers are a secure as you're suggesting they are...

Ray

----- Original Message ----
From: Chaz. <eprpa...@gmail.com>
To: cloud-c...@googlegroups.com
Sent: Thursday, June 19, 2008 1:30:30 PM
Subject: Re: Issues of data in the cloud...


> <mailto:eprpa...@gmail.com>> wrote:
>
>
>    Jim,
>
>    I definitely agree with your point. I can't think of very many
>    multi-nationsls that would let there data out to wander around. I'd
>    think they would want to protect their data and move the computing
>    resources close to it....
>
>    Chuck
>
>    Jim Peters wrote:
>      > Even if the cloud providers come up with excellent answers to the
>      > security and reliability questions, who's going to trust them? Credit
>      > card numbers are one thing, but cloud data is something else
>    entirely.
>      >
>      > +J
>      >
>      > On Thu, Jun 19, 2008 at 10:57 AM, On SaaS <ons...@gmail.com
>    <mailto:ons...@gmail.com>
>    <mailto:To%3Acloud-...@googlegroups.com>
>      >>    <mailto:cloud-c...@googlegroups.com
>      >>>    <mailto:cloud-c...@googlegroups.com
>    <mailto:ch...@wensel.net <mailto:ch...@wensel.net>>>
>    <mailto:ch...@wensel.net <mailto:ch...@wensel.net>>

Ray Nugent

unread,
Jun 19, 2008, 4:55:41 PM6/19/08
to cloud-c...@googlegroups.com
Chris, it's the last step I wonder about. If you leave the resultant data on S3 and run whatever app they have that operates against that data on EC2 it seems you could save some time?

Ray

----- Original Message ----
From: Chris K Wensel <ch...@wensel.net>
To: cloud-c...@googlegroups.com
Sent: Thursday, June 19, 2008 12:55:05 PM
Subject: Re: Business Intelligence solution in Cloud Computing

Chris K Wensel

unread,
Jun 19, 2008, 5:47:57 PM6/19/08
to cloud-c...@googlegroups.com
If there was a next processing step, then yes it would save time. But those jobs represent all the work being done that isn't done by client/customers of my client. 

Chaz.

unread,
Jun 19, 2008, 6:52:32 PM6/19/08
to cloud-c...@googlegroups.com
Ray,

You are absolutely correct. Once you have a person involved it can be
compromised. It is all about risk and how to make it so small it would
take an act of God (or a really large budget) to breach it!

Chuck

> <mailto:cloud-c...@googlegroups.com>
> > <mailto:To%3Acloud-...@googlegroups.com
> <mailto:3Acloud-...@googlegroups.com>>

Chaz.

unread,
Jun 19, 2008, 6:50:51 PM6/19/08
to cloud-c...@googlegroups.com
I think you missed the point. Even having shared keychains, use of X509,
etc. there is no guarantee you data is safe. Once it is in the hands of
a 3rd party you better assume it is compromised.

Perhaps the real solution is to carefully architect your solution to
provide "bulk" services outside the company and leave the critical
things - those that are absolutely vital - to inside the company.

Chuck Wegrzyn

Alan Ho

unread,
Jun 20, 2008, 2:00:07 AM6/20/08
to cloud-c...@googlegroups.com
Hi Chris,

I've looked at this issue quite a bit too. There are a few ways that I think the problem can be "relieved"

1. Don't encourage your clients to download the entire data-set. As long as you provide URLs to the "crunched data", they should only have to pull the data as needed. You can index the data too using SDB too - a nice convenience function for searching the data.

2. See if the customers can split the dataset into sub-datasets, each reachable via some sort of URL. When you run your Map job, each of the Map nodes will be responsible for downloading the data from your clients - you might get some benefits from the parallelization of the download.

3. Use S3 for more of a backing store - If you don't have many clients consuming the data, or you think that the clients will download the data soon after the mapreduce job is complete, they can download it directly from the HDFS (http://hadoop.apache.org/core/docs/r0.17.0/hdfs_design.html#Browser+Interface)

I don't know if that helps.

Regards,
Alan Ho

----- Original Message ----
From: Ray Nugent <rnu...@yahoo.com>
To: cloud-c...@googlegroups.com
Sent: Thursday, June 19, 2008 11:17:13 AM
Subject: Re: Business Intelligence solution in Cloud Computing

Chris, couple of thoughts -

1) is it possible to have the app run on AWS so that the derived data does not need to traverse back down in real time (that way you could use a lazy download in the background to archive it in their DC while their app accesses the copy in real time on AWS.)?

2) I've been thinking about the problem of upload times as well (in the context of large DNA data sets). The cost of loading into AWS is not that prohibitive so if one where to pre-process that data such that it could be uploaded in a bunch of parrallel processes to AWS you could reduce the bottleneck considerably. In theory.

Ray

----- Original Message ----
From: Chris K Wensel <ch...@wensel.net>
To: cloud-c...@googlegroups.com
Sent: Thursday, June 19, 2008 9:27:51 AM
Subject: Re: Business Intelligence solution in Cloud Computing


> On Jun 19, 2008, at 8:08 AM, Chris K Wensel wrote:
>
>> That said, with bandwidth being the bottleneck in the face of the
>> ability to spin up 100 or 1000 nodes to crunch numbers
>
> how big are the datasets you're working with?  Random or linear 
> access ?
>

total data is 100's of G. Individual work loads are ~10G. All linear 
(this being Hadoop), but there is much joining, binning, and crunching 
between the multiple input datasets (the actual workload translates to 
~60 MapReduce jobs, all rendered and managed by Cascading).

So it kinda sucks to have uploads of data to the cluster take longer 
than it does to compute on it. Worse since my client then has to fetch 
the derived data back.

ckw


Ask a question on any topic and get answers from real people. Go to Yahoo! Answers.

Nati Shalom

unread,
Jun 20, 2008, 3:33:55 AM6/20/08
to cloud-c...@googlegroups.com
A practical way for dealing with data in the cloud IMO is to decouple
the way we persist the data from the application.
What that basically means is that the application loads the data into an
in-memory cloud and that memory cloud keeps the data synchronized with
persistent storage asynchronously.
There is an opensource version that does that with Amazon -SimpleDB and
GigaSpaces as the data-grid
http://www.openspaces.org/display/EDS/External+Data+Source+by+Amazon+Sim
pleDB - It basically means that the data is stored in Amazon S3 as the
persistent storage. When the application boots, the datagrid loads the
data from S3 using SimpleDB interface to the memory of the cloud
resources.
The application use the in-memory data. Updates to the memory is being
propagated asynchronously back to S3 through the same datagrid and
SimpleDB interface. You can do pretty much the same thing with MySQL
instead of SimpleDB as I noted in one of my previous posts:
http://natishalom.typepad.com/nati_shaloms_blog/2008/03/scaling-out-mys.
html
Once the persistent storage is decoupled from our application we can
easily use the same model for keeping our persistent data outside of the
cloud i.e. in our local IT.

The nice thing is that we can be very flexible with our strategy as it
relates to where the data will reside, how it will be stored, and at
what rate it will be synchronized from the application. We can change it
overtime to best fit our application scenario and constraints without
touching our application code.


" And CORBA isn't what I am thinking of, or even HADOOP but things like
JavaSpaces (?)."

JavaSpaces is indeed more relevant for this type of scenarios.
What's unique about JavaSpaces IMO is that it can be used for handling
both the compute side and the data storage. The references above shows
how you could use space-based storage for handling the data side.
Now that its stored in the in-memory space cluster you can easily use
the same space to route business logic on those in-memory instances in
parallel. There's a nice way to abstract that from the user using a
remoting abstraction - see more details on how that works here:
http://uri-cohen.blogspot.com/2008/02/openspaces-svf-remoting-on-steroid
s.html

Nati S.
GigaSpaces

-----Original Message-----
From: cloud-c...@googlegroups.com
[mailto:cloud-c...@googlegroups.com] On Behalf Of Chaz.

Oscar Koeroo

unread,
Jun 20, 2008, 4:00:55 AM6/20/08
to cloud-c...@googlegroups.com
Isn't the discussion relative to the level of assurance (in terms of
security here) that is supplied and demanded per use case?

For absolute control, you should stay absolutely in control of the
resources and I don't think Cloud Computing is something for you.

If you want a secured environment you should understand that the
administrator of the resource can read the memory. If you want to
prevent that, then you should look into secured techniques to hide the
memory contents. Ultimately you must get your instructions through a
(virtual) CPU which can also be obscured on what it is doing, but that's
the only threat left in that solution.

Most cloud computing solution don't allow for that level of security in
the cloud.

The other more generic good thing to do IMHO is to encrypt all your data
that resides in the Cloud. Irregardless if this is somewhere between
absolutely needed and a tiny wish. This would also solve issues where
inadvertently some transfer protocol are unencrypted.

For some bio-medical use-cases in Grid computing (more my main field)
this approach is also being used. Decryption happens just prior to the
actual processing. A more advanced solution is the sliding decrypted
window approach. Where the dataset is decrypted per section or block.
CPU usage goes up, but most of the file/database stays encrypted and
opportunities to snoop around on the resource is very limited in its
opportunity.

cheers,

Oscar Koeroo

> --~--~---------~--~----~------------~-------~--~----~
> You received this message because you are subscribed to the Google Groups "Cloud Computing" group.
> To post to this group, send email to cloud-c...@googlegroups.com
> To unsubscribe from this group, send email to cloud-computi...@googlegroups.com
> For more options, visit this group at http://groups.google.ca/group/cloud-computing?hl=en
> -~----------~----~----~----~------~----~------~--~---

Ray Nugent

unread,
Jun 20, 2008, 10:54:59 AM6/20/08
to cloud-c...@googlegroups.com
Nati, I agree that decoupling will help. However your point here confuses me -

"Once the persistent storage is decoupled from our application we can easily use the same model for keeping our persistent data outside of the cloud i.e. in our local IT."

Keeping persistence that far away is bound to have pretty significant impact on your performance isn't it?

Ray

-----Original Message-----
From: cloud-c...@googlegroups.com
[mailto:cloud-c...@googlegroups.com] On Behalf Of Chaz.
Sent: Thursday, June 19, 2008 11:35 PM
To: cloud-c...@googlegroups.com
Subject: Re: Issues of data in the cloud...


>> Sent from my Verizon Wireless BlackBerry
>>
>> -----Original Message-----
>> From: "Chaz." <eprpa...@gmail.com>
>>
>> Date: Thu, 19 Jun 2008 14:00:55
>> To:cloud-c...@googlegroups.com

>> Subject: Re: Issues of data in the cloud...
>>
>>
>>
>> I know from my work that many firms are reluctant to let there data 
>> "out
>> the door" since they see that as their edge in the market. But even 
>> that
>> aside for a minute, it seems to make more sense to move "small" 
>> programs
>> (relative to the size of the data) then to move massive amounts of 
>> data.
>>
>> So my question is as follows: what makes a good "storage cloud"?
>>
>> Chuck Wegrzyn
>>
>> Khazret Sapenov wrote:
>>>
>>> On Thu, Jun 19, 2008 at 1:40 PM, Chaz. <eprpa...@gmail.com
>>> <mailto:eprpa...@gmail.com>> wrote:
>>>
>>>    [snip]
>>>    I'd think the approach is to keep the data still and move the 
>>> computing
>>>    to it. The idea is to see the thousands of machines it takes to 
>>> hold the
>>>    petabytes worth of data as the compute cloud. What needs to 
>>> move to it
>>>    is the programs that can process the data. I've been working on 
>>> this
>>>    approach for the last 3 years (Twisted Storage).
>>>
>>>    Chuck Wegrzyn
>>>
>>>
>>> This is valid approach, that I personally called "Plumber Pattern",

>>> when
>>> application, encapsulated in some kind of container (e.g. virtual
>>> machine image) is marshalled to secure data islands to iteratively
do
>>> its unique work (say, do a matches on some criterium in Interpol, 
>>> FBI,
>>> CIA, MI5 and other databases, all distributed across continents). 
>>> Due to
>>> utterly confidential nature of these types of data, it is 
>>> impossible to
>>> move them to public storage (at least this time). Above-mentioned 
>>> case
>>> might be extrapolated to some lines of business as well with reduced
>>> privacy/security requirements.
>>>
>>> Khaz Sapenov
>>>
>>>
>>
>>
>>
>
> --
> Chris K Wensel
> ch...@wensel.net

Chris K Wensel

unread,
Jun 20, 2008, 11:15:29 AM6/20/08
to cloud-c...@googlegroups.com
Thanks for the comments Alan. My previous post should outline how we have parallelized much of the infrastructure to alleviate my clients issues to a reasonable degree. In short, we employed the patterns you suggest, but not the specific technologies for various reason. I'd be happy to go into a little more detail offline.

The gist of my comments in this thread are to complain that you can't unfortunately scale bandwidth into a cloud to match the relative scale of the compute resources, currently. many hours to upload, and relatively few minutes to crunch, is an annoying imbalance. 

For the analytics in the cloud space, there is an opportunity for a vendor to offer whatever services (many introduced in this thread by others) to alleviate the imbalance.

cheers,
ckw



On Jun 19, 2008, at 11:00 PM, Alan Ho wrote:

Hi Chris,

I've looked at this issue quite a bit too. There are a few ways that I think the problem can be "relieved"

1. Don't encourage your clients to download the entire data-set. As long as you provide URLs to the "crunched data", they should only have to pull the data as needed. You can index the data too using SDB too - a nice convenience function for searching the data.

2. See if the customers can split the dataset into sub-datasets, each reachable via some sort of URL. When you run your Map job, each of the Map nodes will be responsible for downloading the data from your clients - you might get some benefits from the parallelization of the download.

3. Use S3 for more of a backing store - If you don't have many clients consuming the data, or you think that the clients will download the data soon after the mapreduce job is complete, they can download it directly from the HDFS (http://hadoop.apache.org/core/docs/r0.17.0/hdfs_design.html#Browser+Interface)

I don't know if that helps.

Regards,
Alan Ho


Alan Ho

unread,
Jun 20, 2008, 12:11:41 PM6/20/08
to cloud-c...@googlegroups.com
Yeah. The whole issue with SOA as it is today is that you are expected to move the data to where the data is processed. What we really need is the ability to move the processing to where the data is (Which is kinda the point of Hadoop)

Cheers,
Alan Ho



----- Original Message ----
From: Chris K Wensel <ch...@wensel.net>
To: cloud-c...@googlegroups.com
Sent: Friday, June 20, 2008 8:15:29 AM
Subject: Re: Business Intelligence solution in Cloud Computing


Be smarter than spam. See how smart SpamGuard is at giving junk email the boot with the All-new Yahoo! Mail

Gavan Corr

unread,
Jun 20, 2008, 12:32:42 PM6/20/08
to cloud-c...@googlegroups.com
Hadoop, fantastic idea (it would be great if it worked...)

if you need a production ready environment in Finance, it's a long way off. The distributed caching products, Gemfire, Oracle's Coherence and Nati's gigaspaces are all miles ahead of hadoop at this point, some more than others ;-)





Visit our website at http://www.nyse.com
*****************************************************************************
Note: The information contained in this message and any attachment to it is privileged, confidential and protected from disclosure. If the reader of this message is not the intended recipient, or an employee or agent responsible for delivering this message to the intended recipient, you are hereby notified that any dissemination, distribution or copying of this communication is strictly prohibited. If you have received this communication in error, please notify the sender immediately by replying to the message, and please delete it from your system. Thank you. NYSE Euronext, Inc.

Chris K Wensel

unread,
Jun 20, 2008, 1:28:49 PM6/20/08
to cloud-c...@googlegroups.com
Hadoop works fine, if you have proper expectations in respect to its architecture. It is not intended for real time processing. Coherence or GigSpaces, as you point out are great for that.

But if you have dozens or more things that need to get done reliably against reasonably large datasets, it will get them done. One user recently commented he hadn't noticed part of his production cluster was down (hardware failure) since the cluster just kept running his scheduled jobs. 

You have to apply the appropriate tools to the problem. I've used Hadoop to process historical stock trade and equity data. It was a perfect fit for the requirements.

ckw

Chris Marino

unread,
Jun 20, 2008, 2:13:01 PM6/20/08
to cloud-c...@googlegroups.com
Somewhat related to point #1 below, there is a new class of BI tool/client that is sometimes called a Data Browser that relies on data being exposed as data services.  These clients have local DBs and can perform analytics as well as reports, etc.  In the purest form, there is no data warehouse, but practically speaking, the data warehouse is exposed via data services, which can be incrementally be supplemented by other data sources.
 
Kirix has a DataBrowser (www.kirix.com).
 
This whole BI area is sometimes called BI 2.0.  Good article here:
 
 
One other point, I came across another cloud BI provider the other day: Good Data (www.gooddata.com).
 

__________________________________
Chris Marino
SnapLogic, Inc.

Really Simple Integration
www.snaplogic.com
650-655-7200

Michael Moran

unread,
Jun 20, 2008, 2:26:13 PM6/20/08
to cloud-c...@googlegroups.com
Nati,
 
I am intrigued with the idea of decoupling you mention below.
 
Question: Is the scenario you described any different from the one you presented at the Spring Experience conference in Miami FL, back in December 2007?
 
Thanks,
Michael

Nati Shalom

unread,
Jun 20, 2008, 2:49:33 PM6/20/08
to cloud-c...@googlegroups.com

" Question: Is the scenario you described any different from the one you presented at the Spring Experience conference in Miami FL, back in December 2007?"

 

The principles are quite the same