1) is it possible to have the app run on AWS so that the derived data does not need to traverse back down in real time (that way you could use a lazy download in the background to archive it in their DC while their app accesses the copy in real time on AWS.)?
2) I've been thinking about the problem of upload times as well (in the context of large DNA data sets). The cost of loading into AWS is not that prohibitive so if one where to pre-process that data such that it could be uploaded in a bunch of parrallel processes to AWS you could reduce the bottleneck considerably. In theory.
Ray
----- Original Message ---- From: Chris K Wensel <ch...@wensel.net> To: cloud-computing@googlegroups.com Sent: Thursday, June 19, 2008 9:27:51 AM Subject: Re: Business Intelligence solution in Cloud Computing
> On Jun 19, 2008, at 8:08 AM, Chris K Wensel wrote:
>> That said, with bandwidth being the bottleneck in the face of the >> ability to spin up 100 or 1000 nodes to crunch numbers
> how big are the datasets you're working with? Random or linear > access ?
total data is 100's of G. Individual work loads are ~10G. All linear (this being Hadoop), but there is much joining, binning, and crunching between the multiple input datasets (the actual workload translates to ~60 MapReduce jobs, all rendered and managed by Cascading).
So it kinda sucks to have uploads of data to the cluster take longer than it does to compute on it. Worse since my client then has to fetch the derived data back.
In my experiences, there are cases where having the data / computation as close to the customer edge as possible is what is required for an acceptable user experience. In other cases, the relationship of the user / data / computation is not important. Most often, there is a mix of both. One of the ideas behind Hadoop as I understand it is to bring the computation to the data location, while also providing for the data to be in several locations. The scheduler is critical to making good use of data locality. So yes, I believe that what you are looking for does exist within Hadoop at a minimum, though I also believe that there is alot of room to evolve the techniques that it uses.
Chaz. wrote: > That is one approach - again it seems to indicate the model is the data > moving to the compute resources. The other approach is to look at it > from the data perspective - can the data sit some place and the compute > come to it?
> Chuck Wegrzyn
> On SaaS wrote: >> That depends on how the cloud is architected, no?
>> And I would think the cloud providers will have to start answering these >> questions if they want large enterprises to start adopting the >> cloud. There maybe no control of which server in the cloud is doing the >> computation, but service providers may provide options to restrict based >> on geographic domains.
>> We have quite a few people here from the cloud providers, maybe they can >> share some insight?
>> thx
>> On Jun 19, 2008, at 10:44 AM, Stuart Altenhaus wrote:
>>> I think Chaz is right. There are privacy issues regarding use and >>> exposure of data that vary country by country. If the cloud computes >>> the data, there is no control on where that data is moved for >>> computation, right?
>>> Date: Thu, 19 Jun 2008 13:40:20 >>> To:cloud-computing@googlegroups.com >>> <mailto:cloud-computing@googlegroups.com> >>> Subject: Re: Issues of data in the cloud...
>>> While I think trans-national data movement will be an area that requires >>> governance of some kind I think that companies can get around the >>> problem in other ways. I think it just requires looking at the problem >>> in a different way.
>>> I'd think the approach is to keep the data still and move the computing >>> to it. The idea is to see the thousands of machines it takes to hold the >>> petabytes worth of data as the compute cloud. What needs to move to it >>> is the programs that can process the data. I've been working on this >>> approach for the last 3 years (Twisted Storage).
>>> Chuck Wegrzyn
>>> Pittard, Rick wrote: >>>> One big concern are compliance with the data privacy laws in the EU and >>>> other countries which require protection of personal data and that it >>>> not be transmitted to locations that have less protections. Since the >>>> laws in the US are generally less protective than those in the EU, then >>>> additional controls/agreements need to be in place to legally move the >>>> data from the EU to the US.
>>>> Rick
>>>> -----Original Message----- >>>> From: cloud-computing@googlegroups.com >>>> <mailto:cloud-computing@googlegroups.com> >>>> [mailto:cloud-computing@googlegroups.com] On Behalf Of Chaz. >>>> Sent: Thursday, June 19, 2008 11:58 AM >>>> To: cloud-computing@googlegroups.com >>>> <mailto:cloud-computing@googlegroups.com> >>>> Subject: Issues of data in the cloud...
>>>> While data access and recovery is a very important aspect of cloud >>>> computing, I'm curious as to the legal issues surrounding the movement >>>> of data across national boundaries or even across company boundaries.
>>>> How does the "cloud" protect data going from the owner to the computing >>>> service without being compromised (read that as sniffed)? Will a >>>> computing service in country A have the right to impose restrictions on >>>> data from another country (even if the results of the computing don't >>>> affect the citizens of country A)? An so on.
>>>> Chuck Wegrzyn
>>>> Utpal Datta wrote: >>>>> You make all the right points on speed, bandwidth, Amazon charging on >>>>> bandwidth etc. But consider the need for the user (say a large >>>>> financial company with a sensitive business critical application),
>>>>> 1. who will guarantee that the data in S3 is secure from physical and >>>>> logical access
>>>>> 2. who will guarantee that the data is always available using a >>>>> multi-site recovery system (that is what they would have in their own >>>>> data center) that meets their RPO (Recovery Point Objective) and RTO >>>>> (Recovery Time Objective) guidelines.
>>>>> Either Amazon or other Cloud providers will make these available with >>>>> EC2 with SP3 (or some other storage mechanism with more robust >>>>> security and availability characteristics) or the users will have to >>>>> build something similar on their own using EC2 as their basic building >>>>> block.
>>>>> This will be a *very* non-trivial task for any user to do on their own >>>>> and they will have to make the decision to put resources to build this >>>>> on a cloud or to invest more on their own datacenter.
>>>>> So I guess a lot will depend on the level of maturity of the clouds. >>>>> Not sure if all this work belong in a mid-layer outside of the >>>>> original cloud and leave the cloud providers just to provide the basic >>>>> building blocks
>>>>> --utpal
>>>>> On Thu, Jun 19, 2008 at 11:08 AM, Chris K Wensel <ch...@wensel.net >>>>> <mailto:ch...@wensel.net>> >>>> wrote: >>>>>> On Jun 18, 2008, at 8:15 AM, Utpal Datta wrote:
>>>>>>> 1. I think "data in the cloud" is so far a big block to widespread >>>>>>> adoption and using cloud for large, sensitive and mission critical >>>>>>> applications (espicially for Financial organization). Is someone >>>>>>> thinking of a way to leave the data within the user-premises and do >>>>>>> just the computing in the cloud? Kind of a reverse connection back >>>> to >>>>>>> the user datacenter.
>>>>>>> That way the conventional data respositories can still be used. The >>>>>>> users will not have to worry about the reliability, availability and >>>>>>> (to a large part) security of the data. We still have to worry about >>>>>>> the security of the data travelling back and forth to and from the >>>>>>> cloud to the user data center.
>>>>>>> This probably is more relevant for medium to large scale users with >>>>>>> "sensitive" data.
>>>>>>> Comments? tips? >>>>>> I've been processing large historical data sets for a Financial >>>>>> company I'm consulting with using Cascading/Hadoop on EC2/S3.
>>>>>> The biggest bottleneck has been getting data to the compute >>>>>> infrastructure.
>>>>>> The obvious pattern is to have datacenter processes push data to S3, >>>>>> then have the temporary cluster spin up and pull data from S3, do >>>>>> something interesting, then push the results to S3, notify the >>>>>> datacenter the job is complete (SQS), have the datacenter pull down >>>>>> the results from S3.
>>>>>> Because of the need to support both well defined daily processes and >>>>>> ad-hoc processes, my clients data generally needs to stay on S3. >>>>>> Having it pulled from a remote datacenter on duplicate runs would be >>>>>> extraordinarily slow and expensive considering Amazon charges for >>>>>> bandwidth in and out. Plus, it is a bit cheaper just to keep data on >>>>>> S3 than to buy a NAS for storage.
>>>>>> That said, with bandwidth being the bottleneck in the face of the >>>>>> ability to spin up 100 or 1000 nodes to crunch numbers, larger pipes >>>>>> into a vendors Cloud would be very welcome. Otherwise your Cloud >>>>>> solution is only as fast as getting data in and out of it.
>> -- >> OnSaaS.net - /Blogging about the SaaS and cloud computing world/ >> OnSaaS.info - Providing a continuous stream of SaaS and cloud computing news >> /Follow on http://twitter.com/onsaas, http://friendfeed.com/onsaas/
Even if the cloud providers come up with excellent answers to the security and reliability questions, who's going to trust them? Credit card numbers are one thing, but cloud data is something else entirely.
On Thu, Jun 19, 2008 at 10:57 AM, On SaaS <ons...@gmail.com> wrote: > That depends on how the cloud is architected, no? > And I would think the cloud providers will have to start answering these > questions if they want large enterprises to start adopting the cloud. There > maybe no control of which server in the cloud is doing the computation, but > service providers may provide options to restrict based on geographic > domains. > We have quite a few people here from the cloud providers, maybe they can > share some insight?
> thx
> On Jun 19, 2008, at 10:44 AM, Stuart Altenhaus wrote:
> I think Chaz is right. There are privacy issues regarding use and exposure > of data that vary country by country. If the cloud computes the data, there > is no control on where that data is moved for computation, right?
> Date: Thu, 19 Jun 2008 13:40:20 > To:cloud-computing@googlegroups.com > Subject: Re: Issues of data in the cloud...
> While I think trans-national data movement will be an area that requires > governance of some kind I think that companies can get around the > problem in other ways. I think it just requires looking at the problem > in a different way.
> I'd think the approach is to keep the data still and move the computing > to it. The idea is to see the thousands of machines it takes to hold the > petabytes worth of data as the compute cloud. What needs to move to it > is the programs that can process the data. I've been working on this > approach for the last 3 years (Twisted Storage).
> Chuck Wegrzyn
> Pittard, Rick wrote:
> One big concern are compliance with the data privacy laws in the EU and
> other countries which require protection of personal data and that it
> not be transmitted to locations that have less protections. Since the
> laws in the US are generally less protective than those in the EU, then
> additional controls/agreements need to be in place to legally move the
> data from the EU to the US.
> Rick
> -----Original Message-----
> From: cloud-computing@googlegroups.com
> [mailto:cloud-computing@googlegroups.com<cloud-computing@googlegroups.com>] > On Behalf Of Chaz.
> Sent: Thursday, June 19, 2008 11:58 AM
> To: cloud-computing@googlegroups.com
> Subject: Issues of data in the cloud...
> While data access and recovery is a very important aspect of cloud
> computing, I'm curious as to the legal issues surrounding the movement
> of data across national boundaries or even across company boundaries.
> How does the "cloud" protect data going from the owner to the computing
> service without being compromised (read that as sniffed)? Will a
> computing service in country A have the right to impose restrictions on
> data from another country (even if the results of the computing don't
> affect the citizens of country A)? An so on.
> Chuck Wegrzyn
> Utpal Datta wrote:
> You make all the right points on speed, bandwidth, Amazon charging on
> bandwidth etc. But consider the need for the user (say a large
> financial company with a sensitive business critical application),
> 1. who will guarantee that the data in S3 is secure from physical and
> logical access
> 2. who will guarantee that the data is always available using a
> multi-site recovery system (that is what they would have in their own
> data center) that meets their RPO (Recovery Point Objective) and RTO
> (Recovery Time Objective) guidelines.
> Either Amazon or other Cloud providers will make these available with
> EC2 with SP3 (or some other storage mechanism with more robust
> security and availability characteristics) or the users will have to
> build something similar on their own using EC2 as their basic building
> block.
> This will be a *very* non-trivial task for any user to do on their own
> and they will have to make the decision to put resources to build this
> on a cloud or to invest more on their own datacenter.
> So I guess a lot will depend on the level of maturity of the clouds.
> Not sure if all this work belong in a mid-layer outside of the
> original cloud and leave the cloud providers just to provide the basic
> building blocks
> --utpal
> On Thu, Jun 19, 2008 at 11:08 AM, Chris K Wensel <ch...@wensel.net>
> wrote:
> On Jun 18, 2008, at 8:15 AM, Utpal Datta wrote:
> 1. I think "data in the cloud" is so far a big block to widespread
> adoption and using cloud for large, sensitive and mission critical
> applications (espicially for Financial organization). Is someone
> thinking of a way to leave the data within the user-premises and do
> just the computing in the cloud? Kind of a reverse connection back
> to
> the user datacenter.
> That way the conventional data respositories can still be used. The
> users will not have to worry about the reliability, availability and
> (to a large part) security of the data. We still have to worry about
> the security of the data travelling back and forth to and from the
> cloud to the user data center.
> This probably is more relevant for medium to large scale users with
> "sensitive" data.
> Comments? tips?
> I've been processing large historical data sets for a Financial
> company I'm consulting with using Cascading/Hadoop on EC2/S3.
> The biggest bottleneck has been getting data to the compute
> infrastructure.
> The obvious pattern is to have datacenter processes push data to S3,
> then have the temporary cluster spin up and pull data from S3, do
> something interesting, then push the results to S3, notify the
> datacenter the job is complete (SQS), have the datacenter pull down
> the results from S3.
> Because of the need to support both well defined daily processes and
> ad-hoc processes, my clients data generally needs to stay on S3.
> Having it pulled from a remote datacenter on duplicate runs would be
> extraordinarily slow and expensive considering Amazon charges for
> bandwidth in and out. Plus, it is a bit cheaper just to keep data on
> S3 than to buy a NAS for storage.
> That said, with bandwidth being the bottleneck in the face of the
> ability to spin up 100 or 1000 nodes to crunch numbers, larger pipes
> into a vendors Cloud would be very welcome. Otherwise your Cloud
> solution is only as fast as getting data in and out of it.
> -- > OnSaaS.net - *Blogging about the SaaS and cloud computing world* > OnSaaS.info - Providing a continuous stream of SaaS and cloud computing > news > *Follow on http://twitter.com/onsaas, http://friendfeed.com/onsaas*
I definitely agree with your point. I can't think of very many multi-nationsls that would let there data out to wander around. I'd think they would want to protect their data and move the computing resources close to it....
Jim Peters wrote: > Even if the cloud providers come up with excellent answers to the > security and reliability questions, who's going to trust them? Credit > card numbers are one thing, but cloud data is something else entirely.
> +J
> On Thu, Jun 19, 2008 at 10:57 AM, On SaaS <ons...@gmail.com > <mailto:ons...@gmail.com>> wrote:
> That depends on how the cloud is architected, no?
> And I would think the cloud providers will have to start answering > these questions if they want large enterprises to start adopting the > cloud. There maybe no control of which server in the cloud is doing > the computation, but service providers may provide options to > restrict based on geographic domains.
> We have quite a few people here from the cloud providers, maybe they > can share some insight?
> thx
> On Jun 19, 2008, at 10:44 AM, Stuart Altenhaus wrote:
>> I think Chaz is right. There are privacy issues regarding use and >> exposure of data that vary country by country. If the cloud >> computes the data, there is no control on where that data is moved >> for computation, right?
>> Date: Thu, 19 Jun 2008 13:40:20 >> To:cloud-computing@googlegroups.com >> <mailto:cloud-computing@googlegroups.com> >> Subject: Re: Issues of data in the cloud...
>> While I think trans-national data movement will be an area that >> requires >> governance of some kind I think that companies can get around the >> problem in other ways. I think it just requires looking at the >> problem >> in a different way.
>> I'd think the approach is to keep the data still and move the >> computing >> to it. The idea is to see the thousands of machines it takes to >> hold the >> petabytes worth of data as the compute cloud. What needs to move >> to it >> is the programs that can process the data. I've been working on this >> approach for the last 3 years (Twisted Storage).
>> Chuck Wegrzyn
>> Pittard, Rick wrote: >>> One big concern are compliance with the data privacy laws in the >>> EU and >>> other countries which require protection of personal data and that it >>> not be transmitted to locations that have less protections. >>> Since the >>> laws in the US are generally less protective than those in the >>> EU, then >>> additional controls/agreements need to be in place to legally >>> move the >>> data from the EU to the US.
>>> Rick
>>> -----Original Message----- >>> From: cloud-computing@googlegroups.com >>> <mailto:cloud-computing@googlegroups.com> >>> [mailto:cloud-computing@googlegroups.com] On Behalf Of Chaz. >>> Sent: Thursday, June 19, 2008 11:58 AM >>> To: cloud-computing@googlegroups.com >>> <mailto:cloud-computing@googlegroups.com> >>> Subject: Issues of data in the cloud...
>>> While data access and recovery is a very important aspect of cloud >>> computing, I'm curious as to the legal issues surrounding the >>> movement >>> of data across national boundaries or even across company boundaries.
>>> How does the "cloud" protect data going from the owner to the >>> computing >>> service without being compromised (read that as sniffed)? Will a >>> computing service in country A have the right to impose >>> restrictions on >>> data from another country (even if the results of the computing >>> don't >>> affect the citizens of country A)? An so on.
>>> Chuck Wegrzyn
>>> Utpal Datta wrote: >>>> You make all the right points on speed, bandwidth, Amazon >>>> charging on >>>> bandwidth etc. But consider the need for the user (say a large >>>> financial company with a sensitive business critical application),
>>>> 1. who will guarantee that the data in S3 is secure from >>>> physical and >>>> logical access
>>>> 2. who will guarantee that the data is always available using a >>>> multi-site recovery system (that is what they would have in >>>> their own >>>> data center) that meets their RPO (Recovery Point Objective) and RTO >>>> (Recovery Time Objective) guidelines.
>>>> Either Amazon or other Cloud providers will make these available >>>> with >>>> EC2 with SP3 (or some other storage mechanism with more robust >>>> security and availability characteristics) or the users will have to >>>> build something similar on their own using EC2 as their basic >>>> building >>>> block.
>>>> This will be a *very* non-trivial task for any user to do on >>>> their own >>>> and they will have to make the decision to put resources to >>>> build this >>>> on a cloud or to invest more on their own datacenter.
>>>> So I guess a lot will depend on the level of maturity of the clouds. >>>> Not sure if all this work belong in a mid-layer outside of the >>>> original cloud and leave the cloud providers just to provide the >>>> basic >>>> building blocks
>>>> --utpal
>>>> On Thu, Jun 19, 2008 at 11:08 AM, Chris K Wensel >>>> <ch...@wensel.net <mailto:ch...@wensel.net>> >>> wrote: >>>>> On Jun 18, 2008, at 8:15 AM, Utpal Datta wrote:
>>>>>> 1. I think "data in the cloud" is so far a big block to widespread >>>>>> adoption and using cloud for large, sensitive and mission critical >>>>>> applications (espicially for Financial organization). Is someone >>>>>> thinking of a way to leave the data within the user-premises >>>>>> and do >>>>>> just the computing in the cloud? Kind of a reverse connection back >>> to >>>>>> the user datacenter.
>>>>>> That way the conventional data respositories can still be >>>>>> used. The >>>>>> users will not have to worry about the reliability, >>>>>> availability and >>>>>> (to a large part) security of the data. We still have to worry >>>>>> about >>>>>> the security of the data travelling back and forth to and from the >>>>>> cloud to the user data center.
>>>>>> This probably is more relevant for medium to large scale users >>>>>> with >>>>>> "sensitive" data.
>>>>>> Comments? tips? >>>>> I've been processing large historical data sets for a Financial >>>>> company I'm consulting with using Cascading/Hadoop on EC2/S3.
>>>>> The biggest bottleneck has been getting data to the compute >>>>> infrastructure.
>>>>> The obvious pattern is to have datacenter processes push data >>>>> to S3, >>>>> then have the temporary cluster spin up and pull data from S3, do >>>>> something interesting, then push the results to S3, notify the >>>>> datacenter the job is complete (SQS), have the datacenter pull down >>>>> the results from S3.
>>>>> Because of the need to support both well defined daily >>>>> processes and >>>>> ad-hoc processes, my clients data generally needs to stay on S3. >>>>> Having it pulled from a remote datacenter on duplicate runs >>>>> would be >>>>> extraordinarily slow and expensive considering Amazon charges for >>>>> bandwidth in and out. Plus, it is a bit cheaper just to keep >>>>> data on >>>>> S3 than to buy a NAS for storage.
>>>>> That said, with bandwidth being the bottleneck in the face of the >>>>> ability to spin up 100 or 1000 nodes to crunch numbers, larger >>>>> pipes >>>>> into a vendors Cloud would be very welcome. Otherwise your Cloud >>>>> solution is only as fast as getting data in and out of it.
> -- > OnSaaS.net - /Blogging about the SaaS and cloud computing world/ > OnSaaS.info - Providing a continuous stream of SaaS and cloud > computing news > /Follow on http://twitter.com/onsaas, http://friendfeed.com/onsaas/
If the programs are moved to the data, then what is the distinction between cloud computing and CORBA? Seems like the same basic tenets would have to be in place.
(I'm new to the concept of cloud computing, but do see the opportunities for advancing a network of computers that renders geo location trivial. Surely enhancing existing network clouds such that the computing power were placed at each node, a net-centric approach is achieved... The telcos do that today, right?)
Date: Thu, 19 Jun 2008 14:00:55 To:cloud-computing@googlegroups.com
Subject: Re: Issues of data in the cloud...
I know from my work that many firms are reluctant to let there data "out the door" since they see that as their edge in the market. But even that aside for a minute, it seems to make more sense to move "small" programs (relative to the size of the data) then to move massive amounts of data.
So my question is as follows: what makes a good "storage cloud"?
Chuck Wegrzyn
Khazret Sapenov wrote:
> On Thu, Jun 19, 2008 at 1:40 PM, Chaz. <eprparad...@gmail.com > <mailto:eprparad...@gmail.com>> wrote:
> [snip]
> I'd think the approach is to keep the data still and move the computing
> to it. The idea is to see the thousands of machines it takes to hold the
> petabytes worth of data as the compute cloud. What needs to move to it
> is the programs that can process the data. I've been working on this
> approach for the last 3 years (Twisted Storage).
> Chuck Wegrzyn
> This is valid approach, that I personally called "Plumber Pattern", when > application, encapsulated in some kind of container (e.g. virtual > machine image) is marshalled to secure data islands to iteratively do > its unique work (say, do a matches on some criterium in Interpol, FBI, > CIA, MI5 and other databases, all distributed across continents). Due to > utterly confidential nature of these types of data, it is impossible to > move them to public storage (at least this time). Above-mentioned case > might be extrapolated to some lines of business as well with reduced > privacy/security requirements.
On Thu, Jun 19, 2008 at 3:10 PM, Chaz. <eprparad...@gmail.com> wrote:
> Jim,
> I definitely agree with your point. I can't think of very many > multi-nationsls that would let there data out to wander around. I'd > think they would want to protect their data and move the computing > resources close to it....
> Chuck
> Jim Peters wrote: >> Even if the cloud providers come up with excellent answers to the >> security and reliability questions, who's going to trust them? Credit >> card numbers are one thing, but cloud data is something else entirely.
>> +J
>> On Thu, Jun 19, 2008 at 10:57 AM, On SaaS <ons...@gmail.com >> <mailto:ons...@gmail.com>> wrote:
>> That depends on how the cloud is architected, no?
>> And I would think the cloud providers will have to start answering >> these questions if they want large enterprises to start adopting the >> cloud. There maybe no control of which server in the cloud is doing >> the computation, but service providers may provide options to >> restrict based on geographic domains.
>> We have quite a few people here from the cloud providers, maybe they >> can share some insight?
>> thx
>> On Jun 19, 2008, at 10:44 AM, Stuart Altenhaus wrote:
>>> I think Chaz is right. There are privacy issues regarding use and >>> exposure of data that vary country by country. If the cloud >>> computes the data, there is no control on where that data is moved >>> for computation, right?
>>> Date: Thu, 19 Jun 2008 13:40:20 >>> To:cloud-computing@googlegroups.com >>> <mailto:cloud-computing@googlegroups.com> >>> Subject: Re: Issues of data in the cloud...
>>> While I think trans-national data movement will be an area that >>> requires >>> governance of some kind I think that companies can get around the >>> problem in other ways. I think it just requires looking at the >>> problem >>> in a different way.
>>> I'd think the approach is to keep the data still and move the >>> computing >>> to it. The idea is to see the thousands of machines it takes to >>> hold the >>> petabytes worth of data as the compute cloud. What needs to move >>> to it >>> is the programs that can process the data. I've been working on this >>> approach for the last 3 years (Twisted Storage).
>>> Chuck Wegrzyn
>>> Pittard, Rick wrote: >>>> One big concern are compliance with the data privacy laws in the >>>> EU and >>>> other countries which require protection of personal data and that it >>>> not be transmitted to locations that have less protections. >>>> Since the >>>> laws in the US are generally less protective than those in the >>>> EU, then >>>> additional controls/agreements need to be in place to legally >>>> move the >>>> data from the EU to the US.
>>>> Rick
>>>> -----Original Message----- >>>> From: cloud-computing@googlegroups.com >>>> <mailto:cloud-computing@googlegroups.com> >>>> [mailto:cloud-computing@googlegroups.com] On Behalf Of Chaz. >>>> Sent: Thursday, June 19, 2008 11:58 AM >>>> To: cloud-computing@googlegroups.com >>>> <mailto:cloud-computing@googlegroups.com> >>>> Subject: Issues of data in the cloud...
>>>> While data access and recovery is a very important aspect of cloud >>>> computing, I'm curious as to the legal issues surrounding the >>>> movement >>>> of data across national boundaries or even across company boundaries.
>>>> How does the "cloud" protect data going from the owner to the >>>> computing >>>> service without being compromised (read that as sniffed)? Will a >>>> computing service in country A have the right to impose >>>> restrictions on >>>> data from another country (even if the results of the computing >>>> don't >>>> affect the citizens of country A)? An so on.
>>>> Chuck Wegrzyn
>>>> Utpal Datta wrote: >>>>> You make all the right points on speed, bandwidth, Amazon >>>>> charging on >>>>> bandwidth etc. But consider the need for the user (say a large >>>>> financial company with a sensitive business critical application),
>>>>> 1. who will guarantee that the data in S3 is secure from >>>>> physical and >>>>> logical access
>>>>> 2. who will guarantee that the data is always available using a >>>>> multi-site recovery system (that is what they would have in >>>>> their own >>>>> data center) that meets their RPO (Recovery Point Objective) and RTO >>>>> (Recovery Time Objective) guidelines.
>>>>> Either Amazon or other Cloud providers will make these available >>>>> with >>>>> EC2 with SP3 (or some other storage mechanism with more robust >>>>> security and availability characteristics) or the users will have to >>>>> build something similar on their own using EC2 as their basic >>>>> building >>>>> block.
>>>>> This will be a *very* non-trivial task for any user to do on >>>>> their own >>>>> and they will have to make the decision to put resources to >>>>> build this >>>>> on a cloud or to invest more on their own datacenter.
>>>>> So I guess a lot will depend on the level of maturity of the clouds. >>>>> Not sure if all this work belong in a mid-layer outside of the >>>>> original cloud and leave the cloud providers just to provide the >>>>> basic >>>>> building blocks
>>>>> --utpal
>>>>> On Thu, Jun 19, 2008 at 11:08 AM, Chris K Wensel >>>>> <ch...@wensel.net <mailto:ch...@wensel.net>> >>>> wrote: >>>>>> On Jun 18, 2008, at 8:15 AM, Utpal Datta wrote:
>>>>>>> 1. I think "data in the cloud" is so far a big block to widespread >>>>>>> adoption and using cloud for large, sensitive and mission critical >>>>>>> applications (espicially for Financial organization). Is someone >>>>>>> thinking of a way to leave the data within the user-premises >>>>>>> and do >>>>>>> just the computing in the cloud? Kind of a reverse connection back >>>> to >>>>>>> the user datacenter.
>>>>>>> That way the conventional data respositories can still be >>>>>>> used. The >>>>>>> users will not have to worry about the reliability, >>>>>>> availability and >>>>>>> (to a large part) security of the data. We still have to worry >>>>>>> about >>>>>>> the security of the data travelling back and forth to and from the >>>>>>> cloud to the user data center.
>>>>>>> This probably is more relevant for medium to large scale users >>>>>>> with >>>>>>> "sensitive" data.
>>>>>>> Comments? tips? >>>>>> I've been processing large historical data sets for a Financial >>>>>> company I'm consulting with using Cascading/Hadoop on EC2/S3.
>>>>>> The biggest bottleneck has been getting data to the compute >>>>>> infrastructure.
>>>>>> The obvious pattern is to have datacenter processes push data >>>>>> to S3, >>>>>> then have the temporary cluster spin up and pull data from S3, do >>>>>> something interesting, then push the results to S3, notify the >>>>>> datacenter the job is complete (SQS), have the datacenter pull down >>>>>> the results from S3.
>>>>>> Because of the need to support both well defined daily >>>>>> processes and >>>>>> ad-hoc processes, my clients data generally needs to stay on S3. >>>>>> Having it pulled from a remote datacenter on duplicate runs >>>>>> would be >>>>>> extraordinarily slow and expensive considering Amazon charges for >>>>>> bandwidth in and out. Plus, it is a bit cheaper just to keep >>>>>> data on >>>>>> S3 than to buy a NAS for storage.
>>>>>> That said, with bandwidth being the bottleneck in the face of the >>>>>> ability to spin up 100 or 1000 nodes to crunch numbers, larger >>>>>> pipes >>>>>> into a vendors Cloud would be very welcome. Otherwise your Cloud >>>>>> solution is only as fast as getting data in and out of it.
>> -- >> OnSaaS.net - /Blogging about the SaaS and cloud computing world/ >> OnSaaS.info - Providing a continuous stream of SaaS and cloud >> computing news >> /Follow on http://twitter.com/onsaas, http://friendfeed.com/onsaas/
Just joined cloud-computing and this is the first conversation I've received.
A couple of weeks ago I attended Gartner Security where Neil MacDonald spoke on "Adaptive Security." In a nutshell, this approach builds a resilient system for secure data, acting much like the human immune system. It involves whitelisting as the foundation, blacklisting as a mid-tier and learned/adaptive mechanisms at the top. In such an environment, elements would be "autonomic" and self-managing to a large degree, and would share and communicate with other elements to protect workloads and information (as opposed to endpoints). There is a lot more to this vision, and it is probably a number of years away, but it may be a reasonable approach to address the concerns about data security being discussed here.
In any case, does anyone know of any product or standards efforts for the industry to collaborate on a more cohesive architecture for security in the cloud?
> I definitely agree with your point. I can't think of very many > multi-nationsls that would let there data out to wander around. I'd > think they would want to protect their data and move the computing > resources close to it....
> Chuck
> Jim Peters wrote: > > Even if the cloud providers come up with excellent answers to the > > security and reliability questions, who's going to trust them? Credit > > card numbers are one thing, but cloud data is something else entirely.
> > +J
> > On Thu, Jun 19, 2008 at 10:57 AM, On SaaS <ons...@gmail.com > > <mailto:ons...@gmail.com>> wrote:
> > That depends on how the cloud is architected, no?
> > And I would think the cloud providers will have to start answering > > these questions if they want large enterprises to start adopting the > > cloud. There maybe no control of which server in the cloud is doing > > the computation, but service providers may provide options to > > restrict based on geographic domains.
> > We have quite a few people here from the cloud providers, maybe they > > can share some insight?
> > thx
> > On Jun 19, 2008, at 10:44 AM, Stuart Altenhaus wrote:
> >> I think Chaz is right. There are privacy issues regarding use and > >> exposure of data that vary country by country. If the cloud > >> computes the data, there is no control on where that data is moved > >> for computation, right?
> >> Date: Thu, 19 Jun 2008 13:40:20 > >> To:cloud-computing@googlegroups.com<To%3Acloud-computing@googlegroups.com> > >> <mailto:cloud-computing@googlegroups.com> > >> Subject: Re: Issues of data in the cloud...
> >> While I think trans-national data movement will be an area that > >> requires > >> governance of some kind I think that companies can get around the > >> problem in other ways. I think it just requires looking at the > >> problem > >> in a different way.
> >> I'd think the approach is to keep the data still and move the > >> computing > >> to it. The idea is to see the thousands of machines it takes to > >> hold the > >> petabytes worth of data as the compute cloud. What needs to move > >> to it > >> is the programs that can process the data. I've been working on this > >> approach for the last 3 years (Twisted Storage).
> >> Chuck Wegrzyn
> >> Pittard, Rick wrote: > >>> One big concern are compliance with the data privacy laws in the > >>> EU and > >>> other countries which require protection of personal data and that > it > >>> not be transmitted to locations that have less protections. > >>> Since the > >>> laws in the US are generally less protective than those in the > >>> EU, then > >>> additional controls/agreements need to be in place to legally > >>> move the > >>> data from the EU to the US.
> >>> Rick
> >>> -----Original Message----- > >>> From: cloud-computing@googlegroups.com > >>> <mailto:cloud-computing@googlegroups.com> > >>> [mailto:cloud-computing@googlegroups.com] On Behalf Of Chaz. > >>> Sent: Thursday, June 19, 2008 11:58 AM > >>> To: cloud-computing@googlegroups.com > >>> <mailto:cloud-computing@googlegroups.com> > >>> Subject: Issues of data in the cloud...
> >>> While data access and recovery is a very important aspect of cloud > >>> computing, I'm curious as to the legal issues surrounding the > >>> movement > >>> of data across national boundaries or even across company > boundaries.
> >>> How does the "cloud" protect data going from the owner to the > >>> computing > >>> service without being compromised (read that as sniffed)? Will a > >>> computing service in country A have the right to impose > >>> restrictions on > >>> data from another country (even if the results of the computing > >>> don't > >>> affect the citizens of country A)? An so on.
> >>> Chuck Wegrzyn
> >>> Utpal Datta wrote: > >>>> You make all the right points on speed, bandwidth, Amazon > >>>> charging on > >>>> bandwidth etc. But consider the need for the user (say a large > >>>> financial company with a sensitive business critical application),
> >>>> 1. who will guarantee that the data in S3 is secure from > >>>> physical and > >>>> logical access
> >>>> 2. who will guarantee that the data is always available using a > >>>> multi-site recovery system (that is what they would have in > >>>> their own > >>>> data center) that meets their RPO (Recovery Point Objective) and > RTO > >>>> (Recovery Time Objective) guidelines.
> >>>> Either Amazon or other Cloud providers will make these available > >>>> with > >>>> EC2 with SP3 (or some other storage mechanism with more robust > >>>> security and availability characteristics) or the users will have > to > >>>> build something similar on their own using EC2 as their basic > >>>> building > >>>> block.
> >>>> This will be a *very* non-trivial task for any user to do on > >>>> their own > >>>> and they will have to make the decision to put resources to > >>>> build this > >>>> on a cloud or to invest more on their own datacenter.
> >>>> So I guess a lot will depend on the level of maturity of the > clouds. > >>>> Not sure if all this work belong in a mid-layer outside of the > >>>> original cloud and leave the cloud providers just to provide the > >>>> basic > >>>> building blocks
> >>>> --utpal
> >>>> On Thu, Jun 19, 2008 at 11:08 AM, Chris K Wensel > >>>> <ch...@wensel.net <mailto:ch...@wensel.net>> > >>> wrote: > >>>>> On Jun 18, 2008, at 8:15 AM, Utpal Datta wrote:
> >>>>>> 1. I think "data in the cloud" is so far a big block to > widespread > >>>>>> adoption and using cloud for large, sensitive and mission > critical > >>>>>> applications (espicially for Financial organization). Is someone > >>>>>> thinking of a way to leave the data within the user-premises > >>>>>> and do > >>>>>> just the computing in the cloud? Kind of a reverse connection > back > >>> to > >>>>>> the user datacenter.
> >>>>>> That way the conventional data respositories can still be > >>>>>> used. The > >>>>>> users will not have to worry about the reliability, > >>>>>> availability and > >>>>>> (to a large part) security of the data. We still have to worry > >>>>>> about > >>>>>> the security of the data travelling back and forth to and from > the > >>>>>> cloud to the user data center.
> >>>>>> This probably is more relevant for medium to large scale users > >>>>>> with > >>>>>> "sensitive" data.
> >>>>>> Comments? tips? > >>>>> I've been processing large historical data sets for a Financial > >>>>> company I'm consulting with using Cascading/Hadoop on EC2/S3.
> >>>>> The biggest bottleneck has been getting data to the compute > >>>>> infrastructure.
> >>>>> The obvious pattern is to have datacenter processes push data > >>>>> to S3, > >>>>> then have the temporary cluster spin up and pull data from S3, do > >>>>> something interesting, then push the results to S3, notify the > >>>>> datacenter the job is complete (SQS), have the datacenter pull > down > >>>>> the results from S3.
> >>>>> Because of the need to support both well defined daily > >>>>> processes and > >>>>> ad-hoc processes, my clients data generally needs to stay on S3. > >>>>> Having it pulled from a remote datacenter on duplicate runs > >>>>> would be > >>>>> extraordinarily slow and expensive considering Amazon charges for > >>>>> bandwidth in and out. Plus, it is a bit cheaper just to keep > >>>>> data on > >>>>> S3 than to buy a NAS for storage.
> >>>>> That said, with bandwidth being the bottleneck in the face of the > >>>>> ability to spin up 100 or 1000 nodes to crunch numbers, larger > >>>>> pipes > >>>>> into a vendors Cloud would be very welcome. Otherwise your Cloud > >>>>> solution is only as fast as getting data in and out of it.
CORBA isn't about mobility, it's just typesafe OO RPC. There was work done by ObjectSpace and GeneralMagic in the 90's on agent based computing (move code to the data). but that movement died off.
if the Cloud is a collection of compute resources, and you need to apply them to lots of your data, you have little choice but to move your data. you can't move the compute power. (unless you order a shipping container of servers I guess)
ckw
On Jun 19, 2008, at 11:39 AM, Stuart Altenhaus wrote:
> If the programs are moved to the data, then what is the distinction > between cloud computing and CORBA? Seems like the same basic tenets > would have to be in place.
> (I'm new to the concept of cloud computing, but do see the > opportunities for advancing a network of computers that renders geo > location trivial. Surely enhancing existing network clouds such that > the computing power were placed at each node, a net-centric approach > is achieved... The telcos do that today, right?)
> Date: Thu, 19 Jun 2008 14:00:55 > To:cloud-computing@googlegroups.com > Subject: Re: Issues of data in the cloud...
> I know from my work that many firms are reluctant to let there data > "out > the door" since they see that as their edge in the market. But even > that > aside for a minute, it seems to make more sense to move "small" > programs > (relative to the size of the data) then to move massive amounts of > data.
> So my question is as follows: what makes a good "storage cloud"?
> Chuck Wegrzyn
> Khazret Sapenov wrote:
>> On Thu, Jun 19, 2008 at 1:40 PM, Chaz. <eprparad...@gmail.com >> <mailto:eprparad...@gmail.com>> wrote:
>> [snip] >> I'd think the approach is to keep the data still and move the >> computing >> to it. The idea is to see the thousands of machines it takes to >> hold the >> petabytes worth of data as the compute cloud. What needs to >> move to it >> is the programs that can process the data. I've been working on >> this >> approach for the last 3 years (Twisted Storage).
>> Chuck Wegrzyn
>> This is valid approach, that I personally called "Plumber Pattern", >> when >> application, encapsulated in some kind of container (e.g. virtual >> machine image) is marshalled to secure data islands to iteratively do >> its unique work (say, do a matches on some criterium in Interpol, >> FBI, >> CIA, MI5 and other databases, all distributed across continents). >> Due to >> utterly confidential nature of these types of data, it is >> impossible to >> move them to public storage (at least this time). Above-mentioned >> case >> might be extrapolated to some lines of business as well with reduced >> privacy/security requirements.
> 1) is it possible to have the app run on AWS so that the derived > data does not need to traverse back down in real time (that way you > could use a lazy download in the background to archive it in their > DC while their app accesses the copy in real time on AWS.)?
The pattern is roughly this:
-- load dataset to S3 from datacenter (in small pieces, in parallel), repeat
- identify current dataset - boot hadoop cluster - start job on given dataset - head of job pulls down parts from S3 in parallel (very natural with Hadoop) - compete middle of job - tail of job stuffs results sets into S3 in parallel (again fairly natural with Hadoop) -- repeat above concurrently as datasets become available (easy to have concurrent Hadoop clusters in EC2).
-- pull data from S3 in parts in parallel
note 'job' above means a given data processing flow. in terms of Hadoop, the 'job' could be dozens of MapReduce jobs on the cluster.
> 2) I've been thinking about the problem of upload times as well (in > the context of large DNA data sets). The cost of loading into AWS is > not that prohibitive so if one where to pre-process that data such > that it could be uploaded in a bunch of parrallel processes to AWS > you could reduce the bottleneck considerably. In theory.
you will see a boost if you spawn multiple connects from one location to S3. it seems (was clearly in the past, unsure as of today) that individual connections were throttled, and up to a point bandwidth from a given ip was throttled. so doing things in parallel by breaking your big data into small parts give you a boost. I can't remember the numbers, else i'd share. its been a couple months since that project.
one benefit of using small parts, is that a given part will be available before the 'whole' is available. S3 won't show things for download that aren't finished uploading. So this also improves things (especially when coupled with SQS).
by 'parts' i mean, I may have locally 10G of data. I will break it into n MB pieces (compressed) and push them up to S3 (in parallel). having a manifest (*.parts file) is great when you need to manage the integrity of individual parts (MD5) and the whole (parts list all available, MD5 on parts file). This in part guarantees you aren't processing a job on partial data (because the upload failed an no one noticed).
Security is a funny issue. Can you ever use a cloud computing complex and know for certain your data is protected? I'm betting there is no fool proof way that it can be. So the only real way is to fall back to what we know today: maintain physical control of it for once that is gone you are on your own baby.
Utpal Datta wrote: > May be this is a redundant question, where is this protected data > residing? In the cloud or in the user's data center?
> If it is in the cloud then we are still dealing with Security, > Availability and Recoverability isues (that everyone agrees on).
> If is in the users data center then how will the computing resources > offered (and controlled by Amazon) be brought to that specific user's > datacenter?
> --utpal
> On Thu, Jun 19, 2008 at 3:10 PM, Chaz. <eprparad...@gmail.com> wrote: >> Jim,
>> I definitely agree with your point. I can't think of very many >> multi-nationsls that would let there data out to wander around. I'd >> think they would want to protect their data and move the computing >> resources close to it....
>> Chuck
>> Jim Peters wrote: >>> Even if the cloud providers come up with excellent answers to the >>> security and reliability questions, who's going to trust them? Credit >>> card numbers are one thing, but cloud data is something else entirely.
>>> +J
>>> On Thu, Jun 19, 2008 at 10:57 AM, On SaaS <ons...@gmail.com >>> <mailto:ons...@gmail.com>> wrote:
>>> That depends on how the cloud is architected, no?
>>> And I would think the cloud providers will have to start answering >>> these questions if they want large enterprises to start adopting the >>> cloud. There maybe no control of which server in the cloud is doing >>> the computation, but service providers may provide options to >>> restrict based on geographic domains.
>>> We have quite a few people here from the cloud providers, maybe they >>> can share some insight?
>>> thx
>>> On Jun 19, 2008, at 10:44 AM, Stuart Altenhaus wrote:
>>>> I think Chaz is right. There are privacy issues regarding use and >>>> exposure of data that vary country by country. If the cloud >>>> computes the data, there is no control on where that data is moved >>>> for computation, right?
>>>> Date: Thu, 19 Jun 2008 13:40:20 >>>> To:cloud-computing@googlegroups.com >>>> <mailto:cloud-computing@googlegroups.com> >>>> Subject: Re: Issues of data in the cloud...
>>>> While I think trans-national data movement will be an area that >>>> requires >>>> governance of some kind I think that companies can get around the >>>> problem in other ways. I think it just requires looking at the >>>> problem >>>> in a different way.
>>>> I'd think the approach is to keep the data still and move the >>>> computing >>>> to it. The idea is to see the thousands of machines it takes to >>>> hold the >>>> petabytes worth of data as the compute cloud. What needs to move >>>> to it >>>> is the programs that can process the data. I've been working on this >>>> approach for the last 3 years (Twisted Storage).
>>>> Chuck Wegrzyn
>>>> Pittard, Rick wrote: >>>>> One big concern are compliance with the data privacy laws in the >>>>> EU and >>>>> other countries which require protection of personal data and that it >>>>> not be transmitted to locations that have less protections. >>>>> Since the >>>>> laws in the US are generally less protective than those in the >>>>> EU, then >>>>> additional controls/agreements need to be in place to legally >>>>> move the >>>>> data from the EU to the US.
>>>>> Rick
>>>>> -----Original Message----- >>>>> From: cloud-computing@googlegroups.com >>>>> <mailto:cloud-computing@googlegroups.com> >>>>> [mailto:cloud-computing@googlegroups.com] On Behalf Of Chaz. >>>>> Sent: Thursday, June 19, 2008 11:58 AM >>>>> To: cloud-computing@googlegroups.com >>>>> <mailto:cloud-computing@googlegroups.com> >>>>> Subject: Issues of data in the cloud...
>>>>> While data access and recovery is a very important aspect of cloud >>>>> computing, I'm curious as to the legal issues surrounding the >>>>> movement >>>>> of data across national boundaries or even across company boundaries.
>>>>> How does the "cloud" protect data going from the owner to the >>>>> computing >>>>> service without being compromised (read that as sniffed)? Will a >>>>> computing service in country A have the right to impose >>>>> restrictions on >>>>> data from another country (even if the results of the computing >>>>> don't >>>>> affect the citizens of country A)? An so on.
>>>>> Chuck Wegrzyn
>>>>> Utpal Datta wrote: >>>>>> You make all the right points on speed, bandwidth, Amazon >>>>>> charging on >>>>>> bandwidth etc. But consider the need for the user (say a large >>>>>> financial company with a sensitive business critical application),
>>>>>> 1. who will guarantee that the data in S3 is secure from >>>>>> physical and >>>>>> logical access
>>>>>> 2. who will guarantee that the data is always available using a >>>>>> multi-site recovery system (that is what they would have in >>>>>> their own >>>>>> data center) that meets their RPO (Recovery Point Objective) and RTO >>>>>> (Recovery Time Objective) guidelines.
>>>>>> Either Amazon or other Cloud providers will make these available >>>>>> with >>>>>> EC2 with SP3 (or some other storage mechanism with more robust >>>>>> security and availability characteristics) or the users will have to >>>>>> build something similar on their own using EC2 as their basic >>>>>> building >>>>>> block.
>>>>>> This will be a *very* non-trivial task for any user to do on >>>>>> their own >>>>>> and they will have to make the decision to put resources to >>>>>> build this >>>>>> on a cloud or to invest more on their own datacenter.
>>>>>> So I guess a lot will depend on the level of maturity of the clouds. >>>>>> Not sure if all this work belong in a mid-layer outside of the >>>>>> original cloud and leave the cloud providers just to provide the >>>>>> basic >>>>>> building blocks
>>>>>> --utpal
>>>>>> On Thu, Jun 19, 2008 at 11:08 AM, Chris K Wensel >>>>>> <ch...@wensel.net <mailto:ch...@wensel.net>> >>>>> wrote: >>>>>>> On Jun 18, 2008, at 8:15 AM, Utpal Datta wrote:
>>>>>>>> 1. I think "data in the cloud" is so far a big block to widespread >>>>>>>> adoption and using cloud for large, sensitive and mission critical >>>>>>>> applications (espicially for Financial organization). Is someone >>>>>>>> thinking of a way to leave the data within the user-premises >>>>>>>> and do >>>>>>>> just the computing in the cloud? Kind of a reverse connection back >>>>> to >>>>>>>> the user datacenter.
>>>>>>>> That way the conventional data respositories can still be >>>>>>>> used. The >>>>>>>> users will not have to worry about the reliability, >>>>>>>> availability and >>>>>>>> (to a large part) security of the data. We still have to worry >>>>>>>> about >>>>>>>> the security of the data travelling back and forth to and from the >>>>>>>> cloud to the user data center.
>>>>>>>> This probably is more relevant for medium to large scale users >>>>>>>> with >>>>>>>> "sensitive" data.
>>>>>>>> Comments? tips? >>>>>>> I've been processing large historical data sets for a Financial >>>>>>> company I'm consulting with using Cascading/Hadoop on EC2/S3.
>>>>>>> The biggest bottleneck has been getting data to the compute >>>>>>> infrastructure.
>>>>>>> The obvious pattern is to have datacenter processes push data >>>>>>> to S3, >>>>>>> then have the temporary cluster spin up and pull data from S3, do >>>>>>> something interesting, then push the results to S3, notify the >>>>>>> datacenter the job is complete (SQS), have the datacenter pull down >>>>>>> the results from S3.
>>>>>>> Because of the need to support both well defined daily >>>>>>> processes and >>>>>>> ad-hoc processes, my clients data generally needs to stay on S3. >>>>>>> Having it pulled from a remote datacenter on duplicate runs >>>>>>> would be >>>>>>> extraordinarily slow and expensive considering Amazon charges for >>>>>>> bandwidth in and out. Plus, it is a bit cheaper just to keep >>>>>>> data on >>>>>>> S3 than to buy a NAS for storage.
>>>>>>> That said, with bandwidth being the bottleneck in the face of the >>>>>>> ability to spin up 100 or 1000 nodes to crunch numbers, larger >>>>>>> pipes >>>>>>> into a vendors Cloud would be very welcome. Otherwise your Cloud >>>>>>> solution is only as fast as getting data in and out of it.
>>> -- >>> OnSaaS.net - /Blogging about the SaaS and cloud computing world/ >>> OnSaaS.info - Providing a continuous stream of SaaS and cloud >>> computing news >>> /Follow on
I don't believe it is possible to have data security in the "cloud" without having physical security of the data. After all whenever I use a cloud computer I hope that no one has hacked it to replace the security modules, or to map memory and look into a running program, etc.
Now if you have to build out an autonomic system we will never have secure cloud computing. No system today is so tight that it can't be hacked. Just look at all the attempts to protect DVDs or BD disks...
Lynne VanArsdale wrote: > Just joined cloud-computing and this is the first conversation I've > received.
> A couple of weeks ago I attended Gartner Security where Neil MacDonald > spoke on "Adaptive Security." In a nutshell, this approach builds a > resilient system for secure data, acting much like the human immune > system. It involves whitelisting as the foundation, blacklisting as a > mid-tier and learned/adaptive mechanisms at the top. In such an > environment, elements would be "autonomic" and self-managing to a large > degree, and would share and communicate with other elements to protect > workloads and information (as opposed to endpoints). There is a lot > more to this vision, and it is probably a number of years away, but it > may be a reasonable approach to address the concerns about data security > being discussed here.
> In any case, does anyone know of any product or standards efforts for > the industry to collaborate on a more cohesive architecture for security > in the cloud?
> On 6/19/08, *Chaz.* <eprparad...@gmail.com > <mailto:eprparad...@gmail.com>> wrote:
> Jim,
> I definitely agree with your point. I can't think of very many > multi-nationsls that would let there data out to wander around. I'd > think they would want to protect their data and move the computing > resources close to it....
> Chuck
> Jim Peters wrote: > > Even if the cloud providers come up with excellent answers to the > > security and reliability questions, who's going to trust them? Credit > > card numbers are one thing, but cloud data is something else > entirely.
> > +J
> > On Thu, Jun 19, 2008 at 10:57 AM, On SaaS <ons...@gmail.com > <mailto:ons...@gmail.com> > > <mailto:ons...@gmail.com <mailto:ons...@gmail.com>>> wrote:
> > That depends on how the cloud is architected, no?
> > And I would think the cloud providers will have to start > answering > > these questions if they want large enterprises to start > adopting the > > cloud. There maybe no control of which server in the cloud is > doing > > the computation, but service providers may provide options to > > restrict based on geographic domains.
> > We have quite a few people here from the cloud providers, > maybe they > > can share some insight?
> > thx
> > On Jun 19, 2008, at 10:44 AM, Stuart Altenhaus wrote:
> >> I think Chaz is right. There are privacy issues regarding > use and > >> exposure of data that vary country by country. If the cloud > >> computes the data, there is no control on where that data is > moved > >> for computation, right?
> >> Date: Thu, 19 Jun 2008 13:40:20 > >> To:cloud-computing@googlegroups.com > <mailto:To%3Acloud-computing@googlegroups.com> > >> <mailto:cloud-computing@googlegroups.com > <mailto:cloud-computing@googlegroups.com>> > >> Subject: Re: Issues of data in the cloud...
> >> While I think trans-national data movement will be an area that > >> requires > >> governance of some kind I think that companies can get > around the > >> problem in other ways. I think it just requires looking at the > >> problem > >> in a different way.
> >> I'd think the approach is to keep the data still and move the > >> computing > >> to it. The idea is to see the thousands of machines it takes to > >> hold the > >> petabytes worth of data as the compute cloud. What needs to > move > >> to it > >> is the programs that can process the data. I've been working > on this > >> approach for the last 3 years (Twisted Storage).
> >> Chuck Wegrzyn
> >> Pittard, Rick wrote: > >>> One big concern are compliance with the data privacy laws > in the > >>> EU and > >>> other countries which require protection of personal data > and that it > >>> not be transmitted to locations that have less protections. > >>> Since the > >>> laws in the US are generally less protective than those in the > >>> EU, then > >>> additional controls/agreements need to be in place to legally > >>> move the > >>> data from the EU to the US.
> >>> Rick
> >>> -----Original Message----- > >>> From: cloud-computing@googlegroups.com > <mailto:cloud-computing@googlegroups.com> > >>> <mailto:cloud-computing@googlegroups.com > <mailto:cloud-computing@googlegroups.com>> > >>> [mailto:cloud-computing@googlegroups.com > <mailto:cloud-computing@googlegroups.com>] On Behalf Of Chaz. > >>> Sent: Thursday, June 19, 2008 11:58 AM > >>> To: cloud-computing@googlegroups.com > <mailto:cloud-computing@googlegroups.com> > >>> <mailto:cloud-computing@googlegroups.com > <mailto:cloud-computing@googlegroups.com>> > >>> Subject: Issues of data in the cloud...
> >>> While data access and recovery is a very important aspect > of cloud > >>> computing, I'm curious as to the legal issues surrounding the > >>> movement > >>> of data across national boundaries or even across company > boundaries.
> >>> How does the "cloud" protect data going from the owner to the > >>> computing > >>> service without being compromised (read that as sniffed)? > Will a > >>> computing service in country A have the right to impose > >>> restrictions on > >>> data from another country (even if the results of the computing > >>> don't > >>> affect the citizens of country A)? An so on.
> >>> Chuck Wegrzyn
> >>> Utpal Datta wrote: > >>>> You make all the right points on speed, bandwidth, Amazon > >>>> charging on > >>>> bandwidth etc. But consider the need for the user (say a large > >>>> financial company with a sensitive business critical > application),
> >>>> 1. who will guarantee that the data in S3 is secure from > >>>> physical and > >>>> logical access
> >>>> 2. who will guarantee that the data is always available > using a > >>>> multi-site recovery system (that is what they would have in > >>>> their own > >>>> data center) that meets their RPO (Recovery Point > Objective) and RTO > >>>> (Recovery Time Objective) guidelines.
> >>>> Either Amazon or other Cloud providers will make these > available > >>>> with > >>>> EC2 with SP3 (or some other storage mechanism with more robust > >>>> security and availability characteristics) or the users > will have to > >>>> build something similar on their own using EC2 as their basic > >>>> building > >>>> block.
> >>>> This will be a *very* non-trivial task for any user to do on > >>>> their own > >>>> and they will have to make the decision to put resources to > >>>> build this > >>>> on a cloud or to invest more on their own datacenter.
> >>>> So I guess a lot will depend on the level of maturity of > the clouds. > >>>> Not sure if all this work belong in a mid-layer outside of the > >>>> original cloud and leave the cloud providers just to > provide the > >>>> basic > >>>> building blocks
> >>>> --utpal
> >>>> On Thu, Jun 19, 2008 at 11:08 AM, Chris K Wensel > >>>> <ch...@wensel.net <mailto:ch...@wensel.net> > <mailto:ch...@wensel.net <mailto:ch...@wensel.net>>> > >>> wrote: > >>>>> On Jun 18, 2008, at 8:15 AM, Utpal Datta wrote:
> >>>>>> 1. I think "data in the cloud" is so far a big block to > widespread > >>>>>> adoption and using cloud for large, sensitive and > mission critical > >>>>>> applications (espicially for Financial organization). Is > someone > >>>>>> thinking of a way to leave the data within the user-premises > >>>>>> and do > >>>>>> just the computing in the cloud? Kind of a reverse > connection back > >>> to > >>>>>> the user datacenter.
> >>>>>> That way the conventional data respositories can still be > >>>>>> used. The > >>>>>> users will not have to worry about the reliability, > >>>>>> availability and > >>>>>> (to a large part) security of
And CORBA isn't what I am thinking of, or even HADOOP but things like JavaSpaces (?).
I'm not sure I would agree you have to ship your data to somewhere else. After all a "cloud data provider" could create just the secure environment for holding the data and processing it (isn't that really what S3 is all about?). The only thing the using company needs to do is write the program and have it installed, more or less automagically, on the machines that hold the user's data.
Chris K Wensel wrote: > CORBA isn't about mobility, it's just typesafe OO RPC. There was work > done by ObjectSpace and GeneralMagic in the 90's on agent based > computing (move code to the data). but that movement died off.
> if the Cloud is a collection of compute resources, and you need to > apply them to lots of your data, you have little choice but to move > your data. you can't move the compute power. (unless you order a > shipping container of servers I guess)
> ckw
> On Jun 19, 2008, at 11:39 AM, Stuart Altenhaus wrote:
>> If the programs are moved to the data, then what is the distinction >> between cloud computing and CORBA? Seems like the same basic tenets >> would have to be in place.
>> (I'm new to the concept of cloud computing, but do see the >> opportunities for advancing a network of computers that renders geo >> location trivial. Surely enhancing existing network clouds such that >> the computing power were placed at each node, a net-centric approach >> is achieved... The telcos do that today, right?)
>> Date: Thu, 19 Jun 2008 14:00:55 >> To:cloud-computing@googlegroups.com >> Subject: Re: Issues of data in the cloud...
>> I know from my work that many firms are reluctant to let there data >> "out >> the door" since they see that as their edge in the market. But even >> that >> aside for a minute, it seems to make more sense to move "small" >> programs >> (relative to the size of the data) then to move massive amounts of >> data.
>> So my question is as follows: what makes a good "storage cloud"?
>> Chuck Wegrzyn
>> Khazret Sapenov wrote:
>>> On Thu, Jun 19, 2008 at 1:40 PM, Chaz. <eprparad...@gmail.com >>> <mailto:eprparad...@gmail.com>> wrote:
>>> [snip] >>> I'd think the approach is to keep the data still and move the >>> computing >>> to it. The idea is to see the thousands of machines it takes to >>> hold the >>> petabytes worth of data as the compute cloud. What needs to >>> move to it >>> is the programs that can process the data. I've been working on >>> this >>> approach for the last 3 years (Twisted Storage).
>>> Chuck Wegrzyn
>>> This is valid approach, that I personally called "Plumber Pattern", >>> when >>> application, encapsulated in some kind of container (e.g. virtual >>> machine image) is marshalled to secure data islands to iteratively do >>> its unique work (say, do a matches on some criterium in Interpol, >>> FBI, >>> CIA, MI5 and other databases, all distributed across continents). >>> Due to >>> utterly confidential nature of these types of data, it is >>> impossible to >>> move them to public storage (at least this time). Above-mentioned >>> case >>> might be extrapolated to some lines of business as well with reduced >>> privacy/security requirements.
If you are deploying an application in EC2, you must architect it to survive failure, because it will fail in varying degrees. Subsequently features of AWS allow you to do that, roughly (booting a pre- configured xen vm, simple db, sqs, s3, etc etc).
I suggest you do the same regarding security, just assume it's a hostile environment.
The question is, what features of AWS support you in this? shared keychains/stores, encrypted volumes, CA, kerberos, ?? or will this always be left to the user. or could you ever really trust those services the same way you trust them to not lose data.
That said, not being a security person. What 'cloud security services' could a provider provide? Or should they even bother.
> Security is a funny issue. Can you ever use a cloud computing complex > and know for certain your data is protected? I'm betting there is no > fool proof way that it can be. So the only real way is to fall back to > what we know today: maintain physical control of it for once that is > gone you are on your own baby.
> Chuck Wegrzyn
> Utpal Datta wrote: >> May be this is a redundant question, where is this protected data >> residing? In the cloud or in the user's data center?
>> If it is in the cloud then we are still dealing with Security, >> Availability and Recoverability isues (that everyone agrees on).
>> If is in the users data center then how will the computing resources >> offered (and controlled by Amazon) be brought to that specific user's >> datacenter?
>> --utpal
>> On Thu, Jun 19, 2008 at 3:10 PM, Chaz. <eprparad...@gmail.com> wrote: >>> Jim,
>>> I definitely agree with your point. I can't think of very many >>> multi-nationsls that would let there data out to wander around. I'd >>> think they would want to protect their data and move the computing >>> resources close to it....
>>> Chuck
>>> Jim Peters wrote: >>>> Even if the cloud providers come up with excellent answers to the >>>> security and reliability questions, who's going to trust them? >>>> Credit >>>> card numbers are one thing, but cloud data is something else >>>> entirely.
>>>> +J
>>>> On Thu, Jun 19, 2008 at 10:57 AM, On SaaS <ons...@gmail.com >>>> <mailto:ons...@gmail.com>> wrote:
>>>> That depends on how the cloud is architected, no?
>>>> And I would think the cloud providers will have to start >>>> answering >>>> these questions if they want large enterprises to start >>>> adopting the >>>> cloud. There maybe no control of which server in the cloud is >>>> doing >>>> the computation, but service providers may provide options to >>>> restrict based on geographic domains.
>>>> We have quite a few people here from the cloud providers, >>>> maybe they >>>> can share some insight?
>>>> thx
>>>> On Jun 19, 2008, at 10:44 AM, Stuart Altenhaus wrote:
>>>>> I think Chaz is right. There are privacy issues regarding use >>>>> and >>>>> exposure of data that vary country by country. If the cloud >>>>> computes the data, there is no control on where that data is >>>>> moved >>>>> for computation, right?
>>>>> Date: Thu, 19 Jun 2008 13:40:20 >>>>> To:cloud-computing@googlegroups.com >>>>> <mailto:cloud-computing@googlegroups.com> >>>>> Subject: Re: Issues of data in the cloud...
>>>>> While I think trans-national data movement will be an area that >>>>> requires >>>>> governance of some kind I think that companies can get around >>>>> the >>>>> problem in other ways. I think it just requires looking at the >>>>> problem >>>>> in a different way.
>>>>> I'd think the approach is to keep the data still and move the >>>>> computing >>>>> to it. The idea is to see the thousands of machines it takes to >>>>> hold the >>>>> petabytes worth of data as the compute cloud. What needs to >>>>> move >>>>> to it >>>>> is the programs that can process the data. I've been working >>>>> on this >>>>> approach for the last 3 years (Twisted Storage).
>>>>> Chuck Wegrzyn
>>>>> Pittard, Rick wrote: >>>>>> One big concern are compliance with the data privacy laws in >>>>>> the >>>>>> EU and >>>>>> other countries which require protection of personal data >>>>>> and that it >>>>>> not be transmitted to locations that have less protections. >>>>>> Since the >>>>>> laws in the US are generally less protective than those in the >>>>>> EU, then >>>>>> additional controls/agreements need to be in place to legally >>>>>> move the >>>>>> data from the EU to the US.
>>>>>> Rick
>>>>>> -----Original Message----- >>>>>> From: cloud-computing@googlegroups.com >>>>>> <mailto:cloud-computing@googlegroups.com> >>>>>> [mailto:cloud-computing@googlegroups.com] On Behalf Of Chaz. >>>>>> Sent: Thursday, June 19, 2008 11:58 AM >>>>>> To: cloud-computing@googlegroups.com >>>>>> <mailto:cloud-computing@googlegroups.com> >>>>>> Subject: Issues of data in the cloud...
>>>>>> While data access and recovery is a very important aspect of >>>>>> cloud >>>>>> computing, I'm curious as to the legal issues surrounding the >>>>>> movement >>>>>> of data across national boundaries or even across company >>>>>> boundaries.
>>>>>> How does the "cloud" protect data going from the owner to the >>>>>> computing >>>>>> service without being compromised (read that as sniffed)? >>>>>> Will a >>>>>> computing service in country A have the right to impose >>>>>> restrictions on >>>>>> data from another country (even if the results of the >>>>>> computing >>>>>> don't >>>>>> affect the citizens of country A)? An so on.
>>>>>> Chuck Wegrzyn
>>>>>> Utpal Datta wrote: >>>>>>> You make all the right points on speed, bandwidth, Amazon >>>>>>> charging on >>>>>>> bandwidth etc. But consider the need for the user (say a >>>>>>> large >>>>>>> financial company with a sensitive business critical >>>>>>> application),
>>>>>>> 1. who will guarantee that the data in S3 is secure from >>>>>>> physical and >>>>>>> logical access
>>>>>>> 2. who will guarantee that the data is always available >>>>>>> using a >>>>>>> multi-site recovery system (that is what they would have in >>>>>>> their own >>>>>>> data center) that meets their RPO (Recovery Point >>>>>>> Objective) and RTO >>>>>>> (Recovery Time Objective) guidelines.
>>>>>>> Either Amazon or other Cloud providers will make these >>>>>>> available >>>>>>> with >>>>>>> EC2 with SP3 (or some other storage mechanism with more >>>>>>> robust >>>>>>> security and availability characteristics) or the users >>>>>>> will have to >>>>>>> build something similar on their own using EC2 as their basic >>>>>>> building >>>>>>> block.
>>>>>>> This will be a *very* non-trivial task for any user to do on >>>>>>> their own >>>>>>> and they will have to make the decision to put resources to >>>>>>> build this >>>>>>> on a cloud or to invest more on their own datacenter.
>>>>>>> So I guess a lot will depend on the level of maturity of >>>>>>> the clouds. >>>>>>> Not sure if all this work belong in a mid-layer outside of >>>>>>> the >>>>>>> original cloud and leave the cloud providers just to >>>>>>> provide the >>>>>>> basic >>>>>>> building blocks
>>>>>>> --utpal
>>>>>>> On Thu, Jun 19, 2008 at 11:08 AM, Chris K Wensel >>>>>>> <ch...@wensel.net <mailto:ch...@wensel.net>> >>>>>> wrote: >>>>>>>> On Jun 18, 2008, at 8:15 AM, Utpal Datta wrote:
>>>>>>>>> 1. I think "data in the cloud" is so far a big block to >>>>>>>>> widespread >>>>>>>>> adoption and using cloud for large, sensitive and mission >>>>>>>>> critical >>>>>>>>> applications (espicially for Financial organization). Is >>>>>>>>> someone >>>>>>>>> thinking of a way to leave the data within the user- >>>>>>>>> premises >>>>>>>>> and do >>>>>>>>> just the computing in the cloud? Kind of a reverse >>>>>>>>> connection back >>>>>> to >>>>>>>>> the user datacenter.
>>>>>>>>> That way the conventional data respositories can still be >>>>>>>>> used. The >>>>>>>>> users will not have to worry about the reliability, >>>>>>>>> availability and >>>>>>>>> (to a large part) security of the data. We still have to >>>>>>>>> worry >>>>>>>>> about >>>>>>>>> the security of the data travelling back and forth to and >>>>>>>>> from the >>>>>>>>> cloud to the user data center.
>>>>>>>>> This probably is more relevant for medium to large scale >>>>>>>>> users >>>>>>>>> with >>>>>>>>> "sensitive" data.
>>>>>>>>> Comments? tips? >>>>>>>> I've been processing large historical data sets for a >>>>>>>> Financial >>>>>>>> company I'm consulting with using Cascading/Hadoop on EC2/ >>>>>>>> S3.
>>>>>>>> The biggest bottleneck has been getting data to the compute >>>>>>>> infrastructure.
>>>>>>>> The obvious pattern is to have datacenter processes push >>>>>>>> data >>>>>>>> to S3, >>>>>>>> then have the temporary cluster spin up and pull data from >>>>>>>> S3, do >>>>>>>> something interesting, then push the results to S3, notify >>>>>>>> the >>>>>>>> datacenter the job is complete (SQS), have the datacenter >>>>>>>> pull down >>>>>>>> the results from S3.
I think the front page of the Wall Street Journal proves that even having physical security of your data does not provide security! :-)
Security is really a business issue. Each layer of security should cost no more than the data is worth. So the concept of "secure enough" becomes important. What security is appropriate for a given type of data and is it more or less secure in the cloud than in the corp DC? Is data inherently "less secure" by virtue of being in the cloud than, say, an employees laptop or flash dongle or "on the wire"? I don't think corporate data centers are a secure as you're suggesting they are...
----- Original Message ---- From: Chaz. <eprparad...@gmail.com> To: cloud-computing@googlegroups.com Sent: Thursday, June 19, 2008 1:30:30 PM Subject: Re: Issues of data in the cloud...
I don't believe it is possible to have data security in the "cloud" without having physical security of the data. After all whenever I use a cloud computer I hope that no one has hacked it to replace the security modules, or to map memory and look into a running program, etc.
Now if you have to build out an autonomic system we will never have secure cloud computing. No system today is so tight that it can't be hacked. Just look at all the attempts to protect DVDs or BD disks...
Chuck Wegrzyn
Lynne VanArsdale wrote: > Just joined cloud-computing and this is the first conversation I've > received.
> A couple of weeks ago I attended Gartner Security where Neil MacDonald > spoke on "Adaptive Security." In a nutshell, this approach builds a > resilient system for secure data, acting much like the human immune > system. It involves whitelisting as the foundation, blacklisting as a > mid-tier and learned/adaptive mechanisms at the top. In such an > environment, elements would be "autonomic" and self-managing to a large > degree, and would share and communicate with other elements to protect > workloads and information (as opposed to endpoints). There is a lot > more to this vision, and it is probably a number of years away, but it > may be a reasonable approach to address the concerns about data security > being discussed here.
> In any case, does anyone know of any product or standards efforts for > the industry to collaborate on a more cohesive architecture for security > in the cloud?
> On 6/19/08, *Chaz.* <eprparad...@gmail.com > <mailto:eprparad...@gmail.com>> wrote:
> Jim,
> I definitely agree with your point. I can't think of very many > multi-nationsls that would let there data out to wander around. I'd > think they would want to protect their data and move the computing > resources close to it....
> Chuck
> Jim Peters wrote: > > Even if the cloud providers come up with excellent answers to the > > security and reliability questions, who's going to trust them? Credit > > card numbers are one thing, but cloud data is something else > entirely.
> > +J
> > On Thu, Jun 19, 2008 at 10:57 AM, On SaaS <ons...@gmail.com > <mailto:ons...@gmail.com> > > <mailto:ons...@gmail.com <mailto:ons...@gmail.com>>> wrote:
> > That depends on how the cloud is architected, no?
> > And I would think the cloud providers will have to start > answering > > these questions if they want large enterprises to start > adopting the > > cloud. There maybe no control of which server in the cloud is > doing > > the computation, but service providers may provide options to > > restrict based on geographic domains.
> > We have quite a few people here from the cloud providers, > maybe they > > can share some insight?
> > thx
> > On Jun 19, 2008, at 10:44 AM, Stuart Altenhaus wrote:
> >> I think Chaz is right. There are privacy issues regarding > use and > >> exposure of data that vary country by country. If the cloud > >> computes the data, there is no control on where that data is > moved > >> for computation, right?
> >> Date: Thu, 19 Jun 2008 13:40:20 > >> To:cloud-computing@googlegroups.com > <mailto:To%3Acloud-computing@googlegroups.com> > >> <mailto:cloud-computing@googlegroups.com > <mailto:cloud-computing@googlegroups.com>> > >> Subject: Re: Issues of data in the cloud...
> >> While I think trans-national data movement will be an area that > >> requires > >> governance of some kind I think that companies can get > around the > >> problem in other ways. I think it just requires looking at the > >> problem > >> in a different way.
> >> I'd think the approach is to keep the data still and move the > >> computing > >> to it. The idea is to see the thousands of machines it takes to > >> hold the > >> petabytes worth of data as the compute cloud. What needs to > move > >> to it > >> is the programs that can process the data. I've been working > on this > >> approach for the last 3 years (Twisted Storage).
> >> Chuck Wegrzyn
> >> Pittard, Rick wrote: > >>> One big concern are compliance with the data privacy laws > in the > >>> EU and > >>> other countries which require protection of personal data > and that it > >>> not be transmitted to locations that have less protections. > >>> Since the > >>> laws in the US are generally less protective than those in the > >>> EU, then > >>> additional controls/agreements need to be in place to legally > >>> move the > >>> data from the EU to the US.
> >>> Rick
> >>> -----Original Message----- > >>> From: cloud-computing@googlegroups.com > <mailto:cloud-computing@googlegroups.com> > >>> <mailto:cloud-computing@googlegroups.com > <mailto:cloud-computing@googlegroups.com>> > >>> [mailto:cloud-computing@googlegroups.com > <mailto:cloud-computing@googlegroups.com>] On Behalf Of Chaz. > >>> Sent: Thursday, June 19, 2008 11:58 AM > >>> To: cloud-computing@googlegroups.com > <mailto:cloud-computing@googlegroups.com> > >>> <mailto:cloud-computing@googlegroups.com > <mailto:cloud-computing@googlegroups.com>> > >>> Subject: Issues of data in the cloud...
> >>> While data access and recovery is a very important aspect > of cloud > >>> computing, I'm curious as to the legal issues surrounding the > >>> movement > >>> of data across national boundaries or even across company > boundaries.
> >>> How does the "cloud" protect data going from the owner to the > >>> computing > >>> service without being compromised (read that as sniffed)? > Will a > >>> computing service in country A have the right to impose > >>> restrictions on > >>> data from another country (even if the results of the computing > >>> don't > >>> affect the citizens of country A)? An so on.
> >>> Chuck Wegrzyn
> >>> Utpal Datta wrote: > >>>> You make all the right points on speed, bandwidth, Amazon > >>>> charging on > >>>> bandwidth etc. But consider the need for the user (say a large > >>>> financial company with a sensitive business critical > application),
> >>>> 1. who will guarantee that the data in S3 is secure from > >>>> physical and > >>>> logical access
> >>>> 2. who will guarantee that the data is always available > using a > >>>> multi-site recovery system (that is what they would have in > >>>> their own > >>>> data center) that meets their RPO (Recovery Point > Objective) and RTO > >>>> (Recovery Time Objective) guidelines.
> >>>> Either Amazon or other Cloud providers will make these > available > >>>> with > >>>> EC2 with SP3 (or some other storage mechanism with more robust > >>>> security and availability characteristics) or the users > will have to > >>>> build something similar on their own using EC2 as their basic > >>>> building > >>>> block.
> >>>> This will be a *very* non-trivial task for any user to do on > >>>> their own > >>>> and they will have to make the decision to put resources to > >>>> build this > >>>> on a cloud or to invest more on their own datacenter.
> >>>> So I guess a lot will depend on the level of maturity of > the clouds. > >>>> Not sure if all this work belong in a mid-layer outside of the > >>>> original cloud and leave the cloud providers just to > provide the > >>>> basic > >>>> building blocks
> >>>> --utpal
> >>>> On Thu, Jun 19, 2008 at 11:08 AM, Chris K Wensel > >>>> <ch...@wensel.net <mailto:ch...@wensel.net> > <mailto:ch...@wensel.net <mailto:ch...@wensel.net>>> > >>> wrote: > >>>>> On Jun 18, 2008, at
Chris, it's the last step I wonder about. If you leave the resultant data on S3 and run whatever app they have that operates against that data on EC2 it seems you could save some time?
----- Original Message ---- From: Chris K Wensel <ch...@wensel.net> To: cloud-computing@googlegroups.com Sent: Thursday, June 19, 2008 12:55:05 PM Subject: Re: Business Intelligence solution in Cloud Computing
1) is it possible to have the app run on AWS so that the derived data does not need to traverse back down in real time (that way you could use a lazy download in the background to archive it in their DC while their app accesses the copy in real time on AWS.)?
The pattern is roughly this:
-- load dataset to S3 from datacenter (in small pieces, in parallel), repeat
- identify current dataset - boot hadoop cluster - start job on given dataset - head of job pulls down parts from S3 in parallel (very natural with Hadoop) - compete middle of job - tail of job stuffs results sets into S3 in parallel (again fairly natural with Hadoop) -- repeat above concurrently as datasets become available (easy to have concurrent Hadoop clusters in EC2).
-- pull data from S3 in parts in parallel
note 'job' above means a given data processing flow. in terms of Hadoop, the 'job' could be dozens of MapReduce jobs on the cluster.
2) I've been thinking about the problem of upload times as well (in the context of large DNA data sets). The cost of loading into AWS is not that prohibitive so if one where to pre-process that data such that it could be uploaded in a bunch of parrallel processes to AWS you could reduce the bottleneck considerably. In theory.
you will see a boost if you spawn multiple connects from one location to S3. it seems (was clearly in the past, unsure as of today) that individual connections were throttled, and up to a point bandwidth from a given ip was throttled. so doing things in parallel by breaking your big data into small parts give you a boost. I can't remember the numbers, else i'd share. its been a couple months since that project.
one benefit of using small parts, is that a given part will be available before the 'whole' is available. S3 won't show things for download that aren't finished uploading. So this also improves things (especially when coupled with SQS).
by 'parts' i mean, I may have locally 10G of data. I will break it into n MB pieces (compressed) and push them up to S3 (in parallel). having a manifest (*.parts file) is great when you need to manage the integrity of individual parts (MD5) and the whole (parts list all available, MD5 on parts file). This in part guarantees you aren't processing a job on partial data (because the upload failed an no one noticed).
If there was a next processing step, then yes it would save time. But those jobs represent all the work being done that isn't done by client/ customers of my client.
> Chris, it's the last step I wonder about. If you leave the resultant > data on S3 and run whatever app they have that operates against that > data on EC2 it seems you could save some time?
> Ray
> ----- Original Message ---- > From: Chris K Wensel <ch...@wensel.net> > To: cloud-computing@googlegroups.com > Sent: Thursday, June 19, 2008 12:55:05 PM > Subject: Re: Business Intelligence solution in Cloud Computing
>> 1) is it possible to have the app run on AWS so that the derived >> data does not need to traverse back down in real time (that way you >> could use a lazy download in the background to archive it in their >> DC while their app accesses the copy in real time on AWS.)?
> The pattern is roughly this:
> -- load dataset to S3 from datacenter (in small pieces, in > parallel), repeat
> - identify current dataset > - boot hadoop cluster > - start job on given dataset > - head of job pulls down parts from S3 in parallel (very natural > with Hadoop) > - compete middle of job > - tail of job stuffs results sets into S3 in parallel (again fairly > natural with Hadoop) > -- repeat above concurrently as datasets become available (easy to > have concurrent Hadoop clusters in EC2).
> -- pull data from S3 in parts in parallel
> note 'job' above means a given data processing flow. in terms of > Hadoop, the 'job' could be dozens of MapReduce jobs on the cluster.
>> 2) I've been thinking about the problem of upload times as well (in >> the context of large DNA data sets). The cost of loading into AWS >> is not that prohibitive so if one where to pre-process that data >> such that it could be uploaded in a bunch of parrallel processes to >> AWS you could reduce the bottleneck considerably. In theory.
> you will see a boost if you spawn multiple connects from one > location to S3. it seems (was clearly in the past, unsure as of > today) that individual connections were throttled, and up to a point > bandwidth from a given ip was throttled. so doing things in parallel > by breaking your big data into small parts give you a boost. I can't > remember the numbers, else i'd share. its been a couple months since > that project.
> one benefit of using small parts, is that a given part will be > available before the 'whole' is available. S3 won't show things for > download that aren't finished uploading. So this also improves > things (especially when coupled with SQS).
> by 'parts' i mean, I may have locally 10G of data. I will break it > into n MB pieces (compressed) and push them up to S3 (in parallel). > having a manifest (*.parts file) is great when you need to manage > the integrity of individual parts (MD5) and the whole (parts list > all available, MD5 on parts file). This in part guarantees you > aren't processing a job on partial data (because the upload failed > an no one noticed).
You are absolutely correct. Once you have a person involved it can be compromised. It is all about risk and how to make it so small it would take an act of God (or a really large budget) to breach it!
> I think the front page of the Wall Street Journal proves that even > having physical security of your data does not provide security! :-)
> Security is really a business issue. Each layer of security should cost > no more than the data is worth. So the concept of "secure enough" > becomes important. What security is appropriate for a given type of data > and is it more or less secure in the cloud than in the corp DC? Is data > inherently "less secure" by virtue of being in the cloud than, say, an > employees laptop or flash dongle or "on the wire"? I don't think > corporate data centers are a secure as you're suggesting they are...
> Ray
> ----- Original Message ---- > From: Chaz. <eprparad...@gmail.com> > To: cloud-computing@googlegroups.com > Sent: Thursday, June 19, 2008 1:30:30 PM > Subject: Re: Issues of data in the cloud...
> I don't believe it is possible to have data security in the "cloud" > without having physical security of the data. After all whenever I use a > cloud computer I hope that no one has hacked it to replace the security > modules, or to map memory and look into a running program, etc.
> Now if you have to build out an autonomic system we will never have > secure cloud computing. No system today is so tight that it can't be > hacked. Just look at all the attempts to protect DVDs or BD disks...
> Chuck Wegrzyn
> Lynne VanArsdale wrote: > > Just joined cloud-computing and this is the first conversation I've > > received.
> > A couple of weeks ago I attended Gartner Security where Neil MacDonald > > spoke on "Adaptive Security." In a nutshell, this approach builds a > > resilient system for secure data, acting much like the human immune > > system. It involves whitelisting as the foundation, blacklisting as a > > mid-tier and learned/adaptive mechanisms at the top. In such an > > environment, elements would be "autonomic" and self-managing to a large > > degree, and would share and communicate with other elements to protect > > workloads and information (as opposed to endpoints). There is a lot > > more to this vision, and it is probably a number of years away, but it > > may be a reasonable approach to address the concerns about data security > > being discussed here.
> > In any case, does anyone know of any product or standards efforts for > > the industry to collaborate on a more cohesive architecture for security > > in the cloud?
> > I definitely agree with your point. I can't think of very many > > multi-nationsls that would let there data out to wander around. I'd > > think they would want to protect their data and move the computing > > resources close to it....
> > Chuck
> > Jim Peters wrote: > > > Even if the cloud providers come up with excellent answers to the > > > security and reliability questions, who's going to trust them? > Credit > > > card numbers are one thing, but cloud data is something else > > entirely.
> > > +J
> > > On Thu, Jun 19, 2008 at 10:57 AM, On SaaS <ons...@gmail.com > <mailto:ons...@gmail.com> > > <mailto:ons...@gmail.com <mailto:ons...@gmail.com>> > > > <mailto:ons...@gmail.com <mailto:ons...@gmail.com> > <mailto:ons...@gmail.com <mailto:ons...@gmail.com>>>> wrote:
> > > That depends on how the cloud is architected, no?
> > > And I would think the cloud providers will have to start > > answering > > > these questions if they want large enterprises to start > > adopting the > > > cloud. There maybe no control of which server in the cloud is > > doing > > > the computation, but service providers may provide options to > > > restrict based on geographic domains.
> > > We have quite a few people here from the cloud providers, > > maybe they > > > can share some insight?
> > > thx
> > > On Jun 19, 2008, at 10:44 AM, Stuart Altenhaus wrote:
> > >> I think Chaz is right. There are privacy issues regarding > > use and > > >> exposure of data that vary country by country. If the cloud > > >> computes the data, there is no control on where that data is > > moved > > >> for computation, right?
> > >> Date: Thu, 19 Jun 2008 13:40:20 > > >> To:cloud-computing@googlegroups.com > <mailto:cloud-computing@googlegroups.com> > > <mailto:To%3Acloud-computing@googlegroups.com > <mailto:3Acloud-computing@googlegroups.com>> > > >> <mailto:cloud-computing@googlegroups.com > <mailto:cloud-computing@googlegroups.com> > > <mailto:cloud-computing@googlegroups.com > <mailto:cloud-computing@googlegroups.com>>> > > >> Subject: Re: Issues of data in the cloud...
> > >> While I think trans-national data movement will be an area > that > > >> requires > > >> governance of some kind I think that companies can get > > around the > > >> problem in other ways. I think it just requires looking at the > > >> problem > > >> in a different way.
> > >> I'd think the approach is to keep the data still and move the > > >> computing > > >> to it. The idea is to see the thousands of machines it > takes to > > >> hold the > > >> petabytes worth of data as the compute cloud. What needs to > > move > > >> to it > > >> is the programs that can process the data. I've been working > > on this > > >> approach for the last 3 years (Twisted Storage).
> > >> Chuck Wegrzyn
> > >> Pittard, Rick wrote: > > >>> One big concern are compliance with the data privacy laws > > in the > > >>> EU and > > >>> other countries which require protection of personal data > > and that it > > >>> not be transmitted to locations that have less protections. > > >>> Since the > > >>> laws in the US are generally less protective than those > in the > > >>> EU, then > > >>> additional controls/agreements need to be in place to legally > > >>> move the > > >>> data from the EU to the US.
> > >>> Rick
> > >>> -----Original Message----- > > >>> From: cloud-computing@googlegroups.com > <mailto:cloud-computing@googlegroups.com> > > <mailto:cloud-computing@googlegroups.com > <mailto:cloud-computing@googlegroups.com>> > > >>> <mailto:cloud-computing@googlegroups.com > <mailto:cloud-computing@googlegroups.com> > > <mailto:cloud-computing@googlegroups.com > <mailto:cloud-computing@googlegroups.com>>> > > >>> [mailto:cloud-computing@googlegroups.com > <mailto:cloud-computing@googlegroups.com> > > <mailto:cloud-computing@googlegroups.com > <mailto:cloud-computing@googlegroups.com>>] On Behalf Of Chaz. > > >>> Sent: Thursday, June 19, 2008 11:58 AM > > >>> To: cloud-computing@googlegroups.com > <mailto:cloud-computing@googlegroups.com> > > <mailto:cloud-computing@googlegroups.com > <mailto:cloud-computing@googlegroups.com>> > > >>> <mailto:cloud-computing@googlegroups.com > <mailto:cloud-computing@googlegroups.com> > > <mailto:cloud-computing@googlegroups.com > <mailto:cloud-computing@googlegroups.com>>> > > >>> Subject: Issues of data in the cloud...
> > >>> While data access and recovery is a very important aspect > > of cloud > > >>> computing, I'm curious as to the legal issues surrounding the > > >>> movement > > >>> of data across national boundaries or even across company > > boundaries.
> > >>> How does the "cloud" protect data going from the owner to the > > >>> computing > > >>> service without being compromised (read that as sniffed)? > > Will a > > >>> computing service in country A have the right to impose > > >>> restrictions on > > >>> data from another country (even if the results of the > computing > > >>> don't > > >>> affect the citizens of country A)? An so on.
> > >>> Chuck Wegrzyn
> > >>> Utpal Datta wrote: > > >>>> You make all the right points on speed, bandwidth, Amazon > > >>>> charging on > > >>>> bandwidth etc. But consider the need for the user (say a > large > > >>>> financial company with a sensitive business critical > > application),
> > >>>> 1. who will guarantee that the data in S3 is secure from > > >>>> physical and > > >>>> logical access
I think you missed the point. Even having shared keychains, use of X509, etc. there is no guarantee you data is safe. Once it is in the hands of a 3rd party you better assume it is compromised.
Perhaps the real solution is to carefully architect your solution to provide "bulk" services outside the company and leave the critical things - those that are absolutely vital - to inside the company.
> If you are deploying an application in EC2, you must architect it to > survive failure, because it will fail in varying degrees. Subsequently > features of AWS allow you to do that, roughly (booting a pre- > configured xen vm, simple db, sqs, s3, etc etc).
> I suggest you do the same regarding security, just assume it's a > hostile environment.
> The question is, what features of AWS support you in this? shared > keychains/stores, encrypted volumes, CA, kerberos, ?? or will this > always be left to the user. or could you ever really trust those > services the same way you trust them to not lose data.
> That said, not being a security person. What 'cloud security services' > could a provider provide? Or should they even bother.
> ckw
> On Jun 19, 2008, at 1:18 PM, Chaz. wrote:
>> Security is a funny issue. Can you ever use a cloud computing complex >> and know for certain your data is protected? I'm betting there is no >> fool proof way that it can be. So the only real way is to fall back to >> what we know today: maintain physical control of it for once that is >> gone you are on your own baby.
>> Chuck Wegrzyn
>> Utpal Datta wrote: >>> May be this is a redundant question, where is this protected data >>> residing? In the cloud or in the user's data center?
>>> If it is in the cloud then we are still dealing with Security, >>> Availability and Recoverability isues (that everyone agrees on).
>>> If is in the users data center then how will the computing resources >>> offered (and controlled by Amazon) be brought to that specific user's >>> datacenter?
>>> --utpal
>>> On Thu, Jun 19, 2008 at 3:10 PM, Chaz. <eprparad...@gmail.com> wrote: >>>> Jim,
>>>> I definitely agree with your point. I can't think of very many >>>> multi-nationsls that would let there data out to wander around. I'd >>>> think they would want to protect their data and move the computing >>>> resources close to it....
>>>> Chuck
>>>> Jim Peters wrote: >>>>> Even if the cloud providers come up with excellent answers to the >>>>> security and reliability questions, who's going to trust them? >>>>> Credit >>>>> card numbers are one thing, but cloud data is something else >>>>> entirely.
>>>>> +J
>>>>> On Thu, Jun 19, 2008 at 10:57 AM, On SaaS <ons...@gmail.com >>>>> <mailto:ons...@gmail.com>> wrote:
>>>>> That depends on how the cloud is architected, no?
>>>>> And I would think the cloud providers will have to start >>>>> answering >>>>> these questions if they want large enterprises to start >>>>> adopting the >>>>> cloud. There maybe no control of which server in the cloud is >>>>> doing >>>>> the computation, but service providers may provide options to >>>>> restrict based on geographic domains.
>>>>> We have quite a few people here from the cloud providers, >>>>> maybe they >>>>> can share some insight?
>>>>> thx
>>>>> On Jun 19, 2008, at 10:44 AM, Stuart Altenhaus wrote:
>>>>>> I think Chaz is right. There are privacy issues regarding use >>>>>> and >>>>>> exposure of data that vary country by country. If the cloud >>>>>> computes the data, there is no control on where that data is >>>>>> moved >>>>>> for computation, right?
>>>>>> R/s, >>>>>> Stu Altenhaus
>>>>>> Sent from my Verizon Wireless BlackBerry
>>>>>> -----Original Message----- >>>>>> From: "Chaz." <eprparad...@gmail.com <mailto:eprparad...@gmail.com >>>>>> Date: Thu, 19 Jun 2008 13:40:20 >>>>>> To:cloud-computing@googlegroups.com >>>>>> <mailto:cloud-computing@googlegroups.com> >>>>>> Subject: Re: Issues of data in the cloud...
>>>>>> While I think trans-national data movement will be an area that >>>>>> requires >>>>>> governance of some kind I think that companies can get around >>>>>> the >>>>>> problem in other ways. I think it just requires looking at the >>>>>> problem >>>>>> in a different way.
>>>>>> I'd think the approach is to keep the data still and move the >>>>>> computing >>>>>> to it. The idea is to see the thousands of machines it takes to >>>>>> hold the >>>>>> petabytes worth of data as the compute cloud. What needs to >>>>>> move >>>>>> to it >>>>>> is the programs that can process the data. I've been working >>>>>> on this >>>>>> approach for the last 3 years (Twisted Storage).
>>>>>> Chuck Wegrzyn
>>>>>> Pittard, Rick wrote: >>>>>>> One big concern are compliance with the data privacy laws in >>>>>>> the >>>>>>> EU and >>>>>>> other countries which require protection of personal data >>>>>>> and that it >>>>>>> not be transmitted to locations that have less protections. >>>>>>> Since the >>>>>>> laws in the US are generally less protective than those in the >>>>>>> EU, then >>>>>>> additional controls/agreements need to be in place to legally >>>>>>> move the >>>>>>> data from the EU to the US.
>>>>>>> Rick
>>>>>>> -----Original Message----- >>>>>>> From: cloud-computing@googlegroups.com >>>>>>> <mailto:cloud-computing@googlegroups.com> >>>>>>> [mailto:cloud-computing@googlegroups.com] On Behalf Of Chaz. >>>>>>> Sent: Thursday, June 19, 2008 11:58 AM >>>>>>> To: cloud-computing@googlegroups.com >>>>>>> <mailto:cloud-computing@googlegroups.com> >>>>>>> Subject: Issues of data in the cloud...
>>>>>>> While data access and recovery is a very important aspect of >>>>>>> cloud >>>>>>> computing, I'm curious as to the legal issues surrounding the >>>>>>> movement >>>>>>> of data across national boundaries or even across company >>>>>>> boundaries.
>>>>>>> How does the "cloud" protect data going from the owner to the >>>>>>> computing >>>>>>> service without being compromised (read that as sniffed)? >>>>>>> Will a >>>>>>> computing service in country A have the right to impose >>>>>>> restrictions on >>>>>>> data from another country (even if the results of the >>>>>>> computing >>>>>>> don't >>>>>>> affect the citizens of country A)? An so on.
>>>>>>> Chuck Wegrzyn
>>>>>>> Utpal Datta wrote: >>>>>>>> You make all the right points on speed, bandwidth, Amazon >>>>>>>> charging on >>>>>>>> bandwidth etc. But consider the need for the user (say a >>>>>>>> large >>>>>>>> financial company with a sensitive business critical >>>>>>>> application),
>>>>>>>> 1. who will guarantee that the data in S3 is secure from >>>>>>>> physical and >>>>>>>> logical access
>>>>>>>> 2. who will guarantee that the data is always available >>>>>>>> using a >>>>>>>> multi-site recovery system (that is what they would have in >>>>>>>> their own >>>>>>>> data center) that meets their RPO (Recovery Point >>>>>>>> Objective) and RTO >>>>>>>> (Recovery Time Objective) guidelines.
>>>>>>>> Either Amazon or other Cloud providers will make these >>>>>>>> available >>>>>>>> with >>>>>>>> EC2 with SP3 (or some other storage mechanism with more >>>>>>>> robust >>>>>>>> security and availability characteristics) or the users >>>>>>>> will have to >>>>>>>> build something similar on their own using EC2 as their basic >>>>>>>> building >>>>>>>> block.
>>>>>>>> This will be a *very* non-trivial task for any user to do on >>>>>>>> their own >>>>>>>> and they will have to make the decision to put resources to >>>>>>>> build this >>>>>>>> on a cloud or to invest more on their own datacenter.
>>>>>>>> So I guess a lot will depend on the level of maturity of >>>>>>>> the clouds. >>>>>>>> Not sure if all this work belong in a mid-layer outside of >>>>>>>> the >>>>>>>> original cloud and leave the cloud providers just to >>>>>>>> provide the >>>>>>>> basic >>>>>>>> building blocks
>>>>>>>> --utpal
>>>>>>>> On Thu, Jun 19, 2008 at 11:08 AM, Chris K Wensel >>>>>>>> <ch...@wensel.net <mailto:ch...@wensel.net>> >>>>>>> wrote: >>>>>>>>> On Jun 18, 2008, at 8:15 AM, Utpal Datta wrote:
>>>>>>>>>> 1. I think "data in the cloud" is so far a big block to >>>>>>>>>> widespread >>>>>>>>>> adoption and using cloud for large, sensitive and mission >>>>>>>>>> critical >>>>>>>>>> applications (espicially for Financial organization). Is >>>>>>>>>> someone >>>>>>>>>> thinking of a way to leave the data within the user- >>>>>>>>>> premises >>>>>>>>>> and do >>>>>>>>>> just the computing in the cloud? Kind of a reverse >>>>>>>>>> connection back >>>>>>> to >>>>>>>>>> the user datacenter.
>>>>>>>>>> That way the conventional data respositories can still be >>>>>>>>>> used. The >>>>>>>>>> users will not have to worry about the reliability, >>>>>>>>>> availability and >>>>>>>>>> (to a large part) security of the data. We still have to >>>>>>>>>> worry >>>>>>>>>> about >>>>>>>>>> the security of the data travelling back and forth to and >>>>>>>>>> from the >>>>>>>>>> cloud to the user data center.
>>>>>>>>>> This probably is more relevant for medium to large scale >>>>>>>>>> users >>>>>>>>>> with >>>>>>>>>> "sensitive" data.
>>>>>>>>>> Comments? tips? >>>>>>>>> I've been processing large
I've looked at this issue quite a bit too. There are a few ways that I think the problem can be "relieved"
1. Don't encourage your clients to download the entire data-set. As long as you provide URLs to the "crunched data", they should only have to pull the data as needed. You can index the data too using SDB too - a nice convenience function for searching the data.
2. See if the customers can split the dataset into sub-datasets, each reachable via some sort of URL. When you run your Map job, each of the Map nodes will be responsible for downloading the data from your clients - you might get some benefits from the parallelization of the download.
3. Use S3 for more of a backing store - If you don't have many clients consuming the data, or you think that the clients will download the data soon after the mapreduce job is complete, they can download it directly from the HDFS (http://hadoop.apache.org/core/docs/r0.17.0/hdfs_design.html#Browser+I...)
----- Original Message ---- From: Ray Nugent <rnug...@yahoo.com> To: cloud-computing@googlegroups.com Sent: Thursday, June 19, 2008 11:17:13 AM Subject: Re: Business Intelligence solution in Cloud Computing
Chris, couple of thoughts -
1) is it possible to have the app run on AWS so that the derived data does not need to traverse back down in real time (that way you could use a lazy download in the background to archive it in their DC while their app accesses the copy in real time on AWS.)?
2) I've been thinking about the problem of upload times as well (in the context of large DNA data sets). The cost of loading into AWS is not that prohibitive so if one where to pre-process that data such that it could be uploaded in a bunch of parrallel processes to AWS you could reduce the bottleneck considerably. In theory.
Ray
----- Original Message ---- From: Chris K Wensel <ch...@wensel.net> To: cloud-computing@googlegroups.com Sent: Thursday, June 19, 2008 9:27:51 AM Subject: Re: Business Intelligence solution in Cloud Computing
> On Jun 19, 2008, at 8:08 AM, Chris K Wensel wrote:
>> That said, with bandwidth being the bottleneck in the face of the >> ability to spin up 100 or 1000 nodes to crunch numbers
> how big are the datasets you're working with? Random or linear > access ?
total data is 100's of G. Individual work loads are ~10G. All linear (this being Hadoop), but there is much joining, binning, and crunching between the multiple input datasets (the actual workload translates to ~60 MapReduce jobs, all rendered and managed by Cascading).
So it kinda sucks to have uploads of data to the cluster take longer than it does to compute on it. Worse since my client then has to fetch the derived data back.
__________________________________________________________________ Ask a question on any topic and get answers from real people. Go to Yahoo! Answers and share what you know at http://ca.answers.yahoo.com
A practical way for dealing with data in the cloud IMO is to decouple the way we persist the data from the application. What that basically means is that the application loads the data into an in-memory cloud and that memory cloud keeps the data synchronized with persistent storage asynchronously. There is an opensource version that does that with Amazon -SimpleDB and GigaSpaces as the data-grid http://www.openspaces.org/display/EDS/External+Data+Source+by+Amazon+Sim pleDB - It basically means that the data is stored in Amazon S3 as the persistent storage. When the application boots, the datagrid loads the data from S3 using SimpleDB interface to the memory of the cloud resources. The application use the in-memory data. Updates to the memory is being propagated asynchronously back to S3 through the same datagrid and SimpleDB interface. You can do pretty much the same thing with MySQL instead of SimpleDB as I noted in one of my previous posts: http://natishalom.typepad.com/nati_shaloms_blog/2008/03/scaling-out-mys. html Once the persistent storage is decoupled from our application we can easily use the same model for keeping our persistent data outside of the cloud i.e. in our local IT.
The nice thing is that we can be very flexible with our strategy as it relates to where the data will reside, how it will be stored, and at what rate it will be synchronized from the application. We can change it overtime to best fit our application scenario and constraints without touching our application code.
" And CORBA isn't what I am thinking of, or even HADOOP but things like JavaSpaces (?)."
JavaSpaces is indeed more relevant for this type of scenarios. What's unique about JavaSpaces IMO is that it can be used for handling both the compute side and the data storage. The references above shows how you could use space-based storage for handling the data side. Now that its stored in the in-memory space cluster you can easily use the same space to route business logic on those in-memory instances in parallel. There's a nice way to abstract that from the user using a remoting abstraction - see more details on how that works here: http://uri-cohen.blogspot.com/2008/02/openspaces-svf-remoting-on-steroid s.html
[mailto:cloud-computing@googlegroups.com] On Behalf Of Chaz. Sent: Thursday, June 19, 2008 11:35 PM To: cloud-computing@googlegroups.com Subject: Re: Issues of data in the cloud...
And CORBA isn't what I am thinking of, or even HADOOP but things like JavaSpaces (?).
I'm not sure I would agree you have to ship your data to somewhere else.
After all a "cloud data provider" could create just the secure environment for holding the data and processing it (isn't that really what S3 is all about?). The only thing the using company needs to do is write the program and have it installed, more or less automagically, on the machines that hold the user's data.
Chuck Wegrzyn
Chris K Wensel wrote: > CORBA isn't about mobility, it's just typesafe OO RPC. There was work
> done by ObjectSpace and GeneralMagic in the 90's on agent based > computing (move code to the data). but that movement died off.
> if the Cloud is a collection of compute resources, and you need to > apply them to lots of your data, you have little choice but to move > your data. you can't move the compute power. (unless you order a > shipping container of servers I guess)
> ckw
> On Jun 19, 2008, at 11:39 AM, Stuart Altenhaus wrote:
>> If the programs are moved to the data, then what is the distinction >> between cloud computing and CORBA? Seems like the same basic tenets >> would have to be in place.
>> (I'm new to the concept of cloud computing, but do see the >> opportunities for advancing a network of computers that renders geo >> location trivial. Surely enhancing existing network clouds such that
>> the computing power were placed at each node, a net-centric approach
>> is achieved... The telcos do that today, right?)
>> Date: Thu, 19 Jun 2008 14:00:55 >> To:cloud-computing@googlegroups.com >> Subject: Re: Issues of data in the cloud...
>> I know from my work that many firms are reluctant to let there data >> "out >> the door" since they see that as their edge in the market. But even >> that >> aside for a minute, it seems to make more sense to move "small" >> programs >> (relative to the size of the data) then to move massive amounts of >> data.
>> So my question is as follows: what makes a good "storage cloud"?
>> Chuck Wegrzyn
>> Khazret Sapenov wrote:
>>> On Thu, Jun 19, 2008 at 1:40 PM, Chaz. <eprparad...@gmail.com >>> <mailto:eprparad...@gmail.com>> wrote:
>>> [snip] >>> I'd think the approach is to keep the data still and move the >>> computing >>> to it. The idea is to see the thousands of machines it takes to >>> hold the >>> petabytes worth of data as the compute cloud. What needs to >>> move to it >>> is the programs that can process the data. I've been working on >>> this >>> approach for the last 3 years (Twisted Storage).
>>> Chuck Wegrzyn
>>> This is valid approach, that I personally called "Plumber Pattern",
>>> when >>> application, encapsulated in some kind of container (e.g. virtual >>> machine image) is marshalled to secure data islands to iteratively do >>> its unique work (say, do a matches on some criterium in Interpol, >>> FBI, >>> CIA, MI5 and other databases, all distributed across continents). >>> Due to >>> utterly confidential nature of these types of data, it is >>> impossible to >>> move them to public storage (at least this time). Above-mentioned >>> case >>> might be extrapolated to some lines of business as well with reduced >>> privacy/security requirements.
Isn't the discussion relative to the level of assurance (in terms of security here) that is supplied and demanded per use case?
For absolute control, you should stay absolutely in control of the resources and I don't think Cloud Computing is something for you.
If you want a secured environment you should understand that the administrator of the resource can read the memory. If you want to prevent that, then you should look into secured techniques to hide the memory contents. Ultimately you must get your instructions through a (virtual) CPU which can also be obscured on what it is doing, but that's the only threat left in that solution.
Most cloud computing solution don't allow for that level of security in the cloud.
The other more generic good thing to do IMHO is to encrypt all your data that resides in the Cloud. Irregardless if this is somewhere between absolutely needed and a tiny wish. This would also solve issues where inadvertently some transfer protocol are unencrypted.
For some bio-medical use-cases in Grid computing (more my main field) this approach is also being used. Decryption happens just prior to the actual processing. A more advanced solution is the sliding decrypted window approach. Where the dataset is decrypted per section or block. CPU usage goes up, but most of the file/database stays encrypted and opportunities to snoop around on the resource is very limited in its opportunity.
Chaz. wrote: > I don't believe it is possible to have data security in the "cloud" > without having physical security of the data. After all whenever I use a > cloud computer I hope that no one has hacked it to replace the security > modules, or to map memory and look into a running program, etc.
> Now if you have to build out an autonomic system we will never have > secure cloud computing. No system today is so tight that it can't be > hacked. Just look at all the attempts to protect DVDs or BD disks...
> Chuck Wegrzyn
> Lynne VanArsdale wrote: >> Just joined cloud-computing and this is the first conversation I've >> received.
>> A couple of weeks ago I attended Gartner Security where Neil MacDonald >> spoke on "Adaptive Security." In a nutshell, this approach builds a >> resilient system for secure data, acting much like the human immune >> system. It involves whitelisting as the foundation, blacklisting as a >> mid-tier and learned/adaptive mechanisms at the top. In such an >> environment, elements would be "autonomic" and self-managing to a large >> degree, and would share and communicate with other elements to protect >> workloads and information (as opposed to endpoints). There is a lot >> more to this vision, and it is probably a number of years away, but it >> may be a reasonable approach to address the concerns about data security >> being discussed here.
>> In any case, does anyone know of any product or standards efforts for >> the industry to collaborate on a more cohesive architecture for security >> in the cloud?
>> On 6/19/08, *Chaz.* <eprparad...@gmail.com >> <mailto:eprparad...@gmail.com>> wrote:
>> Jim,
>> I definitely agree with your point. I can't think of very many >> multi-nationsls that would let there data out to wander around. I'd >> think they would want to protect their data and move the computing >> resources close to it....
>> Chuck
>> Jim Peters wrote: >> > Even if the cloud providers come up with excellent answers to the >> > security and reliability questions, who's going to trust them? Credit >> > card numbers are one thing, but cloud data is something else >> entirely.
>> > +J
>> > On Thu, Jun 19, 2008 at 10:57 AM, On SaaS <ons...@gmail.com >> <mailto:ons...@gmail.com> >> > <mailto:ons...@gmail.com <mailto:ons...@gmail.com>>> wrote:
>> > That depends on how the cloud is architected, no?
>> > And I would think the cloud providers will have to start >> answering >> > these questions if they want large enterprises to start >> adopting the >> > cloud. There maybe no control of which server in the cloud is >> doing >> > the computation, but service providers may provide options to >> > restrict based on geographic domains.
>> > We have quite a few people here from the cloud providers, >> maybe they >> > can share some insight?
>> > thx
>> > On Jun 19, 2008, at 10:44 AM, Stuart Altenhaus wrote:
>> >> I think Chaz is right. There are privacy issues regarding >> use and >> >> exposure of data that vary country by country. If the cloud >> >> computes the data, there is no control on where that data is >> moved >> >> for computation, right?
>> >> Date: Thu, 19 Jun 2008 13:40:20 >> >> To:cloud-computing@googlegroups.com >> <mailto:To%3Acloud-computing@googlegroups.com> >> >> <mailto:cloud-computing@googlegroups.com >> <mailto:cloud-computing@googlegroups.com>> >> >> Subject: Re: Issues of data in the cloud...
>> >> While I think trans-national data movement will be an area that >> >> requires >> >> governance of some kind I think that companies can get >> around the >> >> problem in other ways. I think it just requires looking at the >> >> problem >> >> in a different way.
>> >> I'd think the approach is to keep the data still and move the >> >> computing >> >> to it. The idea is to see the thousands of machines it takes to >> >> hold the >> >> petabytes worth of data as the compute cloud. What needs to >> move >> >> to it >> >> is the programs that can process the data. I've been working >> on this >> >> approach for the last 3 years (Twisted Storage).
>> >> Chuck Wegrzyn
>> >> Pittard, Rick wrote: >> >>> One big concern are compliance with the data privacy laws >> in the >> >>> EU and >> >>> other countries which require protection of personal data >> and that it >> >>> not be transmitted to locations that have less protections. >> >>> Since the >> >>> laws in the US are generally less protective than those in the >> >>> EU, then >> >>> additional controls/agreements need to be in place to legally >> >>> move the >> >>> data from the EU to the US.
>> >>> Rick
>> >>> -----Original Message----- >> >>> From: cloud-computing@googlegroups.com >> <mailto:cloud-computing@googlegroups.com> >> >>> <mailto:cloud-computing@googlegroups.com >> <mailto:cloud-computing@googlegroups.com>> >> >>> [mailto:cloud-computing@googlegroups.com >> <mailto:cloud-computing@googlegroups.com>] On Behalf Of Chaz. >> >>> Sent: Thursday, June 19, 2008 11:58 AM >> >>> To: cloud-computing@googlegroups.com >> <mailto:cloud-computing@googlegroups.com> >> >>> <mailto:cloud-computing@googlegroups.com >> <mailto:cloud-computing@googlegroups.com>> >> >>> Subject: Issues of data in the cloud...
>> >>> While data access and recovery is a very important aspect >> of cloud >> >>> computing, I'm curious as to the legal issues surrounding the >> >>> movement >> >>> of data across national boundaries or even across company >> boundaries.
>> >>> How does the "cloud" protect data going from the owner to the >> >>> computing >> >>> service without being compromised (read that as sniffed)? >> Will a >> >>> computing service in country A have the right to impose >> >>> restrictions on >> >>> data from another country (even if the results of the computing >> >>> don't >> >>> affect the citizens of country A)? An so on.
>> >>> Chuck Wegrzyn
>> >>> Utpal Datta wrote: >> >>>> You make all the right points on speed, bandwidth, Amazon >> >>>> charging on >> >>>> bandwidth etc. But consider the need for the user (say a large >> >>>> financial company with a sensitive business critical >> application),
>> >>>> 1. who will guarantee that the data in S3 is secure from >> >>>> physical and >> >>>> logical access
>> >>>> 2. who will guarantee that the data is always available >> using a >> >>>> multi-site recovery system (that is what they would have in >> >>>> their own >> >>>> data center) that meets their RPO (Recovery Point >> Objective) and RTO >> >>>> (Recovery Time Objective) guidelines.
>> >>>> Either Amazon or other Cloud providers will make these >> available >> >>>> with >> >>>> EC2 with SP3 (or some other storage mechanism with more robust >> >>>> security and availability characteristics) or the users >> will have to >> >>>> build something similar on their own using EC2 as their basic >> >>>> building >> >>>> block.
Nati, I agree that decoupling will help. However your point here confuses me -
"Once the persistent storage is decoupled from our application we can easily use the same model for keeping our persistent data outside of the cloud i.e. in our local IT."
Keeping persistence that far away is bound to have pretty significant impact on your performance isn't it?
----- Original Message ---- From: Nati Shalom <na...@gigaspaces.com> To: cloud-computing@googlegroups.com Sent: Friday, June 20, 2008 12:33:55 AM Subject: RE: Issues of data in the cloud...
A practical way for dealing with data in the cloud IMO is to decouple the way we persist the data from the application. What that basically means is that the application loads the data into an in-memory cloud and that memory cloud keeps the data synchronized with persistent storage asynchronously. There is an opensource version that does that with Amazon -SimpleDB and GigaSpaces as the data-grid http://www.openspaces.org/display/EDS/External+Data+Source+by+Amazon+Sim pleDB - It basically means that the data is stored in Amazon S3 as the persistent storage. When the application boots, the datagrid loads the data from S3 using SimpleDB interface to the memory of the cloud resources. The application use the in-memory data. Updates to the memory is being propagated asynchronously back to S3 through the same datagrid and SimpleDB interface. You can do pretty much the same thing with MySQL instead of SimpleDB as I noted in one of my previous posts: http://natishalom.typepad.com/nati_shaloms_blog/2008/03/scaling-out-mys. html Once the persistent storage is decoupled from our application we can easily use the same model for keeping our persistent data outside of the cloud i.e. in our local IT.
The nice thing is that we can be very flexible with our strategy as it relates to where the data will reside, how it will be stored, and at what rate it will be synchronized from the application. We can change it overtime to best fit our application scenario and constraints without touching our application code.
" And CORBA isn't what I am thinking of, or even HADOOP but things like JavaSpaces (?)."
JavaSpaces is indeed more relevant for this type of scenarios. What's unique about JavaSpaces IMO is that it can be used for handling both the compute side and the data storage. The references above shows how you could use space-based storage for handling the data side. Now that its stored in the in-memory space cluster you can easily use the same space to route business logic on those in-memory instances in parallel. There's a nice way to abstract that from the user using a remoting abstraction - see more details on how that works here: http://uri-cohen.blogspot.com/2008/02/openspaces-svf-remoting-on-steroid s.html
Nati S. GigaSpaces
-----Original Message----- From: cloud-computing@googlegroups.com [mailto:cloud-computing@googlegroups.com] On Behalf Of Chaz. Sent: Thursday, June 19, 2008 11:35 PM To: cloud-computing@googlegroups.com Subject: Re: Issues of data in the cloud...
And CORBA isn't what I am thinking of, or even HADOOP but things like JavaSpaces (?).
I'm not sure I would agree you have to ship your data to somewhere else.
After all a "cloud data provider" could create just the secure environment for holding the data and processing it (isn't that really what S3 is all about?). The only thing the using company needs to do is write the program and have it installed, more or less automagically, on the machines that hold the user's data.
Chuck Wegrzyn
Chris K Wensel wrote: > CORBA isn't about mobility, it's just typesafe OO RPC. There was work
> done by ObjectSpace and GeneralMagic in the 90's on agent based > computing (move code to the data). but that movement died off.
> if the Cloud is a collection of compute resources, and you need to > apply them to lots of your data, you have little choice but to move > your data. you can't move the compute power. (unless you order a > shipping container of servers I guess)
> ckw
> On Jun 19, 2008, at 11:39 AM, Stuart Altenhaus wrote:
>> If the programs are moved to the data, then what is the distinction >> between cloud computing and CORBA? Seems like the same basic tenets >> would have to be in place.
>> (I'm new to the concept of cloud computing, but do see the >> opportunities for advancing a network of computers that renders geo >> location trivial. Surely enhancing existing network clouds such that
>> the computing power were placed at each node, a net-centric approach
>> is achieved... The telcos do that today, right?)
>> Date: Thu, 19 Jun 2008 14:00:55 >> To:cloud-computing@googlegroups.com >> Subject: Re: Issues of data in the cloud...
>> I know from my work that many firms are reluctant to let there data >> "out >> the door" since they see that as their edge in the market. But even >> that >> aside for a minute, it seems to make more sense to move "small" >> programs >> (relative to the size of the data) then to move massive amounts of >> data.
>> So my question is as follows: what makes a good "storage cloud"?
>> Chuck Wegrzyn
>> Khazret Sapenov wrote:
>>> On Thu, Jun 19, 2008 at 1:40 PM, Chaz. <eprparad...@gmail.com >>> <mailto:eprparad...@gmail.com>> wrote:
>>> [snip] >>> I'd think the approach is to keep the data still and move the >>> computing >>> to it. The idea is to see the thousands of machines it takes to >>> hold the >>> petabytes worth of data as the compute cloud. What needs to >>> move to it >>> is the programs that can process the data. I've been working on >>> this >>> approach for the last 3 years (Twisted Storage).
>>> Chuck Wegrzyn
>>> This is valid approach, that I personally called "Plumber Pattern",
>>> when >>> application, encapsulated in some kind of container (e.g. virtual >>> machine image) is marshalled to secure data islands to iteratively do >>> its unique work (say, do a matches on some criterium in Interpol, >>> FBI, >>> CIA, MI5 and other databases, all distributed across continents). >>> Due to >>> utterly confidential nature of these types of data, it is >>> impossible to >>> move them to public storage (at least this time). Above-mentioned >>> case >>> might be extrapolated to some lines of business as well with reduced >>> privacy/security requirements.
Thanks for the comments Alan. My previous post should outline how we have parallelized much of the infrastructure to alleviate my clients issues to a reasonable degree. In short, we employed the patterns you suggest, but not the specific technologies for various reason. I'd be happy to go into a little more detail offline.
The gist of my comments in this thread are to complain that you can't unfortunately scale bandwidth into a cloud to match the relative scale of the compute resources, currently. many hours to upload, and relatively few minutes to crunch, is an annoying imbalance.
For the analytics in the cloud space, there is an opportunity for a vendor to offer whatever services (many introduced in this thread by others) to alleviate the imbalance.
> I've looked at this issue quite a bit too. There are a few ways that > I think the problem can be "relieved"
> 1. Don't encourage your clients to download the entire data-set. As > long as you provide URLs to the "crunched data", they should only > have to pull the data as needed. You can index the data too using > SDB too - a nice convenience function for searching the data.
> 2. See if the customers can split the dataset into sub-datasets, > each reachable via some sort of URL. When you run your Map job, each > of the Map nodes will be responsible for downloading the data from > your clients - you might get some benefits from the parallelization > of the download.
> 3. Use S3 for more of a backing store - If you don't have many > clients consuming the data, or you think that the clients will > download the data soon after the mapreduce job is complete, they can > download it directly from the HDFS (http://hadoop.apache.org/core/docs/r0.17.0/hdfs_design.html#Browser+I... > )
Yeah. The whole issue with SOA as it is today is that you are expected to move the data to where the data is processed. What we really need is the ability to move the processing to where the data is (Which is kinda the point of Hadoop)
----- Original Message ---- From: Chris K Wensel <ch...@wensel.net> To: cloud-computing@googlegroups.com Sent: Friday, June 20, 2008 8:15:29 AM Subject: Re: Business Intelligence solution in Cloud Computing
Thanks for the comments Alan. My previous post should outline how we have parallelized much of the infrastructure to alleviate my clients issues to a reasonable degree. In short, we employed the patterns you suggest, but not the specific technologies for various reason. I'd be happy to go into a little more detail offline.
The gist of my comments in this thread are to complain that you can't unfortunately scale bandwidth into a cloud to match the relative scale of the compute resources, currently. many hours to upload, and relatively few minutes to crunch, is an annoying imbalance.
For the analytics in the cloud space, there is an opportunity for a vendor to offer whatever services (many introduced in this thread by others) to alleviate the imbalance.
cheers, ckw
On Jun 19, 2008, at 11:00 PM, Alan Ho wrote:
Hi Chris,
I've looked at this issue quite a bit too. There are a few ways that I think the problem can be "relieved"
1. Don't encourage your clients to download the entire data-set. As long as you provide URLs to the "crunched data", they should only have to pull the data as needed. You can index the data too using SDB too - a nice convenience function for searching the data.
2. See if the customers can split the dataset into sub-datasets, each reachable via some sort of URL. When you run your Map job, each of the Map nodes will be responsible for downloading the data from your clients - you might get some benefits from the parallelization of the download.
3. Use S3 for more of a backing store - If you don't have many clients consuming the data, or you think that the clients will download the data soon after the mapreduce job is complete, they can download it directly from the HDFS (http://hadoop.apache.org/core/docs/r0.17.0/hdfs_design.html#Browser+I...)
__________________________________________________________________ Connect with friends from any web browser - no download required. Try the new Yahoo! Canada Messenger for the Web BETA at http://ca.messenger.yahoo.com/webmessengerpromo.php
Hadoop, fantastic idea (it would be great if it worked...)
if you need a production ready environment in Finance, it's a long way off. The distributed caching products, Gemfire, Oracle's Coherence and Nati's gigaspaces are all miles ahead of hadoop at this point, some more than others ;-)
> Yeah. The whole issue with SOA as it is today is that you are > expected to move the data to where the data is processed. What we > really need is the ability to move the processing to where the data > is (Which is kinda the point of Hadoop)
> Cheers, > Alan Ho
> ----- Original Message ---- > From: Chris K Wensel <ch...@wensel.net> > To: cloud-computing@googlegroups.com > Sent: Friday, June 20, 2008 8:15:29 AM > Subject: Re: Business Intelligence solution in Cloud Computing
> Thanks for the comments Alan. My previous post should outline how we > have parallelized much of the infrastructure to alleviate my clients > issues to a reasonable degree. In short, we employed the patterns > you suggest, but not the specific technologies for various reason. > I'd be happy to go into a little more detail offline.
> The gist of my comments in this thread are to complain that you > can't unfortunately scale bandwidth into a cloud to match the > relative scale of the compute resources, currently. many hours to > upload, and relatively few minutes to crunch, is an annoying > imbalance.
> For the analytics in the cloud space, there is an opportunity for a > vendor to offer whatever services (many introduced in this thread by > others) to alleviate the imbalance.
> cheers, > ckw
> On Jun 19, 2008, at 11:00 PM, Alan Ho wrote:
>> Hi Chris,
>> I've looked at this issue quite a bit too. There are a few ways >> that I think the problem can be "relieved"
>> 1. Don't encourage your clients to download the entire data-set. As >> long as you provide URLs to the "crunched data", they should only >> have to pull the data as needed. You can index the data too using >> SDB too - a nice convenience function for searching the data.
>> 2. See if the customers can split the dataset into sub-datasets, >> each reachable via some sort of URL. When you run your Map job, >> each of the Map nodes will be responsible for downloading the data >> from your clients - you might get some benefits from the >> parallelization of the download.
>> 3. Use S3 for more of a backing store - If you don't have many >> clients consuming the data, or you think that the clients will >> download the data soon after the mapreduce job is complete, they >> can download it directly from the HDFS (http://hadoop.apache.org/core/docs/r0.17.0/hdfs_design.html#Browser+I... >> )
*************************************************************************** ** <br> Note: The information contained in this message and any attachment to it is privileged, confidential and protected from disclosure. If the reader of this message is not the intended recipient, or an employee or agent responsible for delivering this message to the intended recipient, you are hereby notified that any dissemination, distribution or copying of this communication is strictly prohibited. If you have received this communication in error, please notify the sender immediately by replying to the message, and please delete it from your system. Thank you. NYSE Euronext, Inc.