Folks, Thanks for sharing valuable points....Just by reading the postings i have picked up quite a bit of information... I was wondering if any of you have experience (or know a vendor) in running a data warehouse based business intelligence solution in a cloud. For instance, accept data through FTP, run it through an ETL tool to load the dimensional model and point the reports, dashboards and what not against the model... Do the cloud vendors support this model? Thanks Ramesh.
Ramesh, I know similar solution from NASDAQ quote: *NASDAQ Market Replay provides a NASDAQ-validated replay and analysis of the activity in the stock market. The application is built using the Adobe Flex and AIR platform, and utilizes the Amazon Simple Storage Service (S3) for persisting historical market data. * sources: https://data.nasdaq.com/mr.aspx and http://www.infoq.com/articles/nasdaq-case-study-air-and-s3
On Tue, Jun 17, 2008 at 3:40 PM, SRINIVASAN GANESAN <s...@yahoo.com> wrote:
> Folks, > Thanks for sharing valuable points....Just by reading the postings i have > picked up quite a bit of information... > I was wondering if any of you have experience (or know a vendor) in running > a data warehouse based business intelligence solution in a cloud. > For instance, accept data through FTP, run it through an ETL tool to load > the dimensional model and point the reports, dashboards and what not against > the model... > Do the cloud vendors support this model? > Thanks > Ramesh.
--- On Tue, 6/17/08, Khazret Sapenov <sape...@gmail.com> wrote:
From: Khazret Sapenov <sape...@gmail.com> Subject: Re: Business Intelligence solution in Cloud Computing To: cloud-computing@googlegroups.com Date: Tuesday, June 17, 2008, 1:09 PM
Ramesh, I know similar solution from NASDAQ quote: NASDAQ Market Replay provides a NASDAQ-validated replay and analysis of the activity in the stock market. The application is built using the Adobe Flex and AIR platform, and utilizes the Amazon Simple Storage Service (S3) for persisting historical market data. sources: https://data.nasdaq.com/mr.aspx and http://www.infoq.com/articles/nasdaq-case-study-air-and-s3 salut, Khaz Sapenov
On Tue, Jun 17, 2008 at 3:40 PM, SRINIVASAN GANESAN <s...@yahoo.com> wrote:
Folks, Thanks for sharing valuable points....Just by reading the postings i have picked up quite a bit of information... I was wondering if any of you have experience (or know a vendor) in running a data warehouse based business intelligence solution in a cloud. For instance, accept data through FTP, run it through an ETL tool to load the dimensional model and point the reports, dashboards and what not against the model... Do the cloud vendors support this model? Thanks Ramesh.
> --- On *Tue, 6/17/08, Khazret Sapenov <sape...@gmail.com>* wrote:
> From: Khazret Sapenov <sape...@gmail.com> > Subject: Re: Business Intelligence solution in Cloud Computing > To: cloud-computing@googlegroups.com > Date: Tuesday, June 17, 2008, 1:09 PM
> Ramesh, > I know similar solution from NASDAQ > quote: > *NASDAQ Market Replay provides a NASDAQ-validated replay and analysis of > the activity in the stock market. The application is built using the Adobe > Flex and AIR platform, and utilizes the Amazon Simple Storage Service (S3) > for persisting historical market data. * > sources: > https://data.nasdaq.com/mr.aspx and > http://www.infoq.com/articles/nasdaq-case-study-air-and-s3
> salut, > Khaz Sapenov > On Tue, Jun 17, 2008 at 3:40 PM, SRINIVASAN GANESAN <s...@yahoo.com> > wrote:
>> Folks, >> Thanks for sharing valuable points....Just by reading the postings i have >> picked up quite a bit of information... >> I was wondering if any of you have experience (or know a vendor) in >> running a data warehouse based business intelligence solution in a cloud. >> For instance, accept data through FTP, run it through an ETL tool to load >> the dimensional model and point the reports, dashboards and what not against >> the model... >> Do the cloud vendors support this model? >> Thanks >> Ramesh.
-- Subhasis Dasgupta Indian Representative Kaavo Inc Stamford CT, USA www.kaavo.com Phone : +919830282548 skype : subhasis.dasgupta
For data intensive requirements such as clickstream analysis, Call
data reports etc, there is a cloud edition available from Vertica in
Amazon web services.
If you have huge data and have issues in generating data intensive
reports, vertica's columnar on the cloud architecture will be a good
option.
--
Best Regards,
Dilli Babu
On-line Computing Architect,
DataSisar,
5 & 6 Walton road,
Bangalore-560001
E-mail: dillib...@datasisar.com
Mobile:+919449191299
Visit:http://www.datasisar.com
On Jun 18, 12:43 pm, "Subhasis Dasgupta" <dasgupta.subha...@gmail.com>
wrote:
> > --- On *Tue, 6/17/08, Khazret Sapenov <sape...@gmail.com>* wrote:
> > From: Khazret Sapenov <sape...@gmail.com>
> > Subject: Re: Business Intelligence solution in Cloud Computing
> > To: cloud-computing@googlegroups.com
> > Date: Tuesday, June 17, 2008, 1:09 PM
> > Ramesh,
> > I know similar solution from NASDAQ
> > quote:
> > *NASDAQ Market Replay provides a NASDAQ-validated replay and analysis of
> > the activity in the stock market. The application is built using the Adobe
> > Flex and AIR platform, and utilizes the Amazon Simple Storage Service (S3)
> > for persisting historical market data. *
> > sources:
> >https://data.nasdaq.com/mr.aspxand > >http://www.infoq.com/articles/nasdaq-case-study-air-and-s3
> > salut,
> > Khaz Sapenov
> > On Tue, Jun 17, 2008 at 3:40 PM, SRINIVASAN GANESAN <s...@yahoo.com>
> > wrote:
> >> Folks,
> >> Thanks for sharing valuable points....Just by reading the postings i have
> >> picked up quite a bit of information...
> >> I was wondering if any of you have experience (or know a vendor) in
> >> running a data warehouse based business intelligence solution in a cloud.
> >> For instance, accept data through FTP, run it through an ETL tool to load
> >> the dimensional model and point the reports, dashboards and what not against
> >> the model...
> >> Do the cloud vendors support this model?
> >> Thanks
> >> Ramesh.
I am *very* new to this group. But i am really excited by the quality of postings in the group. I am learning a lot, quickly.
I have a couple of questions. May be someone has some answers.
1. I think "data in the cloud" is so far a big block to widespread adoption and using cloud for large, sensitive and mission critical applications (espicially for Financial organization). Is someone thinking of a way to leave the data within the user-premises and do just the computing in the cloud? Kind of a reverse connection back to the user datacenter.
That way the conventional data respositories can still be used. The users will not have to worry about the reliability, availability and (to a large part) security of the data. We still have to worry about the security of the data travelling back and forth to and from the cloud to the user data center.
This probably is more relevant for medium to large scale users with "sensitive" data.
Comments? tips?
2. Considering the "cloud computing" is at the beginning of its adoption curve, the user data center will, for a long time, have a mixture of their own Physical, Virtual devices within their datacenter along with their "virtual" datacenters in one or more clouds (may be from different vendors).
The user will obviously look for a management portal that seamlessly crosses the boundaries of Physical, Virtual and Cloud devices (for discovery, monitoring at the very least).
Are there some talk/thought on standardizing the "cloud managemnet actions" and "cloud management data" interfaces?
On Wed, Jun 18, 2008 at 8:59 AM, Dilli Babu <dil...@gmail.com> wrote:
> For data intensive requirements such as clickstream analysis, Call > data reports etc, there is a cloud edition available from Vertica in > Amazon web services.
> If you have huge data and have issues in generating data intensive > reports, vertica's columnar on the cloud architecture will be a good > option. > -- > Best Regards, > Dilli Babu > On-line Computing Architect, > DataSisar, > 5 & 6 Walton road, > Bangalore-560001 > E-mail: dillib...@datasisar.com > Mobile:+919449191299 > Visit:http://www.datasisar.com
> On Jun 18, 12:43 pm, "Subhasis Dasgupta" <dasgupta.subha...@gmail.com> > wrote: >> This is one link I have seen but I have not used it they are providing BI >> solutins on EC2 >> Pentaho
>> > --- On *Tue, 6/17/08, Khazret Sapenov <sape...@gmail.com>* wrote:
>> > From: Khazret Sapenov <sape...@gmail.com> >> > Subject: Re: Business Intelligence solution in Cloud Computing >> > To: cloud-computing@googlegroups.com >> > Date: Tuesday, June 17, 2008, 1:09 PM
>> > Ramesh, >> > I know similar solution from NASDAQ >> > quote: >> > *NASDAQ Market Replay provides a NASDAQ-validated replay and analysis of >> > the activity in the stock market. The application is built using the Adobe >> > Flex and AIR platform, and utilizes the Amazon Simple Storage Service (S3) >> > for persisting historical market data. * >> > sources: >> >https://data.nasdaq.com/mr.aspxand >> >http://www.infoq.com/articles/nasdaq-case-study-air-and-s3
>> > salut, >> > Khaz Sapenov >> > On Tue, Jun 17, 2008 at 3:40 PM, SRINIVASAN GANESAN <s...@yahoo.com> >> > wrote:
>> >> Folks, >> >> Thanks for sharing valuable points....Just by reading the postings i have >> >> picked up quite a bit of information... >> >> I was wondering if any of you have experience (or know a vendor) in >> >> running a data warehouse based business intelligence solution in a cloud. >> >> For instance, accept data through FTP, run it through an ETL tool to load >> >> the dimensional model and point the reports, dashboards and what not against >> >> the model... >> >> Do the cloud vendors support this model? >> >> Thanks >> >> Ramesh.
On Wed, Jun 18, 2008 at 11:15 AM, Utpal Datta <utpal8...@gmail.com> wrote:
> 1. I think "data in the cloud" is so far a big block to widespread > adoption and using cloud for large, sensitive and mission critical > applications (espicially for Financial organization). Is someone > thinking of a way to leave the data within the user-premises and do > just the computing in the cloud? Kind of a reverse connection back to > the user datacenter.
> That way the conventional data respositories can still be used. The > users will not have to worry about the reliability, availability and > (to a large part) security of the data. We still have to worry about > the security of the data travelling back and forth to and from the > cloud to the user data center.
> This probably is more relevant for medium to large scale users with > "sensitive" data.
> Comments? tips?
I was also thinking about some kind of staged DMZ-like data island on premises (with enforced access policies), that has protected communication/transport channel to various compute cloud providers.
As a simple example, I had a use case with Maya3D render job using NFS/SMB shares for input and output files, where NFS server is located on premises and rendering process was done by multiple remote nodes at Amazon Elastic Compute Cloud, orchestrated by LSF.
> 1. I think "data in the cloud" is so far a big block to widespread > adoption and using cloud for large, sensitive and mission critical > applications (espicially for Financial organization). Is someone > thinking of a way to leave the data within the user-premises and do > just the computing in the cloud? Kind of a reverse connection back to > the user datacenter.
> That way the conventional data respositories can still be used. The > users will not have to worry about the reliability, availability and > (to a large part) security of the data. We still have to worry about > the security of the data travelling back and forth to and from the > cloud to the user data center.
> This probably is more relevant for medium to large scale users with > "sensitive" data.
> Comments? tips?
I've been processing large historical data sets for a Financial company I'm consulting with using Cascading/Hadoop on EC2/S3.
The biggest bottleneck has been getting data to the compute infrastructure.
The obvious pattern is to have datacenter processes push data to S3, then have the temporary cluster spin up and pull data from S3, do something interesting, then push the results to S3, notify the datacenter the job is complete (SQS), have the datacenter pull down the results from S3.
Because of the need to support both well defined daily processes and ad-hoc processes, my clients data generally needs to stay on S3. Having it pulled from a remote datacenter on duplicate runs would be extraordinarily slow and expensive considering Amazon charges for bandwidth in and out. Plus, it is a bit cheaper just to keep data on S3 than to buy a NAS for storage.
That said, with bandwidth being the bottleneck in the face of the ability to spin up 100 or 1000 nodes to crunch numbers, larger pipes into a vendors Cloud would be very welcome. Otherwise your Cloud solution is only as fast as getting data in and out of it.
You make all the right points on speed, bandwidth, Amazon charging on bandwidth etc. But consider the need for the user (say a large financial company with a sensitive business critical application),
1. who will guarantee that the data in S3 is secure from physical and logical access
2. who will guarantee that the data is always available using a multi-site recovery system (that is what they would have in their own data center) that meets their RPO (Recovery Point Objective) and RTO (Recovery Time Objective) guidelines.
Either Amazon or other Cloud providers will make these available with EC2 with SP3 (or some other storage mechanism with more robust security and availability characteristics) or the users will have to build something similar on their own using EC2 as their basic building block.
This will be a *very* non-trivial task for any user to do on their own and they will have to make the decision to put resources to build this on a cloud or to invest more on their own datacenter.
So I guess a lot will depend on the level of maturity of the clouds. Not sure if all this work belong in a mid-layer outside of the original cloud and leave the cloud providers just to provide the basic building blocks
--utpal
On Thu, Jun 19, 2008 at 11:08 AM, Chris K Wensel <ch...@wensel.net> wrote:
>> 1. I think "data in the cloud" is so far a big block to widespread >> adoption and using cloud for large, sensitive and mission critical >> applications (espicially for Financial organization). Is someone >> thinking of a way to leave the data within the user-premises and do >> just the computing in the cloud? Kind of a reverse connection back to >> the user datacenter.
>> That way the conventional data respositories can still be used. The >> users will not have to worry about the reliability, availability and >> (to a large part) security of the data. We still have to worry about >> the security of the data travelling back and forth to and from the >> cloud to the user data center.
>> This probably is more relevant for medium to large scale users with >> "sensitive" data.
>> Comments? tips?
> I've been processing large historical data sets for a Financial > company I'm consulting with using Cascading/Hadoop on EC2/S3.
> The biggest bottleneck has been getting data to the compute > infrastructure.
> The obvious pattern is to have datacenter processes push data to S3, > then have the temporary cluster spin up and pull data from S3, do > something interesting, then push the results to S3, notify the > datacenter the job is complete (SQS), have the datacenter pull down > the results from S3.
> Because of the need to support both well defined daily processes and > ad-hoc processes, my clients data generally needs to stay on S3. > Having it pulled from a remote datacenter on duplicate runs would be > extraordinarily slow and expensive considering Amazon charges for > bandwidth in and out. Plus, it is a bit cheaper just to keep data on > S3 than to buy a NAS for storage.
> That said, with bandwidth being the bottleneck in the face of the > ability to spin up 100 or 1000 nodes to crunch numbers, larger pipes > into a vendors Cloud would be very welcome. Otherwise your Cloud > solution is only as fast as getting data in and out of it.
> On Jun 19, 2008, at 8:08 AM, Chris K Wensel wrote:
>> That said, with bandwidth being the bottleneck in the face of the >> ability to spin up 100 or 1000 nodes to crunch numbers
> how big are the datasets you're working with? Random or linear > access ?
total data is 100's of G. Individual work loads are ~10G. All linear (this being Hadoop), but there is much joining, binning, and crunching between the multiple input datasets (the actual workload translates to ~60 MapReduce jobs, all rendered and managed by Cascading).
So it kinda sucks to have uploads of data to the cluster take longer than it does to compute on it. Worse since my client then has to fetch the derived data back.
I have been contemplating this issue of safe storage in the cloud. My opinion is that what I need is at least 4 in the cloud storage vendors, which I can then layer RAID5 behavior on top of, combined with a loopback encryption file system. Even with that, pulling the data into the compute cloud places the data in danger of being observable and possibly tamperable. This all ignores latency problems, which I am certain will be a problem, as well as transit costs.
I personally would like my application-at-the-edge software to also span a number of in the cloud vendors, so that I don't experience vendor lock-in problems. In particular, I am concerned that my public facing services will be targets of DDoS attacks and as a result vendors will consider abruptly discontinuing service.
For these reasons, I have not been able to consider much of what in the cloud providers can offer to date, though I continue to build proof of concept packages in preparation for the point in time that the industry evolves enough to facilitate my needs. I am very curious if others have similar concerns and if plausible solutions are being found...
Utpal Datta wrote: > You make all the right points on speed, bandwidth, Amazon charging on > bandwidth etc. But consider the need for the user (say a large > financial company with a sensitive business critical application),
> 1. who will guarantee that the data in S3 is secure from physical and > logical access
> 2. who will guarantee that the data is always available using a > multi-site recovery system (that is what they would have in their own > data center) that meets their RPO (Recovery Point Objective) and RTO > (Recovery Time Objective) guidelines.
> Either Amazon or other Cloud providers will make these available with > EC2 with SP3 (or some other storage mechanism with more robust > security and availability characteristics) or the users will have to > build something similar on their own using EC2 as their basic building > block.
> This will be a *very* non-trivial task for any user to do on their own > and they will have to make the decision to put resources to build this > on a cloud or to invest more on their own datacenter.
> So I guess a lot will depend on the level of maturity of the clouds. > Not sure if all this work belong in a mid-layer outside of the > original cloud and leave the cloud providers just to provide the basic > building blocks
> --utpal
> On Thu, Jun 19, 2008 at 11:08 AM, Chris K Wensel <ch...@wensel.net> wrote: >> On Jun 18, 2008, at 8:15 AM, Utpal Datta wrote:
>>> 1. I think "data in the cloud" is so far a big block to widespread >>> adoption and using cloud for large, sensitive and mission critical >>> applications (espicially for Financial organization). Is someone >>> thinking of a way to leave the data within the user-premises and do >>> just the computing in the cloud? Kind of a reverse connection back to >>> the user datacenter.
>>> That way the conventional data respositories can still be used. The >>> users will not have to worry about the reliability, availability and >>> (to a large part) security of the data. We still have to worry about >>> the security of the data travelling back and forth to and from the >>> cloud to the user data center.
>>> This probably is more relevant for medium to large scale users with >>> "sensitive" data.
>>> Comments? tips?
>> I've been processing large historical data sets for a Financial >> company I'm consulting with using Cascading/Hadoop on EC2/S3.
>> The biggest bottleneck has been getting data to the compute >> infrastructure.
>> The obvious pattern is to have datacenter processes push data to S3, >> then have the temporary cluster spin up and pull data from S3, do >> something interesting, then push the results to S3, notify the >> datacenter the job is complete (SQS), have the datacenter pull down >> the results from S3.
>> Because of the need to support both well defined daily processes and >> ad-hoc processes, my clients data generally needs to stay on S3. >> Having it pulled from a remote datacenter on duplicate runs would be >> extraordinarily slow and expensive considering Amazon charges for >> bandwidth in and out. Plus, it is a bit cheaper just to keep data on >> S3 than to buy a NAS for storage.
>> That said, with bandwidth being the bottleneck in the face of the >> ability to spin up 100 or 1000 nodes to crunch numbers, larger pipes >> into a vendors Cloud would be very welcome. Otherwise your Cloud >> solution is only as fast as getting data in and out of it.
While data access and recovery is a very important aspect of cloud computing, I'm curious as to the legal issues surrounding the movement of data across national boundaries or even across company boundaries.
How does the "cloud" protect data going from the owner to the computing service without being compromised (read that as sniffed)? Will a computing service in country A have the right to impose restrictions on data from another country (even if the results of the computing don't affect the citizens of country A)? An so on.
Utpal Datta wrote: > You make all the right points on speed, bandwidth, Amazon charging on > bandwidth etc. But consider the need for the user (say a large > financial company with a sensitive business critical application),
> 1. who will guarantee that the data in S3 is secure from physical and > logical access
> 2. who will guarantee that the data is always available using a > multi-site recovery system (that is what they would have in their own > data center) that meets their RPO (Recovery Point Objective) and RTO > (Recovery Time Objective) guidelines.
> Either Amazon or other Cloud providers will make these available with > EC2 with SP3 (or some other storage mechanism with more robust > security and availability characteristics) or the users will have to > build something similar on their own using EC2 as their basic building > block.
> This will be a *very* non-trivial task for any user to do on their own > and they will have to make the decision to put resources to build this > on a cloud or to invest more on their own datacenter.
> So I guess a lot will depend on the level of maturity of the clouds. > Not sure if all this work belong in a mid-layer outside of the > original cloud and leave the cloud providers just to provide the basic > building blocks
> --utpal
> On Thu, Jun 19, 2008 at 11:08 AM, Chris K Wensel <ch...@wensel.net> wrote: >> On Jun 18, 2008, at 8:15 AM, Utpal Datta wrote:
>>> 1. I think "data in the cloud" is so far a big block to widespread >>> adoption and using cloud for large, sensitive and mission critical >>> applications (espicially for Financial organization). Is someone >>> thinking of a way to leave the data within the user-premises and do >>> just the computing in the cloud? Kind of a reverse connection back to >>> the user datacenter.
>>> That way the conventional data respositories can still be used. The >>> users will not have to worry about the reliability, availability and >>> (to a large part) security of the data. We still have to worry about >>> the security of the data travelling back and forth to and from the >>> cloud to the user data center.
>>> This probably is more relevant for medium to large scale users with >>> "sensitive" data.
>>> Comments? tips?
>> I've been processing large historical data sets for a Financial >> company I'm consulting with using Cascading/Hadoop on EC2/S3.
>> The biggest bottleneck has been getting data to the compute >> infrastructure.
>> The obvious pattern is to have datacenter processes push data to S3, >> then have the temporary cluster spin up and pull data from S3, do >> something interesting, then push the results to S3, notify the >> datacenter the job is complete (SQS), have the datacenter pull down >> the results from S3.
>> Because of the need to support both well defined daily processes and >> ad-hoc processes, my clients data generally needs to stay on S3. >> Having it pulled from a remote datacenter on duplicate runs would be >> extraordinarily slow and expensive considering Amazon charges for >> bandwidth in and out. Plus, it is a bit cheaper just to keep data on >> S3 than to buy a NAS for storage.
>> That said, with bandwidth being the bottleneck in the face of the >> ability to spin up 100 or 1000 nodes to crunch numbers, larger pipes >> into a vendors Cloud would be very welcome. Otherwise your Cloud >> solution is only as fast as getting data in and out of it.
Hi Marc! Not the only problems. I'd be worried about trans-country laws governing the data. After all once it is in country A, the laws of that country would hold.
Marc Evans wrote: > I have been contemplating this issue of safe storage in the cloud. My > opinion is that what I need is at least 4 in the cloud storage vendors, > which I can then layer RAID5 behavior on top of, combined with a > loopback encryption file system. Even with that, pulling the data into > the compute cloud places the data in danger of being observable and > possibly tamperable. This all ignores latency problems, which I am > certain will be a problem, as well as transit costs.
> I personally would like my application-at-the-edge software to also span > a number of in the cloud vendors, so that I don't experience vendor > lock-in problems. In particular, I am concerned that my public facing > services will be targets of DDoS attacks and as a result vendors will > consider abruptly discontinuing service.
> For these reasons, I have not been able to consider much of what in the > cloud providers can offer to date, though I continue to build proof of > concept packages in preparation for the point in time that the industry > evolves enough to facilitate my needs. I am very curious if others have > similar concerns and if plausible solutions are being found...
> - Marc
> Utpal Datta wrote: >> You make all the right points on speed, bandwidth, Amazon charging on >> bandwidth etc. But consider the need for the user (say a large >> financial company with a sensitive business critical application),
>> 1. who will guarantee that the data in S3 is secure from physical and >> logical access
>> 2. who will guarantee that the data is always available using a >> multi-site recovery system (that is what they would have in their own >> data center) that meets their RPO (Recovery Point Objective) and RTO >> (Recovery Time Objective) guidelines.
>> Either Amazon or other Cloud providers will make these available with >> EC2 with SP3 (or some other storage mechanism with more robust >> security and availability characteristics) or the users will have to >> build something similar on their own using EC2 as their basic building >> block.
>> This will be a *very* non-trivial task for any user to do on their own >> and they will have to make the decision to put resources to build this >> on a cloud or to invest more on their own datacenter.
>> So I guess a lot will depend on the level of maturity of the clouds. >> Not sure if all this work belong in a mid-layer outside of the >> original cloud and leave the cloud providers just to provide the basic >> building blocks
>> --utpal
>> On Thu, Jun 19, 2008 at 11:08 AM, Chris K Wensel <ch...@wensel.net> wrote: >>> On Jun 18, 2008, at 8:15 AM, Utpal Datta wrote:
>>>> 1. I think "data in the cloud" is so far a big block to widespread >>>> adoption and using cloud for large, sensitive and mission critical >>>> applications (espicially for Financial organization). Is someone >>>> thinking of a way to leave the data within the user-premises and do >>>> just the computing in the cloud? Kind of a reverse connection back to >>>> the user datacenter.
>>>> That way the conventional data respositories can still be used. The >>>> users will not have to worry about the reliability, availability and >>>> (to a large part) security of the data. We still have to worry about >>>> the security of the data travelling back and forth to and from the >>>> cloud to the user data center.
>>>> This probably is more relevant for medium to large scale users with >>>> "sensitive" data.
>>>> Comments? tips? >>> I've been processing large historical data sets for a Financial >>> company I'm consulting with using Cascading/Hadoop on EC2/S3.
>>> The biggest bottleneck has been getting data to the compute >>> infrastructure.
>>> The obvious pattern is to have datacenter processes push data to S3, >>> then have the temporary cluster spin up and pull data from S3, do >>> something interesting, then push the results to S3, notify the >>> datacenter the job is complete (SQS), have the datacenter pull down >>> the results from S3.
>>> Because of the need to support both well defined daily processes and >>> ad-hoc processes, my clients data generally needs to stay on S3. >>> Having it pulled from a remote datacenter on duplicate runs would be >>> extraordinarily slow and expensive considering Amazon charges for >>> bandwidth in and out. Plus, it is a bit cheaper just to keep data on >>> S3 than to buy a NAS for storage.
>>> That said, with bandwidth being the bottleneck in the face of the >>> ability to spin up 100 or 1000 nodes to crunch numbers, larger pipes >>> into a vendors Cloud would be very welcome. Otherwise your Cloud >>> solution is only as fast as getting data in and out of it.
I agree with your concerns. Thus far I have been using vendors within single governance regions, and then having a policy engine at my application layer to govern where data is allowed to be operated upon. So, EU data stays in the EU for example. As the vendors grow to span multiple boundaries, if they are not providing programmatic interfaces to allow application layer control of these issues, I may need to avoid those vendors.
Chaz. wrote: > Hi Marc! Not the only problems. I'd be worried about trans-country laws > governing the data. After all once it is in country A, the laws of that > country would hold.
> Chuck Wegrzyn
> Marc Evans wrote: >> I have been contemplating this issue of safe storage in the cloud. My >> opinion is that what I need is at least 4 in the cloud storage vendors, >> which I can then layer RAID5 behavior on top of, combined with a >> loopback encryption file system. Even with that, pulling the data into >> the compute cloud places the data in danger of being observable and >> possibly tamperable. This all ignores latency problems, which I am >> certain will be a problem, as well as transit costs.
>> I personally would like my application-at-the-edge software to also span >> a number of in the cloud vendors, so that I don't experience vendor >> lock-in problems. In particular, I am concerned that my public facing >> services will be targets of DDoS attacks and as a result vendors will >> consider abruptly discontinuing service.
>> For these reasons, I have not been able to consider much of what in the >> cloud providers can offer to date, though I continue to build proof of >> concept packages in preparation for the point in time that the industry >> evolves enough to facilitate my needs. I am very curious if others have >> similar concerns and if plausible solutions are being found...
>> - Marc
>> Utpal Datta wrote: >>> You make all the right points on speed, bandwidth, Amazon charging on >>> bandwidth etc. But consider the need for the user (say a large >>> financial company with a sensitive business critical application),
>>> 1. who will guarantee that the data in S3 is secure from physical and >>> logical access
>>> 2. who will guarantee that the data is always available using a >>> multi-site recovery system (that is what they would have in their own >>> data center) that meets their RPO (Recovery Point Objective) and RTO >>> (Recovery Time Objective) guidelines.
>>> Either Amazon or other Cloud providers will make these available with >>> EC2 with SP3 (or some other storage mechanism with more robust >>> security and availability characteristics) or the users will have to >>> build something similar on their own using EC2 as their basic building >>> block.
>>> This will be a *very* non-trivial task for any user to do on their own >>> and they will have to make the decision to put resources to build this >>> on a cloud or to invest more on their own datacenter.
>>> So I guess a lot will depend on the level of maturity of the clouds. >>> Not sure if all this work belong in a mid-layer outside of the >>> original cloud and leave the cloud providers just to provide the basic >>> building blocks
>>> --utpal
>>> On Thu, Jun 19, 2008 at 11:08 AM, Chris K Wensel <ch...@wensel.net> wrote: >>>> On Jun 18, 2008, at 8:15 AM, Utpal Datta wrote:
>>>>> 1. I think "data in the cloud" is so far a big block to widespread >>>>> adoption and using cloud for large, sensitive and mission critical >>>>> applications (espicially for Financial organization). Is someone >>>>> thinking of a way to leave the data within the user-premises and do >>>>> just the computing in the cloud? Kind of a reverse connection back to >>>>> the user datacenter.
>>>>> That way the conventional data respositories can still be used. The >>>>> users will not have to worry about the reliability, availability and >>>>> (to a large part) security of the data. We still have to worry about >>>>> the security of the data travelling back and forth to and from the >>>>> cloud to the user data center.
>>>>> This probably is more relevant for medium to large scale users with >>>>> "sensitive" data.
>>>>> Comments? tips? >>>> I've been processing large historical data sets for a Financial >>>> company I'm consulting with using Cascading/Hadoop on EC2/S3.
>>>> The biggest bottleneck has been getting data to the compute >>>> infrastructure.
>>>> The obvious pattern is to have datacenter processes push data to S3, >>>> then have the temporary cluster spin up and pull data from S3, do >>>> something interesting, then push the results to S3, notify the >>>> datacenter the job is complete (SQS), have the datacenter pull down >>>> the results from S3.
>>>> Because of the need to support both well defined daily processes and >>>> ad-hoc processes, my clients data generally needs to stay on S3. >>>> Having it pulled from a remote datacenter on duplicate runs would be >>>> extraordinarily slow and expensive considering Amazon charges for >>>> bandwidth in and out. Plus, it is a bit cheaper just to keep data on >>>> S3 than to buy a NAS for storage.
>>>> That said, with bandwidth being the bottleneck in the face of the >>>> ability to spin up 100 or 1000 nodes to crunch numbers, larger pipes >>>> into a vendors Cloud would be very welcome. Otherwise your Cloud >>>> solution is only as fast as getting data in and out of it.
Data locality is definitely a huge issue in the cloud. My company works with a lot of multi-nationals with huge data sets in various countries. In many countries, especially the EU ones as well as like Mexico have some fairly strict laws around privacy data (e.g., data with personal info, etc.) Some of these multi-national countries have to architect their on-premise software around these restrictions (e.g., putting on-premise software in each country) and restrict the data movement. One of them took several months to study the laws and legality of data location and movement before implementing their solution.
So the location of the cloud and data is definitely going to be very important to these multi-nationals. That's part of the reasons why Amazon has an EU cloud and Salesforce is building a cloud in Singapore. Some of the countries are also wary of putting any data inside U.S. due to concerns about patriot act. In general the country where the data resides has jurisdiction over it.
> While data access and recovery is a very important aspect of cloud > computing, I'm curious as to the legal issues surrounding the movement > of data across national boundaries or even across company boundaries.
> How does the "cloud" protect data going from the owner to the > computing > service without being compromised (read that as sniffed)? Will a > computing service in country A have the right to impose restrictions > on > data from another country (even if the results of the computing don't > affect the citizens of country A)? An so on.
> Chuck Wegrzyn
> Utpal Datta wrote: >> You make all the right points on speed, bandwidth, Amazon charging on >> bandwidth etc. But consider the need for the user (say a large >> financial company with a sensitive business critical application),
>> 1. who will guarantee that the data in S3 is secure from physical and >> logical access
>> 2. who will guarantee that the data is always available using a >> multi-site recovery system (that is what they would have in their own >> data center) that meets their RPO (Recovery Point Objective) and RTO >> (Recovery Time Objective) guidelines.
>> Either Amazon or other Cloud providers will make these available with >> EC2 with SP3 (or some other storage mechanism with more robust >> security and availability characteristics) or the users will have to >> build something similar on their own using EC2 as their basic >> building >> block.
>> This will be a *very* non-trivial task for any user to do on their >> own >> and they will have to make the decision to put resources to build >> this >> on a cloud or to invest more on their own datacenter.
>> So I guess a lot will depend on the level of maturity of the clouds. >> Not sure if all this work belong in a mid-layer outside of the >> original cloud and leave the cloud providers just to provide the >> basic >> building blocks
>> --utpal
>> On Thu, Jun 19, 2008 at 11:08 AM, Chris K Wensel <ch...@wensel.net> >> wrote: >>> On Jun 18, 2008, at 8:15 AM, Utpal Datta wrote:
>>>> 1. I think "data in the cloud" is so far a big block to widespread >>>> adoption and using cloud for large, sensitive and mission critical >>>> applications (espicially for Financial organization). Is someone >>>> thinking of a way to leave the data within the user-premises and do >>>> just the computing in the cloud? Kind of a reverse connection >>>> back to >>>> the user datacenter.
>>>> That way the conventional data respositories can still be used. The >>>> users will not have to worry about the reliability, availability >>>> and >>>> (to a large part) security of the data. We still have to worry >>>> about >>>> the security of the data travelling back and forth to and from the >>>> cloud to the user data center.
>>>> This probably is more relevant for medium to large scale users with >>>> "sensitive" data.
>>>> Comments? tips?
>>> I've been processing large historical data sets for a Financial >>> company I'm consulting with using Cascading/Hadoop on EC2/S3.
>>> The biggest bottleneck has been getting data to the compute >>> infrastructure.
>>> The obvious pattern is to have datacenter processes push data to S3, >>> then have the temporary cluster spin up and pull data from S3, do >>> something interesting, then push the results to S3, notify the >>> datacenter the job is complete (SQS), have the datacenter pull down >>> the results from S3.
>>> Because of the need to support both well defined daily processes and >>> ad-hoc processes, my clients data generally needs to stay on S3. >>> Having it pulled from a remote datacenter on duplicate runs would be >>> extraordinarily slow and expensive considering Amazon charges for >>> bandwidth in and out. Plus, it is a bit cheaper just to keep data on >>> S3 than to buy a NAS for storage.
>>> That said, with bandwidth being the bottleneck in the face of the >>> ability to spin up 100 or 1000 nodes to crunch numbers, larger pipes >>> into a vendors Cloud would be very welcome. Otherwise your Cloud >>> solution is only as fast as getting data in and out of it.
One big concern are compliance with the data privacy laws in the EU and other countries which require protection of personal data and that it not be transmitted to locations that have less protections. Since the laws in the US are generally less protective than those in the EU, then additional controls/agreements need to be in place to legally move the data from the EU to the US.
[mailto:cloud-computing@googlegroups.com] On Behalf Of Chaz. Sent: Thursday, June 19, 2008 11:58 AM To: cloud-computing@googlegroups.com Subject: Issues of data in the cloud...
While data access and recovery is a very important aspect of cloud computing, I'm curious as to the legal issues surrounding the movement of data across national boundaries or even across company boundaries.
How does the "cloud" protect data going from the owner to the computing service without being compromised (read that as sniffed)? Will a computing service in country A have the right to impose restrictions on data from another country (even if the results of the computing don't affect the citizens of country A)? An so on.
Chuck Wegrzyn
Utpal Datta wrote: > You make all the right points on speed, bandwidth, Amazon charging on > bandwidth etc. But consider the need for the user (say a large > financial company with a sensitive business critical application),
> 1. who will guarantee that the data in S3 is secure from physical and > logical access
> 2. who will guarantee that the data is always available using a > multi-site recovery system (that is what they would have in their own > data center) that meets their RPO (Recovery Point Objective) and RTO > (Recovery Time Objective) guidelines.
> Either Amazon or other Cloud providers will make these available with > EC2 with SP3 (or some other storage mechanism with more robust > security and availability characteristics) or the users will have to > build something similar on their own using EC2 as their basic building > block.
> This will be a *very* non-trivial task for any user to do on their own > and they will have to make the decision to put resources to build this > on a cloud or to invest more on their own datacenter.
> So I guess a lot will depend on the level of maturity of the clouds. > Not sure if all this work belong in a mid-layer outside of the > original cloud and leave the cloud providers just to provide the basic > building blocks
> --utpal
> On Thu, Jun 19, 2008 at 11:08 AM, Chris K Wensel <ch...@wensel.net> wrote: >> On Jun 18, 2008, at 8:15 AM, Utpal Datta wrote:
>>> 1. I think "data in the cloud" is so far a big block to widespread >>> adoption and using cloud for large, sensitive and mission critical >>> applications (espicially for Financial organization). Is someone >>> thinking of a way to leave the data within the user-premises and do >>> just the computing in the cloud? Kind of a reverse connection back to >>> the user datacenter.
>>> That way the conventional data respositories can still be used. The >>> users will not have to worry about the reliability, availability and >>> (to a large part) security of the data. We still have to worry about >>> the security of the data travelling back and forth to and from the >>> cloud to the user data center.
>>> This probably is more relevant for medium to large scale users with >>> "sensitive" data.
>>> Comments? tips?
>> I've been processing large historical data sets for a Financial >> company I'm consulting with using Cascading/Hadoop on EC2/S3.
>> The biggest bottleneck has been getting data to the compute >> infrastructure.
>> The obvious pattern is to have datacenter processes push data to S3, >> then have the temporary cluster spin up and pull data from S3, do >> something interesting, then push the results to S3, notify the >> datacenter the job is complete (SQS), have the datacenter pull down >> the results from S3.
>> Because of the need to support both well defined daily processes and >> ad-hoc processes, my clients data generally needs to stay on S3. >> Having it pulled from a remote datacenter on duplicate runs would be >> extraordinarily slow and expensive considering Amazon charges for >> bandwidth in and out. Plus, it is a bit cheaper just to keep data on >> S3 than to buy a NAS for storage.
>> That said, with bandwidth being the bottleneck in the face of the >> ability to spin up 100 or 1000 nodes to crunch numbers, larger pipes >> into a vendors Cloud would be very welcome. Otherwise your Cloud >> solution is only as fast as getting data in and out of it.
I think privacy is one aspect of data movement but what I see as a bigger problem is that it might become a national security issue. How about one country not allowing the data to leave once "it" has possession? Or organizations like the NSA mining the data as it passes through the borders.
On SaaS wrote: > Data locality is definitely a huge issue in the cloud. My company works > with a lot of multi-nationals with huge data sets in various countries. > In many countries, especially the EU ones as well as like Mexico have > some fairly strict laws around privacy data (e.g., data with personal > info, etc.) Some of these multi-national countries have to architect > their on-premise software around these restrictions (e.g., putting > on-premise software in each country) and restrict the data movement. One > of them took several months to study the laws and legality of data > location and movement before implementing their solution.
> So the location of the cloud and data is definitely going to be very > important to these multi-nationals. That's part of the reasons why > Amazon has an EU cloud and Salesforce is building a cloud in > Singapore. Some of the countries are also wary of putting any data > inside U.S. due to concerns about patriot act. In general the country > where the data resides has jurisdiction over it.
> -- > OnSaaS.net - /Blogging about the SaaS and cloud computing world/ > OnSaaS.info - Providing a continuous stream of SaaS and cloud computing news > /Follow on http://twitter.com/onsaas, http://friendfeed.com/onsaas/ > / > / > On Jun 19, 2008, at 9:57 AM, Chaz. wrote:
>> While data access and recovery is a very important aspect of cloud >> computing, I'm curious as to the legal issues surrounding the movement >> of data across national boundaries or even across company boundaries.
>> How does the "cloud" protect data going from the owner to the computing >> service without being compromised (read that as sniffed)? Will a >> computing service in country A have the right to impose restrictions on >> data from another country (even if the results of the computing don't >> affect the citizens of country A)? An so on.
>> Chuck Wegrzyn
>> Utpal Datta wrote: >>> You make all the right points on speed, bandwidth, Amazon charging on >>> bandwidth etc. But consider the need for the user (say a large >>> financial company with a sensitive business critical application),
>>> 1. who will guarantee that the data in S3 is secure from physical and >>> logical access
>>> 2. who will guarantee that the data is always available using a >>> multi-site recovery system (that is what they would have in their own >>> data center) that meets their RPO (Recovery Point Objective) and RTO >>> (Recovery Time Objective) guidelines.
>>> Either Amazon or other Cloud providers will make these available with >>> EC2 with SP3 (or some other storage mechanism with more robust >>> security and availability characteristics) or the users will have to >>> build something similar on their own using EC2 as their basic building >>> block.
>>> This will be a *very* non-trivial task for any user to do on their own >>> and they will have to make the decision to put resources to build this >>> on a cloud or to invest more on their own datacenter.
>>> So I guess a lot will depend on the level of maturity of the clouds. >>> Not sure if all this work belong in a mid-layer outside of the >>> original cloud and leave the cloud providers just to provide the basic >>> building blocks
>>> --utpal
>>> On Thu, Jun 19, 2008 at 11:08 AM, Chris K Wensel <ch...@wensel.net >>> <mailto:ch...@wensel.net>> wrote: >>>> On Jun 18, 2008, at 8:15 AM, Utpal Datta wrote:
>>>>> 1. I think "data in the cloud" is so far a big block to widespread >>>>> adoption and using cloud for large, sensitive and mission critical >>>>> applications (espicially for Financial organization). Is someone >>>>> thinking of a way to leave the data within the user-premises and do >>>>> just the computing in the cloud? Kind of a reverse connection back to >>>>> the user datacenter.
>>>>> That way the conventional data respositories can still be used. The >>>>> users will not have to worry about the reliability, availability and >>>>> (to a large part) security of the data. We still have to worry about >>>>> the security of the data travelling back and forth to and from the >>>>> cloud to the user data center.
>>>>> This probably is more relevant for medium to large scale users with >>>>> "sensitive" data.
>>>>> Comments? tips?
>>>> I've been processing large historical data sets for a Financial >>>> company I'm consulting with using Cascading/Hadoop on EC2/S3.
>>>> The biggest bottleneck has been getting data to the compute >>>> infrastructure.
>>>> The obvious pattern is to have datacenter processes push data to S3, >>>> then have the temporary cluster spin up and pull data from S3, do >>>> something interesting, then push the results to S3, notify the >>>> datacenter the job is complete (SQS), have the datacenter pull down >>>> the results from S3.
>>>> Because of the need to support both well defined daily processes and >>>> ad-hoc processes, my clients data generally needs to stay on S3. >>>> Having it pulled from a remote datacenter on duplicate runs would be >>>> extraordinarily slow and expensive considering Amazon charges for >>>> bandwidth in and out. Plus, it is a bit cheaper just to keep data on >>>> S3 than to buy a NAS for storage.
>>>> That said, with bandwidth being the bottleneck in the face of the >>>> ability to spin up 100 or 1000 nodes to crunch numbers, larger pipes >>>> into a vendors Cloud would be very welcome. Otherwise your Cloud >>>> solution is only as fast as getting data in and out of it.
That probably works well now. In the future I would expect compute clouds to be available in 'cheaper' locales (think of Washington State...lol) or Finland, at that point it becomes a real issue.
> I agree with your concerns. Thus far I have been using vendors within > single governance regions, and then having a policy engine at my > application layer to govern where data is allowed to be operated upon. > So, EU data stays in the EU for example. As the vendors grow to span > multiple boundaries, if they are not providing programmatic interfaces > to allow application layer control of these issues, I may need to avoid > those vendors.
> - Marc
> Chaz. wrote: >> Hi Marc! Not the only problems. I'd be worried about trans-country laws >> governing the data. After all once it is in country A, the laws of that >> country would hold.
>> Chuck Wegrzyn
>> Marc Evans wrote: >>> I have been contemplating this issue of safe storage in the cloud. My >>> opinion is that what I need is at least 4 in the cloud storage vendors, >>> which I can then layer RAID5 behavior on top of, combined with a >>> loopback encryption file system. Even with that, pulling the data into >>> the compute cloud places the data in danger of being observable and >>> possibly tamperable. This all ignores latency problems, which I am >>> certain will be a problem, as well as transit costs.
>>> I personally would like my application-at-the-edge software to also span >>> a number of in the cloud vendors, so that I don't experience vendor >>> lock-in problems. In particular, I am concerned that my public facing >>> services will be targets of DDoS attacks and as a result vendors will >>> consider abruptly discontinuing service.
>>> For these reasons, I have not been able to consider much of what in the >>> cloud providers can offer to date, though I continue to build proof of >>> concept packages in preparation for the point in time that the industry >>> evolves enough to facilitate my needs. I am very curious if others have >>> similar concerns and if plausible solutions are being found...
>>> - Marc
>>> Utpal Datta wrote: >>>> You make all the right points on speed, bandwidth, Amazon charging on >>>> bandwidth etc. But consider the need for the user (say a large >>>> financial company with a sensitive business critical application),
>>>> 1. who will guarantee that the data in S3 is secure from physical and >>>> logical access
>>>> 2. who will guarantee that the data is always available using a >>>> multi-site recovery system (that is what they would have in their own >>>> data center) that meets their RPO (Recovery Point Objective) and RTO >>>> (Recovery Time Objective) guidelines.
>>>> Either Amazon or other Cloud providers will make these available with >>>> EC2 with SP3 (or some other storage mechanism with more robust >>>> security and availability characteristics) or the users will have to >>>> build something similar on their own using EC2 as their basic building >>>> block.
>>>> This will be a *very* non-trivial task for any user to do on their own >>>> and they will have to make the decision to put resources to build this >>>> on a cloud or to invest more on their own datacenter.
>>>> So I guess a lot will depend on the level of maturity of the clouds. >>>> Not sure if all this work belong in a mid-layer outside of the >>>> original cloud and leave the cloud providers just to provide the basic >>>> building blocks
>>>> --utpal
>>>> On Thu, Jun 19, 2008 at 11:08 AM, Chris K Wensel <ch...@wensel.net> wrote: >>>>> On Jun 18, 2008, at 8:15 AM, Utpal Datta wrote:
>>>>>> 1. I think "data in the cloud" is so far a big block to widespread >>>>>> adoption and using cloud for large, sensitive and mission critical >>>>>> applications (espicially for Financial organization). Is someone >>>>>> thinking of a way to leave the data within the user-premises and do >>>>>> just the computing in the cloud? Kind of a reverse connection back to >>>>>> the user datacenter.
>>>>>> That way the conventional data respositories can still be used. The >>>>>> users will not have to worry about the reliability, availability and >>>>>> (to a large part) security of the data. We still have to worry about >>>>>> the security of the data travelling back and forth to and from the >>>>>> cloud to the user data center.
>>>>>> This probably is more relevant for medium to large scale users with >>>>>> "sensitive" data.
>>>>>> Comments? tips? >>>>> I've been processing large historical data sets for a Financial >>>>> company I'm consulting with using Cascading/Hadoop on EC2/S3.
>>>>> The biggest bottleneck has been getting data to the compute >>>>> infrastructure.
>>>>> The obvious pattern is to have datacenter processes push data to S3, >>>>> then have the temporary cluster spin up and pull data from S3, do >>>>> something interesting, then push the results to S3, notify the >>>>> datacenter the job is complete (SQS), have the datacenter pull down >>>>> the results from S3.
>>>>> Because of the need to support both well defined daily processes and >>>>> ad-hoc processes, my clients data generally needs to stay on S3. >>>>> Having it pulled from a remote datacenter on duplicate runs would be >>>>> extraordinarily slow and expensive considering Amazon charges for >>>>> bandwidth in and out. Plus, it is a bit cheaper just to keep data on >>>>> S3 than to buy a NAS for storage.
>>>>> That said, with bandwidth being the bottleneck in the face of the >>>>> ability to spin up 100 or 1000 nodes to crunch numbers, larger pipes >>>>> into a vendors Cloud would be very welcome. Otherwise your Cloud >>>>> solution is only as fast as getting data in and out of it.
While I think trans-national data movement will be an area that requires governance of some kind I think that companies can get around the problem in other ways. I think it just requires looking at the problem in a different way.
I'd think the approach is to keep the data still and move the computing to it. The idea is to see the thousands of machines it takes to hold the petabytes worth of data as the compute cloud. What needs to move to it is the programs that can process the data. I've been working on this approach for the last 3 years (Twisted Storage).
Pittard, Rick wrote: > One big concern are compliance with the data privacy laws in the EU and > other countries which require protection of personal data and that it > not be transmitted to locations that have less protections. Since the > laws in the US are generally less protective than those in the EU, then > additional controls/agreements need to be in place to legally move the > data from the EU to the US.
> Rick
> -----Original Message----- > From: cloud-computing@googlegroups.com > [mailto:cloud-computing@googlegroups.com] On Behalf Of Chaz. > Sent: Thursday, June 19, 2008 11:58 AM > To: cloud-computing@googlegroups.com > Subject: Issues of data in the cloud...
> While data access and recovery is a very important aspect of cloud > computing, I'm curious as to the legal issues surrounding the movement > of data across national boundaries or even across company boundaries.
> How does the "cloud" protect data going from the owner to the computing > service without being compromised (read that as sniffed)? Will a > computing service in country A have the right to impose restrictions on > data from another country (even if the results of the computing don't > affect the citizens of country A)? An so on.
> Chuck Wegrzyn
> Utpal Datta wrote: >> You make all the right points on speed, bandwidth, Amazon charging on >> bandwidth etc. But consider the need for the user (say a large >> financial company with a sensitive business critical application),
>> 1. who will guarantee that the data in S3 is secure from physical and >> logical access
>> 2. who will guarantee that the data is always available using a >> multi-site recovery system (that is what they would have in their own >> data center) that meets their RPO (Recovery Point Objective) and RTO >> (Recovery Time Objective) guidelines.
>> Either Amazon or other Cloud providers will make these available with >> EC2 with SP3 (or some other storage mechanism with more robust >> security and availability characteristics) or the users will have to >> build something similar on their own using EC2 as their basic building >> block.
>> This will be a *very* non-trivial task for any user to do on their own >> and they will have to make the decision to put resources to build this >> on a cloud or to invest more on their own datacenter.
>> So I guess a lot will depend on the level of maturity of the clouds. >> Not sure if all this work belong in a mid-layer outside of the >> original cloud and leave the cloud providers just to provide the basic >> building blocks
>> --utpal
>> On Thu, Jun 19, 2008 at 11:08 AM, Chris K Wensel <ch...@wensel.net> > wrote: >>> On Jun 18, 2008, at 8:15 AM, Utpal Datta wrote:
>>>> 1. I think "data in the cloud" is so far a big block to widespread >>>> adoption and using cloud for large, sensitive and mission critical >>>> applications (espicially for Financial organization). Is someone >>>> thinking of a way to leave the data within the user-premises and do >>>> just the computing in the cloud? Kind of a reverse connection back > to >>>> the user datacenter.
>>>> That way the conventional data respositories can still be used. The >>>> users will not have to worry about the reliability, availability and >>>> (to a large part) security of the data. We still have to worry about >>>> the security of the data travelling back and forth to and from the >>>> cloud to the user data center.
>>>> This probably is more relevant for medium to large scale users with >>>> "sensitive" data.
>>>> Comments? tips? >>> I've been processing large historical data sets for a Financial >>> company I'm consulting with using Cascading/Hadoop on EC2/S3.
>>> The biggest bottleneck has been getting data to the compute >>> infrastructure.
>>> The obvious pattern is to have datacenter processes push data to S3, >>> then have the temporary cluster spin up and pull data from S3, do >>> something interesting, then push the results to S3, notify the >>> datacenter the job is complete (SQS), have the datacenter pull down >>> the results from S3.
>>> Because of the need to support both well defined daily processes and >>> ad-hoc processes, my clients data generally needs to stay on S3. >>> Having it pulled from a remote datacenter on duplicate runs would be >>> extraordinarily slow and expensive considering Amazon charges for >>> bandwidth in and out. Plus, it is a bit cheaper just to keep data on >>> S3 than to buy a NAS for storage.
>>> That said, with bandwidth being the bottleneck in the face of the >>> ability to spin up 100 or 1000 nodes to crunch numbers, larger pipes >>> into a vendors Cloud would be very welcome. Otherwise your Cloud >>> solution is only as fast as getting data in and out of it.
On Thu, Jun 19, 2008 at 1:40 PM, Chaz. <eprparad...@gmail.com> wrote: > [snip] > I'd think the approach is to keep the data still and move the computing > to it. The idea is to see the thousands of machines it takes to hold the > petabytes worth of data as the compute cloud. What needs to move to it > is the programs that can process the data. I've been working on this > approach for the last 3 years (Twisted Storage).
> Chuck Wegrzyn
This is valid approach, that I personally called "Plumber Pattern", when application, encapsulated in some kind of container (e.g. virtual machine image) is marshalled to secure data islands to iteratively do its unique work (say, do a matches on some criterium in Interpol, FBI, CIA, MI5 and other databases, all distributed across continents). Due to utterly confidential nature of these types of data, it is impossible to move them to public storage (at least this time). Above-mentioned case might be extrapolated to some lines of business as well with reduced privacy/security requirements.
I think Chaz is right. There are privacy issues regarding use and exposure of data that vary country by country. If the cloud computes the data, there is no control on where that data is moved for computation, right?
Date: Thu, 19 Jun 2008 13:40:20 To:cloud-computing@googlegroups.com
Subject: Re: Issues of data in the cloud...
While I think trans-national data movement will be an area that requires governance of some kind I think that companies can get around the problem in other ways. I think it just requires looking at the problem in a different way.
I'd think the approach is to keep the data still and move the computing to it. The idea is to see the thousands of machines it takes to hold the petabytes worth of data as the compute cloud. What needs to move to it is the programs that can process the data. I've been working on this approach for the last 3 years (Twisted Storage).
Chuck Wegrzyn
Pittard, Rick wrote:
> One big concern are compliance with the data privacy laws in the EU and
> other countries which require protection of personal data and that it
> not be transmitted to locations that have less protections. Since the
> laws in the US are generally less protective than those in the EU, then
> additional controls/agreements need to be in place to legally move the
> data from the EU to the US.
> Rick
> -----Original Message-----
> From: cloud-computing@googlegroups.com
> [mailto:cloud-computing@googlegroups.com] On Behalf Of Chaz.
> Sent: Thursday, June 19, 2008 11:58 AM
> To: cloud-computing@googlegroups.com
> Subject: Issues of data in the cloud...
> While data access and recovery is a very important aspect of cloud > computing, I'm curious as to the legal issues surrounding the movement > of data across national boundaries or even across company boundaries.
> How does the "cloud" protect data going from the owner to the computing > service without being compromised (read that as sniffed)? Will a > computing service in country A have the right to impose restrictions on > data from another country (even if the results of the computing don't > affect the citizens of country A)? An so on.
> Chuck Wegrzyn
> Utpal Datta wrote:
>> You make all the right points on speed, bandwidth, Amazon charging on
>> bandwidth etc. But consider the need for the user (say a large
>> financial company with a sensitive business critical application),
>> 1. who will guarantee that the data in S3 is secure from physical and
>> logical access
>> 2. who will guarantee that the data is always available using a
>> multi-site recovery system (that is what they would have in their own
>> data center) that meets their RPO (Recovery Point Objective) and RTO
>> (Recovery Time Objective) guidelines.
>> Either Amazon or other Cloud providers will make these available with
>> EC2 with SP3 (or some other storage mechanism with more robust
>> security and availability characteristics) or the users will have to
>> build something similar on their own using EC2 as their basic building
>> block.
>> This will be a *very* non-trivial task for any user to do on their own
>> and they will have to make the decision to put resources to build this
>> on a cloud or to invest more on their own datacenter.
>> So I guess a lot will depend on the level of maturity of the clouds.
>> Not sure if all this work belong in a mid-layer outside of the
>> original cloud and leave the cloud providers just to provide the basic
>> building blocks
>> --utpal
>> On Thu, Jun 19, 2008 at 11:08 AM, Chris K Wensel <ch...@wensel.net>
> wrote:
>>> On Jun 18, 2008, at 8:15 AM, Utpal Datta wrote:
>>>> 1. I think "data in the cloud" is so far a big block to widespread
>>>> adoption and using cloud for large, sensitive and mission critical
>>>> applications (espicially for Financial organization). Is someone
>>>> thinking of a way to leave the data within the user-premises and do
>>>> just the computing in the cloud? Kind of a reverse connection back
> to
>>>> the user datacenter.
>>>> That way the conventional data respositories can still be used. The
>>>> users will not have to worry about the reliability, availability and
>>>> (to a large part) security of the data. We still have to worry about
>>>> the security of the data travelling back and forth to and from the
>>>> cloud to the user data center.
>>>> This probably is more relevant for medium to large scale users with
>>>> "sensitive" data.
>>>> Comments? tips?
>>> I've been processing large historical data sets for a Financial
>>> company I'm consulting with using Cascading/Hadoop on EC2/S3.
>>> The biggest bottleneck has been getting data to the compute
>>> infrastructure.
>>> The obvious pattern is to have datacenter processes push data to S3,
>>> then have the temporary cluster spin up and pull data from S3, do
>>> something interesting, then push the results to S3, notify the
>>> datacenter the job is complete (SQS), have the datacenter pull down
>>> the results from S3.
>>> Because of the need to support both well defined daily processes and
>>> ad-hoc processes, my clients data generally needs to stay on S3.
>>> Having it pulled from a remote datacenter on duplicate runs would be
>>> extraordinarily slow and expensive considering Amazon charges for
>>> bandwidth in and out. Plus, it is a bit cheaper just to keep data on
>>> S3 than to buy a NAS for storage.
>>> That said, with bandwidth being the bottleneck in the face of the
>>> ability to spin up 100 or 1000 nodes to crunch numbers, larger pipes
>>> into a vendors Cloud would be very welcome. Otherwise your Cloud
>>> solution is only as fast as getting data in and out of it.
And I would think the cloud providers will have to start answering these questions if they want large enterprises to start adopting the cloud. There maybe no control of which server in the cloud is doing the computation, but service providers may provide options to restrict based on geographic domains.
We have quite a few people here from the cloud providers, maybe they can share some insight?
thx
On Jun 19, 2008, at 10:44 AM, Stuart Altenhaus wrote:
> I think Chaz is right. There are privacy issues regarding use and > exposure of data that vary country by country. If the cloud computes > the data, there is no control on where that data is moved for > computation, right?
> Date: Thu, 19 Jun 2008 13:40:20 > To:cloud-computing@googlegroups.com > Subject: Re: Issues of data in the cloud...
> While I think trans-national data movement will be an area that > requires > governance of some kind I think that companies can get around the > problem in other ways. I think it just requires looking at the problem > in a different way.
> I'd think the approach is to keep the data still and move the > computing > to it. The idea is to see the thousands of machines it takes to hold > the > petabytes worth of data as the compute cloud. What needs to move to > it > is the programs that can process the data. I've been working on this > approach for the last 3 years (Twisted Storage).
> Chuck Wegrzyn
> Pittard, Rick wrote: >> One big concern are compliance with the data privacy laws in the EU >> and >> other countries which require protection of personal data and that it >> not be transmitted to locations that have less protections. Since >> the >> laws in the US are generally less protective than those in the EU, >> then >> additional controls/agreements need to be in place to legally move >> the >> data from the EU to the US.
>> Rick
>> -----Original Message----- >> From: cloud-computing@googlegroups.com >> [mailto:cloud-computing@googlegroups.com] On Behalf Of Chaz. >> Sent: Thursday, June 19, 2008 11:58 AM >> To: cloud-computing@googlegroups.com >> Subject: Issues of data in the cloud...
>> While data access and recovery is a very important aspect of cloud >> computing, I'm curious as to the legal issues surrounding the >> movement >> of data across national boundaries or even across company boundaries.
>> How does the "cloud" protect data going from the owner to the >> computing >> service without being compromised (read that as sniffed)? Will a >> computing service in country A have the right to impose >> restrictions on >> data from another country (even if the results of the computing don't >> affect the citizens of country A)? An so on.
>> Chuck Wegrzyn
>> Utpal Datta wrote: >>> You make all the right points on speed, bandwidth, Amazon charging >>> on >>> bandwidth etc. But consider the need for the user (say a large >>> financial company with a sensitive business critical application),
>>> 1. who will guarantee that the data in S3 is secure from physical >>> and >>> logical access
>>> 2. who will guarantee that the data is always available using a >>> multi-site recovery system (that is what they would have in their >>> own >>> data center) that meets their RPO (Recovery Point Objective) and RTO >>> (Recovery Time Objective) guidelines.
>>> Either Amazon or other Cloud providers will make these available >>> with >>> EC2 with SP3 (or some other storage mechanism with more robust >>> security and availability characteristics) or the users will have to >>> build something similar on their own using EC2 as their basic >>> building >>> block.
>>> This will be a *very* non-trivial task for any user to do on their >>> own >>> and they will have to make the decision to put resources to build >>> this >>> on a cloud or to invest more on their own datacenter.
>>> So I guess a lot will depend on the level of maturity of the clouds. >>> Not sure if all this work belong in a mid-layer outside of the >>> original cloud and leave the cloud providers just to provide the >>> basic >>> building blocks
>>> --utpal
>>> On Thu, Jun 19, 2008 at 11:08 AM, Chris K Wensel <ch...@wensel.net> >> wrote: >>>> On Jun 18, 2008, at 8:15 AM, Utpal Datta wrote:
>>>>> 1. I think "data in the cloud" is so far a big block to widespread >>>>> adoption and using cloud for large, sensitive and mission critical >>>>> applications (espicially for Financial organization). Is someone >>>>> thinking of a way to leave the data within the user-premises and >>>>> do >>>>> just the computing in the cloud? Kind of a reverse connection back >> to >>>>> the user datacenter.
>>>>> That way the conventional data respositories can still be used. >>>>> The >>>>> users will not have to worry about the reliability, availability >>>>> and >>>>> (to a large part) security of the data. We still have to worry >>>>> about >>>>> the security of the data travelling back and forth to and from the >>>>> cloud to the user data center.
>>>>> This probably is more relevant for medium to large scale users >>>>> with >>>>> "sensitive" data.
>>>>> Comments? tips? >>>> I've been processing large historical data sets for a Financial >>>> company I'm consulting with using Cascading/Hadoop on EC2/S3.
>>>> The biggest bottleneck has been getting data to the compute >>>> infrastructure.
>>>> The obvious pattern is to have datacenter processes push data to >>>> S3, >>>> then have the temporary cluster spin up and pull data from S3, do >>>> something interesting, then push the results to S3, notify the >>>> datacenter the job is complete (SQS), have the datacenter pull down >>>> the results from S3.
>>>> Because of the need to support both well defined daily processes >>>> and >>>> ad-hoc processes, my clients data generally needs to stay on S3. >>>> Having it pulled from a remote datacenter on duplicate runs would >>>> be >>>> extraordinarily slow and expensive considering Amazon charges for >>>> bandwidth in and out. Plus, it is a bit cheaper just to keep data >>>> on >>>> S3 than to buy a NAS for storage.
>>>> That said, with bandwidth being the bottleneck in the face of the >>>> ability to spin up 100 or 1000 nodes to crunch numbers, larger >>>> pipes >>>> into a vendors Cloud would be very welcome. Otherwise your Cloud >>>> solution is only as fast as getting data in and out of it.
I know from my work that many firms are reluctant to let there data "out the door" since they see that as their edge in the market. But even that aside for a minute, it seems to make more sense to move "small" programs (relative to the size of the data) then to move massive amounts of data.
So my question is as follows: what makes a good "storage cloud"?
> On Thu, Jun 19, 2008 at 1:40 PM, Chaz. <eprparad...@gmail.com > <mailto:eprparad...@gmail.com>> wrote:
> [snip] > I'd think the approach is to keep the data still and move the computing > to it. The idea is to see the thousands of machines it takes to hold the > petabytes worth of data as the compute cloud. What needs to move to it > is the programs that can process the data. I've been working on this > approach for the last 3 years (Twisted Storage).
> Chuck Wegrzyn
> This is valid approach, that I personally called "Plumber Pattern", when > application, encapsulated in some kind of container (e.g. virtual > machine image) is marshalled to secure data islands to iteratively do > its unique work (say, do a matches on some criterium in Interpol, FBI, > CIA, MI5 and other databases, all distributed across continents). Due to > utterly confidential nature of these types of data, it is impossible to > move them to public storage (at least this time). Above-mentioned case > might be extrapolated to some lines of business as well with reduced > privacy/security requirements.
That is one approach - again it seems to indicate the model is the data moving to the compute resources. The other approach is to look at it from the data perspective - can the data sit some place and the compute come to it?
On SaaS wrote: > That depends on how the cloud is architected, no?
> And I would think the cloud providers will have to start answering these > questions if they want large enterprises to start adopting the > cloud. There maybe no control of which server in the cloud is doing the > computation, but service providers may provide options to restrict based > on geographic domains.
> We have quite a few people here from the cloud providers, maybe they can > share some insight?
> thx
> On Jun 19, 2008, at 10:44 AM, Stuart Altenhaus wrote:
>> I think Chaz is right. There are privacy issues regarding use and >> exposure of data that vary country by country. If the cloud computes >> the data, there is no control on where that data is moved for >> computation, right?
>> Date: Thu, 19 Jun 2008 13:40:20 >> To:cloud-computing@googlegroups.com >> <mailto:cloud-computing@googlegroups.com> >> Subject: Re: Issues of data in the cloud...
>> While I think trans-national data movement will be an area that requires >> governance of some kind I think that companies can get around the >> problem in other ways. I think it just requires looking at the problem >> in a different way.
>> I'd think the approach is to keep the data still and move the computing >> to it. The idea is to see the thousands of machines it takes to hold the >> petabytes worth of data as the compute cloud. What needs to move to it >> is the programs that can process the data. I've been working on this >> approach for the last 3 years (Twisted Storage).
>> Chuck Wegrzyn
>> Pittard, Rick wrote: >>> One big concern are compliance with the data privacy laws in the EU and >>> other countries which require protection of personal data and that it >>> not be transmitted to locations that have less protections. Since the >>> laws in the US are generally less protective than those in the EU, then >>> additional controls/agreements need to be in place to legally move the >>> data from the EU to the US.
>>> Rick
>>> -----Original Message----- >>> From: cloud-computing@googlegroups.com >>> <mailto:cloud-computing@googlegroups.com> >>> [mailto:cloud-computing@googlegroups.com] On Behalf Of Chaz. >>> Sent: Thursday, June 19, 2008 11:58 AM >>> To: cloud-computing@googlegroups.com >>> <mailto:cloud-computing@googlegroups.com> >>> Subject: Issues of data in the cloud...
>>> While data access and recovery is a very important aspect of cloud >>> computing, I'm curious as to the legal issues surrounding the movement >>> of data across national boundaries or even across company boundaries.
>>> How does the "cloud" protect data going from the owner to the computing >>> service without being compromised (read that as sniffed)? Will a >>> computing service in country A have the right to impose restrictions on >>> data from another country (even if the results of the computing don't >>> affect the citizens of country A)? An so on.
>>> Chuck Wegrzyn
>>> Utpal Datta wrote: >>>> You make all the right points on speed, bandwidth, Amazon charging on >>>> bandwidth etc. But consider the need for the user (say a large >>>> financial company with a sensitive business critical application),
>>>> 1. who will guarantee that the data in S3 is secure from physical and >>>> logical access
>>>> 2. who will guarantee that the data is always available using a >>>> multi-site recovery system (that is what they would have in their own >>>> data center) that meets their RPO (Recovery Point Objective) and RTO >>>> (Recovery Time Objective) guidelines.
>>>> Either Amazon or other Cloud providers will make these available with >>>> EC2 with SP3 (or some other storage mechanism with more robust >>>> security and availability characteristics) or the users will have to >>>> build something similar on their own using EC2 as their basic building >>>> block.
>>>> This will be a *very* non-trivial task for any user to do on their own >>>> and they will have to make the decision to put resources to build this >>>> on a cloud or to invest more on their own datacenter.
>>>> So I guess a lot will depend on the level of maturity of the clouds. >>>> Not sure if all this work belong in a mid-layer outside of the >>>> original cloud and leave the cloud providers just to provide the basic >>>> building blocks
>>>> --utpal
>>>> On Thu, Jun 19, 2008 at 11:08 AM, Chris K Wensel <ch...@wensel.net >>>> <mailto:ch...@wensel.net>> >>> wrote: >>>>> On Jun 18, 2008, at 8:15 AM, Utpal Datta wrote:
>>>>>> 1. I think "data in the cloud" is so far a big block to widespread >>>>>> adoption and using cloud for large, sensitive and mission critical >>>>>> applications (espicially for Financial organization). Is someone >>>>>> thinking of a way to leave the data within the user-premises and do >>>>>> just the computing in the cloud? Kind of a reverse connection back >>> to >>>>>> the user datacenter.
>>>>>> That way the conventional data respositories can still be used. The >>>>>> users will not have to worry about the reliability, availability and >>>>>> (to a large part) security of the data. We still have to worry about >>>>>> the security of the data travelling back and forth to and from the >>>>>> cloud to the user data center.
>>>>>> This probably is more relevant for medium to large scale users with >>>>>> "sensitive" data.
>>>>>> Comments? tips? >>>>> I've been processing large historical data sets for a Financial >>>>> company I'm consulting with using Cascading/Hadoop on EC2/S3.
>>>>> The biggest bottleneck has been getting data to the compute >>>>> infrastructure.
>>>>> The obvious pattern is to have datacenter processes push data to S3, >>>>> then have the temporary cluster spin up and pull data from S3, do >>>>> something interesting, then push the results to S3, notify the >>>>> datacenter the job is complete (SQS), have the datacenter pull down >>>>> the results from S3.
>>>>> Because of the need to support both well defined daily processes and >>>>> ad-hoc processes, my clients data generally needs to stay on S3. >>>>> Having it pulled from a remote datacenter on duplicate runs would be >>>>> extraordinarily slow and expensive considering Amazon charges for >>>>> bandwidth in and out. Plus, it is a bit cheaper just to keep data on >>>>> S3 than to buy a NAS for storage.
>>>>> That said, with bandwidth being the bottleneck in the face of the >>>>> ability to spin up 100 or 1000 nodes to crunch numbers, larger pipes >>>>> into a vendors Cloud would be very welcome. Otherwise your Cloud >>>>> solution is only as fast as getting data in and out of it.
> -- > OnSaaS.net - /Blogging about the SaaS and cloud computing world/ > OnSaaS.info - Providing a continuous stream of SaaS and cloud computing news > /Follow on http://twitter.com/onsaas, http://friendfeed.com/onsaas/