You can also look at LucidEra - http://www.lucidera.com/solutions/index.php
--Naren |
I am *very* new to this group. But i am really excited by the quality
of postings in the group. I am learning a lot, quickly.
I have a couple of questions. May be someone has some answers.
1. I think "data in the cloud" is so far a big block to widespread
adoption and using cloud for large, sensitive and mission critical
applications (espicially for Financial organization). Is someone
thinking of a way to leave the data within the user-premises and do
just the computing in the cloud? Kind of a reverse connection back to
the user datacenter.
That way the conventional data respositories can still be used. The
users will not have to worry about the reliability, availability and
(to a large part) security of the data. We still have to worry about
the security of the data travelling back and forth to and from the
cloud to the user data center.
This probably is more relevant for medium to large scale users with
"sensitive" data.
Comments? tips?
2. Considering the "cloud computing" is at the beginning of its
adoption curve, the user data center will, for a long time, have a
mixture of their own Physical, Virtual devices within their datacenter
along with their "virtual" datacenters in one or more clouds (may be
from different vendors).
The user will obviously look for a management portal that seamlessly
crosses the boundaries of Physical, Virtual and Cloud devices (for
discovery, monitoring at the very least).
Are there some talk/thought on standardizing the "cloud managemnet
actions" and "cloud management data" interfaces?
Comments? tips?
Thanks
--utpal
1. I think "data in the cloud" is so far a big block to widespread
adoption and using cloud for large, sensitive and mission critical
applications (espicially for Financial organization). Is someone
thinking of a way to leave the data within the user-premises and do
just the computing in the cloud? Kind of a reverse connection back to
the user datacenter.
That way the conventional data respositories can still be used. The
users will not have to worry about the reliability, availability and
(to a large part) security of the data. We still have to worry about
the security of the data travelling back and forth to and from the
cloud to the user data center.
This probably is more relevant for medium to large scale users with
"sensitive" data.
Comments? tips?
> 1. I think "data in the cloud" is so far a big block to widespread
> adoption and using cloud for large, sensitive and mission critical
> applications (espicially for Financial organization). Is someone
> thinking of a way to leave the data within the user-premises and do
> just the computing in the cloud? Kind of a reverse connection back to
> the user datacenter.
>
> That way the conventional data respositories can still be used. The
> users will not have to worry about the reliability, availability and
> (to a large part) security of the data. We still have to worry about
> the security of the data travelling back and forth to and from the
> cloud to the user data center.
>
> This probably is more relevant for medium to large scale users with
> "sensitive" data.
>
> Comments? tips?
I've been processing large historical data sets for a Financial
company I'm consulting with using Cascading/Hadoop on EC2/S3.
The biggest bottleneck has been getting data to the compute
infrastructure.
The obvious pattern is to have datacenter processes push data to S3,
then have the temporary cluster spin up and pull data from S3, do
something interesting, then push the results to S3, notify the
datacenter the job is complete (SQS), have the datacenter pull down
the results from S3.
Because of the need to support both well defined daily processes and
ad-hoc processes, my clients data generally needs to stay on S3.
Having it pulled from a remote datacenter on duplicate runs would be
extraordinarily slow and expensive considering Amazon charges for
bandwidth in and out. Plus, it is a bit cheaper just to keep data on
S3 than to buy a NAS for storage.
That said, with bandwidth being the bottleneck in the face of the
ability to spin up 100 or 1000 nodes to crunch numbers, larger pipes
into a vendors Cloud would be very welcome. Otherwise your Cloud
solution is only as fast as getting data in and out of it.
chris
--
Chris K Wensel
ch...@wensel.net
http://chris.wensel.net/
http://www.cascading.org/
> That said, with bandwidth being the bottleneck in the face of the
> ability to spin up 100 or 1000 nodes to crunch numbers
how big are the datasets you're working with? Random or linear access ?
Timothy Huber
Strategic Account Development
tim....@metaram.com
cell 310 795.6599
MetaRAM Inc.
181 Metro Drive, Suite 400
San Jose, CA 95110
1. who will guarantee that the data in S3 is secure from physical and
logical access
2. who will guarantee that the data is always available using a
multi-site recovery system (that is what they would have in their own
data center) that meets their RPO (Recovery Point Objective) and RTO
(Recovery Time Objective) guidelines.
Either Amazon or other Cloud providers will make these available with
EC2 with SP3 (or some other storage mechanism with more robust
security and availability characteristics) or the users will have to
build something similar on their own using EC2 as their basic building
block.
This will be a *very* non-trivial task for any user to do on their own
and they will have to make the decision to put resources to build this
on a cloud or to invest more on their own datacenter.
So I guess a lot will depend on the level of maturity of the clouds.
Not sure if all this work belong in a mid-layer outside of the
original cloud and leave the cloud providers just to provide the basic
building blocks
--utpal
total data is 100's of G. Individual work loads are ~10G. All linear
(this being Hadoop), but there is much joining, binning, and crunching
between the multiple input datasets (the actual workload translates to
~60 MapReduce jobs, all rendered and managed by Cascading).
So it kinda sucks to have uploads of data to the cluster take longer
than it does to compute on it. Worse since my client then has to fetch
the derived data back.
ckw
I personally would like my application-at-the-edge software to also span
a number of in the cloud vendors, so that I don't experience vendor
lock-in problems. In particular, I am concerned that my public facing
services will be targets of DDoS attacks and as a result vendors will
consider abruptly discontinuing service.
For these reasons, I have not been able to consider much of what in the
cloud providers can offer to date, though I continue to build proof of
concept packages in preparation for the point in time that the industry
evolves enough to facilitate my needs. I am very curious if others have
similar concerns and if plausible solutions are being found...
- Marc
How does the "cloud" protect data going from the owner to the computing
service without being compromised (read that as sniffed)? Will a
computing service in country A have the right to impose restrictions on
data from another country (even if the results of the computing don't
affect the citizens of country A)? An so on.
Chuck Wegrzyn
Chuck Wegrzyn
I agree with your concerns. Thus far I have been using vendors within
single governance regions, and then having a policy engine at my
application layer to govern where data is allowed to be operated upon.
So, EU data stays in the EU for example. As the vendors grow to span
multiple boundaries, if they are not providing programmatic interfaces
to allow application layer control of these issues, I may need to avoid
those vendors.
- Marc
Rick
Chuck Wegrzyn
On SaaS wrote:
> Data locality is definitely a huge issue in the cloud. My company works
> with a lot of multi-nationals with huge data sets in various countries.
> In many countries, especially the EU ones as well as like Mexico have
> some fairly strict laws around privacy data (e.g., data with personal
> info, etc.) Some of these multi-national countries have to architect
> their on-premise software around these restrictions (e.g., putting
> on-premise software in each country) and restrict the data movement. One
> of them took several months to study the laws and legality of data
> location and movement before implementing their solution.
>
> So the location of the cloud and data is definitely going to be very
> important to these multi-nationals. That's part of the reasons why
> Amazon has an EU cloud and Salesforce is building a cloud in
> Singapore. Some of the countries are also wary of putting any data
> inside U.S. due to concerns about patriot act. In general the country
> where the data resides has jurisdiction over it.
>
> --
> OnSaaS.net - /Blogging about the SaaS and cloud computing world/
> OnSaaS.info - Providing a continuous stream of SaaS and cloud computing news
> /Follow on http://twitter.com/onsaas, http://friendfeed.com/onsaas/
> /
> /
>>>> ch...@wensel.net <mailto:ch...@wensel.net>
>>>> http://chris.wensel.net/
>>>> http://www.cascading.org/
>>>>
>>>>
>>>>
>>>>
>>>>
>>>>
>>>>
>>>
>>>>
>>>
>>
>>
>>
>>
>
>
>
> >
Chuck
I'd think the approach is to keep the data still and move the computing
to it. The idea is to see the thousands of machines it takes to hold the
petabytes worth of data as the compute cloud. What needs to move to it
is the programs that can process the data. I've been working on this
approach for the last 3 years (Twisted Storage).
Chuck Wegrzyn
I'd think the approach is to keep the data still and move the computing
to it. The idea is to see the thousands of machines it takes to hold the
petabytes worth of data as the compute cloud. What needs to move to it
is the programs that can process the data. I've been working on this
approach for the last 3 years (Twisted Storage).
Chuck Wegrzyn
So my question is as follows: what makes a good "storage cloud"?
Chuck Wegrzyn
Chuck Wegrzyn
On SaaS wrote:
> That depends on how the cloud is architected, no?
>
> And I would think the cloud providers will have to start answering these
> questions if they want large enterprises to start adopting the
> cloud. There maybe no control of which server in the cloud is doing the
> computation, but service providers may provide options to restrict based
> on geographic domains.
>
> We have quite a few people here from the cloud providers, maybe they can
> share some insight?
>
> thx
>
> On Jun 19, 2008, at 10:44 AM, Stuart Altenhaus wrote:
>
>> I think Chaz is right. There are privacy issues regarding use and
>> exposure of data that vary country by country. If the cloud computes
>> the data, there is no control on where that data is moved for
>> computation, right?
>>
>> R/s,
>> Stu Altenhaus
>>
>> Sent from my Verizon Wireless BlackBerry
>>
>> -----Original Message-----
>> From: "Chaz." <eprpa...@gmail.com <mailto:eprpa...@gmail.com>>
>>
>> Date: Thu, 19 Jun 2008 13:40:20
>> To:cloud-c...@googlegroups.com
>> <mailto:cloud-c...@googlegroups.com>
>>>> <mailto:ch...@wensel.net>>
>>>>> ch...@wensel.net <mailto:ch...@wensel.net>
>>>>> http://chris.wensel.net/
>>>>> http://www.cascading.org/
>>>>>
>>>>>
>>>>>
>>>>>
>>>>>
>>>>>
>>>>>
>>>
>>>
>>>
>>>
>>>>
>>>
>>
>>
>>
>>
>>
>>
>
> --
> OnSaaS.net - /Blogging about the SaaS and cloud computing world/
> OnSaaS.info - Providing a continuous stream of SaaS and cloud computing news
- Marc
I definitely agree with your point. I can't think of very many
multi-nationsls that would let there data out to wander around. I'd
think they would want to protect their data and move the computing
resources close to it....
Chuck
>> <mailto:cloud-c...@googlegroups.com>
>>>> <ch...@wensel.net <mailto:ch...@wensel.net>>
>>>>> ch...@wensel.net <mailto:ch...@wensel.net>
>>>>> http://chris.wensel.net/
>>>>> http://www.cascading.org/
>>>>>
>>>>>
>>>>>
>>>>>
>>>>>
>>>>>
>>>>>
>>>
>>>
>>>
>>>
>>>>
>>>
>>
>>
>>
>>
>>
>>
>
> --
> OnSaaS.net - /Blogging about the SaaS and cloud computing world/
> OnSaaS.info - Providing a continuous stream of SaaS and cloud
> computing news
> /Follow on http://twitter.com/onsaas, http://friendfeed.com/onsaas/
>
>
>
>
>
>
> --
> Jim Peters
> +415-608-0851
> >
If it is in the cloud then we are still dealing with Security,
Availability and Recoverability isues (that everyone agrees on).
If is in the users data center then how will the computing resources
offered (and controlled by Amazon) be brought to that specific user's
datacenter?
--utpal
if the Cloud is a collection of compute resources, and you need to
apply them to lots of your data, you have little choice but to move
your data. you can't move the compute power. (unless you order a
shipping container of servers I guess)
ckw
--
1) is it possible to have the app run on AWS so that the derived data does not need to traverse back down in real time (that way you could use a lazy download in the background to archive it in their DC while their app accesses the copy in real time on AWS.)?
2) I've been thinking about the problem of upload times as well (in the context of large DNA data sets). The cost of loading into AWS is not that prohibitive so if one where to pre-process that data such that it could be uploaded in a bunch of parrallel processes to AWS you could reduce the bottleneck considerably. In theory.
Now if you have to build out an autonomic system we will never have
secure cloud computing. No system today is so tight that it can't be
hacked. Just look at all the attempts to protect DVDs or BD disks...
Chuck Wegrzyn
Lynne VanArsdale wrote:
> Just joined cloud-computing and this is the first conversation I've
> received.
>
> A couple of weeks ago I attended Gartner Security where Neil MacDonald
> spoke on "Adaptive Security." In a nutshell, this approach builds a
> resilient system for secure data, acting much like the human immune
> system. It involves whitelisting as the foundation, blacklisting as a
> mid-tier and learned/adaptive mechanisms at the top. In such an
> environment, elements would be "autonomic" and self-managing to a large
> degree, and would share and communicate with other elements to protect
> workloads and information (as opposed to endpoints). There is a lot
> more to this vision, and it is probably a number of years away, but it
> may be a reasonable approach to address the concerns about data security
> being discussed here.
>
> In any case, does anyone know of any product or standards efforts for
> the industry to collaborate on a more cohesive architecture for security
> in the cloud?
>
>
> On 6/19/08, *Chaz.* <eprpa...@gmail.com
> <mailto:eprpa...@gmail.com>> wrote:
>
>
> Jim,
>
> I definitely agree with your point. I can't think of very many
> multi-nationsls that would let there data out to wander around. I'd
> think they would want to protect their data and move the computing
> resources close to it....
>
> Chuck
>
> Jim Peters wrote:
> > Even if the cloud providers come up with excellent answers to the
> > security and reliability questions, who's going to trust them? Credit
> > card numbers are one thing, but cloud data is something else
> entirely.
> >
> > +J
> >
> > On Thu, Jun 19, 2008 at 10:57 AM, On SaaS <ons...@gmail.com
> <mailto:ons...@gmail.com>
> <mailto:To%3Acloud-...@googlegroups.com>
> >> <mailto:cloud-c...@googlegroups.com
> >>> <mailto:cloud-c...@googlegroups.com
> <mailto:ch...@wensel.net <mailto:ch...@wensel.net>>>
> <mailto:ch...@wensel.net <mailto:ch...@wensel.net>>
I'm not sure I would agree you have to ship your data to somewhere else.
After all a "cloud data provider" could create just the secure
environment for holding the data and processing it (isn't that really
what S3 is all about?). The only thing the using company needs to do is
write the program and have it installed, more or less automagically, on
the machines that hold the user's data.
Chuck Wegrzyn
I suggest you do the same regarding security, just assume it's a
hostile environment.
The question is, what features of AWS support you in this? shared
keychains/stores, encrypted volumes, CA, kerberos, ?? or will this
always be left to the user. or could you ever really trust those
services the same way you trust them to not lose data.
That said, not being a security person. What 'cloud security services'
could a provider provide? Or should they even bother.
ckw
http://chris.wensel.net/
http://www.cascading.org/
You are absolutely correct. Once you have a person involved it can be
compromised. It is all about risk and how to make it so small it would
take an act of God (or a really large budget) to breach it!
Chuck
> <mailto:cloud-c...@googlegroups.com>
> > <mailto:To%3Acloud-...@googlegroups.com
> <mailto:3Acloud-...@googlegroups.com>>
Perhaps the real solution is to carefully architect your solution to
provide "bulk" services outside the company and leave the critical
things - those that are absolutely vital - to inside the company.
Chuck Wegrzyn
The nice thing is that we can be very flexible with our strategy as it
relates to where the data will reside, how it will be stored, and at
what rate it will be synchronized from the application. We can change it
overtime to best fit our application scenario and constraints without
touching our application code.
" And CORBA isn't what I am thinking of, or even HADOOP but things like
JavaSpaces (?)."
JavaSpaces is indeed more relevant for this type of scenarios.
What's unique about JavaSpaces IMO is that it can be used for handling
both the compute side and the data storage. The references above shows
how you could use space-based storage for handling the data side.
Now that its stored in the in-memory space cluster you can easily use
the same space to route business logic on those in-memory instances in
parallel. There's a nice way to abstract that from the user using a
remoting abstraction - see more details on how that works here:
http://uri-cohen.blogspot.com/2008/02/openspaces-svf-remoting-on-steroid
s.html
Nati S.
GigaSpaces
-----Original Message-----
From: cloud-c...@googlegroups.com
[mailto:cloud-c...@googlegroups.com] On Behalf Of Chaz.
For absolute control, you should stay absolutely in control of the
resources and I don't think Cloud Computing is something for you.
If you want a secured environment you should understand that the
administrator of the resource can read the memory. If you want to
prevent that, then you should look into secured techniques to hide the
memory contents. Ultimately you must get your instructions through a
(virtual) CPU which can also be obscured on what it is doing, but that's
the only threat left in that solution.
Most cloud computing solution don't allow for that level of security in
the cloud.
The other more generic good thing to do IMHO is to encrypt all your data
that resides in the Cloud. Irregardless if this is somewhere between
absolutely needed and a tiny wish. This would also solve issues where
inadvertently some transfer protocol are unencrypted.
For some bio-medical use-cases in Grid computing (more my main field)
this approach is also being used. Decryption happens just prior to the
actual processing. A more advanced solution is the sliding decrypted
window approach. Where the dataset is decrypted per section or block.
CPU usage goes up, but most of the file/database stays encrypted and
opportunities to snoop around on the resource is very limited in its
opportunity.
cheers,
Oscar Koeroo
> --~--~---------~--~----~------------~-------~--~----~
> You received this message because you are subscribed to the Google Groups "Cloud Computing" group.
> To post to this group, send email to cloud-c...@googlegroups.com
> To unsubscribe from this group, send email to cloud-computi...@googlegroups.com
> For more options, visit this group at http://groups.google.ca/group/cloud-computing?hl=en
> -~----------~----~----~----~------~----~------~--~---
Hi Chris,
I've looked at this issue quite a bit too. There are a few ways that I think the problem can be "relieved"
1. Don't encourage your clients to download the entire data-set. As long as you provide URLs to the "crunched data", they should only have to pull the data as needed. You can index the data too using SDB too - a nice convenience function for searching the data.
2. See if the customers can split the dataset into sub-datasets, each reachable via some sort of URL. When you run your Map job, each of the Map nodes will be responsible for downloading the data from your clients - you might get some benefits from the parallelization of the download.
3. Use S3 for more of a backing store - If you don't have many clients consuming the data, or you think that the clients will download the data soon after the mapreduce job is complete, they can download it directly from the HDFS (http://hadoop.apache.org/core/docs/r0.17.0/hdfs_design.html#Browser+Interface)
I don't know if that helps.
Regards,
Alan Ho
Visit our website at http://www.nyse.com
*****************************************************************************
Note: The information contained in this message and any attachment to it is privileged, confidential and protected from disclosure. If the reader of this message is not the intended recipient, or an employee or agent responsible for delivering this message to the intended recipient, you are hereby notified that any dissemination, distribution or copying of this communication is strictly prohibited. If you have received this communication in error, please notify the sender immediately by replying to the message, and please delete it from your system. Thank you. NYSE Euronext, Inc.
__________________________________
Chris Marino
SnapLogic,
Inc.
Really Simple
Integration
www.snaplogic.com
650-655-7200
" Question: Is the scenario you described any different from the one you presented at the Spring Experience conference in Miami FL, back in December 2007?"
The principles are quite the same – this