Re: [DataCleaner] Digest for datacleaner-dev@googlegroups.com

Prabhuram Mohan

unread,

Oct 16, 2011, 3:54:21 PM10/16/11

to datacle...@googlegroups.com

Hi Kasper,

appreciate your response.

I am a Data & BI Engineer by profession. I have consulted for several banks. I feel at home with SQL, SAS & R.

I have administered Dataflux and worked on Informatica DQ.

Dataflux and IDQ are monolithic and expensive. I thought some thing better can be possible.

I liked Data Cleaner. Sure i would like to collaborate .

to start with :- do u have any Amazon AMI images with Data Cleaner. If not i would recommend creating one. Its an easier way to taking Data Cleaner for a spin.

I have experience with Amazon EC2.

If you have some time we can have a chat.

thx

prabhu

On Fri, Oct 14, 2011 at 8:36 AM, <datacle...@googlegroups.com> wrote:

Today's Topic Summary

Group: http://groups.google.com/group/datacleaner-dev/topics

What would it take to make Data Cleaner cloud ready ? [2 Updates]

Topic: What would it take to make Data Cleaner cloud ready ?

Pmohan <mprab...@gmail.com> Oct 13 09:29AM -0700 ^

I was just exploring, what would it take to

1) make Data cleaner be able to be deployed on the cloud
2) accessed through a web client / API
3) Distributed Job Processing across multiple machines ( long shot
though)

Thanks
Pmohan

"Kasper Sørensen" <kas...@eobjects.dk> Oct 13 07:03PM +0200 ^

Hi Pmohan,

Thanks for the question, an interesting one!

1+2) Actually we have been playing around with this idea already at Human
Inference. We've already made some loose plans to be able to deploy DC jobs
as invokable web services, running on a server. The architecture completely
supports this idea and I see no major impediments, except "just doing it".

3) For some tasks this is a good fit, for some features not. Specifically,
the transformer and filter components are very analogous with the "map" part
of a MapReduce system (like Hadoop or GridGain) and thus could be REALLY
scalable. The Analyzer components are also kinda analogous to "Reduce" in a
MapReduce system, but there to make it work it would impose certain
restrictions onto what an Analyzer can do, and specifically how it saves
state. So yes, it is in our thoughts but it's not likely to be something we
would create on the short term.

Now that you have a few answers, may I ask (out of curiosity) why you are
asking? Are you considering building such an application? Would you maybe be
interested in a cooperation?

Best regards,
Kasper

--
You received this message because you are subscribed to the Google Groups "DataCleaner-dev" group.
To post to this group, send email to datacle...@googlegroups.com.
To unsubscribe from this group, send email to datacleaner-d...@googlegroups.com.
For more options, visit this group at http://groups.google.com/group/datacleaner-dev?hl=en.

Kasper Sørensen

unread,

Oct 17, 2011, 4:53:47 AM10/17/11

to datacle...@googlegroups.com

Hi Prabhu,

No we don't have any Amazon images available, if you want to create one I think that's fine. I don't see it as the primary way our users will want to access DC, since the installation is already very easy and unintrusive.

But if you create an imagine I will be happy to link to it or add a newsitem about it on the DC website.

Would deploying DC on Amazon qualify as "DC in the cloud" in your opinion? Because then I think we're talking about slightly different things. I was more replying on the grounds of creating a web application that uses the DC engine to execute jobs in the browser. Obviously you're more talking about using the computing power of eg. Amazon's cloud offering to run DC jobs. Given that those machines work just like any other machines, I don't see any issues in doing so!

Best regards,

Kasper

Prabhuram Mohan

unread,

Oct 17, 2011, 9:37:15 AM10/17/11

to datacle...@googlegroups.com

Hi Kasper,

I also dont think having an DC AMI qualifies as DC in the cloud. That said, having an image with DC setup along with sample files , sample jobs and other related tools would be immensely useful in my view. That is not a big deal anyway.

Looking at the bigger picture, DC in the cloud would be some thing like

Batch Mode

1) I go to a website, pick the data source (already uploaded/ available via ftp), define a task/job and submit it through a web service/web interface

2) DC Cloud, accepts the job and queues it for execution.

3) Once the job is complete , the user is notified by call backs

4) The results are available on the cloud accessible in the web interface/ web service

Realtime Mode

1) User app makes a web service call to the DC cloud with a task (the task may be name validation, address validation, fuzzy based duplicate detection etc)

2) DC cloud executes the task (using predefined knowledge base) and returns response to the user service

From the user perspective, all i need to know is webservice end point/ webapp URL. the rest of the complexeity(setup, scaling, recovery etc) is taken care by the DC cloud.

let me know your thoughts.

I think at this point, it will be nice to define what is DC Cloud.

thanks

Prabhu

On Mon, Oct 17, 2011 at 8:39 AM, <datacle...@googlegroups.com> wrote:

Today's Topic Summary

Group: http://groups.google.com/group/datacleaner-dev/topics

Digest for datacle...@googlegroups.com - 2 Messages in 1 Topic [2 Updates]

Topic: Digest for datacle...@googlegroups.com - 2 Messages in 1 Topic

Prabhuram Mohan <mprab...@gmail.com> Oct 16 03:54PM -0400 ^

Hi Kasper,

appreciate your response.

I am a Data & BI Engineer by profession. I have consulted for several
banks. I feel at home with SQL, SAS & R.
I have administered Dataflux and worked on Informatica DQ.

Dataflux and IDQ are monolithic and expensive. I thought some thing better
can be possible.

I liked Data Cleaner. Sure i would like to collaborate .

to start with :- do u have any Amazon AMI images with Data Cleaner. If not i
would recommend creating one. Its an easier way to taking Data Cleaner for a
spin.

I have experience with Amazon EC2.

If you have some time we can have a chat.

thx
prabhu

"Kasper Sørensen" <kas...@eobjects.dk> Oct 17 10:53AM +0200 ^

Hi Prabhu,

No we don't have any Amazon images available, if you want to create one I
think that's fine. I don't see it as the primary way our users will want to
access DC, since the installation is already very easy and unintrusive.

But if you create an imagine I will be happy to link to it or add a newsitem
about it on the DC website.

Would deploying DC on Amazon qualify as "DC in the cloud" in your opinion?
Because then I think we're talking about slightly different things. I was
more replying on the grounds of creating a web application that uses the DC
engine to execute jobs in the browser. Obviously you're more talking about
using the computing power of eg. Amazon's cloud offering to run DC jobs.
Given that those machines work just like any other machines, I don't see any
issues in doing so!

Best regards,
Kasper

--

Kasper Sørensen

unread,

Oct 17, 2011, 1:27:59 PM10/17/11

to datacle...@googlegroups.com

Hi Prabhu,

Good with some more details. I agree to your characteristics, and these match pretty well with what we've had in mind at Human Inference. In particular we are looking for a way in which we can easily publish and invoke DC jobs through SOAP services. We also already have a running prototype of a webapp that runs DataCleaner jobs upon HTTP requests and displays the result (similar to the GUI results) as HTML.

But to make it really work of course there needs a lot more. The major things to overcome is in my oppinion:

Persisting the profiling results of a job.
Displaying results as HTML (this is already supported by the result handling framework, so should be quite easily doable).
Support for "single row"/push execution, where the results are not aggregated, but DC is simply used for individual data cleansing and validation steps. This sounds similar to what call realtime mode.
Scheduling of jobs.
I would also like to see comparative result views, where we display timelines/trends of profiling results.
We would probably provide a set of template jobs, which should use the existing "template job" mapping mechanism in DC, but in a browser of course

I also notice a couple of very good points by your previous mail. In particular the idea about job completion callbacks was an aspect I had not thought about. I guess there can be difficult types of callbacks, probably with email as the primary one.

So ... Moving forward, I think we should probably have a chat about how you or your organization can get involved. I am available on Skype (username: kasper.sorensen.human.inference) - is that doable?

BR
Kasper

Prabhuram Mohan

unread,

Oct 18, 2011, 8:38:53 PM10/18/11

to datacle...@googlegroups.com

Hi Kasper,

I know would thought a lot about the profiling results persistance. Here are some of thoughts to add to it

Persisting Profiling Results:

At the minimum we would need the following for persisting the results

1) Repository - to store the results itself (may be grp of tables)

2) Structure - even though you can store the results in the repository with a uinque key, some structure is necessary to organise the results. Think of this like a folder structure.

3) Security - Access Control Lists are required to provide access to the correct people and at the same time allow them to share the results and collaborate

we can have something like

Private location specific to a user on the system. However the user can choose to share a folder with another user

com.dccloud.<OrgName>.<SubOrg>.<uniqueuserid>.<project>.<subproject-lvl1>.<subproject-lvl2>.<prof-result-name> ==> <GUID>

This whole thing is mapped to the resultset of a profiling job <GUID>. This GUID is present in all the result tables to tie the result together. I hope this make sense?

Project based folder structure

com.dccloud.<OrgName>.<SubOrg>.<project>.<subproject-lvl1>.<subproject-lvl2>.<prof-result-name> ==> <GUID>

There also needs to be a ACL table which will tell who has access to which folder.

A parting thought - do you if there is any data profiling tool available for hadoop/big data profiling?

thanks

Prabhu

2011/10/17 Kasper Sørensen <kas...@eobjects.dk>

Reply all

Reply to author

Forward

Re: [DataCleaner] Digest for datacleaner-dev@googlegroups.com - 2 Messages in 1 Topic

Prabhuram Mohan

Kasper Sørensen

Prabhuram Mohan

Kasper Sørensen

Prabhuram Mohan