Re: [DataCleaner] Digest for datacleaner-dev@googlegroups.com - 2 Messages in 1 Topic

3 views
Skip to first unread message

Prabhuram Mohan

unread,
Oct 16, 2011, 3:54:21 PM10/16/11
to datacle...@googlegroups.com
Hi Kasper,

appreciate your response.

I am a Data & BI Engineer by profession. I have consulted for several banks.  I feel at home with SQL, SAS & R.
I have administered Dataflux and worked on Informatica DQ.

Dataflux and IDQ are  monolithic and expensive. I thought some thing better can be possible. 

I liked Data Cleaner. Sure i would like to collaborate .

to start with :- do u have any Amazon AMI images with Data Cleaner. If not i would recommend creating one. Its an easier way to taking Data Cleaner for a spin.

I have experience with Amazon EC2.

If you have some time we can have a chat.

thx
prabhu



On Fri, Oct 14, 2011 at 8:36 AM, <datacle...@googlegroups.com> wrote:

Group: http://groups.google.com/group/datacleaner-dev/topics

    Pmohan <mprab...@gmail.com> Oct 13 09:29AM -0700 ^
     
    I was just exploring, what would it take to
     
    1) make Data cleaner be able to be deployed on the cloud
    2) accessed through a web client / API
    3) Distributed Job Processing across multiple machines ( long shot
    though)
     
    Thanks
    Pmohan

     

    "Kasper Sørensen" <kas...@eobjects.dk> Oct 13 07:03PM +0200 ^
     
    Hi Pmohan,
     
    Thanks for the question, an interesting one!
     
    1+2) Actually we have been playing around with this idea already at Human
    Inference. We've already made some loose plans to be able to deploy DC jobs
    as invokable web services, running on a server. The architecture completely
    supports this idea and I see no major impediments, except "just doing it".
     
    3) For some tasks this is a good fit, for some features not. Specifically,
    the transformer and filter components are very analogous with the "map" part
    of a MapReduce system (like Hadoop or GridGain) and thus could be REALLY
    scalable. The Analyzer components are also kinda analogous to "Reduce" in a
    MapReduce system, but there to make it work it would impose certain
    restrictions onto what an Analyzer can do, and specifically how it saves
    state. So yes, it is in our thoughts but it's not likely to be something we
    would create on the short term.
     
    Now that you have a few answers, may I ask (out of curiosity) why you are
    asking? Are you considering building such an application? Would you maybe be
    interested in a cooperation?
     
    Best regards,
    Kasper
     

     

--
You received this message because you are subscribed to the Google Groups "DataCleaner-dev" group.
To post to this group, send email to datacle...@googlegroups.com.
To unsubscribe from this group, send email to datacleaner-d...@googlegroups.com.
For more options, visit this group at http://groups.google.com/group/datacleaner-dev?hl=en.

Kasper Sørensen

unread,
Oct 17, 2011, 4:53:47 AM10/17/11
to datacle...@googlegroups.com
Hi Prabhu,

No we don't have any Amazon images available, if you want to create one I think that's fine. I don't see it as the primary way our users will want to access DC, since the installation is already very easy and unintrusive.

But if you create an imagine I will be happy to link to it or add a newsitem about it on the DC website.

Would deploying DC on Amazon qualify as "DC in the cloud" in your opinion? Because then I think we're talking about slightly different things. I was more replying on the grounds of creating a web application that uses the DC engine to execute jobs in the browser. Obviously you're more talking about using the computing power of eg. Amazon's cloud offering to run DC jobs. Given that those machines work just like any other machines, I don't see any issues in doing so!

Best regards,
Kasper

Prabhuram Mohan

unread,
Oct 17, 2011, 9:37:15 AM10/17/11
to datacle...@googlegroups.com
Hi Kasper,

  I also dont think having an DC AMI qualifies as DC in the cloud. That said, having an image with DC setup along with sample files , sample jobs and other related tools would be immensely useful in my view. That is not a big deal anyway.

Looking at the bigger picture, DC in the cloud would be some thing like


Batch Mode
    1) I go to a website,  pick the data source (already uploaded/ available via ftp), define a task/job and submit it through a web service/web interface
    2) DC Cloud, accepts the job and queues it for execution. 
    3) Once the job is complete , the user is notified by call backs
    4) The results are available on the cloud accessible in the web interface/ web service

Realtime Mode
    1) User app makes a web service call to the DC cloud with a task (the task may be name validation, address validation, fuzzy based duplicate detection etc)
    2) DC cloud executes the task (using predefined knowledge base) and returns response to the user service


From the user perspective, all i need to know is webservice end point/ webapp URL. the rest of the complexeity(setup, scaling, recovery etc) is taken care by the DC cloud.

let me know your thoughts.

I think at this point, it will be nice to define what is DC Cloud.

thanks
Prabhu

  


  

  



On Mon, Oct 17, 2011 at 8:39 AM, <datacle...@googlegroups.com> wrote:

     
    Hi Kasper,
     
    appreciate your response.
     
    I am a Data & BI Engineer by profession. I have consulted for several
    banks. I feel at home with SQL, SAS & R.
    I have administered Dataflux and worked on Informatica DQ.
     
    Dataflux and IDQ are monolithic and expensive. I thought some thing better
    can be possible.
     
    I liked Data Cleaner. Sure i would like to collaborate .
     
    to start with :- do u have any Amazon AMI images with Data Cleaner. If not i
    would recommend creating one. Its an easier way to taking Data Cleaner for a
    spin.
     
    I have experience with Amazon EC2.
     
    If you have some time we can have a chat.
     
    thx
    prabhu
     
     
     

     


       
      Hi Prabhu,
       
      No we don't have any Amazon images available, if you want to create one I
      think that's fine. I don't see it as the primary way our users will want to
      access DC, since the installation is already very easy and unintrusive.
       
      But if you create an imagine I will be happy to link to it or add a newsitem
      about it on the DC website.
       
      Would deploying DC on Amazon qualify as "DC in the cloud" in your opinion?
      Because then I think we're talking about slightly different things. I was
      more replying on the grounds of creating a web application that uses the DC
      engine to execute jobs in the browser. Obviously you're more talking about
      using the computing power of eg. Amazon's cloud offering to run DC jobs.
      Given that those machines work just like any other machines, I don't see any
      issues in doing so!
       
      Best regards,
      Kasper
       

       

      --

      Kasper Sørensen

      unread,
      Oct 17, 2011, 1:27:59 PM10/17/11
      to datacle...@googlegroups.com
      Hi Prabhu,

      Good with some more details. I agree to your characteristics, and these match pretty well with what we've had in mind at Human Inference. In particular we are looking for a way in which we can easily publish and invoke DC jobs through SOAP services. We also already have a running prototype of a webapp that runs DataCleaner jobs upon HTTP requests and displays the result (similar to the GUI results) as HTML.

      But to make it really work of course there needs a lot more. The major things to overcome is in my oppinion:
      • Persisting the profiling results of a job.
      • Displaying results as HTML (this is already supported by the result handling framework, so should be quite easily doable).
      • Support for "single row"/push execution, where the results are not aggregated, but DC is simply used for individual data cleansing and validation steps. This sounds similar to what call realtime mode.
      • Scheduling of jobs.
      • I would also like to see comparative result views, where we display timelines/trends of profiling results.
      • We would probably provide a set of template jobs, which should use the existing "template job" mapping mechanism in DC, but in a browser of course
      I also notice a couple of very good points by your previous mail. In particular the idea about job completion callbacks was an aspect I had not thought about. I guess there can be difficult types of callbacks, probably with email as the primary one.

      So ... Moving forward, I think we should probably have a chat about how you or your organization can get involved. I am available on Skype (username: kasper.sorensen.human.inference) - is that doable?

      BR
      Kasper

      Prabhuram Mohan

      unread,
      Oct 18, 2011, 8:38:53 PM10/18/11
      to datacle...@googlegroups.com
      Hi Kasper,

      I know would thought a lot about the profiling results persistance. Here are some of thoughts to add to it

      Persisting Profiling Results:

      At the minimum we would need the following for persisting the results

      1) Repository - to store the results itself  (may be grp of tables)
      2) Structure - even though you can store the results in the repository with a uinque key, some structure is necessary to organise the results. Think of this like a folder structure. 
      3) Security - Access Control Lists are required to provide access to the correct people and at the same time allow them to share the results and collaborate

      we can have something like 

          Private location specific to a user on the system. However the user can choose to share a folder with another user

              com.dccloud.<OrgName>.<SubOrg>.<uniqueuserid>.<project>.<subproject-lvl1>.<subproject-lvl2>.<prof-result-name> ==> <GUID>

              This whole thing is mapped to the resultset of a profiling job <GUID>. This GUID is present in all the result tables to tie the result together. I hope this make sense?

         Project based folder structure

             com.dccloud.<OrgName>.<SubOrg>.<project>.<subproject-lvl1>.<subproject-lvl2>.<prof-result-name> ==> <GUID>

          There also needs to be a ACL table which will tell who has access to which folder.

         A parting thought - do you if there is any data profiling tool available for hadoop/big data profiling?

      thanks
      Prabhu





       


      2011/10/17 Kasper Sørensen <kas...@eobjects.dk>
      Reply all
      Reply to author
      Forward
      0 new messages