Account Options

  1. Sign in
The old Google Groups will be going away soon, but your browser is incompatible with the new version.
Google Groups Home
« Groups Home
data cleansing: externally or internally?
There are currently too many topics in this group that display first. To make this topic appear first, remove this option from another topic.
There was an error processing your request. Please try again.
flag
  11 messages - Collapse all  -  Translate all to Translated (View all originals)
The group you are posting to is a Usenet group. Messages posted to this group will make your email address visible to anyone on the Internet.
Your reply message has not been sent.
Your post was successful
 
From:
To:
Cc:
Followup To:
Add Cc | Add Followup-to | Edit Subject
Subject:
Validation:
For verification purposes please type the characters you see in the picture below or the numbers you hear by clicking the accessibility icon. Listen and type the numbers you hear
 
geos  
View profile  
 More options Nov 4 2011, 2:51 am
Newsgroups: comp.databases.oracle.server, comp.databases.oracle.misc
Followup-To: comp.databases.oracle.misc
From: geos <g...@nowhere.invalid>
Date: Fri, 04 Nov 2011 07:51:49 +0100
Local: Fri, Nov 4 2011 2:51 am
Subject: data cleansing: externally or internally?
there is a big text file with dirty data. a company wants it to be
clean. there are some known patterns expressed as like or regexp. I
first thought about two approaches:
1) do this on the system level
2) or in a database
for the latter case it looks to me that I could use external tables or
load data into temporary table and then do the cleaning.

I am looking for pros and cons of each variant. my intuition tells me
that loading into temporary table would give the most flexibility but
also take additional space. I am not sure about the other methods. I
would appreciate your opinion about what I should pay attention to when
choosing the other methods. how are they restricted in terms of
performance, flexibility and capabilities (eg. multitable loading)? I am
also interested in good practices and your experience in similar cases
you can share.

thank you,
geos

--
NOTE: Follow Up set to comp.databases.oracle.misc


 
You must Sign in before you can post messages.
To post a message you must first join this group.
Please update your nickname on the subscription settings page before posting.
You do not have the permission required to post.
Robert Klemme  
View profile  
 More options Nov 4 2011, 5:23 am
Newsgroups: comp.databases.oracle.misc
From: Robert Klemme <shortcut...@googlemail.com>
Date: Fri, 04 Nov 2011 10:23:02 +0100
Local: Fri, Nov 4 2011 5:23 am
Subject: Re: data cleansing: externally or internally?
On 11/04/2011 07:51 AM, geos wrote:

> there is a big text file with dirty data. a company wants it to be
> clean. there are some known patterns expressed as like or regexp. I
> first thought about two approaches:
> 1) do this on the system level

What exactly do you mean by "system level"?  Is this an alias for
"without database"?

> 2) or in a database
> for the latter case it looks to me that I could use external tables or
> load data into temporary table and then do the cleaning.

> I am looking for pros and cons of each variant. my intuition tells me
> that loading into temporary table would give the most flexibility but
> also take additional space. I am not sure about the other methods. I
> would appreciate your opinion about what I should pay attention to when
> choosing the other methods. how are they restricted in terms of
> performance, flexibility and capabilities (eg. multitable loading)? I am
> also interested in good practices and your experience in similar cases
> you can share.

There are a lot of questions to be asked before one can come up with
reasonable advice:

Is this a one off task or needs to be done on a regular basis?
What data is contained in the file?
What format is the file?
How large is the file?
Where does the data ultimately go (next processing stage)?
What does "dirty data" mean?  Put differently: what kind of cleanup
activity do you need to do?
What other requirements are there (execution speed, development time /
cost)?

Cheers

        robert


 
You must Sign in before you can post messages.
To post a message you must first join this group.
Please update your nickname on the subscription settings page before posting.
You do not have the permission required to post.
mhoys  
View profile  
 More options Nov 4 2011, 5:26 am
Newsgroups: comp.databases.oracle.misc
From: mhoys <matthias.h...@gmail.com>
Date: Fri, 4 Nov 2011 02:26:08 -0700 (PDT)
Local: Fri, Nov 4 2011 5:26 am
Subject: Re: data cleansing: externally or internally?
On Nov 4, 7:51 am, geos <g...@nowhere.invalid> wrote:

Hi,
I've always done it with external/temp tables... but that's because
I'm more familiar with (PL/)SQL than with shell scripting. You could
write your cleaning functions/procedures once and put them in a
package for later reuse. And if you know APEX, it's easy to build a
web interface around them. Some things are also more easy to do with
SQL than with shell scripting, for example data aggregations. I'm not
sure about the performance. Maybe Perl might perform better on very
large flat files (this could also be dependent on the type of file
system).

HTH,
Matthias


 
You must Sign in before you can post messages.
To post a message you must first join this group.
Please update your nickname on the subscription settings page before posting.
You do not have the permission required to post.
Deadly Dirk  
View profile  
 More options Nov 4 2011, 1:51 pm
Newsgroups: comp.databases.oracle.misc
From: Deadly Dirk <d...@pfln.invalid>
Date: Fri, 4 Nov 2011 17:51:02 +0000 (UTC)
Local: Fri, Nov 4 2011 1:51 pm
Subject: Re: data cleansing: externally or internally?

On Fri, 04 Nov 2011 07:51:49 +0100, geos wrote:
> there is a big text file with dirty data.

How big is "big"?

> a company wants it to be
> clean. there are some known patterns expressed as like or regexp. I
> first thought about two approaches:
> 1) do this on the system level
> 2) or in a database

Database is not well suited for things like that. Personally, I would use
Perl. Perl is my favorite tool because it's extremely versatile and fast
but any scripting language with regex support will probably do.

--
I don't think, therefore I am not.


 
You must Sign in before you can post messages.
To post a message you must first join this group.
Please update your nickname on the subscription settings page before posting.
You do not have the permission required to post.
Mark D Powell  
View profile  
 More options Nov 4 2011, 2:33 pm
Newsgroups: comp.databases.oracle.misc
From: Mark D Powell <Mark.Powe...@hp.com>
Date: Fri, 4 Nov 2011 11:33:23 -0700 (PDT)
Local: Fri, Nov 4 2011 2:33 pm
Subject: Re: data cleansing: externally or internally?
Robert has listed some good points.  The devil is always in the details.  What does 'dirty' mean in this context.  Is the data dirty cause someone had the shift key depresssed when he or she tried to enter a numeric digit or it is dirty because a related reocord or table row (FK) does not exist for some of the data?  Perhaps both conditions are true.

You may in fact want to filter the data via shell scripts for some of the issues and then use Oracle tools for other issues.

HTH -- Mark D Powell --


 
You must Sign in before you can post messages.
To post a message you must first join this group.
Please update your nickname on the subscription settings page before posting.
You do not have the permission required to post.
joel garry  
View profile  
 More options Nov 4 2011, 8:12 pm
Newsgroups: comp.databases.oracle.misc
From: joel garry <joel-ga...@home.com>
Date: Fri, 4 Nov 2011 17:12:04 -0700 (PDT)
Local: Fri, Nov 4 2011 8:12 pm
Subject: Re: data cleansing: externally or internally?
On Nov 3, 11:51 pm, geos <g...@nowhere.invalid> wrote:

Out of database.  I've done this so many times, whenever the database
is the limiting factor I take it out of the db, clean, and put it back
in.  Two biggest factors are redo generation and memory limitations,
the latter of which are handled better by unix and pipelining.  Plus,
it's easier to split transforms into little pieces and pipe them
together.

jg
--
@home.com is bogus.
http://www.computerworld.com/s/article/9221459/Oracle_s_best_of_breed...


 
You must Sign in before you can post messages.
To post a message you must first join this group.
Please update your nickname on the subscription settings page before posting.
You do not have the permission required to post.
Cliff  
View profile  
 More options Nov 6 2011, 7:27 am
Newsgroups: comp.databases.oracle.misc
From: Cliff <clifford1.buetiko...@yahoo.com>
Date: Sun, 6 Nov 2011 07:27:02 -0500
Local: Sun, Nov 6 2011 7:27 am
Subject: Re: data cleansing: externally or internally?

My 2 cents and IMHO:

Database

        CON

        - additional storage space

        - increased backup time as you'd basically be making multiple copes of
your data as you transform it

        -version control - Do not do this in the database as it's easier to
keep track of your transformations via file names than via multiple
tables in the database.

        PRO

         - If you're transforming current production data being used every
day, you know you have the current dataset all of the time.  However,  
I would write some pre-code to verify my assumptions prior to each run.

External file

        CON

                - transforming external to the database can be more complex if you
have multiple decision points in your transformation rules.

                - Some editing tools choke on huge file sizes

        PRO

        - easier to export out of the database and perform multiple small
transformations with your favorite tool ( sed and awk are mine) and
easier to debug.

        - the additional file space used by the transformation iterations are
easier to recover from file manipulation than by shrinking a tablespace
or creating a hughe tablespace specifically for this task.

        -  Depending on whether this is a one-off or a monthly clean-up
script, consider having your script run in 2 modes,   Proposed changes
and changes that were made;  that way you can get business sigh-off on
what your script is actually doing.


 
You must Sign in before you can post messages.
To post a message you must first join this group.
Please update your nickname on the subscription settings page before posting.
You do not have the permission required to post.
geos  
View profile  
 More options Nov 8 2011, 4:47 pm
Newsgroups: comp.databases.oracle.misc
From: geos <g...@nowhere.invalid>
Date: Tue, 08 Nov 2011 22:47:16 +0100
Local: Tues, Nov 8 2011 4:47 pm
Subject: Re: data cleansing: externally or internally?
thank you for all your answers!

you pointed out many interesting things so that I have to think about'em
before I write more in the subject.

thank you,
geos


 
You must Sign in before you can post messages.
To post a message you must first join this group.
Please update your nickname on the subscription settings page before posting.
You do not have the permission required to post.
Ramon F Herrera  
View profile  
 More options Nov 10 2011, 8:48 am
Newsgroups: comp.databases.oracle.misc
From: Ramon F Herrera <ra...@conexus.net>
Date: Thu, 10 Nov 2011 05:48:57 -0800 (PST)
Local: Thurs, Nov 10 2011 8:48 am
Subject: Re: data cleansing: externally or internally?
On Nov 4, 12:51 am, geos <g...@nowhere.invalid> wrote:

After more than a decade of experience my advice to you is: Use Oracle
as little as possible.

I wrote all my business logic in C/C++ making calls to the database
only as needed, and now my applications run much, much, much, much
faster. Not to mention the improved development and debug (can use a
debugger, not sure whether Oracle has something similar).

In essence, the only commands that I run in the database are basic
ones such as SELECT and UPDATE. No IFs or BUTs.

-Ramon


 
You must Sign in before you can post messages.
To post a message you must first join this group.
Please update your nickname on the subscription settings page before posting.
You do not have the permission required to post.
Robert Klemme  
View profile  
 More options Nov 11 2011, 3:51 pm
Newsgroups: comp.databases.oracle.misc
From: Robert Klemme <shortcut...@googlemail.com>
Date: Fri, 11 Nov 2011 21:51:08 +0100
Local: Fri, Nov 11 2011 3:51 pm
Subject: Re: data cleansing: externally or internally?
On 11/10/2011 02:48 PM, Ramon F Herrera wrote:

You still did not disclose the type of processing you want to do.
Without that information advice cannot be targeted at your scenario.

> After more than a decade of experience my advice to you is: Use Oracle
> as little as possible.

> I wrote all my business logic in C/C++ making calls to the database
> only as needed, and now my applications run much, much, much, much
> faster. Not to mention the improved development and debug (can use a
> debugger, not sure whether Oracle has something similar).

> In essence, the only commands that I run in the database are basic
> ones such as SELECT and UPDATE. No IFs or BUTs.

This cannot be generalized as advice!  You do not even mention the type
of application(s) you are talking about.  What may work good for the
application types you work on may not work at all for other application
types.

Kind regards

        robert


 
You must Sign in before you can post messages.
To post a message you must first join this group.
Please update your nickname on the subscription settings page before posting.
You do not have the permission required to post.
joel garry  
View profile  
 More options Nov 14 2011, 12:43 pm
Newsgroups: comp.databases.oracle.misc
From: joel garry <joel-ga...@home.com>
Date: Mon, 14 Nov 2011 09:43:42 -0800 (PST)
Local: Mon, Nov 14 2011 12:43 pm
Subject: Re: data cleansing: externally or internally?
On Nov 11, 12:51 pm, Robert Klemme <shortcut...@googlemail.com> wrote:

I agree.  All the answers added together are pretty good advice, I
could have qualified mine better with something about what database
limitations I was referring to.

When I first switched from VMS to the unix world, I took over
maintenance on a system that was like Ramon described, mostly because
of the limitations of the db engine (Unify, things like not being able
to join 5 tables).  That is when I came to the conclusion that
whatever else you like about C, using it directly as an application
language is just a huge mistake.  I can't judge C++, as I never
learned any good theoretical foundation for it (which could be my
shortcoming, or not).  The only rational rationalization for these
languages is the large number of CS students that get trained in them,
and that could be a mistake too.  And these mistakes keep getting made
over and over.  People tend to favor what they know, and favor new
code or "paradigms" over long term maintenance.

jg
--
@home.com is bogus.
Talk about dirty data left behind...
http://www.signonsandiego.com/news/2011/nov/14/buyer-of-minivan-finds...


 
You must Sign in before you can post messages.
To post a message you must first join this group.
Please update your nickname on the subscription settings page before posting.
You do not have the permission required to post.
End of messages
« Back to Discussions « Newer topic     Older topic »