Account Options

  1. Sign in
The old Google Groups will be going away soon, but your browser is incompatible with the new version.
Google Groups Home
« Groups Home
Message from discussion data cleansing: externally or internally?

Received: by 10.43.117.133 with SMTP id fm5mr24531063icc.7.1320452017827;
        Fri, 04 Nov 2011 17:13:37 -0700 (PDT)
Path: p6ni72034pbn.0!nntp.google.com!news2.google.com!postnews.google.com!e5g2000prf.googlegroups.com!not-for-mail
From: joel garry <joel-ga...@home.com>
Newsgroups: comp.databases.oracle.misc
Subject: Re: data cleansing: externally or internally?
Date: Fri, 4 Nov 2011 17:12:04 -0700 (PDT)
Organization: http://groups.google.com
Lines: 36
Message-ID: <155e37ce-71ae-443b-b69b-2b23b29b8082@e5g2000prf.googlegroups.com>
References: <j9022g$4b5$1@news.task.gda.pl>
NNTP-Posting-Host: 64.58.150.28
Mime-Version: 1.0
X-Trace: posting.google.com 1320452017 1639 127.0.0.1 (5 Nov 2011 00:13:37 GMT)
X-Complaints-To: groups-abuse@google.com
NNTP-Posting-Date: Sat, 5 Nov 2011 00:13:37 +0000 (UTC)
Complaints-To: groups-abuse@google.com
Injection-Info: e5g2000prf.googlegroups.com; posting-host=64.58.150.28; posting-account=tpQovAkAAABNoH5bwsZAiff2L0zxGwdv
User-Agent: G2/1.0
X-HTTP-UserAgent: Mozilla/5.0 (Windows; U; Windows NT 5.1; en-US; rv:1.9.2.18)
 Gecko/20110614 Firefox/3.6.18 ( .NET CLR 3.5.30729),gzip(gfe)
Content-Type: text/plain; charset=ISO-8859-1
Content-Transfer-Encoding: quoted-printable

On Nov 3, 11:51=A0pm, geos <g...@nowhere.invalid> wrote:
> there is a big text file with dirty data. a company wants it to be
> clean. there are some known patterns expressed as like or regexp. I
> first thought about two approaches:
> 1) do this on the system level
> 2) or in a database
> for the latter case it looks to me that I could use external tables or
> load data into temporary table and then do the cleaning.
>
> I am looking for pros and cons of each variant. my intuition tells me
> that loading into temporary table would give the most flexibility but
> also take additional space. I am not sure about the other methods. I
> would appreciate your opinion about what I should pay attention to when
> choosing the other methods. how are they restricted in terms of
> performance, flexibility and capabilities (eg. multitable loading)? I am
> also interested in good practices and your experience in similar cases
> you can share.
>
> thank you,
> geos
>
> --
> NOTE: Follow Up set to comp.databases.oracle.misc

Out of database.  I've done this so many times, whenever the database
is the limiting factor I take it out of the db, clean, and put it back
in.  Two biggest factors are redo generation and memory limitations,
the latter of which are handled better by unix and pipelining.  Plus,
it's easier to split transforms into little pieces and pipe them
together.

jg
--
@home.com is bogus.
http://www.computerworld.com/s/article/9221459/Oracle_s_best_of_breed_strat=
egy_as_described_by_president_Mark_Hurd?taxonomyId=3D154