Received: by 10.204.154.136 with SMTP id o8mr1872216bkw.2.1320429020799; Fri, 04 Nov 2011 10:50:20 -0700 (PDT) Path: l23ni58712bkv.0!nntp.google.com!news1.google.com!goblin3!goblin2!goblin.stu.neva.ru!weretis.net!feeder4.news.weretis.net!feeder1.news.weretis.net!news.solani.org!.POSTED!not-for-mail From: Deadly Dirk Newsgroups: comp.databases.oracle.misc Subject: Re: data cleansing: externally or internally? Date: Fri, 4 Nov 2011 17:51:02 +0000 (UTC) Organization: solani.org Lines: 22 Message-ID: References: Mime-Version: 1.0 X-Trace: weretis.net 1320429062 11288 eJwFwQkBgAAIA8BKIGxAHOXpH8E7GJUdTtBxOF3hU7pXX/qsovgC09kGWpvmRJl0XccB8gMT2BC3 (4 Nov 2011 17:51:02 GMT) X-Complaints-To: abuse@news.solani.org NNTP-Posting-Date: Fri, 4 Nov 2011 17:51:02 +0000 (UTC) User-Agent: Pan/0.133 (House of Butterflies) X-User-ID: eJwFwQkBwCAMA0BL/ZJRORCofwm7Q9KprwgWBvMwvTdPXoUNdLM02StDFp4RcOC8s6h2Wv0nLRDE Cancel-Lock: sha1:HVd/Fyap7/YhpqqPYxSB2UDTh5Q= X-NNTP-Posting-Host: eJwFwQkRADAIAzBLMFbK5PAc/iUsgbl68zr8YrHCnmzkmFm0RiqXgSfldKMoiCoxJdWTMSqx1fdVj2wmjzbOnBe5Zac+spoZgw== Content-Type: text/plain; charset=UTF-8 Content-Transfer-Encoding: 8bit On Fri, 04 Nov 2011 07:51:49 +0100, geos wrote: > there is a big text file with dirty data. How big is "big"? > a company wants it to be > clean. there are some known patterns expressed as like or regexp. I > first thought about two approaches: > 1) do this on the system level > 2) or in a database Database is not well suited for things like that. Personally, I would use Perl. Perl is my favorite tool because it's extremely versatile and fast but any scripting language with regex support will probably do. -- I don't think, therefore I am not.