Huge files manipulation

2 visualizzazioni
Passa al primo messaggio da leggere

klashxx

da leggere,
10 nov 2008, 05:24:5310/11/08
a
Hi , i need a fast way to delete duplicates entrys from very huge
files ( >2 Gbs ) , these files are in plain text.

..To clarify, this is the structure of the file:

30xx|000009925000194653|00000000000000|20081031|02510|00000005445363|
01|F|0207|00|||+0005655,00|||+0000000000000,00
30xx|000009925000194653|00000000000000|20081031|02510|00000005445363|
01|F|0207|00|||+0000000000000,00|||+0000000000000,00
30xx|4150010003502043|CARDS|20081031|MP415001|00000024265698|01|F|
1804|
00|||+0000000000000,00|||+0000000000000,00

Having a key formed by the first 7 fields i want to print or delete
only the duplicates( the delimiter is the pipe..).

I tried all the usual methods ( awk / sort /uniq / sed /grep .. ) but
it always ended with the same result (out of memory!)

In using HP-UX large servers.

I 'm very new to perl, but i read somewhere tha Tie::File module can
handle very large files , i tried but cannot get the right code...

Any advice will be very well come.

Thank you in advance.

Regards

PD:I do not want to split the files.

RedGrittyBrick

da leggere,
10 nov 2008, 06:13:3410/11/08
a

When you try the following do you run out of memory?

perl -n -e '/^(\w*|\w*|\w*|\w*|\w*|\w*|\w*)|/ \
and print unless $seen{$1}++' \
hugefilename

You might trade CPU for RAM by making a hash of the key. (in the
cryptographic digest sense not the perl associative array sense).

Tie:File works with files larger than memory but I'm not sure how you
would use it for your problem. It's storing the index of seen keys that
is the problem.

I'd maybe tie my %seen to a dbm file. See `perldoc -f tie`

--
RGB

Tad J McClellan

da leggere,
10 nov 2008, 07:13:0710/11/08
a
RedGrittyBrick <RedGrit...@spamweary.invalid> wrote:

> perl -n -e '/^(\w*|\w*|\w*|\w*|\w*|\w*|\w*)|/ \


ITYM:

perl -n -e '/^(\w*\|\w*\|\w*\|\w*\|\w*\|\w*\|\w*)\|/ \


--
Tad McClellan
email: perl -le "print scalar reverse qq/moc.noitatibaher\100cmdat/"

RedGrittyBrick

da leggere,
10 nov 2008, 08:10:1110/11/08
a

Tad J McClellan wrote:
> RedGrittyBrick <RedGrit...@spamweary.invalid> wrote:
>
>> perl -n -e '/^(\w*|\w*|\w*|\w*|\w*|\w*|\w*)|/ \
>
>
> ITYM:
>
> perl -n -e '/^(\w*\|\w*\|\w*\|\w*\|\w*\|\w*\|\w*)\|/ \
>
>

Doh!

--
RGB

xho...@gmail.com

da leggere,
10 nov 2008, 10:42:3010/11/08
a
klashxx <kla...@gmail.com> wrote:
> Hi , i need a fast way to delete duplicates entrys from very huge
> files ( >2 Gbs ) , these files are in plain text.
>
> ..To clarify, this is the structure of the file:
>
> 30xx|000009925000194653|00000000000000|20081031|02510|00000005445363|
> 01|F|0207|00|||+0005655,00|||+0000000000000,00
> 30xx|000009925000194653|00000000000000|20081031|02510|00000005445363|
> 01|F|0207|00|||+0000000000000,00|||+0000000000000,00
> 30xx|4150010003502043|CARDS|20081031|MP415001|00000024265698|01|F|
> 1804|
> 00|||+0000000000000,00|||+0000000000000,00
>
> Having a key formed by the first 7 fields i want to print or delete
> only the duplicates( the delimiter is the pipe..).

Given the line wraps, it is hard to figure out what the structure
of your file is. Every line has from 7 to infinity fields, with the
first one being 30xx? When you say "print or delete", which one? Do you
want to do both in a single pass, or have two different programs, one for
each use-case?

>
> I tried all the usual methods ( awk / sort /uniq / sed /grep .. ) but
> it always ended with the same result (out of memory!)

Which of those programs was running out of memory? Can you use sort
to group lines according to the key without running out of memory?
That is what I do, first use system sort to group keys, then Perl
to finish up.

How many duplicate keys do you expect there to be? If the number of
duplicates is pretty small, I'd come up with the list of them:

cut -d\| -f1-7 big_file|sort|uniq -d > dup_keys.

And then load dup_keys into a Perl hash, then step through big_file
comparing each line's key to the hash.

> I 'm very new to perl, but i read somewhere tha Tie::File module can
> handle very large files ,

Tie::File has substantial per-line overhead. So unless the lines
are quite long, Tie::File doesn't increase the size of the file you
can handle by all that much. Also, it isn't clear how you would use
it anyway. It doesn't help you keep huge hashes, which is what you need
to group keys efficiently if you aren't pre-sorting. And while it makes it
*easy* to delete lines from the middle of large files, it does not make
it *efficient* to do so.

> i tried but cannot get the right code...

We can't very well comment on code we can't see.

...


> PD:I do not want to split the files.

Why not?

Xho

--
-------------------- http://NewsReader.Com/ --------------------
The costs of publication of this article were defrayed in part by the
payment of page charges. This article must therefore be hereby marked
advertisement in accordance with 18 U.S.C. Section 1734 solely to indicate
this fact.

Rocco Caputo

da leggere,
10 nov 2008, 10:44:1610/11/08
a
On Mon, 10 Nov 2008 02:24:53 -0800 (PST), klashxx wrote:
> Hi , i need a fast way to delete duplicates entrys from very huge
> files ( >2 Gbs ) , these files are in plain text.
>
> ..To clarify, this is the structure of the file:
>
> 30xx|000009925000194653|00000000000000|20081031|02510|00000005445363|
> 01|F|0207|00|||+0005655,00|||+0000000000000,00
> 30xx|000009925000194653|00000000000000|20081031|02510|00000005445363|
> 01|F|0207|00|||+0000000000000,00|||+0000000000000,00
> 30xx|4150010003502043|CARDS|20081031|MP415001|00000024265698|01|F|
> 1804|
> 00|||+0000000000000,00|||+0000000000000,00
>
> Having a key formed by the first 7 fields i want to print or delete
> only the duplicates( the delimiter is the pipe..).
>
> I tried all the usual methods ( awk / sort /uniq / sed /grep .. ) but
> it always ended with the same result (out of memory!)

> PD:I do not want to split the files.

Exactly how are you using awk/sort/uniq/sed/grep? Which part of the
pipeline is running out of memory?

Depending on what you're doing, and where you're doing it, you may be
able to tune sort() to use more memory (much faster) or a faster
filesystem for temporary files.

Are the first seven fields always the same width? If so, you needn't
bother with the pipes.

Must the order of the files be preserved?

--
Rocco Caputo - http://poe.perl.org/

bugbear

da leggere,
10 nov 2008, 11:45:0810/11/08
a
klashxx wrote:
> Hi , i need a fast way to delete duplicates entrys from very huge
> files ( >2 Gbs ) , these files are in plain text.
>
> ..To clarify, this is the structure of the file:
>
> 30xx|000009925000194653|00000000000000|20081031|02510|00000005445363|
> 01|F|0207|00|||+0005655,00|||+0000000000000,00
> 30xx|000009925000194653|00000000000000|20081031|02510|00000005445363|
> 01|F|0207|00|||+0000000000000,00|||+0000000000000,00
> 30xx|4150010003502043|CARDS|20081031|MP415001|00000024265698|01|F|
> 1804|
> 00|||+0000000000000,00|||+0000000000000,00
>
> Having a key formed by the first 7 fields i want to print or delete
> only the duplicates( the delimiter is the pipe..).

Hmm. If you're blowing ram, I suggest a first pass than generates
a data structure containing a signature, formed by a hash
on your 7 fields; I suggest this key should be 8-12 bytes,
and a file offset, and size.

This can be considered an "index" into your huge file.
Hopefully this index is sufficiently digested to fit in RAM,

The data structure are now sorted(and grouped) by the signatures.
The grouped signatures are now sorted by earliest file offset.

The file is now processed again, using the index; each set of fields
against a signature is simply read in (seek/read), and checked
"long hand".

Matches are output, which was (after all) the point.

I would suggest that a large RAM cache be used for this stage, to minimise
IO.

BugBear

Jürgen Exner

da leggere,
10 nov 2008, 12:25:3310/11/08
a
klashxx <kla...@gmail.com> wrote:
>Hi , i need a fast way to delete duplicates entrys from very huge
>files ( >2 Gbs ) , these files are in plain text.
>
>Having a key formed by the first 7 fields i want to print or delete
>only the duplicates( the delimiter is the pipe..).

Hmmm, what is the ratio of unique lines to total lines? I.e. are there
many duplicate lines or only few?

If the number of unique lines is small then a standard approach with
recording each unique line in a hash may work. Then you can simply check
if a line with that content already exists() and delete/print the
duplicate as you encouter it further down the file.

If the number of unique lines is large then that will no longer be
possible and you will have to trade speed and simplicity for memory.
For each line I'd compute a checksum and record that checksum together
with the exact position of each matching line in the hash.
Then in a second pass those lines with unique checksums are unique while
lines with the same checksum (more than one line was recorded for a
given checksum) are candidates for duplicates and need to be compared
individually.

jue

Tim Greer

da leggere,
10 nov 2008, 13:26:1110/11/08
a
klashxx wrote:

> Hi , i need a fast way to delete duplicates entrys from very huge
> files ( >2 Gbs ) , these files are in plain text.
>
> ..To clarify, this is the structure of the file:
>
> 30xx|000009925000194653|00000000000000|20081031|02510|00000005445363|
> 01|F|0207|00|||+0005655,00|||+0000000000000,00
> 30xx|000009925000194653|00000000000000|20081031|02510|00000005445363|
> 01|F|0207|00|||+0000000000000,00|||+0000000000000,00
> 30xx|4150010003502043|CARDS|20081031|MP415001|00000024265698|01|F|
> 1804|
> 00|||+0000000000000,00|||+0000000000000,00
>
> Having a key formed by the first 7 fields i want to print or delete
> only the duplicates( the delimiter is the pipe..).
>
> I tried all the usual methods ( awk / sort /uniq / sed /grep .. ) but
> it always ended with the same result (out of memory!)

What is the code you're using now?

>
> PD:I do not want to split the files.

That could help solve the problem in its current form, potentially.
Have you considered using a database solution, if nothing more than
just for this type of task, even if you want to continue storing the
fields/data in the files?
--
Tim Greer, CEO/Founder/CTO, BurlyHost.com, Inc.
Shared Hosting, Reseller Hosting, Dedicated & Semi-Dedicated servers
and Custom Hosting. 24/7 support, 30 day guarantee, secure servers.
Industry's most experienced staff! -- Web Hosting With Muscle!

cartercc

da leggere,
10 nov 2008, 14:17:5610/11/08
a

1. Create a database with a data table, in two columns, ID_PK and
DATA.
2. Read the file line by line and insert each row into the database,
using the first seven fields as the key. This will insure that you
have no duplicates in the database. As each PK must be unique, your
insert statement will fail for duplicates.
3. Do a select statement from the database and print out all the
records returned.

Splitting the file is irrelevant as virtually 100% of the time you
have to split all files to take a gander at the data.

I still don't understand why you can't use a simple hash rather than a
DB, but my not understanding that point is irrelevant as well.

CC

Michele Dondi

da leggere,
11 nov 2008, 07:11:4911/11/08
a
On Mon, 10 Nov 2008 02:24:53 -0800 (PST), klashxx <kla...@gmail.com>
wrote:

>Hi , i need a fast way to delete duplicates entrys from very huge
>files ( >2 Gbs ) , these files are in plain text.

[cut]


>Any advice will be very well come.
>
>Thank you in advance.

Wouldn't have been nice of you to mention that you asked the very same
question elsewhere? <http://perlmonks.org/?node_id=722634> Did they
help you there? How did they fail to do so?


Michele
--
{$_=pack'B8'x25,unpack'A8'x32,$a^=sub{pop^pop}->(map substr
(($a||=join'',map--$|x$_,(unpack'w',unpack'u','G^<R<Y]*YB='
.'KYU;*EVH[.FHF2W+#"\Z*5TI/ER<Z`S(G.DZZ9OX0Z')=~/./g)x2,$_,
256),7,249);s/[^\w,]/ /g;$ \=/^J/?$/:"\r";print,redo}#JAPH,

s...@netherlands.com

da leggere,
16 nov 2008, 15:51:4116/11/08
a
On Mon, 10 Nov 2008 02:24:53 -0800 (PST), klashxx <kla...@gmail.com> wrote:

I can do this for you with custom algorithym's.
Each file is sent to me and each has an independent
fee, based on processing time. Or I can license my
technologoy to you, flat-fee, per usage based.

Let me know if your interrested, post a contact email address.


sln

Tad J McClellan

da leggere,
16 nov 2008, 18:36:4116/11/08
a
s...@netherlands.com <s...@netherlands.com> wrote:

> I can do this for you with custom algorithym's.

^^
^^

Your algorithym (sic) possesses something?


> Let me know if your interrested, post a contact email address.

^^^^
^^^^

Put in apostrophe's where they are not needed, leave them out
where theyre needed. Interresting.

David Combs

da leggere,
30 nov 2008, 21:16:1430/11/08
a
In article <slrngi1bk9...@tadmc30.sbcglobal.net>,

Tad J McClellan <ta...@seesig.invalid> wrote:
>s...@netherlands.com <s...@netherlands.com> wrote:
>
>> I can do this for you with custom algorithym's.
> ^^
> ^^
>
>Your algorithym (sic) possesses something?
>
>
>> Let me know if your interrested, post a contact email address.
> ^^^^
> ^^^^
>
>Put in apostrophe's where they are not needed, leave them out

What's with the schools these days?

On the net, at least, I hardly ever see "you're" any more -- it's
always "your".

(I bet the Chinese and Russians don't make that mistake! :-( )


David


Rispondi a tutti
Rispondi all'autore
Inoltra
0 nuovi messaggi