I need ideas on how to sort 350 million lines of data

cha...@lonemerchant.com

unread,

May 17, 2008, 11:21:21 AM5/17/08

to

I have roughly 350 million lines of data in the following form

name, price, weight, brand, sku, upc, size

sitting on my home PC.

Is there some kind of sane way to sort this without taking up too much
ram or jacking up my limited CPU time?

Andrew Rich

unread,

May 17, 2008, 11:49:51 AM5/17/08

to

What operating system ?

I would throw it into MySQL

<cha...@lonemerchant.com> wrote in message
news:1c6aca8c-f1ac-4113...@t12g2000prg.googlegroups.com...

cha...@lonemerchant.com

unread,

May 17, 2008, 11:56:30 AM5/17/08

to

> > ram or jacking up my limited CPU time?- Hide quoted text -
>
> - Show quoted text -

At the risk of sounding like a total dumba--, is it' possible to
upload a .cvs file directly into mysql?

cha...@lonemerchant.com

unread,

May 17, 2008, 12:00:55 PM5/17/08

to

> upload a .cvs file directly into mysql?- Hide quoted text -

>
> - Show quoted text -

Never mind. I can google the answer. Thanks.

xho...@gmail.com

unread,

May 18, 2008, 4:51:42 PM5/18/08

to

cha...@lonemerchant.com wrote:
> I have roughly 350 million lines of data in the following form
>
> name, price, weight, brand, sku, upc, size

Name, in particular, seems like it might be able to contain embedded
punctuation and might be escaped in some way. That could complicate
things

> sitting on my home PC.

What kind of PC is your home PC?

> Is there some kind of sane way to sort this without taking up too much
> ram

As long as you have plenty of scratch space, Linux's system sort will
use temp files to sort things much larger than main memory. For all I
know, Window's DOS emulator's sort will as well. But it is a matter of
whether you can get the system sort command to sort on the field and
collation sequence you want sorted. If not, you could use Perl to
transform the data into something more acceptable, use the system sort,
then transform it back.

> or jacking up my limited CPU time?

Sorting 350 million records will take some CPU time. I don't know what
you consider to be "jacking up" or how limited you think your CPU time.
My CPUs are limited to about 86,400 seconds per day, rather I am using
them or not.

Xho

--
-------------------- http://NewsReader.Com/ --------------------
The costs of publication of this article were defrayed in part by the
payment of page charges. This article must therefore be hereby marked
advertisement in accordance with 18 U.S.C. Section 1734 solely to indicate
this fact.

Bill H

unread,

May 18, 2008, 8:14:35 PM5/18/08

to

Just out of curiosity I would like to know how someone has a file
containing 350 million line of product information sitting on a home
pc in the first place. I mean it had to have come from some sort of
database to start with, and withthose numbers we aren't talking about
a second hand store.

Bill H

Ted Zlatanov

unread,

May 19, 2008, 1:00:15 PM5/19/08

to

On Sat, 17 May 2008 08:21:21 -0700 (PDT) cha...@lonemerchant.com wrote:

c> I have roughly 350 million lines of data in the following form
c> name, price, weight, brand, sku, upc, size

c> sitting on my home PC.

c> Is there some kind of sane way to sort this without taking up too much
c> ram or jacking up my limited CPU time?

One simple way, without using databases, is to take smaller pieces (say,
10K lines each) and sort them individually by whatever field you need.
Then you take the top or bottom of each piece, make a new set, and sort
that set for the final result.

If you need to sort the whole list and not just get the max/min, apply
the same algorithm except you keep each sorted piece open and keep
taking the smallest/largest element from the top/bottom of the piece
that contains it.

For more information and if my explanation doesn't make sense, look up
the "merge sort" algorithm.

Ted

bugbear

unread,

May 20, 2008, 6:42:55 AM5/20/08

to

Ted Zlatanov wrote:
> One simple way, without using databases, is to take smaller pieces (say,
> 10K lines each) and sort them individually by whatever field you need.
> Then you take the top or bottom of each piece, make a new set, and sort
> that set for the final result.
>
> If you need to sort the whole list and not just get the max/min, apply
> the same algorithm except you keep each sorted piece open and keep
> taking the smallest/largest element from the top/bottom of the piece
> that contains it.
>
> For more information and if my explanation doesn't make sense, look up
> the "merge sort" algorithm.

IIRC Linux/Unix sort used quicksort for in RAM
and merge sort (via disc) if the data size exceeds RAM size,
again using quicksort in RAM when the portion to be
merged fit in RAM.

BugBear

Ted Zlatanov

unread,

May 20, 2008, 11:42:19 AM5/20/08

to

On Tue, 20 May 2008 11:42:55 +0100 bugbear <bugbear@trim_papermule.co.uk_trim> wrote:

b> Ted Zlatanov wrote:
>> One simple way, without using databases, is to take smaller pieces (say,
>> 10K lines each) and sort them individually by whatever field you need.
>> Then you take the top or bottom of each piece, make a new set, and sort
>> that set for the final result.
>>
>> If you need to sort the whole list and not just get the max/min, apply
>> the same algorithm except you keep each sorted piece open and keep
>> taking the smallest/largest element from the top/bottom of the piece
>> that contains it.
>>
>> For more information and if my explanation doesn't make sense, look up
>> the "merge sort" algorithm.

b> IIRC Linux/Unix sort used quicksort for in RAM
b> and merge sort (via disc) if the data size exceeds RAM size,
b> again using quicksort in RAM when the portion to be
b> merged fit in RAM.

Yes, but a) it writes them in /tmp (unless you use -T in newer sort
implementations), b) it's not as flexible as what I described, and c) it
only works on Unix-like systems (on Windows you have to install cygwin
or other packages, etc.).

(b) is particularly important IMO for anything but simple sorting.

Ted

RedGrittyBrick

unread,

May 20, 2008, 2:51:17 PM5/20/08

to

In an earlier thread* you'll see the OP is planning to download 350
million records one at a time from the doba.com website. Sinan pointed
out this would take 3.7 years of continuous scraping (at 3 pages/sec).

Perhaps the OP is planning ahead.

--
RGB
* "Need ideas on how to make this code faster than a speeding turtle"

cha...@lonemerchant.com

unread,

May 20, 2008, 4:07:17 PM5/20/08

to

On May 18, 1:51 pm, xhos...@gmail.com wrote:
> cha...@lonemerchant.com wrote:
> > I have roughly 350 million lines of data in the following form
>
> > name, price, weight, brand, sku, upc, size
>
> Name, in particular, seems like it might be able to contain embedded
> punctuation and might be escaped in some way. That could complicate
> things
>
> > sitting on my home PC.
>
> What kind of PC is your home PC?
>

My home PC is an 700MHZ intel, 256MB RAM running Fedora Core Linux 6

Bill H

unread,

May 20, 2008, 6:04:01 PM5/20/08

to

On May 20, 2:51 pm, RedGrittyBrick <RedGrittyBr...@SpamWeary.foo>
wrote:

> * "Need ideas on how to make this code faster than a speeding turtle"- Hide quoted text -

>
> - Show quoted text -

Well if he was downloading them individually he should have sorted
them at the same time and killed 2 birds with one stone in those 3.7
years.

Bill H

BTW - whats up with google now using captcha in their posting??