name, price, weight, brand, sku, upc, size
sitting on my home PC.
Is there some kind of sane way to sort this without taking up too much
ram or jacking up my limited CPU time?
I would throw it into MySQL
<cha...@lonemerchant.com> wrote in message
news:1c6aca8c-f1ac-4113...@t12g2000prg.googlegroups.com...
At the risk of sounding like a total dumba--, is it' possible to
upload a .cvs file directly into mysql?
Never mind. I can google the answer. Thanks.
Name, in particular, seems like it might be able to contain embedded
punctuation and might be escaped in some way. That could complicate
things
> sitting on my home PC.
What kind of PC is your home PC?
> Is there some kind of sane way to sort this without taking up too much
> ram
As long as you have plenty of scratch space, Linux's system sort will
use temp files to sort things much larger than main memory. For all I
know, Window's DOS emulator's sort will as well. But it is a matter of
whether you can get the system sort command to sort on the field and
collation sequence you want sorted. If not, you could use Perl to
transform the data into something more acceptable, use the system sort,
then transform it back.
> or jacking up my limited CPU time?
Sorting 350 million records will take some CPU time. I don't know what
you consider to be "jacking up" or how limited you think your CPU time.
My CPUs are limited to about 86,400 seconds per day, rather I am using
them or not.
Xho
--
-------------------- http://NewsReader.Com/ --------------------
The costs of publication of this article were defrayed in part by the
payment of page charges. This article must therefore be hereby marked
advertisement in accordance with 18 U.S.C. Section 1734 solely to indicate
this fact.
Just out of curiosity I would like to know how someone has a file
containing 350 million line of product information sitting on a home
pc in the first place. I mean it had to have come from some sort of
database to start with, and withthose numbers we aren't talking about
a second hand store.
Bill H
c> I have roughly 350 million lines of data in the following form
c> name, price, weight, brand, sku, upc, size
c> sitting on my home PC.
c> Is there some kind of sane way to sort this without taking up too much
c> ram or jacking up my limited CPU time?
One simple way, without using databases, is to take smaller pieces (say,
10K lines each) and sort them individually by whatever field you need.
Then you take the top or bottom of each piece, make a new set, and sort
that set for the final result.
If you need to sort the whole list and not just get the max/min, apply
the same algorithm except you keep each sorted piece open and keep
taking the smallest/largest element from the top/bottom of the piece
that contains it.
For more information and if my explanation doesn't make sense, look up
the "merge sort" algorithm.
Ted
IIRC Linux/Unix sort used quicksort for in RAM
and merge sort (via disc) if the data size exceeds RAM size,
again using quicksort in RAM when the portion to be
merged fit in RAM.
BugBear
b> Ted Zlatanov wrote:
>> One simple way, without using databases, is to take smaller pieces (say,
>> 10K lines each) and sort them individually by whatever field you need.
>> Then you take the top or bottom of each piece, make a new set, and sort
>> that set for the final result.
>>
>> If you need to sort the whole list and not just get the max/min, apply
>> the same algorithm except you keep each sorted piece open and keep
>> taking the smallest/largest element from the top/bottom of the piece
>> that contains it.
>>
>> For more information and if my explanation doesn't make sense, look up
>> the "merge sort" algorithm.
b> IIRC Linux/Unix sort used quicksort for in RAM
b> and merge sort (via disc) if the data size exceeds RAM size,
b> again using quicksort in RAM when the portion to be
b> merged fit in RAM.
Yes, but a) it writes them in /tmp (unless you use -T in newer sort
implementations), b) it's not as flexible as what I described, and c) it
only works on Unix-like systems (on Windows you have to install cygwin
or other packages, etc.).
(b) is particularly important IMO for anything but simple sorting.
Ted
In an earlier thread* you'll see the OP is planning to download 350
million records one at a time from the doba.com website. Sinan pointed
out this would take 3.7 years of continuous scraping (at 3 pages/sec).
Perhaps the OP is planning ahead.
--
RGB
* "Need ideas on how to make this code faster than a speeding turtle"
My home PC is an 700MHZ intel, 256MB RAM running Fedora Core Linux 6
Well if he was downloading them individually he should have sorted
them at the same time and killed 2 birds with one stone in those 3.7
years.
Bill H
BTW - whats up with google now using captcha in their posting??