Anton Landerer writes:
> For my test run, I'm using an input of 100 WARC files from the
> latest crawl. It takes almost 2 hours for my cluster to process all
> these WARCs. My laptop can run about 1 WARC per minute (including
> downloading/streaming) with smart_open, so it can do the same job in
> 1.5 hours.
Just to confirm that your laptop is doing pretty well, but you're
certainly losing in your cluster setup somehow.
For _local_ files, for example, here's my result for searching 100
WARC (local) files using 10 threads on a lightly loaded 2.10 GHz Intel
Xeon with 18 2cpu cores:
>: time ls *001??.warc.gz | parallel -j 10 "uz '{}' | egrep -iac '\bthe\b'"
325270
300294
313315
...
311426
316116
327770
real 8m11.985s
user 84m48.443s
sys 6m25.282s
That's 5088 seconds for 100 searches =~ 51 seconds each of compute
time.
So streaming, even assuming you're using a single download thread for
your laptop test, must be taking most of the time.
And indeed streaming files these days using AWS S3 is taking me at least an
hour per segment (for 2022-33, as it happens) using 10 threads == 600
minutes for 800 files == 45 seconds per file.
So adding the above two tasks on my setup we get 95 seconds =~ 1.5 minutes
per WARC file, same as your laptop.
Conclusion: Your cluster is not set up properly, as you're not
getting _any_ benefit, indeed it's costing you, to multiplex the job
over somewhere between 20 and 40 pairs of threads.
ht
--
Henry S. Thompson, School of Informatics, University of Edinburgh
10 Crichton Street, Edinburgh EH8 9AB, SCOTLAND
e-mail:
h...@inf.ed.ac.uk
URL:
https://www.ltg.ed.ac.uk/~ht/
[mail from me _always_ has a .sig like this -- mail without it is forged spam]