Parsing WARC - Beautiful Soup - Extract Specific Content - SLOW

1,042 views
Skip to first unread message

Alban Tranchard

unread,
Jun 11, 2020, 5:37:01 AM6/11/20
to Common Crawl
Dear All, Dear Sebastian

I had the ambition to do the following: go over all English content warc record, check and extract if HTML text relates to a cyber topic.
I have studied carefully the various repo and now experimented on AWS EMR with pyspark.
Before running a script, I am working/testing with an EMR jupyter notebook

if I select at all the English content text pages in the latest May-June Crawl, I have a whopping 80,764,430 records to go through  

df = spark.read.option("mergeSchema", "true").parquet('s3://commoncrawl/cc-index/table/cc-main/warc/')
df.createOrReplaceTempView('ccindex')
sqldf = spark.sql('SELECT url, warc_filename, warc_record_offset, warc_record_length \
                  FROM ccindex \
                  WHERE crawl = "CC-MAIN-2020-24" \
                  AND content_languages="eng" \
                  AND subset = "warc" \
                  AND (content_mime_detected="text/html" OR content_mime_detected="application/xhtml+xml")')

sqldf.createOrReplaceTempView('records')


To test drive my code, I select the first 10,000 records

warc_recs=spark.sql('Select url, warc_filename, warc_record_offset, warc_record_length FROM records LIMIT 10000').rdd
output = warc_recs.mapPartitions(process_warc_records)

output_schema = StructType([
        StructField("key", LongType(), True),
        StructField("uri", StringType(), True),
        StructField("html", StringType(), True)
    ])

spark.createDataFrame(output, schema=output_schema) \
        .coalesce(1) \
        .write \
        .mode('append').format('parquet') \
        .option("compression", 'gzip') \
        .save('s3a://chk2817-commoncrawl/crawl_output')

on a cluster with 1 master c5.xlarge with 5 c5.xlarge cores, it took ~1700 seconds to complete. it is setup EMR-5.30.0 with Hadoop, Ganglia, Hive, Spark, Livy with following config [{"classification":"spark","properties":{"maximizeResourceAllocation":"true"}}]
and with all the dependencies installed via boostrap actions.

  • Is this expected to be so slow?
  • Is it just a question of scaling up the cluster instances, if so what would you recommend 
  • Is it the notebook environment making it slow? would a script be much faster?
  • is my ambition realistic at all?




i have the following basic function/routines

def process(record):
        try:
            content = record.content_stream().read()
            encoding = EncodingDetector.find_declared_encoding(content,is_html=True)
            soup = BeautifulSoup(content, "lxml", from_encoding=encoding)
            # strip all script and style elements
            for script in soup(["script", "style"]):
                script.decompose()
            
            regex=re.compile(r"\b(?=\w*cyber)\w+\b",re.IGNORECASE)
            text=soup.get_text(" ", strip=True)
            
            if len(re.findall(regex,text))>10 and len(text)>500:
                regex2=re.compile(r"\b(?=\w*sex)\w+\b",re.IGNORECASE)
                if len(re.findall(regex2,text))>0:
                    return 0,"",'adult'
                else:
                    return 1,record.rec_headers.get_header('WARC-Target-URI'),content
            else:
                return 0,"","not cyber"
                #return 2,record.rec_headers.get_header('WARC-Target-URI'),content
        except:
            return 0,"","error" 

and,

def process_warc_records(rows):

    s3client = boto3.client('s3')
    for row in rows:
        url = row['url']
        warc_path = row['warc_filename']
        offset = int(row['warc_record_offset'])
        length = int(row['warc_record_length'])
        rangereq = 'bytes={}-{}'.format(offset, (offset+length-1))
        response = s3client.get_object(Bucket='commoncrawl',Key=warc_path,Range=rangereq)
        record_stream = BytesIO(response["Body"].read())
        try:
            for record in ArchiveIterator(record_stream):
                yield process(record)
        except ArchiveLoadFailed as exception:
            continue



code cyber extract.ipynb
Message has been deleted

Colin Dellow

unread,
Jun 11, 2020, 10:52:59 AM6/11/20
to Common Crawl
On Thursday, 11 June 2020 05:37:01 UTC-4, Alban Tranchard wrote:
Dear All, Dear Sebastian

Hi Alban,
A c5.xlarge has 4 cores, so with 5 such instances, you have 20 cores. If the job takes 1700 seconds, it means you are taking 34,000 CPU seconds to process 10,000 pages, or 3.4 seconds per page.

That is very slow.

For comparison, when doing a regex search on a single WARC file using a single core of an m6g instance, I can process 1 WARC per 50 seconds. Since a WARC has ~50,000 records, that's a rate of 0.001 seconds per page.

This is not an apples-to-apples comparison (WARC vs range requests, regex vs parsing HTML), but I think you can use some of the same principles I use to improve your processing time.
  • Is it just a question of scaling up the cluster instances, if so what would you recommend
More instances will let you complete faster, but your total cost spent will be the same. It will likely be much more expensive than it ought to be.
  • Is it the notebook environment making it slow? would a script be much faster?
I don't think it's notebook vs script.

Looking at your code, I have a few questions/comments:

1) How much of the 1,700 seconds is spent doing the spark.sql query on the parquet index? It should be a minimal amount (< 10%), but just want to be sure.

2) In your process method, can you check for the presence of your desired keyword _before_ parsing as HTML, and before doing content encoding detection & decoding? Parsing HTML and character sets is notoriously slow. If you know that you only care about pages with the word "cyber" in them, do a simple check for that string and stop processing before doing your more expensive operations.

3) The goal is to have 100% CPU usage on your Spark cluster while your job is running. If you SSH to a worker node and monitor CPU usage with htop (or monitor in the EC2 console), do you observe 100% CPU use? I suspect you won't.

I suspect that will be because you are using the Parquet indexes and individual Range queries to fetch single WARC entries. This means that before the processing for a page can happen, you have to wait for the page to be fetched from S3. S3 is pretty fast, but even so, each request will take 20-100ms. During that time, your CPU is sitting idle, which is inefficient. By contrast, your own code probably only needs 1-10ms to process the page once it's been fetched.

A solution here is to fetch the pages in parallel - rather than requesting pages serially 1 at a time, request them concurrently in batches of, say, 100 at a time using an asynchronous HTTP client. Then process the 1,000 at once.

There's a lot of wiggle room for fine-tuning this process, but even a very naive approach should result in significant improvements. Depending on your needs, that improvement may be enough.

Hope that helps!
Colin

Alban Tranchard

unread,
Jun 11, 2020, 11:56:52 AM6/11/20
to Common Crawl


Thanks Colin for the reply, i am stuck and anxious to get some help here :=)

So I must admit, as a newbie, some the details some of your feedback is more difficult to understand but I get the gist of it

clearly the bottleneck is with the process_warc / process. I will adapt to only parse if regex finds keywords on the record.content_stream().read() directly.

if accessing individual range is slower (which i understand s3 cannot get multiple ranges) then maybe a better option would be to go through entire WARC files instead? check for WARC-Type: response, Content-Length: x > min_size, check content_type (like is_html function)
but not sure how to check for the eng language. Maybe not needed and simply rely on the regex search as a first collection pass. and then refine further by analyzing further the fetched-saved pages

Other option you mention, fetch pages in parallel. That, I do not know how to improve by doing the 100 at a time using an asynchronous HTTP client...

Thanks
   
 

Colin Dellow

unread,
Jun 11, 2020, 12:34:35 PM6/11/20
to Common Crawl
On Thursday, 11 June 2020 11:56:52 UTC-4, Alban Tranchard wrote:


Thanks Colin for the reply, i am stuck and anxious to get some help here :=)

So I must admit, as a newbie, some the details some of your feedback is more difficult to understand but I get the gist of it

clearly the bottleneck is with the process_warc / process. I will adapt to only parse if regex finds keywords on the record.content_stream().read() directly.

if accessing individual range is slower (which i understand s3 cannot get multiple ranges) then maybe a better option would be to go through entire WARC files instead? check for WARC-Type: response, Content-Length: x > min_size, check content_type (like is_html function)
but not sure how to check for the eng language. Maybe not needed and simply rely on the regex search as a first collection pass. and then refine further by analyzing further the fetched-saved pages

This might provide a good enough speedup, and would probably be fast to prototype and try.

If you did want to check for the eng language, you'd need to tweak your code to process the WARC's entries in groups of 3. For each URL that is captured, there is a request entry, a response entry, and a metadata entry. The metadata entry will have a line like:

languages-cld2: {"reliable":true,"text-bytes":2464,"languages":[{"code":"en","code-iso-639-3":"eng","text-covered":0.99,"score":952.0,"name":"ENGLISH"}]}

that you can inspect to figure out the language of the preceding response. (In the parquet index case, you don't see the request/metadata entries -- only the reponse entries.)

 

Other option you mention, fetch pages in parallel. That, I do not know how to improve by doing the 100 at a time using an asynchronous HTTP client...

Unfortunately, I'm not super familiar with Python, so I can only sketch the approach and you'd have to do some trial and error to make it work.

The gist is to change your process_warc_records. Right now it uses a synchronous S3 client and fetches requests serially. Instead, we'd like to change it to use an HTTP client that can fetch things in parallel. Something like the code in https://gist.github.com/cldellow/ccc53258dc6ede488a5811bb0739b54b should work, although I will caution that I haven't tried this code and there may be some dumb mistakes in it. Hopefully it illustrates the idea, though.


 

Thanks
   
 

Alban Tranchard

unread,
Jun 11, 2020, 1:31:06 PM6/11/20
to Common Crawl
Thanks Colin 

very hepful. will try your code definitely. 

a stupid question i am sure, the WARC record .content_stream.read() returns an object bytes

I am trying some simple regex like regex=re.compile(b"\b(?=\w*html)\w+\b",re.IGNORECASE) being b'\x08(?=\\w*html)\\w+\x08'
with the b' but re.findall(regex,content) returns nothing []
Am I missing a dump silly thing? I wanted to avoid decoding up-front
A

Colin Dellow

unread,
Jun 11, 2020, 1:52:31 PM6/11/20
to Common Crawl
The python regex library only works on strings, so you'd have to make a tradeoff here and use something less powerful than regexes for the initial filter. The goal is to find a fast enough filter that lets you reject entries that you definitely won't care about. Once something has passed that filter, you can use the more expensive regex + HTML parsing methods.

Based on your first example, perhaps checking for the word cyber is enough?

content.find(b'cyber') >= 0

Or to to simulate case-insensitivity, check for popular casings:

content.find(b'cyber') >= 0 or content.find(b'CYBER') >= 0 or content.find(b'Cyber') >= 0  

This means you'll miss some things that your case insensitive regex would have accepted (eg CyBeR) -- that's probably a reasonable tradeoff.

You could even consider collapsing two of the cases, cyber and Cyber, by only checking for their common substring:

content.find(b'yber') >= 0 or content.find(b'CYBER') >= 0

The tradeoff here is that while the initial filter is faster, because it's only checking 2 strings instead of 3, there will be more false positives that have to be processed by your more expensive second stage (e.g. the Czech word vybere will pass this initial filter, and then get rejected by your second stage).

False positives are OK, because they only affect performance, not correctness. Your regular code will still have a chance to filter these out after they do decoding, HTML parsing and your regexes. With some trial and error, you'll find the right balance between making an initial filter that is fast enough and minimizes false positives. As with requesting entries in parallel, there's lots of wiggle room for optimizing it, but you'll get most of the improvement from doing _something_, even if it's fairly naive.

A

Tom Morris

unread,
Jun 11, 2020, 6:27:59 PM6/11/20
to common...@googlegroups.com
You know that there are WET files available with the text pre-extracted for you, right? That would save having to use JSoup at all.

Tom

--
You received this message because you are subscribed to the Google Groups "Common Crawl" group.
To unsubscribe from this group and stop receiving emails from it, send an email to common-crawl...@googlegroups.com.
To view this discussion on the web visit https://groups.google.com/d/msgid/common-crawl/d8b05ec2-176c-4812-9949-41ea849eabedo%40googlegroups.com.

Greg Lindahl

unread,
Jun 12, 2020, 12:50:12 AM6/12/20
to common...@googlegroups.com
On Thu, Jun 11, 2020 at 10:31:05AM -0700, Alban Tranchard wrote:

> like regex=re.compile(b"\b(?=\w*html)\w+\b",re.IGNORECASE) being

This is a common python mistake: if you want \b to mean the regex \b,
you either have to say r"" (but that can't be combined with bytes) or
escape it like

re.compile(b"\\b(?=\\w*html)\\w+\\b",re.IGNORECASE

I correct this in my smart colleague's code once every few weeks, so
don't feel bad :-)

Alban Tranchard

unread,
Jun 12, 2020, 1:40:32 AM6/12/20
to Common Crawl
Hi Tom

Yes it sounds definitely easier to work and ingest from WET files.

Though I was wondering if it is easy to clean and extract relevant text topic from plain without any html info

For ex, with html info, you could search keywords in header tags and the extract text of all children elements and omit the rest of a page that might not be relevant.
With plain text, it seems way more challenging ?

Thanks

Alban Tranchard

unread,
Jun 12, 2020, 1:42:49 AM6/12/20
to Common Crawl
Thanks a lot!

So basic that I could not find the solution on SO!

Thanks again

Sebastian Nagel

unread,
Jun 12, 2020, 2:15:42 AM6/12/20
to common...@googlegroups.com
Hi Alban, hi Tom,

> Yes it sounds definitely easier to work and ingest from WET files.

I definitely agree and want to second Tom in this point:

- starting with the latest crawl the content language(s) are annotated in WET files
(you'll find more details in the May/June crawl announcement)

- using WET files make definitely sense if you're interested in English content since
languages are not evenly distributed and English is dominant [1] in the Common Crawl data.
That's definitely different if you focus on a language from the long tail where you can pick
the relatively few pages with help of the columnar index. Some numbers from the May/June crawl:
2.75 billion page captures in total
1.15 billion English pages
1.33 million Icelandic pages
53.2 TiB WARC files
8.4 TiB WET files


> For ex, with html info, you could search keywords in header tags and the extract text of all
> children elements and omit the rest of a page that might not be relevant.
> With plain text, it seems way more challenging ?

With plain text only that's not possible. You need the HTML, ie. the WARC files.

But you could also later use the index (CDX or columnar) to fetch the HTML from the WARC archives for your result set - given it's
considerably small.

Best,
Sebastian

[1] https://commoncrawl.github.io/cc-crawl-statistics/plots/languages

Alban Tranchard

unread,
Jun 12, 2020, 7:45:25 AM6/12/20
to Common Crawl
Dear Sebastian and All,

Thanks for all the info. I have managed to progress today, a good friday one could say.

So a revised strategy is to create a first extraction from the WET files. They could already do further text analysis to refine the selection before getting the relevant WARC records 

on the AWS cluster with 5 workers, i managed to parse 100 WET files in about 1050 seconds which results in 1732 uri entries containing at least 10 references to *cyber* without being of adult nature.

In your experience, is this acceptable performance as it stands? is it now a question of scaling the cluster? or are there ways to further optimize the code

on the AWS ERM spark config itself, all i did was, not sure if there are additional config optimization required  
sparkmaximizeResourceAllocationtrueCluster configuration


main code

input_data = sc.textFile('s3://.../input/wet_paths_100.txt',minPartitions=400) => 415 partitions

 def process_wetfiles(iterator):
        # S3 client (not thread-safe, initialize outside parallelized loop)
        no_sign_request = botocore.client.Config(signature_version=botocore.UNSIGNED)
        s3client = boto3.client('s3', config=no_sign_request)
        
             
        for uri in iterator:
            bucketname='commoncrawl'
            wettemp = TemporaryFile(mode='w+b')
            
            try:
                #if len(uri)>0:
                s3client.download_fileobj(bucketname, uri, wettemp)
                #else:
                #    wettemp.close()
                #    continue    
            except botocore.client.ClientError as exception:
            #except:
                wettemp.close()
                continue
            
            wettemp.seek(0)
            stream = wettemp
            
            
            try:
                archive_iterator = ArchiveIterator(stream,no_record_parse=False)
                regex_cyber=re.compile(b"\\b(?=\\w*cyber)\\w+\\b",re.IGNORECASE) # thanks Greg, small errors are always the most painful!
                regex_adult=re.compile(b"\\b(?=\\w*sex)\\w+\\b",re.IGNORECASE)
                for record in archive_iterator:
                    try:
                        if record.rec_headers['WARC-Identified-Content-Language']=='eng' and \
                        record.rec_type == 'conversion' and record.content_type == 'text/plain':
                            content=record.content_stream().read()
                            if len(re.findall(regex_cyber,content))>=10 and len(re.findall(regex_adult,content))==0:
                                yield 1,record.rec_headers['WARC-Target-URI'],content.decode('utf-8')  #maybe doing without the decode could speed up?
                            else:
                                yield 0,'',''
                        else:
                            yield 0,'',''
                    except:
                        yield 0,'',''
            except ArchiveLoadFailed as exception:
                continue
            finally:
                stream.close()

output = input_data.mapPartitions(process_wetfiles).filter(lambda x: x[0]==1) #only keep entries matching regex criteria

spark.createDataFrame(output, schema=output_schema) \
        .coalesce(1) \ # for test case i chose 1 as the filesize is small 
        .write \
        .mode('append').format('parquet') \
        .option("compression", 'gzip') \
        .save('s3a://.../crawl_output')


Thanks again for your valuable feedback, as newbie in aws, cluster, spark etc...this has been a fun learning week...:=)

 

Sebastian Nagel

unread,
Jun 12, 2020, 10:15:33 AM6/12/20
to common...@googlegroups.com
Hi Alban,

> on the AWS cluster with 5 workers, i managed to parse 100 WET files in about *1050 seconds*

There are 60,000 WET files for May/June 2020, so to process the entire data you'd need
about 7 days:

1050 * 600 / (60*60*24) = 7.29

My feeling: could be somewhat faster but you're already close to the optimum.

You could try the following optimizations:

- use the regex b"cyber" instead of b"\\b(?=\\w*cyber)\\w+\\b"
* my benchmarks show that it's faster by a factor of almost 2
* there's a subtle but maybe negligible difference:
the word "cybercyber" counts as two matches.
If this is mandatory it's probably more efficient to re-count
matches in a second pass


> yield 0,'',''

- better "continue" *and* remove the filter ".filter(lambda x: x[0]==1)"
In the worst case, this forces Spark to switch twice to Python and back to Scala/Java.
PySpark is a hybrid:
- a Spark data flow running in the JVM is configured from Python
- certain methods (filter, map, reduce, etc.) are executed in Python and data
is piped back and forth from the JVM to Python

- you could configure boto3 not use https (use_ssl=False, see [1]). Colin reported in [2]
that this makes a difference. I still have not tried it, could be that the impact is
small when you fetch large files not single WARC records.
Since the data is public and you're on the AWS cloud, SSL might be not necessary.


Best,
Sebastian

[1] https://boto3.amazonaws.com/v1/documentation/api/latest/reference/core/session.html
[2] https://code402.com/blog/s3-scans-vs-index/


On 6/12/20 1:45 PM, Alban Tranchard wrote:
> Dear Sebastian and All,
>
>
> Thanks for all the info. I have managed to progress today, a good friday one could say.
>
> So a revised strategy is to create a first extraction from the WET files. They could already do further text analysis to refine the
> selection before getting the relevant WARC records 
>
> on the AWS cluster with 5 workers, i managed to parse 100 WET files in about *1050 seconds* which results in 1732 uri entries containing at
> least 10 references to *cyber* without being of adult nature.
> *
> *
> *In your experience, is this acceptable performance as it stands? is it now a question of scaling the cluster? or are there ways to further
> optimize the code*
> *
> *
> *on the AWS ERM spark config itself, all i did was, not sure if there are additional config optimization required  *
> --
> You received this message because you are subscribed to the Google Groups "Common Crawl" group.
> To unsubscribe from this group and stop receiving emails from it, send an email to common-crawl...@googlegroups.com
> <mailto:common-crawl...@googlegroups.com>.
> To view this discussion on the web visit
> https://groups.google.com/d/msgid/common-crawl/d8662d64-a8eb-447b-9d0d-4b35923e5b93o%40googlegroups.com
> <https://groups.google.com/d/msgid/common-crawl/d8662d64-a8eb-447b-9d0d-4b35923e5b93o%40googlegroups.com?utm_medium=email&utm_source=footer>.

Alban Tranchard

unread,
Jun 12, 2020, 12:40:27 PM6/12/20
to common...@googlegroups.com
Thanks Sebastian

I have made the change and it works like a charm
Awaiting AWS to grant me more CPU limit :=)

One confirmation to make sure I am not doing something wrong here

I capture the uri and the text. I assume the uri is what I need to work with the columnar index after in case, or am I missing other important rec_headers fields?

Thanks again


To unsubscribe from this group and stop receiving emails from it, send an email to common-crawl...@googlegroups.com.
To view this discussion on the web visit https://groups.google.com/d/msgid/common-crawl/ce87f549-0c3b-4a58-86df-8ee8453a72bb%40commoncrawl.org.

Alban Tranchard

unread,
Jun 13, 2020, 4:09:57 AM6/13/20
to Common Crawl
Dear all

wanted to provide an update.
I have run this morning the code on the remaining 47901 wet files using a cluster with 75 instances (300 vCPUS) using c5xlarge types on AWS
the query has run in 53mins.

Yesterday I have noticed the number of partitions given in the coalesce() for the writing back to s3 played an important part in terms of CPUs usage
for this run, i have 400 input data partitions and 40 in the coalesce()

While the CPU usuage was max out for the first part of the job, I noticed the following profile and wanted to ask if this expected as per your experience,
see screenshots. I also had two failed tasks, not sure how to debug those...

Overall, FYI, over the latest month WET extract, I have collected a total 880,894 uri entries with cyber mentioned at least 10 times being non-adult content, 3.8Gb of text to further analyzed and refined...
NLP work all the way now.

A first milestone for me and all my thanks for all the help and advice given by you. It would have been much more painful without it.

cpu suage.PNGmemory.PNG


load.PNG


       

Sebastian Nagel

unread,
Jun 15, 2020, 9:33:11 AM6/15/20
to Common Crawl
Hi Alban,

> the query has run in 53mins

Great! That means a "grep" filtered by language over an entire monthly crawl in less than one hour. Even with the initial cluster of only 5 * c5.xlarge the job should take less than one day.

> 400 input data partitions and 40 in the coalesce()

To utilize all nodes of the cluster, you'd need at least 75 partitions or a multiple of them to utilize all cores (150 assumed that c5.xlarge has 2 physical cores / 4 threads).

Also running the job on a smaller cluster may give you a better utilization. Finally, you could use EC2 spot instances to save money - if so: split the job into parts (eg. each processing 10% of the WET files) and upload the results to S3 after a job completes.

> I assume the uri is what I need to work with the columnar index after in case, or am I missing other important rec_headers fields?

Yes, you can either use the CDX index (https://index.commoncrawl.org/) to look up the URLs or do a join on the columnar index. There is a description how to do this given a list of domain names:

Should be straight-forward to adapt this to a join on the "url" column.

Best,
Sebastian

Sebastian Nagel

unread,
Jun 15, 2020, 12:11:58 PM6/15/20
to common...@googlegroups.com
Sorry, I forgot to copy the link. Here it is:

> There is a description how to do this given a list of domain names:

https://github.com/commoncrawl/cc-index-table/blob/master/src/sql/examples/cc-index/count-domains-alexa-top-1m.sql
> --
> You received this message because you are subscribed to the Google Groups "Common Crawl" group.
> To unsubscribe from this group and stop receiving emails from it, send an email to common-crawl...@googlegroups.com
> <mailto:common-crawl...@googlegroups.com>.
> To view this discussion on the web visit
> https://groups.google.com/d/msgid/common-crawl/4284ad63-a341-4a70-a008-517b1ba6cd15o%40googlegroups.com
> <https://groups.google.com/d/msgid/common-crawl/4284ad63-a341-4a70-a008-517b1ba6cd15o%40googlegroups.com?utm_medium=email&utm_source=footer>.

Tom Morris

unread,
Jun 15, 2020, 2:19:06 PM6/15/20
to common...@googlegroups.com
On Mon, Jun 15, 2020 at 9:33 AM Sebastian Nagel <seba...@commoncrawl.org> wrote:
>
> > 400 input data partitions and 40 in the coalesce()
>
> To utilize all nodes of the cluster, you'd need at least 75 partitions or a multiple of them to utilize all cores (150 assumed that c5.xlarge has 2 physical cores / 4 threads).
>
> Also running the job on a smaller cluster may give you a better utilization. Finally, you could use EC2 spot instances to save money - if so: split the job into parts (eg. each processing 10% of the WET files) and upload the results to S3 after a job completes.

I agree with all of Sebastian's suggestions, but another way to get better utilization, if turnaround time is important, would be to use more partitions. 
It strikes me that 400 (or is it actually 415 from your earlier note?) isn't a good match for your 300 available processors.
I suspect the phases are: 1) 300 partitions fully occupying the cluster, 2) remaining ~100 partitions partially occupying the cluster, 3) stragglers, 4) output

Using either fewer processors or many more partitions might make for better fitting of load to capacity.

Also, writing a single output file represents a natural bottleneck. It doesn't look like you've got a ton of output data currently, but it's something to keep in mind. You're still paying for those 300 vCPUs while just a few of them are doing the output.

Tom
Reply all
Reply to author
Forward
0 new messages