Client context parameters are configurable on a client instance via the client_context_params parameter in the Config object. For more detailed instructions and examples on the exact usage of context params see the configuration guide.
This by itself isn't tremendously better than the client in the accepted answer (although the docs say that it does a better job retrying uploads and downloads on failure) but considering that resources are generally more ergonomic (for example, the s3 bucket and object resources are nicer than the client methods) this does allow you to stay at the resource layer without having to drop down.
In Boto3, there are two ways to retrieve an object: get_object and download_fileobj. Get_object is easier to work with but slower for large objects, and download_fileobj is a managed transfer service that uses parallel range GETs if an object is larger than a configured threshold. My FastS3 library mirrors this logic, reimplemented in Rust. S3fs enables reading from objects using a pattern similar to standard Python file opens and reads.
The first experiment measures retrieval (GET) time for large objects using FastS3, s3fs, and both Boto3 codepaths. The goal is to retrieve an object from FlashBlade S3 into Python memory as fast as possible. All four functions scale linearly as the object size increases, with the Rust-based FastS3 being 3x and 2x faster than sf3s-read/boto3-get and boto3-download respectively.
Python prominence in data science machine learning continues to grow. And the mismatch in performance between accessing object storage data and compute hardware (GPUs) continues to widen. Faster object storage client libraries are required to keep modern processors fed with data. This blog post has shown that one way to significantly improve performance is to replace native Python Boto3 code with compiled Rust code. Just as NumPy makes computation in Python efficient, a new library needs to make S3 access more efficient.
tl;dr; You can download files from S3 with requests.get() (whole or in stream) or use the boto3 library. Although slight differences in speed, the network I/O dictates more than the relative implementation of how you do it.
So what's the fastest way to download them? In chunks, all in one go or with the boto3 library? I should warn, if the object we're downloading is not publically exposed I actually don't even know how to download other than using the boto3 library. In this experiment I'm only concerned with publicly available objects.
I'm actually quite new to boto3 (the cool thing was to use boto before) and from some StackOverflow-surfing I found this solution to support downloading of gzipped or non-gzipped objects into a buffer:
The first part of the DAG includes the import statements required. The core part of the DAG is the s3_extract function. This function uses the Airflow S3 Hook to initialize a connection to AWS. It then gets the file using the key and bucket name. The file is then downloaded using the download_fileobj method provided by the boto3 S3 client.
I started thinking that client side encryption would beuseful as well. AES is tried and tested, and it's easy to find samplecodeto do it. But it seems wasteful to first create encrypted files on yourhard drive, then upload them to AWS and finally delete everything.
Also, when you are seeking back to start, you need to reset the AES encryption,as boto3 does two passes on the upload (presumably for checksumming). Here'sthe final wrapper (with "write" support as well to support on-the-fly decryptionwhen downloading from S3 and writing to disk):
Armed with the above class, it becomes trivial to adapt the boto3 AWS S3examples to encrypt on the fly during upload, and decrypt on the fly duringdownload. Note that you need to configure boto3 properly before runningthe code below, so follow the SDK docs first and only do this after you'vesuccessfully ran their example without encryption.
From what I have tried, download / upload would not work at all. The reason for that is that upload_fileobj & download_fileobj are creating threads.
From what I know of streamlit sofar, subthreading can produce issues if not handled properly.
For example, I have a missing ReportContext when running a slightly modified version of your code. Improve "missing ReportContext" threading error Issue #1326 streamlit/streamlit GitHub