AWS S3 MultiPart Upload with Python and Boto3

In this blog post, I’ll show you how you can make multi-part upload with S3 for files in basically any size. We’ll also make use of callbacks in Python to keep track of the progress while our files are being uploaded to S3 and also threading in Python to speed up the process to make the most of it. And I’ll explain everything you need to do to have your environment set up and implementation you need to have it up and running!

aws_s3_course_cover

Hi,

In this blog post, I’ll show you how you can make multi-part upload with S3 for files in basically any size. We’ll also make use of callbacks in Python to keep track of the progress while our files are being uploaded to S3 and also threading in Python to speed up the process to make the most of it. And I’ll explain everything you need to do to have your environment set up and implementation you need to have it up and running!

This is a part of from my course on S3 Solutions at Udemy if you’re interested in how to implement solutions with S3 using Python and Boto3.

First things first, you need to have your environment ready to work with Python and Boto3. If you haven’t set things up yet, please check out my blog post here and get ready for the implementation.

I assume you already checked out my Setting Up Your Environment for Python and Boto3 so I’ll jump right into the Python code.

First thing we need to make sure is that we import boto3:

import boto3

We now should create our S3 resource with boto3 to interact with S3:

s3 = boto3.resource('s3')

Ok, we’re ready to develop, let’s begin!

Let’s start by defining ourselves a method in Python for the operation:

def multi_part_upload_with_s3():

There are basically 3 things we need to implement: First is the TransferConfig where we will configure our multi-part upload and also make use of threading in Python to  speed up the process dramatically. So let’s start with TransferConfig and import it:

from boto3.s3.transfer import TransferConfig

Now we need to make use of it in our multi_part_upload_with_s3 method:

config = TransferConfig(multipart_threshold=1024 * 25, max_concurrency=10,
                        multipart_chunksize=1024 * 25, use_threads=True)

Here’s a base configuration with TransferConfig. Let’s brake down each element and explain it all:

multipart_threshold: The transfer size threshold for which multi-part uploads, downloads, and copies will automatically be triggered.

max_concurrency: The maximum number of threads that will be making requests to perform a transfer. If use_threads is set to False, the value provided is ignored as the transfer will only ever use the main thread.

multipart_chunksize: The partition size of each part for a multi-part transfer.

use_threads: If True, threads will be used when performing S3 transfers. If False, no threads will be used in performing transfers: all logic will be ran in the main thread.

This is what I configured my TransferConfig but you can definitely play around with it  and make some changes on thresholds, chunk sizes and so on. But let’s continue now.

Now we need to find a right file candidate to test out how our multi-part upload performs. So let’s read a rather large file (in my case this PDF document was around 100 MB).

First, let’s import os library in Python:

import os

Now let’s import largefile.pdf which is located under our project’s working directory so this call to os.path.dirname(__file__)  gives us the path to the current working directory.

file_path = os.path.dirname(__file__) + '/largefile.pdf'

Now we have our file in place, let’s give it a key for S3 so we can follow along with S3 key-value methodology and place our file inside a folder called multipart_files and with the key largefile.pdf:

key_path = 'multipart_files/largefile.pdf'

Now, let’s proceed with the upload process and call our client to do so:

s3.meta.client.upload_file(file_path, BUCKET_NAME, key_path,
                            ExtraArgs={'ACL': 'public-read', 
                                       'ContentType': 'text/pdf'},
                            Config=config,
                            Callback=ProgressPercentage(file_path))

Here I’d like to attract your attention to the last part of this method call; Callback. If you’re familiar with a functional programming language and especially with Javascript then you must be well aware of its existence and the purpose.

What basically a Callback does to call the passed in function, method or even a class in our case which is ProgressPercentage and after handling the process then return it back to the sender. So with this way, we’ll be able to keep track of the process of our multi-part upload progress like the current percentage, total and remaining size and so on. But how is this going to work? Where does ProgressPercentage comes from? Nowhere, we need to implement it for our needs so let’s do that now.

Either create a new class or your existing .py, it doesn’t really matter where we declare the class; it’s all up to you. So let’s begin:

class ProgressPercentage(object):

In this class declaration, we’re receiving only a single parameter which will later be our file object so we can keep track of its upload progress. Let’s continue with our implementation and add an __init__ method to our class so we can make use of some instance variables we will need:

self._filename = filename
self._size = float(os.path.getsize(filename))
self._seen_so_far = 0
self._lock = threading.Lock()

Here we are preparing our instance variables we will need while managing our upload progress. filename and size are very self-explanatory so let’s explain what are the other ones:

seen_so_far: will be the file size that is already uploaded in any given time. For starters, its just 0.

lock: as you can guess, will be used to lock the worker threads so we won’t lose them while processing and have our worker threads under control.

Here’s the most important part comes for ProgressPercentage and that is the Callback method so let’s define it:

def __call__(self, bytes_amount):

bytes_amount is of course will be the indicator of bytes that are already transferred to S3. What we need is a way to get the information about current progress and print it out accordingly so that we will know for sure where we are. Let’s start by taking thread lock into account and move on:

with self._lock:

After getting the lock, let’s first set seen_so_far to an appropriate value which is the cumulative value for bytes_amount:

self._seen_so_far += bytes_amount

Next is that we need to know the percentage of the progress so to track it easily:

percentage = (self._seen_so_far / self._size) * 100

We’re simply dividing the already uploaded byte size to the whole size and multiplying it by 100 to simply get the percentage. Now, for all these to be actually useful, we need to print them out. So let’s do that now. I’m making use of Python sys library to print all out and I’ll import it; if you use something else than you can definitely use it:

import sys

Now let’s use it to print things out:

sys.stdout.write("\r%s  %s / %s  (%.2f%%)" % 
                (self._filename, self._seen_so_far, self._size,percentage))

As you can clearly see, we’re simply printing out filename, seen_so_far, size and percentage in a nicely formatted way.

One last thing before we finish and test things out is to flush the sys resource so we can give it back to memory:

sys.stdout.flush()

Now we’re ready to test things out. Here’s a complete look to our implementation in case you want to see the big picture:

import threading

import boto3
import os
import sys

from boto3.s3.transfer import TransferConfig

BUCKET_NAME = "YOUR_BUCKET_NAME"


def multi_part_upload_with_s3():
    # Multipart upload
    config = TransferConfig(multipart_threshold=1024 * 25, max_concurrency=10,
                            multipart_chunksize=1024 * 25, use_threads=True)
    file_path = os.path.dirname(__file__) + '/largefile.pdf'
    key_path = 'multipart_files/largefile.pdf'
    s3.meta.client.upload_file(file_path, BUCKET_NAME, key_path,
                            ExtraArgs={'ACL': 'public-read', 'ContentType': 'text/pdf'},
                            Config=config,
                            Callback=ProgressPercentage(file_path)
                            )


class ProgressPercentage(object):
    def __init__(self, filename):
        self._filename = filename
        self._size = float(os.path.getsize(filename))
        self._seen_so_far = 0
        self._lock = threading.Lock()

    def __call__(self, bytes_amount):
        # To simplify we'll assume this is hooked up
        # to a single filename.
        with self._lock:
            self._seen_so_far += bytes_amount
            percentage = (self._seen_so_far / self._size) * 100
            sys.stdout.write(
                "\r%s  %s / %s  (%.2f%%)" % (
                    self._filename, self._seen_so_far, self._size,
                    percentage))
            sys.stdout.flush()

Let’s now add a main method to call our multi_part_upload_with_s3:

if __name__ == '__main__':
multi_part_upload_with_s3()

Let’s hit run and see our multi-part upload in action:

2018-09-18 12_07_32-AWS with Python and Boto3_ Implementing Solutions with S3 _ Udemy

As you can see we have a nice progress indicator and two size descriptors; first one for the already uploaded bytes and the second for the whole file size.

So this is basically how you implement multi-part upload on S3. There are definitely several ways to implement it however this is I believe is more clean and sleek.

Make sure to subscribe my blog or reach me at niyazierdogan@windowslive.com for more great posts and suprises on my Udemy courses

Have a great day!

AWS with Python and Boto3: RDS PostgreSQL and DynamoDB CRUD course is out!

Do you want to learn how to launch managed Relational Databases or RDS on AWS? Do you want to learn how to connect to your RDS DB instances using Python and psycopg2 library and implement all Create, Read, Update and Delete (CRUD) operations? Or do you want to learn how to implement NoSQL DynamoDB Tables on AWS and work with data from scanning, querying to update, read and delete operations?

Then this is the course you need on RDS and DynamoDB on AWS!

In this course, we’ll start by taking a look at the tools and the environment that we need to work with AWS resources. We’ll be using Python 3 and as per the IDE I recommend you to use PyCharm from Jetbrains. It has a free community edition even!

After I teach you how you can set up your environment on both MacOS and Windows, we’ll create our credentials for AWS as being the AWS Access Key and AWS Secret Access Key for programmatic access to AWS resources. You’ll learn how you can set your AWS credentials globally on your computers using AWS CLI. Before jumping into the implementation, for one last tip, I’ll show you how you can have auto-complete capabilities on your PyCharm IDE with PyBoto3!

Once we’re ready with our environment setup, we’ll start implementing our solution on AWS! And remember we’ll do everything with Python code; not a single thing manually or by hand!

We’ll start off with RDS or Relational Database Service from AWS. I’ll teach you how to launch your own Amazon RDS Instances purely with your Python code! Then we’ll learn how to connect to our RDS database instance using Python and psycopg2 library. After that, I’ll teach you how to execute your queries against RDS PostgreSQL using psycopg2 library and we’ll implement SELECT, INSERT, DELETE, UPDATE so basically all the CRUD opreations against our own-launched RDS PostgreSQL instance on AWS!

Next up is DynamoDB! With this very-popular NoSQL service from AWS, I’ll teach you how to create your own DynamoDB Tables on AWS with Python! You’ll learn how to provide a key schema, attribute definitions and apply throughput to your tables.

And I’ll share the great news for you that there is a Local version of DynamoDB that you can simply run on your computer to play around with! I will show you how you can get and run the Local version of DynamoDB on your computer and we’ll setup our environment and boto3 client configuration accordingly.

Then we’ll start making our way to putting new items, updating, deleting and reading them. Once we learn the basic CRUD operations with DynamoDB, we’ll move on to rather advanced operations like scanning and querying.

We’ll also implement a script to insert our sample data set of “movies” into our DynamoDB Movies table! Once we insert the data, we’ll start exploring how we can search it using DynamoDB query operation and we’ll also learn how we can use conditions. And finally, we’ll take a look at the scan operation which basically scans your whole data and retriveves the results you need. So to filter out the results from scan operation, we’ll apply filter expressions to our scan operation and see how things work with DynamoDB.

Lots of information, hands-on practice and experience is waiting for you in this course on AWS. So, don’t miss any more time and join me in this course to sharpen your skills on AWS using Python and Boto3!