Source

Speeding up Python code using multithreading

Hi lovely people! πŸ‘‹ A lot of times we end up writing code in Python which does remote requests or reads multiple files or does processing on some data. And in a lot of those cases I have seen programmers using a simple for loop which takes forever to finish executing. For example:

import requests
from time import time

url_list = [
    "https://via.placeholder.com/400",
    "https://via.placeholder.com/410",
    "https://via.placeholder.com/420",
    "https://via.placeholder.com/430",
    "https://via.placeholder.com/440",
    "https://via.placeholder.com/450",
    "https://via.placeholder.com/460",
    "https://via.placeholder.com/470",
    "https://via.placeholder.com/480",
    "https://via.placeholder.com/490",
    "https://via.placeholder.com/500",
    "https://via.placeholder.com/510",
    "https://via.placeholder.com/520",
    "https://via.placeholder.com/530",
]

def download_file(url):
    html = requests.get(url, stream=True)
    return html.status_code

start = time()

for url in url_list:
    print(download_file(url))

print(f'Time taken: {time() - start}')

Output:

<--truncated-->
Time taken: 4.128157138824463

This is a sane example and the code will open each URL, wait for it to load, print its status code and only then move on to the next URL. This kind of code is a very good candidate for multi-threading.

Modern systems can run a lot of threads and that means you can do multiple tasks at once with a very low over-head. Why don’t we try and make use of that to make the above code process these URLs faster?

We will make use of the ThreadPoolExecutor from the concurrent.futures library. It is super easy to use. Let me show you some code and then explain how it works.

import requests
from concurrent.futures import ThreadPoolExecutor, as_completed
from time import time

url_list = [
    "https://via.placeholder.com/400",
    "https://via.placeholder.com/410",
    "https://via.placeholder.com/420",
    "https://via.placeholder.com/430",
    "https://via.placeholder.com/440",
    "https://via.placeholder.com/450",
    "https://via.placeholder.com/460",
    "https://via.placeholder.com/470",
    "https://via.placeholder.com/480",
    "https://via.placeholder.com/490",
    "https://via.placeholder.com/500",
    "https://via.placeholder.com/510",
    "https://via.placeholder.com/520",
    "https://via.placeholder.com/530",
]

def download_file(url):
    html = requests.get(url, stream=True)
    return html.status_code

start = time()

processes = []
with ThreadPoolExecutor(max_workers=10) as executor:
    for url in url_list:
        processes.append(executor.submit(download_file, url))

for task in as_completed(processes):
    print(task.result())


print(f'Time taken: {time() - start}')

Output:

<--truncated-->
Time taken: 0.4583399295806885

We just sped up our code by a factor of almost 9! And we didn’t even do anything super involved. The performance benefits would have been even more if there were more urls.

So what is happening? When we call executor.submitΒ we are adding a new task to the thread pool. We store that task in the processes list. Later we iterate over the processes and print out the result.

The as_completed method yields the items (tasks) from processes list as soon as they complete. There are two reasons a task can go to the completed state. It has either finished executing or it got cancelled. We could have also passed in a timeout parameter to as_completed and if a task took longer than that time period, even then as_completed will yield that task.

You should explore multi-threading a bit more. For trivial projects it is the quickest way to speed up your code. If you want to learn, more read the official docs. They are super helpful.

Have a good day! See you later!

Newsletter

Γ—

If you liked what you read then I am sure you will enjoy a newsletter of the content I create. I send it out every other month. It contains new stuff that I make, links I find interesting on the web, and occasional discount coupons for my book. Join the 5000+ other people who receive my newsletter:

I send out the newsletter once every other month. No spam, I promise + you can unsubscribe at anytime

✍️ Comments

Javier Ruiz

By applying your great tip I just sped up an importing process to an Atlassian Jira instance by a factor of 50 or more! Amazing!

Edan Saltz

Increasing max_workers to 14 further decreased time by ~30%. Playing with got me to this optimal number for max_workers.

Yasoob
In reply to Edan Saltz

Yeah it is all about testing the limits of your system. πŸ‘

Apteka

I’m new to Python. I’m positively surprised how easy multi threading is in that language. Thanks

Yknip

Thanks for putting this up. It definitely help speeding up my process!

Say something

Send me an email when someone comments on this post.

Thank you!

Your comment has been submitted and will be published once it has been approved. 😊

OK