Trying to understand Async in Python - Part 1 | Introduction

I have a Python program which currently spawns about 1000 threads, that pull data from various websites via GET requests once every thirty minutes. When they finish, they go to sleep until those thirty minutes pass.

What's the problem? When I tried to spawn more than 2000 threads, they will crash and my raspberry Pi 4, with 4 gbs of ram won't run them, even though it's such a simple mission!

I had to find a soltuin, and as a beginner Python user, I think I found the answer, and I am going to explore it with you: asynchronized development.

So what is even async programming?

While func1 waits for information, let's run and finish func2 and only then return to func1.

Or in another way: if func1 doesn't have anything to calculate, run func2 and then get back to func1 when func2 finishes.

Let's say I have two functions: First func is printSlow(), which prints the text of a slow-responding website like slowWeb.com. Seriously, that site is so slow that it takes a whole minute to return an answer.

Second func is printFast(), which prints the text of fastWeb.com that responds in 40 seconds.

  • Those websites don't really exist.

As a total beginner, If one would like to print the content of these two websites, he will run printSlow(), wait for an answer, and than run printFast(). The runtime of getting text from those two websites is 100 seconds.

When one becomes smarter, he would run those functions in two threads: Thread1 for printSlow(), Thread2 for printFast(), and than - run the threads. The runtime of getting text from these websites using threads is 60 seconds - assuming that both threads were started in the same time.

That solution is pretty neat, but what if we want to get the data from 2000 different websites? Or what if we want to get data from 1000 websites every hour? we will spawn 1000 threads just for them to execute once every half an hour?

That's wasteful, and fortunatley - there's a better way. It was rather complicated trying (and still try) to understand what it is - so I will try to simplify it the best I can.

The third way is using async functions, which basically allow you to pause the current function and run another funcion, until it is ready to run again.

In our example:

  1. printSlow() sends a GET request and says to the server "Hey, I need information".
  2. Becuase it takes a minute for the server to respond, printSlow() will pause itself and let printFast() to run.
  3. Now printFast() is running - it sends a GET request and says to the server "Hey, I need information"
  4. printFast() will wait for data, and than print it
  5. Now, when printFast() is finished, printSlow() will resume itself, and wait for a response. Than it will print it.

That tool a total of a minute - no time was wasted here, and all was done on a single thread - so no wasted cpu power!

The concept is quite difficult to understand, so it basically goes like:

  1. printSlow() sends a request to the server
  2. While it waits for a response, it spawns a second thread for printFast() and runs it.
  3. printSlow() will continue to run once printFast()'s thread is dead.
  • Note that we DON'T actually spawn another thread - it's just a way to abstact the concept, but all is done on a single thread.

Basically, printSlow() knows that it should wait for a response and lets printFast() do its thing, knowing that once printFast() finishes it could continue running and wait to the response, and print it once it arrives from the server/

You probably are still confused. It's okay! Let's say we have 5 different functions to fetch data from the 5 different websites. with async functions,

func1 runs and while it waits for a response -> it starts func2 while func2 waits for a response -> it starts func3 while func3 waits for a response -> it starts func4 while func4 waits for a response -> it starts func5. while func5 waits for a response -> it notices that it has no further funcs to start. So it waits for a result, prints it, and lets func4 run again. func4 has got a result -> it prints it and let func3 run again func3 has got a result -> it prints it and let func2 run again func2 has got a result -> it prints it and let func1 run again

Now all of the functions have finished, and it took a total of the time it took to the slowest server to respond instead of it being the sum of the wait-time of every server we called.

A literally real world example is:

Imagine you deliver two packages to two different neighbors. You could knock on neighbor 1's door, wait 10 minutes for him to open his door and get the package, and then proceed to neighbor 2, knock, wait for 9 minutes and deliver the package.

That's like the first approach - One person that does one task at a time for Two neighbors. Total of 19 minutes - Slow and inefficient.

We can also have 2 delivery guys that deliver packages in the same time: One delivery guy for each neighbor.

That's like the second approach - Two persons do one task at a time for One preson each. That's fast - it takes only 10 minutes for both of them to finish their task, but it can get very expensive: If we had 30 neighbors, it wouldv'e taken 30 delivery guys for such a simple task!

We can also have a different scenario: One delivery guy knocks on the door, and the house owner yells "I will open the door in ten minutes!" Meanwhile, the delivery guy goes to the 2nd neighbor, which yells "It will take me 9 minutes to open the door!". The delivery guy knows he has no one to visit next, so he will wait 9 minutes and deliver his package when the time's up. Than, it will go back to the first neighbor just in time for him to open his door and get the package.

That's approach number 3 - One person does multiple tasks at a time and being efficient, as it will take him just 10 minutes to deliver both packages.

Grasping the concept of asynchronizing is tiring, I know. Please rest a little, in the next part I will implement the async library in python, to my best understanding. Stay well!