Timing Python Processes

Posted on January 14, 2018 by Andrew Treadway in R bloggers | 0 Comments

[This article was first published on Open Source Automation, and kindly contributed to R-bloggers]. (You can report issue about the content on this page here)

Want to share your content on R-bloggers? click here if you have a blog, or here if you don't.

Timing Python processes is made possible with several different packages. One of the most common ways is using the standard library package, time, which we’ll demonstrate with an example. However, another package that is very useful for timing a process — and particularly telling you how far along a process has come — is tqdm. As we’ll show a little further down the post, tqdm will actually print out a progress bar as a process is running.

Basic timing example

Suppose we want to scrape the HTML from some collection of links. In this case, we’re going to get a collection of URLs from Bloomberg’s homepage. To do this, we’ll use BeautifulSoup to get a list of full-path URLs. From the code below, this gives us a list of around 100 URLs. This first section of code should run pretty quickly; where timing a process comes in is if we wanted to cycle through some (or all) of these links and scrape the HTML from the respective pages.


# load packages
import time
from bs4 import BeautifulSoup
import requests

# get HTML of Bloomberg's website
html = requests.get("https://www.bloomberg.com/").content

# convert HTML to BeautifulSoup object
soup = BeautifulSoup(html, "lxml")

# find all links on Bloomberg's homepage
links = soup.find_all("a")

# get the URL of each link object
urls = [x.get("href") for x in links]

# filter out URLS that are None-values
urls = [x for x in urls if x is not None]

# get only the URLS with full path
with_http = [x for x in urls if "http" in x]

# check how many URLS we have
len(with_http)

Let’s suppose we want to get the HTML of the first 50 webpages associated with our URLs. This could take a little bit of time, so let’s time it using the time package.


# get start epoch time
start = time.time()

# create an empty dictionary to store HTML
# from each page
html_store = {}

# get HTML from first 50 webpages
for url in with_http[:50]:
    
    html = requests.get(url).content
    html_store[url] = html

# get end epoch time
end = time.time()

# print number of seconds the above process took
print(end - start)

Above, we’re calling time.time(), which returns the current epoch time i.e. the number of seconds since January 1, 1970 (see here for reference). Since we call this immediately before running the for loop, and immediately after, we can just subtract these two values to get the number of seconds it took for our process to run. Hitting the 50 links took just under 28 seconds, though that time will vary depending on your internet connection.

How much longer will this loop last?

Although the above is useful in telling us how long our process takes, it doesn’t do anything to let us know how long the process is taking while it is actually running. We had to wait until the process is over to find out how long it took. There’s a solution to this conundrum using an amazing package, called tqdm. As mentioned above, the tqdm package provides a progress status as a process is running to show how far Python has come in the loop. Not only that, tqdm gives an estimate of how much longer the process has to run. This estimate gets updated as the program snippet runs. Let’s use it in our code from above.


# get tqdm method from tqdm package
from tqdm import tqdm

# create an empty dictionary to store HTML
# from each page
html_store = {}

# get HTML from first 50 webpages
for url in tqdm(with_http[:50]):
    
    html = requests.get(url).content
    html_store[url] = html

In the above code, we simply wrap the list of URLs we’re looping through in the tqdm method, which gets imported from the tqdm package.

The output from tqdm looks like this:

Above, the 18 represents that 18 seconds have passed so far in the loop, while the 13 represents how many seconds tqdm estimates is left in the process based upon on how it long the code has been running so far. The 29 / 50 shows that Python has hit 29 of the 50 links in the loop. At the end of the loop, the output looks like below:

We can also use tqdm with list or dictionary comprehensions. Let’s change our code from above to work as a dictionary comprehension.


html_store = {url : requests.get(url).content for url in tqdm(with_http[:50])}

Please check out my other posts by clicking one of the links on the right side of the page, or by perusing through the archives here: http://theautomatic.net/blog/.

If you have requests for tutorials, please leave a note on the Contact page.

The post Timing Python Processes appeared first on Open Source Automation.

To leave a comment for the author, please follow the link and comment on their blog: Open Source Automation.

R-bloggers.com offers daily e-mail updates about R news and tutorials about learning R and many other topics. Click here if you're looking to post or find an R/data-science job.

Want to share your content on R-bloggers? click here if you have a blog, or here if you don't.

R-bloggers

R news and tutorials contributed by hundreds of R bloggers

Timing Python Processes

Basic timing example

How much longer will this loop last?

Related

Basic timing example

How much longer will this loop last?

Related

Never miss an update! Subscribe to R-bloggers to receive e-mails with the latest R posts. (You will not see this message again.)

Never miss an update!
Subscribe to R-bloggers to receive
e-mails with the latest R posts.
(You will not see this message again.)