Be Careful Searching Python Dictionaries!
Want to share your content on R-bloggers? click here if you have a blog, or here if you don't.
For my talk on High Performance Computing in R (which I had to reschedule due to a nasty stomach bug), I used Wikipedia linking data, an adjacency list of articles and the articles to which they link. This data was linked from DataWrangling and was originally created by Henry Haselgrove. The dataset is small on disk, but I needed a dataset that was huge, very huge. So, without a simple option off the top of my head, I took this data and expanded a subset of it into an incidence matrix, occupying 2GB in RAM. Perfect!
The subsetting was a bit of a problem because I had to maintain the links within the subgraph induced by the same. This required me to search dictionary objects for keys. This is where things went awry. Usually, I try to be efficient as possible. Since I was just producing a little example, and would never ever run this code otherwise, I wasn’t as careful.
The data were presented as a follows
1. First, I looked at from, and if from was in the chosen subset, keep it and proceed to 2, otherwise, throw it out.
2. Then, take the to nodes and only keep the nodes that are also in the subset.
How we search for an element in a dict makes a huge, huge difference in running time. It is pretty easy to tell when you have fallen into this pitfall because your program will go ridiculously slow. In my case I had 50,000 unique keys, all of type int (originally).
The Wrong Way
Anybody with a background in mathematics may easily see this as the problem: is k in the set of keys for the dictionary? I mean, it is the keys we are searching, not the values.
for node in to: if k in d.keys(): #keep the node else: #toss out the node.
or similarly
to = filter(lambda x: x in index.keys(), to)
This is bad. Using this method is the difference between your coding taking a few seconds, and running for a few hours.
The Better Ways
This correct way doesn’t make as much sense logically unless we understand one thing: the in operator is overloaded for dictionaries, to search for a key.
If we do not know this, it seems that k in d is ambiguous. My first thought was that it would search the values. Actually, it searches the keys.
>>> a = {'a': 1, 'b': 2, 'c': 3} >>> 'b' in a True >>> 2 in a False >>> 2 in a.values() True
So we can use the following,
for node in to: if k in d: #keep the node. else: #drop the node.
or similarly,
result = filter(lambda x: x in index, to)
To just check for the existence of a key, such as in my case, we can use has_key().
for node in to: if d.has_key(k): #keep the node. else: #drop the node.
Another way is to use exceptions. This may be natural for the beginning programmer. If the user queries k and it is not a key in the dictionary, an exception is thrown. The user can catch this exception just to shut up the interpreter.
for k in to: try: dummy = index[k] except: #drop the node else: #keep the node
Timing the Methods
I stripped down my code to run one iteration of the processing using the various methods, all Python scripts. Since the subsets are chosen randomly each time, I ran each method 100 times and averaged the running time.
Clarification: I did not want to limit my code sample too much. The curious reader will notice that 5000ms is pretty high for just a search operation. There is some other stuff going on before the search takes place: generating a random list, opening a file, reading a line from it, and parsing the line. This overhead increases the running time. The table below is solely to show the
To time a function, we can use a function decorator as follows. Define the function print_timing, and then add a decorator to the function being time:
def print_timing(func): def wrapper(*arg): t1 = time.clock() res = func(*arg) t2 = time.clock() print '%0.3fms' % ((t2-t1)*1000.0) return res return wrapper @print_timing def my_function(): #stuff return
Adapted from a post on DaniWeb.
The table below shows the results. The point to take home is never, ever search dict.keys(). Using has_key() and exceptions gives a pretty good improvement. Simply iterating over the dictionary itself takes the least amount of time.
The difference in the average time between the slowest method and the fastest method is 74ms. The difference between these methods is 7 seconds after processing 100 nodes. The difference becomes unacceptable when processing even just 1,000 to nodes.
So why is the d.keys() so slow? When we call d.keys(), we get a list. Searching the list is an operation. The other methods that I studied rely on the dict and only the dict object. A Python dict is like a hash table, and checking for existence of a key is an operation.
While writing this post, I found this web page discussing common Python idioms and hints to improve efficiency. I recommend it for anyone that programs in Python.
R-bloggers.com offers daily e-mail updates about R news and tutorials about learning R and many other topics. Click here if you're looking to post or find an R/data-science job.
Want to share your content on R-bloggers? click here if you have a blog, or here if you don't.