Imagine: you are working on a specific problem. After much trial and error, your code finally does what it is supposed to do. The solution is neither elegant nor very efficient, but you have already spent more time than you wanted to. You tell yourself you will refactor the code somewhere in the future. As time goes by, you obviously do not find any more time (or desire) to do so, and you leave your code as it was. After all, the code works, and you have much more important issues to deal with first. Sounds familiar?
Fast forward to the moment you are ready to put everything into production. You notice that a small part of the pipeline is remarkably slow. You dive into the code, and you identify the problem. Shock. It is the code you had written months ago that you told yourself you would refactor at some point. Well, that moment has finally arrived.
According to a famous ancient quote (originating in the 1960s) from Donald Knuth (“Premature optimization is the root of all evil”), this workflow is completely fine. The premise of that quote originates from the logic that programmers spend far too much time on things they may not need in the end. Examples include optimizing a preprocessing function that eventually does not make it to production. Another example is prematurely scaling to millions of users while you are still a scale-up and only have a couple of users.
In this blog post, I’m not going to tell you that Donald was completely wrong or that his reasoning is outdated. Instead, I mainly want to give you some do’s and don’ts pointers, how you can identify possible optimizations in your code, and how to optimize it.
The power of vectorization
First, I would like to introduce you to a very basic toy example. Assume we have an array of integers, and we would like to randomly sample an increase between 0 and 2 for each value in the array. We can solve this in various ways. I have outlined three of them with their corresponding computation times for various data lengths.
For this effortless task, we see that the classic for loop is a tad slower than the list comprehension. The difference comes from the additional overhead of appending. However, both are much slower than the ‘vectorized’ method, for which the effect increases as the data gets larger.
By only sampling once, we could eliminate significant overhead and take advantage of the numpy computations (which are implemented in C), which can be computed in parallel. Without going into too much detail, vectorization reduces the number of CPU instructions, memory reads, and cache misses. More recent versions of numpy also take advantage of parallelization with SIMD (Single Instructions Multiple Data). In essence, vectorization is transforming your code so that it can be automatically translated to machine code instead of the default slow python code.
Since vectorization works so well because it minimizes costly operations, I would like to introduce a concept I like to call ‘pseudo-vectorization’. Within pseudo-vectorization, you first identify the potential bottlenecks/costly operations in your code. Then, you try to reduce those by preprocessing/cleverly transforming your data. But how can we identify these potential bottlenecks in our code now? This is where line_profiler comes in.
Identifying and solving bottlenecks in your code
Line_profiler is a convenient package showing the time individual code lines take to execute. It is straightforward to set up, and with a few lines of code, you get an incredibly insightful overview of the execution time of a particular function. The function from the image returns a slightly modified time series data frame for a day with 24 timesteps. It only needs to modify specific columns with randomly sampled nearest existing values, and certain values needed to remain constant (due to the nature of the scale of certain variables).
To reduce the computation time, the first step is to compute this information/metadata once and pass it to the function. Since this was a function that had to be applied tens of thousands of times, waiting times accumulated. I decided to have a look at it with line_profiler. The output was quite shocking. Rather than trying to optimize for lines within the function, it was the return line that took over 85% of the total computation time. The function we had written turned out to be quite efficient, but it was merely the final step of creating a data frame from an array and assigning the original columns that were so expensive. By simply returning the array, we sped up our code by a factor of almost 10.
But Olivier, now our function returns an array instead of a data frame. That’s not what we want. Yes, I know, and that is what I was referring to during my introduction to pseudo-vectorization. You will often have to make slight but intelligent modifications to your pipeline to benefit from it. Instead of always returning a data frame, you should slightly adapt your code. This way, you only create a data frame once after concatenating all numpy arrays from our adapted function. With some minor changes in your code, that part of the pipeline is now 10 times faster.
You could go even further and vectorize the loops if you want, speeding up the function. However, vectorizing too much can sometimes be detrimental to your code's readability, as observed from the image below, which does the same as the previous functions. As a rule of thumb, I wouldn't start to maniacally vectorize everything if there is no real need. Nevertheless, it is an instrumental skill to have in your pocket.
Of course, it is not always as straightforward or easy as in this example. Identifying and fixing bottlenecks in your code is a learning experience. It is something you will get better at as time progresses. Especially vectorizing your code can be pretty challenging at first. The most impactful optimizations originate from processes that need to be repeated multiple times.
Another example from a real case study is calculating the feature importance for a time series classification problem. Since we couldn't rely on an existing package due to the multi-output of the model and multivariate time series input, we had to write our implementation of the Shapley values. In that instance, I had written a method that calculated the feature importance case by case, rerunning the predictions with every case. Initially, this was fine, but as the project matured, this started to become a problem.
Eventually, I decided to rewrite the method. I identified that the main performance issues came from running the predictions so many times (we had hundreds of features, so for one case, we had to perform inference already close to a thousand times). I rewrote the code so we would only run the predictions once but with large batch size. Again, this required some changes to the pipeline (you had to keep track of some metadata and then eventually assign the concatenated predictions to that metadata). Still, the potential speedup (which I had first verified) was enormous, resulting in a method that was now 15-20x faster. Although optimizations are often in the form of a for loop or a list comprehension, this is not exclusively the case. Applying a custom function inefficiently multiple times can also cause severe troubles, which we will see in our next section.
Assessing the efficiency of pandas
Import pandas as pd is probably the number one used import of all data scientists and is perhaps on top of basically every one of your notebooks. And with good reason. It is convenient, user-friendly, and reasonably fast, right? But how efficient is the pandas library? And when should we use it?
Since pandas uses numpy under the hood, almost all of the build-in methods in pandas are fast. As opposed to numpy, they are specifically designed to work well with tabular data of different types (strings, integers, lists, etc.). It is also considered faster if you have more than 500k observations. The main drawbacks of the pandas library are that it consumes much more memory than numpy. Moreover, indexing is ridiculously slow, compared to numpy, which excels in both aspects. As a rule of thumb, I would suggest converting to numpy if you are solely working with numerical data, if you can not resort to the build-in pandas methods, or if you are considering using the pandas build-in method.
Imagine: you want to apply a custom function on multiple columns of your data frame for the first time. A quick google search later, you will probably have learned about pandas.DataFrame.apply. You copy-paste one of the first answers on StackOverflow, tweak it to your data frame, and the code exactly does what you want it to do. It might be a bit slow, but hey, it was the top answer on that StackOverflow thread. Surely, there won't be a better solution to this, right?
Let's put that to the test. As a proof of concept, I have created a toy function that scales the number of flights based on an airport-specific maximum value previously stored in a dictionary. It thus requires exactly two columns from the data frame. We will start by assessing the performance of the often recommended apply-syntax.
2.7 seconds for a little more than 20.000 observations. Not great, but reasonable. At first. Somewhere along my first project, I remember waiting almost half an hour for an apply function to finish on a data frame of a couple of million rows. I remember doing this a couple of times until I finally was fed up with it. I did what every programmer does in that situation: consult Google and look for better alternatives. One of the first suggestions I came across was using a list comprehension combined with zip. So, I took over the suggested syntax and started to wait. For merely 15 seconds. This alternative was over 100 times faster. How is that possible? And why does apply exist then?
Let's look at our toy example. The situation above did not happen to be a lucky shot. Apply is indeed ridiculously slow. But why? Apply is a bit of a misfit within the pandas family. It is a wrapper/silver bullet for everything that the built-in methods can not do. Because it is a wrapper, it does not make any assumptions about the underlying data. Unlike the other methods, it does not take advantage of any vectorization but executes the statement row by row or column by column. On top of that, it consumes a ton of memory and returns a new pandas series, which also creates quite some IO overhead.
Should list comprehensions automatically be the new standard of data science, then? Not necessarily. First of all, it is essential to highlight that the combination with zip is doing the heavy lifting here. The python zip method is implemented so that it allows you to traverse iterators in parallel rather than instance per instance. In combination with a list comprehension, the function is thus applied in parallel, causing the impressive performance speedup. Although it is a clever trick, there are other alternatives to speed up the apply syntax. For instance, the built-in map method allows for parallelization and achieves the same performance increase.
Even though pandas - if used appropriately - is quite efficient, it still has some significant drawbacks in terms of performance. The three most important of them are its slowness concerning numerical computations, its scalability when working with massive datasets, and its lack of support for multiprocessing (e.g., pandas only uses 1 CPU).
Although it would fall outside the scope of this blog to discuss them all, numerous alternatives have been released over time. For example, dask, modin, koalas, polars, and recently, terality. All these packages try to resolve those issues while mostly keeping the pandas syntax unchanged. So, feel free to explore those libraries if all of the above tips are to no avail or you are still unsatisfied with the performance.
Even though the advantages of vectorization are undeniable, I don't expect you to start rewriting all your inefficient functions immediately. With this blog post, I merely wanted to provide some tooltips which could help you identify potential bottlenecks in your code, some common pitfalls, and how you can solve those. Though it will often require some changes to the existing pipeline, it is usually well worth it in terms of potential performance increases. It was not my intention to provide countless examples of code optimizations. Rather, I wanted to introduce a new method of approaching repetitive tasks. As with everything within programming, though, practice makes perfect, and adapt these practices to a workflow you feel comfortable with.