Lightning fast Python with Numba (2)

Reading time ~6 minutes

Now you know how to speed up data processing with Numba. Today we will dig deeper into vectorization, window calculations and a real multithreading in Python.

Vectorizing windows

So, you can replace looping over a vector with a single vector operation which is over 200 times faster. But what if you need some sliding calculations like a moving average?

For simple calculations you can use some algorithmic trick if you can find one. But generally speaking you will eventually end up with looping of some kind, so you should be able to make it fast. It is where numba.guvectorize comes into play.

import numpy as np
from numba import guvectorize

# this is a very interesting piece of code; try to understand what it does
def numpy_f(x, window_width):
  cumsum = np.cumsum(np.insert(x, 0, 0)) 
  return (cumsum[window_width:] - cumsum[:-window_width]) / window_width


@guvectorize(['void(float64[:], int64[:], float64[:])'], '(n),()->(n)')
def numba_f(x, window_arr, res):
    window_width = window_arr[0]
    xsum = x[:window_width-1].sum() + x[0]
    for i in range(window_width, len(x)):
        xsum += x[i] - x[i - window_width]
        res[i] = xsum / window_width


# your data has 10 million items
x = np.random.uniform(0, 20., size=1e7)
# the window is quite large
N = 1000

# calculate with numpy.cumsum()
numpy_f(x, N)
# and with numba.guvectorize
numba_f(x, N)

Now compare the performance:

numpy: 95 ms
numba: 52 ms

And numba wins again with over 1.8 times speed increase! Take into account that numpy uses here very well optimized C-functions but a pretty simple code with a numba decorator is still faster.

Unfortunately, there are just a few functions like cumsum() which can help you avoid iteration over a numpy array. So when you need more sophisticated calculations over sliding windows (and you will definitely need them sooner or later) you may just use numba.guvectorize to make your code clean and fast at the same time. Thus you could achieve performance improvement by 1-2 orders of magnitude.

Multithreaded Python, really?

Due to Global Interpreter Lock (GIL) only one thread is active in python programs at any given time. As a result, you can’t get the real parallelization. This problem may be partially solved with gevent if your code is I/O-constrained. However, for computationally intensive programs you had to resort to multiprocessing. Though processes are heavy-weight and require more complicated data manipulations and sophisticated interprocess data exchange.

Hopefully, you can unlock GIL and take advantage of fully functional threads in Python. All you need is add nogil option to jit decorator.

import numpy as np
import threading
from numba import jit


@jit('void(double[:], double[:], int64, int64, double[:])', nopython=True, nogil=True)
def numba_f(x, y, S, N, res):
    for i in xrange(S, min(S+N, len(x))):
        res[i] = np.log(np.exp(x[i]) * np.log(y[i]))


# number of threads
T = 8
# data size, 80 million items
N = 8e7
# data
x = np.random.uniform(0, 20., size=N)
y = np.random.uniform(0, 20., size=N)
# array for results
r = np.zeros(N)

# data size for each thread
chunk_N = N / T
# starting index for each thread
chunks = [i * chunk_N for i in range(T)]

threads = [threading.Thread(target=numba_f, args=(x,y,chunk,chunk_N,r)) for chunk in chunks]
for thread in threads:
  thread.start()
for thread in threads:
  thread.join()
# all threads have finished here

# also run a 1-threaded version for comparison
numba_f(x, y, 0, N, r)

Did it bring a noticable perfromance improvement? Yes, it did.

numba multithread  : 1.49 s
numba single thread: 7.52 s

Multithreading gives a 5 times speed up. Thus you don’t have to write any C-extensions anymore to achieve a real parallelization with threads.


P.S. This post is not a replacement for Numba documentation. Please, read it carefully as there are a few constraints and important notes which might influence your code and even your program design.