I do a lot of data science stuff: reading and writing large data files, running some complicated and resource-consuming software (like Apache Spark), programming in several languages simultaneously (R, Python, JavaScript, Scala), quick prototyping in Jupyter notebooks. Some libraries I use do not work on Windows, so I need Linux as well.

Not surprisngly, soon after Windows Subsystem for Linux (WSL) was announced I tried it as I got sick and tired of all that VM stuff. The installion procedure was pretty straightforward. However, soon I bumped into problems with some packages and particularly Jupyter notebooks. So I had to remove WSL completely. Eventually I found out the best installation process which leads to a problem-free configuration where everything works smoothly.

By the way, I use the very same flow for ordinary Linux servers (with a couple of exceptions I will mention later).

How to install and uninstall WSL

After enabling WSL you might need to reinstall it (for instance, when you get into a vicious circle of non-compatible packages or when something does not work and you don’t know why). You can do that from a usual Windows command prompt with lxrun.exe

To remove WSL and all your Linux files, just type a simple command:

lxrun /uninstall /full

You will see a confirmation prompt (unless you use /y option) and after a few minutes WSL will be uninstalled and all your files will be deleted.

After that you can re-install WSL:

lxrun /install /y

Installing system packages

I usually start with installing compilators, git and some useful libraries.

sudo apt-get install -y build-essential gfortran
sudo apt-get install -y zlib1g-dev liblapack-dev libatlas-base-dev libopenblas-dev libhdf5-dev libedit-dev
sudo apt-get install -y git

Due to WSL peculiarities, some network applications do not work properly. Unfortunately, Jupyter is one of them. Fortunately, there is a fix.

sudo add-apt-repository ppa:aseering/wsl
sudo apt-get update
sudo apt-get install libzmq3 libzmq3-dev

As you might guess I also use numba quite a lot, so I need LLVM. The problem is that LLVM is developing quite quickly and new releases often break API compatibility. Currently LLVM 3.7 is required for numba. But it is not available in the official repository for Ubuntu 14.04 which lies at the core of WSL. That is why we need some manual editing.

sudo cat > /etc/apt/sources.list.d/llvm.list << EOF
deb http://apt.llvm.org/trusty/ llvm-toolchain-trusty main
deb-src http://apt.llvm.org/trusty/ llvm-toolchain-trusty main
# 3.7
deb http://apt.llvm.org/trusty/ llvm-toolchain-trusty-3.7 main
deb-src http://apt.llvm.org/trusty/ llvm-toolchain-trusty-3.7 main
# 3.8 
deb http://apt.llvm.org/trusty/ llvm-toolchain-trusty-3.8 main
deb-src http://apt.llvm.org/trusty/ llvm-toolchain-trusty-3.8 main
# 3.9 
deb http://apt.llvm.org/trusty/ llvm-toolchain-trusty-3.9 main
deb-src http://apt.llvm.org/trusty/ llvm-toolchain-trusty-3.9 main
EOF
sudo apt-key adv --recv-keys --keyserver keyserver.ubuntu.com 15CF4D18AF4F7421
sudo apt-get update
sudo apt-get install llvm-3.7
sudo ln -s /usr/bin/llvm-config-3.7 /usr/bin/llvm-config

What we do here is pretty simple:

  • add additional sources for LLVM packages
  • add a new key
  • update info about available packages
  • install LLVM 3.7
  • create a symlink so that llvm-config is referencing the right version.

Installing Python packages

First of all, we need pip and python development package with C headers and other stuff for building custom packages.

sudo apt-get install -y python-pip python-dev

The rest will be installed through pip, so I just list the packages:

  • cython
  • virtualenv
  • numpy
  • numexpr
  • blosc
  • enum34
  • llvmlite
  • numba
  • scipy
  • tables
  • feather-format
  • pandas
  • sklearn
  • statsmodels
  • jupyter

And now you can start jupyter to check it:

jupyter notebook --no-browser

Connect to http://localhost:8888 (your settings might differ), and create or open a notebook to run some code. If the kernel does not crash immediately, then everything is fine.

Take into account that this is the baseline installation. You can now install some other great software like xgboost, tensorflow, mxnet, neon, etc. I don’t need these in some projects, that is why they are not a part of the baseline.

To sum up

WSL is a great tool, despite some current constraints (like graphics and networking). Nevertheless, as a data scientists you can use WSL to your advantage and it will be of a great help especially if you need to live in both worlds (Windows and Linux).

After a long struggle I managed to build from sources Tensorflow for GPU with CUDA capability=3.0. So now I can dig deeper into what Tensorflow is and how one can solve analytics tasks with it.

Let’s begin with a logistic regression, a simple, yet pretty powerful tool suitable for real-life business problems.

import tensorflow as tf

# to begin with, load your data
# you have to implement your own load_data function
train_X, train_Y, test_X, test_Y = load_data()

# data format is as usual:
# train_X and test_X have shape (num_instances, num_features)
# train_Y and test_Y have shape (num_instances, num_classes)
num_features = train_X.shape[1]
num_classes = train_Y.shape[1]

# Create variables
# X is a symbolic variable which will contain input data
# shape [None, num_features] suggests that we don't limit the number of instances in the model
# while the number of features is known in advance
X = tf.placeholder("float", [None, num_features])
# same with labels: number of classes is known, while number of instances is left undefined
Y = tf.placeholder("float",[None, num_classes])

# W - weights array
W = tf.Variable(tf.zeros([num_features,num_classes]))
# B - bias array
B = tf.Variable(tf.zeros([num_classes]))

# Define a model
# a simple linear model y=wx+b wrapped into softmax
pY = tf.nn.softmax(tf.matmul(X, W) + B)
# pY will contain predictions the model makes, while Y contains real data

# Define a cost function
cost_fn = tf.reduce_mean(tf.nn.softmax_cross_entropy_with_logits(pY, Y))
# You could also put it in a more explicit way
# cost_fn = -tf.reduce_sum(Y * tf.log(pY))

# Define an optimizer
# I prefer Adam
opt = tf.train.AdamOptimizer(0.01).minimize(cost_fn)
# but there is also a plain old SGD if you'd like
#opt = tf.train.GradientDescentOptimizer(0.01).minimize(cost_fn)

# Create and initialize a session
sess = tf.Session()
init = tf.initialize_all_variables()
sess.run(init)

num_epochs = 40
for i in range(num_epochs):
  # run an optimization step with all train data
  sess.run(opt, feed_dict={X:train_X, Y:train_Y})
  # thus, a symbolic variable X gets data from train_X, while Y gets data from train_Y

# Now assess the model
# create a variable which reflects how good your predictions are
# here we just compare if the predicted label and the real label are the same
accuracy = tf.reduce_mean(tf.cast(tf.equal(tf.argmax(pY,1), tf.argmax(Y,1)), "float"))
# and finally, run calculations with all test data
accuracy_value = sess.run(accuracy, feed_dict={X:test_X, Y:test_Y})

Now you know how to speed up data processing with Numba. Today we will dig deeper into vectorization, window calculations and a real multithreading in Python.

Vectorizing windows

So, you can replace looping over a vector with a single vector operation which is over 200 times faster. But what if you need some sliding calculations like a moving average?

For simple calculations you can use some algorithmic trick if you can find one. But generally speaking you will eventually end up with looping of some kind, so you should be able to make it fast. It is where numba.guvectorize comes into play.

import numpy as np
from numba import guvectorize

# this is a very interesting piece of code; try to understand what it does
def numpy_f(x, window_width):
  cumsum = np.cumsum(np.insert(x, 0, 0)) 
  return (cumsum[window_width:] - cumsum[:-window_width]) / window_width


@guvectorize(['void(float64[:], int64[:], float64[:])'], '(n),()->(n)')
def numba_f(x, window_arr, res):
    window_width = window_arr[0]
    xsum = x[:window_width-1].sum() + x[0]
    for i in range(window_width, len(x)):
        xsum += x[i] - x[i - window_width]
        res[i] = xsum / window_width


# your data has 10 million items
x = np.random.uniform(0, 20., size=1e7)
# the window is quite large
N = 1000

# calculate with numpy.cumsum()
numpy_f(x, N)
# and with numba.guvectorize
numba_f(x, N)

Now compare the performance:

numpy: 95 ms
numba: 52 ms

And numba wins again with over 1.8 times speed increase! Take into account that numpy uses here very well optimized C-functions but a pretty simple code with a numba decorator is still faster.

Unfortunately, there are just a few functions like cumsum() which can help you avoid iteration over a numpy array. So when you need more sophisticated calculations over sliding windows (and you will definitely need them sooner or later) you may just use numba.guvectorize to make your code clean and fast at the same time. Thus you could achieve performance improvement by 1-2 orders of magnitude.

Multithreaded Python, really?

Due to Global Interpreter Lock (GIL) only one thread is active in python programs at any given time. As a result, you can’t get the real parallelization. This problem may be partially solved with gevent if your code is I/O-constrained. However, for computationally intensive programs you had to resort to multiprocessing. Though processes are heavy-weight and require more complicated data manipulations and sophisticated interprocess data exchange.

Hopefully, you can unlock GIL and take advantage of fully functional threads in Python. All you need is add nogil option to jit decorator.

import numpy as np
import threading
from numba import jit


@jit('void(double[:], double[:], int64, int64, double[:])', nopython=True, nogil=True)
def numba_f(x, y, S, N, res):
    for i in xrange(S, min(S+N, len(x))):
        res[i] = np.log(np.exp(x[i]) * np.log(y[i]))


# number of threads
T = 8
# data size, 80 million items
N = 8e7
# data
x = np.random.uniform(0, 20., size=N)
y = np.random.uniform(0, 20., size=N)
# array for results
r = np.zeros(N)

# data size for each thread
chunk_N = N / T
# starting index for each thread
chunks = [i * chunk_N for i in range(T)]

threads = [threading.Thread(target=numba_f, args=(x,y,chunk,chunk_N,r)) for chunk in chunks]
for thread in threads:
  thread.start()
for thread in threads:
  thread.join()
# all threads have finished here

# also run a 1-threaded version for comparison
numba_f(x, y, 0, N, r)

Did it bring a noticable perfromance improvement? Yes, it did.

numba multithread  : 1.49 s
numba single thread: 7.52 s

Multithreading gives a 5 times speed up. Thus you don’t have to write any C-extensions anymore to achieve a real parallelization with threads.


P.S. This post is not a replacement for Numba documentation. Please, read it carefully as there are a few constraints and important notes which might influence your code and even your program design.

Even if you have numpy which uses all available cores of your processor your programs can still be pretty slow. This is especially the case if you process massive datasets, performing a lot of data transformations and calculations.

Switch to vector operations

The very first hint would be to quit pythonic-way of programming and get rid of list comprehensions and all other loops as these make your programs as fast as snails.

Instead, use numpy’s vector operations which bring you a significant performance improvement. For instance, you have a simple vector each element of which contains a timestamp and need to know all time intervals. Let’s write a simple code containing both a list comprehension and a vector operation:

import numpy as np
# x contains a timeseries; fill it with random data for a test
x = np.random.random_int(1e7)
# this is a classic pythonic list comprehension
intervals = [x[i] - x[i-1] for i in range(1, len(x))]
# and a vector subtraction which gives the very same result
intervals = x[1:] - x[:-1]

But compare the performance:

list comprehension: 5642 ms
vector operation: 32 ms

This is an enormous speed up - 176 times faster. Those who have mastered vector operations and never write loops and list comprehensions anymore believe that this is the fastest way of data processing. And they can’t be more wrong…

Numba makes it even faster

To put it simply, Numba is a just-in-time compiler that makes your Python code as fast as a code written in C. All you need is add a decorator to your usual python function (you have to comply to certain constraints, though, as not any code can be made fast). Let me demonstrate it on the simplest example:

import numpy as np
from numba import jit

def python_f(x):
  s = 0
  for i in range(len(x)):
    s += x[i]
  return s

def numpy_f(x):
  return x.sum()

@jit
def numba_f(x):
  s = 0
  for i in range(len(x)):
    s += x[i]
  return s

# data
x = np.random.random_int(3e7)
# try them all
python_f(x)
numpy_f(x)
numba_f(x)

And now execution times:

python: 8641 ms
numpy: 52 ms
numba (first call): 116 ms
numba (next calls): 35 ms

The first call of a numba function includes compilation, that is why it takes longer. But when compiled it is 1.5 times faster than highly optimized numpy code. This is just unbelievable! The simplest summation gives you an immensive performance increase. But this is not all numba can do for you.

Vectorize everything

While you can perform a vector-wide operations and data transformations, numpy is great. However, eventually you bump into a situation when you need some element-wise logic… and numpy just stops delivering here.

import numpy as np
from numba import vectorize

def python_f(x,y):
  if x > y:
     m = x
  else:
     m = y
  return m

@vectorize
def numba_f(x,y):
  if x > y:
     m = x
  else:
     m = y
  return m

# your data, 10 million items
size = 1e7
x = np.random.randint(0, 20, size=size)
y = np.random.randint(0, 20, size=size)

# pythonic way of element-wise operations
[python_f(x[i], y[i]) for i in range(len(x))]
# numpy has its own vectorize
numpy_f = np.vectorize(python_f)
numpy_f(x, y)
# but numba's vectorize is what we're after
numba_f(x, y)

You might be eager to see the performance, don’t you? Here it is:

python: 6657 ms
numpy: 3068 ms
numba (first call): 70 ms
numba (next calls): 24 ms

Now you see it - numba is 127 times faster than numpy.

Show your types

As you could notice, the first call to numba functions takes much longer as it includes a compilation phase. Fortunately, you can avoid this with a so-called eager, or decoration time, compilation when you pass type signatures to the decorator.

@vectorize([int32(int32, int32)])
def numba_f(x,y):
  if x > y:
     m = x
  else:
     m = y
  return m

You can pass several signatures if your work with different data types in the same function:

@vectorize([int32(int32, int32), 
            int64(int64, int64), 
            float32(float32, float32), 
            float64(float64, float64)])

Now your first call will be fast as well, since compilation will be performed in advance. However, if you call this function with data of some other types you will get a run-time error. So be cautious and think carefully about your data and how to process it.

Almost everybody now uses numpy as it is extremely helpful for data analysis. However, oftentimes (if not almost always) numpy does not deliver at its full strength since it is installed in a very inefficient way - when it is linked with old-fashioned ATLAS and BLAS libraries which can use only 1 CPU core even when your computer is equipped with a multicore processor or even a few processors.

This post is intended for Linux users only. All the shell commands below are for Ubuntu, but you will easily find analogues for your distribution. Those from Windows world might need additiondal googling. The other option is to switch to scientific Python bundles like Anaconda (my choice), Canopy or some other.

You might easily check if you are affected by the single core numpy problem. To do this just create a simple test program:

import numpy as np
size = 10000
a = np.random.random_sample((size, size))
b = np.random.random_sample((size, size))
n = np.dot(a,b)

And run it as a background process

python test.py &

Afterwards, run top to check the performance:

top

You will see a python process at the very top of your process list. Now pay attention to the %CPU column of that process: a value around 100 means that it is actually using only 1 CPU core.

If this is the case, you might want to significantly improve numpy’s performance. Fortunately. this is pretty easy.

Check libraries

There are two quite different situations:

  • you have some ATLAS/BLAS libraries installed already;
  • you don’t have any libraries yet.

To find out what you have, check whether your numpy is linked to BLAS

cd /usr/local/lib/python2.7/dist-packages/numpy/core
ldd multiarray.so

In earlier numpy versions (before 1.10) you have to check linkage of _dotblas.so (instead of multiarray.so), so you should do:

cd /usr/local/lib/python2.7/dist-packages/numpy/core
ldd _dotblas.so

The output will look something like:

        linux-vdso.so.1 =>  (0x00007fffe58a2000)
        libm.so.6 => /lib/x86_64-linux-gnu/libm.so.6 (0x00007f8adbff4000)
        libpthread.so.0 => /lib/x86_64-linux-gnu/libpthread.so.0 (0x00007f8adbdd6000)
        libc.so.6 => /lib/x86_64-linux-gnu/libc.so.6 (0x00007f8adba10000)
        /lib64/ld-linux-x86-64.so.2 (0x00007f8adc68c000)

If there is no mention of libblas.so (like in the output above), then you have situation #2 - you don’t have any BLAS library installed. This means that your numpy uses its own internal library of linear algebra functions which is extremely slow. So you will get the greatest performance improvement, but at the expense of recompiling and reinstalling numpy.

If libblas.so is mentioned in your ldd output, you may just reassign a BLAS library which is fast and easy.

In any case, you need a better BLAS library first.

Install OpenBLAS library

OpenBLAS is a very good library with various algorithms and functions of linear algebra which lies in the core of many modern data analysis methods.

However, to begin with, you need a fortran compiler, so install gfortran package, as g77 compiler that you most probably have is incompatible with OpenBLAS.

sudo apt-get install gfortran

Now download OpenBLAS sources from Github

git clone https://github.com/xianyi/OpenBLAS.git

Enter the directory and compile sources

cd OpenBLAS
make FC=gfortran

When make has successfully finished, install the library

sudo make install

The default installation directory is /opt/OpenBLAS. You may choose a different location, though:

sudo make install PREFIX=/your/preferred/location

Reassign BLAS

If you had another BLAS library at the beginning, you need to make OpenBLAS library the preferred choice of all BLAS libraries installed.

sudo update-alternatives --install /usr/lib/libblas.so.3 libblas.so.3 \
	/opt/OpenBLAS/lib/libopenblas.so 50

Now run the test again and make sure that all CPU cores are now being used. This is all you have to do. Now enjoy the full speed numpy.

Build the right numpy

Those who did not have any BLAS libraries are left with nothing to do but reinstall numpy.

First of all, get rid of the wrong numpy you already have.

sudo pip uninstall numpy

Then create a file .numpy-site.cfg in your home directory with the following content:

[default]
include_dirs = /opt/OpenBLAS/include
library_dirs = /opt/OpenBLAS/lib

[openblas]
openblas_libs = openblas
include_dirs = /opt/OpenBLAS/include
library_dirs = /opt/OpenBLAS/lib

[lapack]
lapack_libs = openblas

[atlas]
atlas_libs = openblas
libraries = openblas

If you have chosen a different location for your OpenBLAS installation, edit the paths accordingly.

And install numpy again

sudo pip install numpy

If there were no errors during compilation and installation and everything went just fine, run the test again to make sure that all CPU cores are now being used.

If you prefer a manual compilation/installation, like I often do, you may try the following approach. First, download numpy sources.

pip install -d . numpy

This will create in the current directory a file named like numpy-1.10.2.tar.gz (the version will definitely change in future). Unzip it and enter the source directory.

tar xzf numpy-1.10.2.tar.gz
cd numpy-1.10.2

Now create site.cfg file (notice that the name is a bit different here) with the very same content as .numpy-site.cfg above.

And finally build and install numpy

python setup.py build
sudo python setup.py install

Now you can run the test to see how fast your numpy is.