Optimize your CPU for Deep Learning

Published in

Towards Data Science

10 min readMay 8, 2019

In the last few years, Deep Learning has picked up pace in academia as well as industry. Every company is now looking for an A.I. based solution to problems. This boom has its own merits and demerits, but that’s for another article, another day. This surge in Machine Learning practitioners have infiltrated the academia to its roots, and almost every student from every domain has access to AI and ML knowledge via courses, MOOCs, books, articles, and of course papers.

This rise was, however, bottlenecked by the availability of hardware resources. It was suggested and demonstrated that a Graphical Processing Unit is one of the best devices you can have to perform your ML tasks at pace. But a good high-performance GPU came with a price tag which can go even up to $20,449.00 for a single NVIDIA Tesla V100 32GB GPU which has server-like compute capabilities. Furthermore, a consumer laptop with a decent GPU costs around $2000 with a GPU like 1050Ti or 1080Ti. To ease the pain, Google, Kaggle, Intel, and Nvidia provides cloud-based High-Compute systems for free with a restriction on either space, compute capability, memory or time. But these online services have their drawbacks which include managing the data(upload/download), data privacy, etc. These issues lead to the main point of my article, “Why not optimize our CPUs to attain a speed-up in Deep Learning tasks?”.

Intel has provided optimizations for Python, Tensorflow, Pytorch, etc. with a whole range of Intel Optimized support libraries like NumPy, scikit-learn and many more. These are freely available to download and set-up and provides a speed of anywhere from 2x to even 5x on a CPU like Intel Core i7 which is not also a high-performance CPU like the Xeon Series. In the remaining part of the article, I will demonstrate how to set-up Intel’s optimizations in your PC/laptop and will provide the speed-up data that I observed.

Performance Boost Obtained

For a variety of experiments mentioned below, I will present the time and utilization boosts that I observed.

10-layer Deep CNN for CIFAR-100 Image Classification.
3 Layer Deep LSTM for IMDB Sentiment Analysis.
6 Layer deep Dense ANN for MNIST image Classification.
9 Layer deep Fully Convolutional Auto-Encoder for MNIST.

These tasks have been coded in Keras using tensorflow backend and datasets are available in the same hard-drive as that of the codes and executable libraries. The hard-drive utilized is an SSD.

We will consider six combinations of optimizations as following.

Intel(R) Core (TM) i7.
Intel(R) Xeon(R) CPU E3–1535M v6.
Intel(R) Core (TM) i7 with Intel Python (Intel i7*).
Intel(R) Xeon(R) CPU E3–1535M v6 with Intel Python (Intel Xeon*).
Intel(R) Core (TM) i7 with Intel Python and Processor Thread optimization (Intel i7(O)).
Intel(R) Xeon(R) CPU E3–1535M v6 with Intel Python and Processor Thread optimization (Intel Xeon(O)).

For each task, the number epochs were fixed at 50. In the chart below we can see that for an Intel(R) Core (TM) i7–7700HQ CPU @ 2.80GHz CPU, the average time per epoch is nearly 4.67 seconds, and it drops to 1.48 seconds upon proper optimization, which is 3.2x boost up. And for an Intel(R) Xeon(R) CPU E3–1535M v6 @ 3.10GHz CPU, the average time per epoch is nearly 2.21 seconds, and it drops to 0.64 seconds upon proper optimization, which is a 3.45x boost up.

The optimization is not just in time, the optimized distribution also optimizes the CPU utilization which eventually leads to better heat management, and your laptops won’t get as heated as they used to while training a deep neural network.

We can see that without any optimization the CPU utilization while training maxes out to 100%, slowing down all the other processes and heating the system. However, with proper optimizations, the utilization drops to 70% for i7 and 65% for Xeon despite providing a performance gain in terms of time.

These two metrics can be summarized in relative terms as follows.

In the above graph, a lower value is better, that is in relative terms Intel Xeon with all the optimizations stands as the benchmark, and an Intel Core i7 processor takes almost twice as time as Xeon, per epoch, after optimizing its usage. The above graph clearly shows the bright side of Intel’s Python Optimization in terms of time taken to train a neural network and CPU’s usage.

Setting Up Intel’s Python Distribution

Intel Software has provided an exhaustive list of resources on how to set this up, but there are some issues which we may usually face. More details about distribution are available here. You can choose between the type of installation, that is, either native pip or conda. I prefer conda as it saves a ton of hassle for me and I can focus on ML rather than on solving compatibility issues for my libraries.

1) Download and install Anaconda

You can download Anaconda from here. Their website has all the steps listed to install Anaconda on windows, ubuntu and macOS environments, and are easy to follow.

2) Set up Intel python in your Anaconda Distribution

This step is where it usually gets tricky. It is preferred to create a virtual environment for Intel distribution so that you can always add/change your optimized libraries at one place. Let’s create a new virtual environment with the name “intel.”

conda create -n intel -c intel intelpython3_full

Here -c represents channel, so instead of adding Intel as a channel, we call that channel via -c. Here, intelpython3_full will automatically fetch necessary libraries from Intel’s distribution and install them in your virtual environment. This command will install the following libraries.

The following NEW packages will be INSTALLED:asn1crypto         intel/win-64::asn1crypto-0.24.0-py36_3
bzip2              intel/win-64::bzip2-1.0.6-vc14_17
certifi            intel/win-64::certifi-2018.1.18-py36_2
cffi               intel/win-64::cffi-1.11.5-py36_3
chardet            intel/win-64::chardet-3.0.4-py36_3
cryptography       intel/win-64::cryptography-2.3-py36_1
cycler             intel/win-64::cycler-0.10.0-py36_7
cython             intel/win-64::cython-0.29.3-py36_1
daal               intel/win-64::daal-2019.3-intel_203
daal4py            intel/win-64::daal4py-2019.3-py36h7b7c402_6
freetype           intel/win-64::freetype-2.9-vc14_3
funcsigs           intel/win-64::funcsigs-1.0.2-py36_7
icc_rt             intel/win-64::icc_rt-2019.3-intel_203
idna               intel/win-64::idna-2.6-py36_3
impi_rt            intel/win-64::impi_rt-2019.3-intel_203
intel-openmp       intel/win-64::intel-openmp-2019.3-intel_203
intelpython        intel/win-64::intelpython-2019.3-0
intelpython3_core  intel/win-64::intelpython3_core-2019.3-0
intelpython3_full  intel/win-64::intelpython3_full-2019.3-0
kiwisolver         intel/win-64::kiwisolver-1.0.1-py36_2
libpng             intel/win-64::libpng-1.6.36-vc14_2
llvmlite           intel/win-64::llvmlite-0.27.1-py36_0
matplotlib         intel/win-64::matplotlib-3.0.1-py36_1
menuinst           intel/win-64::menuinst-1.4.1-py36_6
mkl                intel/win-64::mkl-2019.3-intel_203
mkl-service        intel/win-64::mkl-service-1.0.0-py36_7
mkl_fft            intel/win-64::mkl_fft-1.0.11-py36h7b7c402_0
mkl_random         intel/win-64::mkl_random-1.0.2-py36h7b7c402_4
mpi4py             intel/win-64::mpi4py-3.0.0-py36_3
numba              intel/win-64::numba-0.42.1-np116py36_0
numexpr            intel/win-64::numexpr-2.6.8-py36_2
numpy              intel/win-64::numpy-1.16.1-py36h7b7c402_3
numpy-base         intel/win-64::numpy-base-1.16.1-py36_3
openssl            intel/win-64::openssl-1.0.2r-vc14_0
pandas             intel/win-64::pandas-0.24.1-py36_3
pip                intel/win-64::pip-10.0.1-py36_0
pycosat            intel/win-64::pycosat-0.6.3-py36_3
pycparser          intel/win-64::pycparser-2.18-py36_2
pyopenssl          intel/win-64::pyopenssl-17.5.0-py36_2
pyparsing          intel/win-64::pyparsing-2.2.0-py36_2
pysocks            intel/win-64::pysocks-1.6.7-py36_1
python             intel/win-64::python-3.6.8-6
python-dateutil    intel/win-64::python-dateutil-2.6.0-py36_12
pytz               intel/win-64::pytz-2018.4-py36_3
pyyaml             intel/win-64::pyyaml-4.1-py36_3
requests           intel/win-64::requests-2.20.1-py36_1
ruamel_yaml        intel/win-64::ruamel_yaml-0.11.14-py36_4
scikit-learn       intel/win-64::scikit-learn-0.20.2-py36h7b7c402_2
scipy              intel/win-64::scipy-1.2.0-py36_3
setuptools         intel/win-64::setuptools-39.0.1-py36_0
six                intel/win-64::six-1.11.0-py36_3
sqlite             intel/win-64::sqlite-3.27.2-vc14_2
tbb                intel/win-64::tbb-2019.4-vc14_intel_203
tbb4py             intel/win-64::tbb4py-2019.4-py36_intel_0
tcl                intel/win-64::tcl-8.6.4-vc14_22
tk                 intel/win-64::tk-8.6.4-vc14_28
urllib3            intel/win-64::urllib3-1.24.1-py36_2
vc                 intel/win-64::vc-14.0-2
vs2015_runtime     intel/win-64::vs2015_runtime-14.0.25420-intel_2
wheel              intel/win-64::wheel-0.31.0-py36_3
win_inet_pton      intel/win-64::win_inet_pton-1.0.1-py36_4
wincertstore       intel/win-64::wincertstore-0.2-py36_3
xz                 intel/win-64::xz-5.2.3-vc14_2
zlib               intel/win-64::zlib-1.2.11-vc14h21ff451_5

You can see that for each library the wheel’s description starts with “Intel/…” this signifies that the said library is being downloaded from intel’s distribution channel. Once you give yes to install these libraries, they will start getting downloaded and installed.

This step is where the first issue comes. Sometimes, these libraries don’t get downloaded, and the list propagates, or we get an SSL error and the command exits. This issue may even be delayed, that is, right now everything will get downloaded and installed, but later on if you want to add any new library, the prompt will throw SSL errors. There’s an easy fix to this problem which needs to be done before creating the virtual environment for Intel as mentioned above.

In your shell or command prompt, turn off the anaconda’s default SSL verification via the following command

conda config --set ssl_verify false

Once SLL verification is turned off, you can repeat step 2 by deleting the previously created environment and starting fresh.

3) Setting up Tensorflow

Congratulations!! Now you have set up the Intel’s python distribution in your PC/laptop. It’s time to enter the ML pipeline.

Intel has provided optimization for tensorflow via all the distribution channels and is very smooth to set up. You can read more about it here. Let’s see how we can install optimized tensorflow for our CPU. Intel Software provides an optimized math kernel library(mkl) which optimizes the mathematical operations and provides the users with required speed-up. Thus, we will install tensorflow-mkl as follows.

conda install tensorflow-mkl

Or with pip, one can set it up as follows.

pip install intel-tensorflow

Voila!! Tensorflow is now up and running in your system with necessary optimizations. And if you are a Keras fan then you can set it up with a simple command: -

conda install keras -c intel

4) Set up Jupyter

Since we have created a new virtual environment, it will not come with spyder or jupyter notebooks by default. However, it is straightforward to set these up. With a single line, we can do wonders.

conda install jupyter -c intel

5) Activate the Environment and start Experimenting

Now that we have set up everything, it’s time to get our hands dirty as we start coding and experimenting with various ML and DL approaches on our optimized CPU systems. Firstly, before executing any code, make sure that you are using the right environment. You need to activate the virtual environment before you can use the libraries installed in it. This activation step is an all-time process, and it is effortless. Write the following command in your anaconda prompt, and you’re good to go.

conda activate intel

To make sanity checks on your environment, type the following in the command prompt/shell once the environment is activated.

python

Once you press enter after typing python, the following text should appear in your command prompt. Make sure it says “Intel Corporation” between the pipe and has the message “Intel(R) Distribution for Python is brought to you by Intel Corporation.”. These validate the correct installation of Intel’s Python Distribution.

Python 3.6.8 |Intel Corporation| (default, Feb 27 2019, 19:55:17) [MSC v.1900 64 bit (AMD64)] on win32
Type "help", "copyright", "credits" or "license" for more information.
Intel(R) Distribution for Python is brought to you by Intel Corporation.
Please check out: https://software.intel.com/en-us/python-distribution

Now you can use the command line to experiment or write your scripts elsewhere and save them with .py extension. These files can then be accessed by navigating to the location of the file via “cd” command and running the script via: -

(intel) C:\Users\User>python script.py

By following steps 1 to 4, you will have your system ready with the level of Intel xyz* as mentioned in the performance benchmark charts above. These are still not multi-processor-based thread optimized. I will discuss below how to achieve further optimization for your multi-core CPU.

Multi-Core Optimization

To add further optimizations for your multi-core system, you can add the following lines of code to your .py file, and it will execute the scripts accordingly. Here NUM_PARALLEL_EXEC_UNITS represent the number of cores you have; I have a quad-core i7. Hence the number is 4. For Windows users, you can check the count of cores in your Task Manager via navigating to Task Manager -> Performance -> CPU -> Cores.

from keras import backend as K
import tensorflow as tfNUM_PARALLEL_EXEC_UNITS = 4
config = tf.ConfigProto(intra_op_parallelism_threads=NUM_PARALLEL_EXEC_UNITS, inter_op_parallelism_threads=2,
                       allow_soft_placement=True, device_count={'CPU': NUM_PARALLEL_EXEC_UNITS})session = tf.Session(config=config)K.set_session(session)os.environ["OMP_NUM_THREADS"] = "4"os.environ["KMP_BLOCKTIME"] = "30"os.environ["KMP_SETTINGS"] = "1"os.environ["KMP_AFFINITY"] = "granularity=fine,verbose,compact,1,0"

If you’re not using Keras and prefer using core tensorflow, then the script remains almost the same, just remove the following 2 lines.

from keras import backend as K

K.set_session(session)

After adding these lines in your code, the speed-up should be comparable to Intel xyz(O) entries in the performance charts above.

If you have a GPU in your system and it is conflicting with the current set of libraries or throwing a cudnn error then you can add the following line in your code to disable the GPU.

os.environ["CUDA_VISIBLE_DEVICES"] = "-1"

Conclusion

That’s it. You have now an optimized pipeline to test and develop machine learning projects and ideas. This channel opens up a lot of opportunities for students involving themselves in academic research to carry on their work with whatever system they have. This pipeline will also prevent the worries of privacy of private data on which a practitioner might be working.

It is also to be observed that with proper fine-tuning, one can obtain a 3.45x speed-up in their workflow which means that if you are experimenting with your ideas, you can now work three times as fast as compared to before.