Optimize your CPU for Deep Learning

Param Popat
Towards Data Science
10 min readMay 8, 2019

--

In the last few years, Deep Learning has picked up pace in academia as well as industry. Every company is now looking for an A.I. based solution to problems. This boom has its own merits and demerits, but that’s for another article, another day. This surge in Machine Learning practitioners have infiltrated the academia to its roots, and almost every student from every domain has access to AI and ML knowledge via courses, MOOCs, books, articles, and of course papers.

This rise was, however, bottlenecked by the availability of hardware resources. It was suggested and demonstrated that a Graphical Processing Unit is one of the best devices you can have to perform your ML tasks at pace. But a good high-performance GPU came with a price tag which can go even up to $20,449.00 for a single NVIDIA Tesla V100 32GB GPU which has server-like compute capabilities. Furthermore, a consumer laptop with a decent GPU costs around $2000 with a GPU like 1050Ti or 1080Ti. To ease the pain, Google, Kaggle, Intel, and Nvidia provides cloud-based High-Compute systems for free with a restriction on either space, compute capability, memory or time. But these online services have their drawbacks which include managing the data(upload/download), data privacy, etc. These issues lead to the main point of my article, “Why not optimize our CPUs to attain a speed-up in Deep Learning tasks?”.

Intel has provided optimizations for Python, Tensorflow, Pytorch, etc. with a whole range of Intel Optimized support libraries like NumPy, scikit-learn and many more. These are freely available to download and set-up and provides a speed of anywhere from 2x to even 5x on a CPU like Intel Core i7 which is not also a high-performance CPU like the Xeon Series. In the remaining part of the article, I will demonstrate how to set-up Intel’s optimizations in your PC/laptop and will provide the speed-up data that I observed.

Performance Boost Obtained

For a variety of experiments mentioned below, I will present the time and utilization boosts that I observed.

  1. 10-layer Deep CNN for CIFAR-100 Image Classification.
  2. 3 Layer Deep LSTM for IMDB Sentiment Analysis.
  3. 6 Layer deep Dense ANN for MNIST image Classification.
  4. 9 Layer deep Fully Convolutional Auto-Encoder for MNIST.

These tasks have been coded in Keras using tensorflow backend and datasets are available in the same hard-drive as that of the codes and executable libraries. The hard-drive utilized is an SSD.

We will consider six combinations of optimizations as following.

  1. Intel(R) Core (TM) i7.
  2. Intel(R) Xeon(R) CPU E3–1535M v6.
  3. Intel(R) Core (TM) i7 with Intel Python (Intel i7*).
  4. Intel(R) Xeon(R) CPU E3–1535M v6 with Intel Python (Intel Xeon*).
  5. Intel(R) Core (TM) i7 with Intel Python and Processor Thread optimization (Intel i7(O)).
  6. Intel(R) Xeon(R) CPU E3–1535M v6 with Intel Python and Processor Thread optimization (Intel Xeon(O)).

For each task, the number epochs were fixed at 50. In the chart below we can see that for an Intel(R) Core (TM) i7–7700HQ CPU @ 2.80GHz CPU, the average time per epoch is nearly 4.67 seconds, and it drops to 1.48 seconds upon proper optimization, which is 3.2x boost up. And for an Intel(R) Xeon(R) CPU E3–1535M v6 @ 3.10GHz CPU, the average time per epoch is nearly 2.21 seconds, and it drops to 0.64 seconds upon proper optimization, which is a 3.45x boost up.

Average time per epoch

The optimization is not just in time, the optimized distribution also optimizes the CPU utilization which eventually leads to better heat management, and your laptops won’t get as heated as they used to while training a deep neural network.

Utilization

We can see that without any optimization the CPU utilization while training maxes out to 100%, slowing down all the other processes and heating the system. However, with proper optimizations, the utilization drops to 70% for i7 and 65% for Xeon despite providing a performance gain in terms of time.

These two metrics can be summarized in relative terms as follows.

In the above graph, a lower value is better, that is in relative terms Intel Xeon with all the optimizations stands as the benchmark, and an Intel Core i7 processor takes almost twice as time as Xeon, per epoch, after optimizing its usage. The above graph clearly shows the bright side of Intel’s Python Optimization in terms of time taken to train a neural network and CPU’s usage.

Setting Up Intel’s Python Distribution

Intel Software has provided an exhaustive list of resources on how to set this up, but there are some issues which we may usually face. More details about distribution are available here. You can choose between the type of installation, that is, either native pip or conda. I prefer conda as it saves a ton of hassle for me and I can focus on ML rather than on solving compatibility issues for my libraries.

1) Download and install Anaconda

You can download Anaconda from here. Their website has all the steps listed to install Anaconda on windows, ubuntu and macOS environments, and are easy to follow.

2) Set up Intel python in your Anaconda Distribution

This step is where it usually gets tricky. It is preferred to create a virtual environment for Intel distribution so that you can always add/change your optimized libraries at one place. Let’s create a new virtual environment with the name “intel.”

conda create -n intel -c intel intelpython3_full

Here -c represents channel, so instead of adding Intel as a channel, we call that channel via -c. Here, intelpython3_full will automatically fetch necessary libraries from Intel’s distribution and install them in your virtual environment. This command will install the following libraries.

The following NEW packages will be INSTALLED:asn1crypto         intel/win-64::asn1crypto-0.24.0-py36_3
bzip2 intel/win-64::bzip2-1.0.6-vc14_17
certifi intel/win-64::certifi-2018.1.18-py36_2
cffi intel/win-64::cffi-1.11.5-py36_3
chardet intel/win-64::chardet-3.0.4-py36_3
cryptography intel/win-64::cryptography-2.3-py36_1
cycler intel/win-64::cycler-0.10.0-py36_7
cython intel/win-64::cython-0.29.3-py36_1
daal intel/win-64::daal-2019.3-intel_203
daal4py intel/win-64::daal4py-2019.3-py36h7b7c402_6
freetype intel/win-64::freetype-2.9-vc14_3
funcsigs intel/win-64::funcsigs-1.0.2-py36_7
icc_rt intel/win-64::icc_rt-2019.3-intel_203
idna intel/win-64::idna-2.6-py36_3
impi_rt intel/win-64::impi_rt-2019.3-intel_203
intel-openmp intel/win-64::intel-openmp-2019.3-intel_203
intelpython intel/win-64::intelpython-2019.3-0
intelpython3_core intel/win-64::intelpython3_core-2019.3-0
intelpython3_full intel/win-64::intelpython3_full-2019.3-0
kiwisolver intel/win-64::kiwisolver-1.0.1-py36_2
libpng intel/win-64::libpng-1.6.36-vc14_2
llvmlite intel/win-64::llvmlite-0.27.1-py36_0
matplotlib intel/win-64::matplotlib-3.0.1-py36_1
menuinst intel/win-64::menuinst-1.4.1-py36_6
mkl intel/win-64::mkl-2019.3-intel_203
mkl-service intel/win-64::mkl-service-1.0.0-py36_7
mkl_fft intel/win-64::mkl_fft-1.0.11-py36h7b7c402_0
mkl_random intel/win-64::mkl_random-1.0.2-py36h7b7c402_4
mpi4py intel/win-64::mpi4py-3.0.0-py36_3
numba intel/win-64::numba-0.42.1-np116py36_0
numexpr intel/win-64::numexpr-2.6.8-py36_2
numpy intel/win-64::numpy-1.16.1-py36h7b7c402_3
numpy-base intel/win-64::numpy-base-1.16.1-py36_3
openssl intel/win-64::openssl-1.0.2r-vc14_0
pandas intel/win-64::pandas-0.24.1-py36_3
pip intel/win-64::pip-10.0.1-py36_0
pycosat intel/win-64::pycosat-0.6.3-py36_3
pycparser intel/win-64::pycparser-2.18-py36_2
pyopenssl intel/win-64::pyopenssl-17.5.0-py36_2
pyparsing intel/win-64::pyparsing-2.2.0-py36_2
pysocks intel/win-64::pysocks-1.6.7-py36_1
python intel/win-64::python-3.6.8-6
python-dateutil intel/win-64::python-dateutil-2.6.0-py36_12
pytz intel/win-64::pytz-2018.4-py36_3
pyyaml intel/win-64::pyyaml-4.1-py36_3
requests intel/win-64::requests-2.20.1-py36_1
ruamel_yaml intel/win-64::ruamel_yaml-0.11.14-py36_4
scikit-learn intel/win-64::scikit-learn-0.20.2-py36h7b7c402_2
scipy intel/win-64::scipy-1.2.0-py36_3
setuptools intel/win-64::setuptools-39.0.1-py36_0
six intel/win-64::six-1.11.0-py36_3
sqlite intel/win-64::sqlite-3.27.2-vc14_2
tbb intel/win-64::tbb-2019.4-vc14_intel_203
tbb4py intel/win-64::tbb4py-2019.4-py36_intel_0
tcl intel/win-64::tcl-8.6.4-vc14_22
tk intel/win-64::tk-8.6.4-vc14_28
urllib3 intel/win-64::urllib3-1.24.1-py36_2
vc intel/win-64::vc-14.0-2
vs2015_runtime intel/win-64::vs2015_runtime-14.0.25420-intel_2
wheel intel/win-64::wheel-0.31.0-py36_3
win_inet_pton intel/win-64::win_inet_pton-1.0.1-py36_4
wincertstore intel/win-64::wincertstore-0.2-py36_3
xz intel/win-64::xz-5.2.3-vc14_2
zlib intel/win-64::zlib-1.2.11-vc14h21ff451_5

You can see that for each library the wheel’s description starts with “Intel/…” this signifies that the said library is being downloaded from intel’s distribution channel. Once you give yes to install these libraries, they will start getting downloaded and installed.

This step is where the first issue comes. Sometimes, these libraries don’t get downloaded, and the list propagates, or we get an SSL error and the command exits. This issue may even be delayed, that is, right now everything will get downloaded and installed, but later on if you want to add any new library, the prompt will throw SSL errors. There’s an easy fix to this problem which needs to be done before creating the virtual environment for Intel as mentioned above.

In your shell or command prompt, turn off the anaconda’s default SSL verification via the following command

conda config --set ssl_verify false

Once SLL verification is turned off, you can repeat step 2 by deleting the previously created environment and starting fresh.

3) Setting up Tensorflow

Congratulations!! Now you have set up the Intel’s python distribution in your PC/laptop. It’s time to enter the ML pipeline.

Intel has provided optimization for tensorflow via all the distribution channels and is very smooth to set up. You can read more about it here. Let’s see how we can install optimized tensorflow for our CPU. Intel Software provides an optimized math kernel library(mkl) which optimizes the mathematical operations and provides the users with required speed-up. Thus, we will install tensorflow-mkl as follows.

conda install tensorflow-mkl

Or with pip, one can set it up as follows.

pip install intel-tensorflow

Voila!! Tensorflow is now up and running in your system with necessary optimizations. And if you are a Keras fan then you can set it up with a simple command: -

conda install keras -c intel

4) Set up Jupyter

Since we have created a new virtual environment, it will not come with spyder or jupyter notebooks by default. However, it is straightforward to set these up. With a single line, we can do wonders.

conda install jupyter -c intel

5) Activate the Environment and start Experimenting

Now that we have set up everything, it’s time to get our hands dirty as we start coding and experimenting with various ML and DL approaches on our optimized CPU systems. Firstly, before executing any code, make sure that you are using the right environment. You need to activate the virtual environment before you can use the libraries installed in it. This activation step is an all-time process, and it is effortless. Write the following command in your anaconda prompt, and you’re good to go.

conda activate intel

To make sanity checks on your environment, type the following in the command prompt/shell once the environment is activated.

python

Once you press enter after typing python, the following text should appear in your command prompt. Make sure it says “Intel Corporation” between the pipe and has the message “Intel(R) Distribution for Python is brought to you by Intel Corporation.”. These validate the correct installation of Intel’s Python Distribution.

Python 3.6.8 |Intel Corporation| (default, Feb 27 2019, 19:55:17) [MSC v.1900 64 bit (AMD64)] on win32
Type "help", "copyright", "credits" or "license" for more information.
Intel(R) Distribution for Python is brought to you by Intel Corporation.
Please check out: https://software.intel.com/en-us/python-distribution

Now you can use the command line to experiment or write your scripts elsewhere and save them with .py extension. These files can then be accessed by navigating to the location of the file via “cd” command and running the script via: -

(intel) C:\Users\User>python script.py

By following steps 1 to 4, you will have your system ready with the level of Intel xyz* as mentioned in the performance benchmark charts above. These are still not multi-processor-based thread optimized. I will discuss below how to achieve further optimization for your multi-core CPU.

Multi-Core Optimization

To add further optimizations for your multi-core system, you can add the following lines of code to your .py file, and it will execute the scripts accordingly. Here NUM_PARALLEL_EXEC_UNITS represent the number of cores you have; I have a quad-core i7. Hence the number is 4. For Windows users, you can check the count of cores in your Task Manager via navigating to Task Manager -> Performance -> CPU -> Cores.

from keras import backend as K
import tensorflow as tf
NUM_PARALLEL_EXEC_UNITS = 4
config = tf.ConfigProto(intra_op_parallelism_threads=NUM_PARALLEL_EXEC_UNITS, inter_op_parallelism_threads=2,
allow_soft_placement=True, device_count={'CPU': NUM_PARALLEL_EXEC_UNITS})
session = tf.Session(config=config)K.set_session(session)os.environ["OMP_NUM_THREADS"] = "4"os.environ["KMP_BLOCKTIME"] = "30"os.environ["KMP_SETTINGS"] = "1"os.environ["KMP_AFFINITY"] = "granularity=fine,verbose,compact,1,0"

If you’re not using Keras and prefer using core tensorflow, then the script remains almost the same, just remove the following 2 lines.

from keras import backend as K

K.set_session(session)

After adding these lines in your code, the speed-up should be comparable to Intel xyz(O) entries in the performance charts above.

If you have a GPU in your system and it is conflicting with the current set of libraries or throwing a cudnn error then you can add the following line in your code to disable the GPU.

os.environ["CUDA_VISIBLE_DEVICES"] = "-1"

Conclusion

That’s it. You have now an optimized pipeline to test and develop machine learning projects and ideas. This channel opens up a lot of opportunities for students involving themselves in academic research to carry on their work with whatever system they have. This pipeline will also prevent the worries of privacy of private data on which a practitioner might be working.

It is also to be observed that with proper fine-tuning, one can obtain a 3.45x speed-up in their workflow which means that if you are experimenting with your ideas, you can now work three times as fast as compared to before.

--

--