A convolutional neural network is a type of Deep neural network which has got great success in image classification problems, it is primarily used in object recognition by taking images as input and then classifying them in a certain category. The major advantage of CNN is that it learns the filters that in traditional algorithms were hand-engineered, so it takes less human effort.

To understand the architecture of CNN, we must first know about the input which it takes. So, what is an image?

An image is a collection of a large number of squares called pixels. Pixels are the building blocks of image and are the one to decide color, contrast, brightness, and sharpness of the image. If we talk about black and white images, then we can take a pixel with value “1” for black color and a pixel with value “0” for white color. So in the image with these 2 types of a pixel, we will have an image like P(1). And if we will talk about a grayscale image, then there are not just two colors (0 and 1), but a range from 0-255, which is shown in P(2).

But when we talk about colored images, we basically have 3 layers of red, green and blue color, and in each layer, there is a range 0-255 for a pixel, where “0” is white and “255” is the base color and these three layers together form a colored image where pixel can be defined in rgb terms.Three layers with pixels ranging from 0-255 is shown below:

The Architecture

There are basically 3 components in a basic convolutional network:

  1. The Convolutional layer
  2. The Pooling layer
  3. The output layer

The Convolutional layer

In this layer, we take an input image and is converted in matrix form from that pixel form. Let us assume that the input image is black and white and has 25 pixels (5×5), then there will be a 5×5 input matrix with each of its element’s value “0” or “1”. Now take a 3×3 matrix and slide that 3×3 window around the input matrix

In the gif, the green 5×5 matrix is the input matrix, and the yellow 3×3 matrix floating on it is called a kernel filter. Now each time we will put the kernel matrix on the part of the input matrix and perform a scalar product between that portion of the input matrix and kernel filter, we will get a new value. We will perform the scalar product total 9 times, and a 3×3 matrix will be produced, which will be called convolutional matrix.
There is also a concept of stride and padding in this method. Both the padding and stride impacts the data size. Stride has some other special effects too. The need to keep the data size usually depends on the type of task, and it is part of the network design/architecture. Without padding means that the data size will get reduced for the next layer.
In the above example, the kernel matrix was moving across the entire image moving one pixel at a time. If the weight matrix moves 1 pixel at a time, we call it a stride of 1. Similarly, for a stride of 2, the kernel matrix will move across the entire image moving two pixels at a time. The size of the image (convolutional matrix) keeps on reducing as we increase the stride value, so padding the input image with zeros across it can solve this problem. We can also add more than one layer of zeros around the image in case of higher stride values.

Now the next step in this layer is active. It here also work exactly as in other neural networks, a value is passed through a function that squashes the value into a range. Some activations which are commonly used are Identity, Binary step, logistic, tanh, etc.

The pooling layer

This layer is the optional one. Pooling works very much like convoluting, where we take a kernel and move the kernel over the image, the only difference is the function that is applied to the kernel and the image window isn’t linear.

Max pooling and Average pooling are the most common pooling functions. Max pooling takes the largest value from the window of the image currently covered by the kernel, while average pooling takes the average of all values in the window.

The Output Layer

After multiple layers of convolution and padding, we would need the output in the form of a class. The convolution and pooling layers would only be able to extract features and reduce the number of parameters from the original images. However, to generate the final output we need to apply a fully connected layer to generate an output equal to the number of classes we need. It becomes tough to reach that number with just the convolution layers. Convolution layers generate 3D activation maps while we just need the output as whether or not an image belongs to a particular class. The output layer has a loss function like categorical cross-entropy, to compute the error in prediction. Once the forward pass is complete the backpropagation begins to update the weight and biases for error and loss reduction.

The whole working

So firstly we have an input image which goes to ConvLayer then Relu to extract relevant features from the input image to pass further and then Pooling layers are added to further reduce the number of parameters, again ConvLayer →Relu→ MaxPooling and finally the output is then generated through the output layer and is compared to the output layer for error generation. A loss function is defined in the fully connected output layer to compute the mean square loss. The gradient of error is then calculated. The error is then backpropagated to update the filter and bias values.
One training cycle is completed in a single forward and backward pass.

Summary

We learned about the architecture of CNN. We learned how a computer looks at an image, then we learned convolutional matrix. We tried to understand the convolutional, pooling and output layer of CNN. Read my follow-up post Handwritten Digit Recognition with CNN.

References