Hyper-parameter Tuning Techniques in Deep Learning

Published in

Towards Data Science

12 min readMar 16, 2019

The process of setting the hyper-parameters requires expertise and extensive trial and error. There are no simple and easy ways to set hyper-parameters — specifically, learning rate, batch size, momentum, and weight decay.

Deep learning models are full of hyper-parameters and finding the best configuration for these parameters in such a high dimensional space is not a trivial challenge.

Before discussing the ways to find the optimal hyper-parameters, let us first understand these hyper-parameters: learning rate, batch size, momentum, and weight decay. These hyper-parameters act as knobs which can be tweaked during the training of the model. For our model to provide best result, we need to find the optimal value of these hyper-parameters.

Gradient Descent

Gradient descent is an optimization technique commonly used in training machine learning algorithms. The main aim of training ML algorithms is to adjust the weights w to minimize the loss or cost. This cost is measure of how well our model is doing, we represent this cost by J(w). Thus, by minimizing the cost function we can find the optimal parameters that yield the best model performance [1].

A typical loss function plot of regression problems is bowl shaped like below.

In the gradient descent algorithm, we start with random model parameters and calculate the error for each learning iteration, keep updating the model parameters to move closer to the values that results in minimum cost. Please refer my post for details. Gradient descent algorithms multiply the gradient (slope) by a scalar known as the learning rate (or step size) to determine the next point. This parameter tells how far to move the weights in the direction of the gradient.

If we denote dw and db as gradients to update our parameters W and b for gradient descent algorithm as follows:

If the learning rate is small, then training is more reliable, but it will take a lot of time because steps towards the minimum of the loss function are tiny.

If the learning rate is high, then training may not converge or even diverge. Weight changes can be so big that the optimizer overshoots the minimum and makes the loss worse. Thus our aim is to find the optimal learning rate which can quickly find the minimum loss.

For more generic models, you can think of Gradient Descent as a ball rolling down on a valley. we want it to sit in the deepest place of the mountains, however, it is easy to see that things can go wrong.

Depending on where the ball starts rolling, it may rest in the bottom of a valley. But not in the lowest one. This is called a local minimum.The way we initialize our model weights may lead it to rest in a local minimum. To avoid that, we initialize the weight vectors with values from a random distribution.

We can represent the loss surface in 2-D as below:

The red dot is the global minima, and we want to reach that point. Using gradient descent, the updates will look like:

With each iteration of gradient descent, we move towards the local optima with up and down oscillations. If we use larger learning rate then the vertical oscillation will have higher magnitude. So, this vertical oscillation slows down our gradient descent and prevents us from using a much larger learning rate. Additionally, too small a learning rate makes the gradient descent slower.

We want a slower learning in the vertical direction and a faster learning in the horizontal direction which will help us to reach the global minima much faster.

To help us achieve that we use Gradient Descent with Momentum [2].

We start with our Gradient descent:

In momentum, instead of using dw and db independently for each epoch, we take the exponentially weighted averages of dw and db.

Where beta ‘β’ is another hyper-parameter called momentum and ranges from 0 to 1. It sets the weight between the average of previous values and the current value to calculate the new weighted average.

After calculating exponentially weighted averages, we will update our parameters.

By using the exponentially weighted average values of dw and db, we tend to average out the oscillations in the vertical direction closer to zero . Whereas, on the horizontal direction, all the derivatives are pointing to the right of the horizontal direction, so the average in the horizontal direction will still be pretty big. It allows our algorithm to take more straight forwards path towards local optima and damp out vertical oscillations. Due to this reason, the algorithm will end up at local optima with a few iterations.

To have an intuition of how this works is to consider the example of a ball rolling down the hill— Vᵈʷ and Vᵈᵇ provide velocity to that ball and make it move faster. We do not want our ball to speed up so much that it misses the global minima, and hence β acts as friction.

There are three ways of doing gradient descent:

Batch gradient descent: ‘

all examples at once: Uses all of the training instances to update the model parameters in each iteration.
converges slowly with accurate estimates of the error gradient.

Stochastic Gradient Descent (SGD):

one example at a time: Updates the parameters using only a single training instance in each iteration. The training instance is usually selected randomly.
converges fast with noisy estimates of the error gradient.

Mini-batch Gradient Descent:

‘b’ examples at a time: Instead of using all examples, Mini-batch Gradient Descent divides the training set into smaller size called batch denoted by ‘b’. Thus a mini-batch ‘b’ is used to update the model parameters in each iteration.

Mini-batch gradient descent seeks to find a balance between the robustness of stochastic gradient descent and the efficiency of batch gradient descent.

Mini-batch gradient descent is the most common implementation of gradient descent used in the field of deep learning. The down-side of Mini-batch is that it adds an additional hyper-parameter “batch size” or “b’ for the learning algorithm.

Approaches of searching for the best configuration: Grid Search & Random Search

Grid Search

In grid search [3], we try every possible configuration of the parameters.

Steps:

Define a grid on n dimensions, where each of these maps for an hyper-parameter. e.g. n = (learning_rate,, batch_size)
For each dimension, define the range of possible values: e.g. batch_size = [4, 8, 16, 32], learning_rate =[0.1, 0.01, 0.0001]
Search for all the possible configurations and wait for the results to establish the best one: e.g. C1 = (0.1, 4) -> acc = 92%, C2 = (0.01, 4) -> acc = 92.3%, etc…

As we can see with more dimensions, the more the search will explode in time complexity. It’s common to use this approach when the dimensions are less than or equal to 4. Though it guarantees to find the best configuration at the end, it’s still not preferable. Instead, it’s better to use Random Search

Random Search

Random Search [4] does random search on the step 1 to pick a point randomly from the configuration space. The intuition of how it works better is that we can explore the hyper-parameters space more widely with Random Search (especially for the more important variables). This will help us to find the best configuration in fewer iterations. For example, see the image below :

In the Grid Layout, it’s easy to notice that, even if we have trained 9 (n=3) models, we have used only 3 values per variable. Whereas, with the Random Layout, it’s extremely unlikely that we will select the same variables more than once. It ends up that, with the second approach, we will have trained 9 model using 9 different values for each variables. For detailed analysis of Grid vs Random, please refer this paper.

Even though random search performs better than grid search, both these approaches are still computationally expensive and time consuming. In 2018, Leslie N. Smith came out with a detailed report on various approaches towards identifying optimal hyper-parameters in his classic paper. We will quickly go through the approach suggested by Smith [5]. The approach is based on finding the balance between underfitting and overfitting by examining the training’s test/validation loss for clues of underfitting and overfitting in order to strive for the optimal set of hyper-parameters.

The hyper-parameter tuning process is a tightrope walk to achieve a balance between underfitting and overfitting.

Underfitting is when the machine learning model is unable to reduce the error for either the test or training set. An underfitting model is not powerful enough to fit the underlying complexities of the data distributions.

Overfitting happens when the machine learning model is so powerful as to fit the training set too well and the generalization error increases. The representation of this underfitting and overfitting trade-off displayed in the above figure.

Approach

Observe and understand the clues available during training by monitoring validation/test loss early in the training, tune your architecture and hyper-parameters with short runs of a few epochs.
Signs of underfitting or overfitting of the test or validation loss early in the training process are useful for tuning the hyper-parameters.

Model complexity refers to the capacity of the machine learning model. The figure shows the optimal capacity that falls between underfitting and overfitting.

Finding Optimal Hyper-parameters

Learning Rate (LR)

If the learning rate (LR) is too small, overfitting can occur. Large learning rates help to regularize the training but if the learning rate is too large, the training will diverge. Hence a grid search of short runs to find learning rates that converge or diverge is possible but we have another approach called “Cyclical learning rates (CLR)” by Leslie N. Smith.

Leslie’s experiments show that varying learning rate during training is beneficial overall and thus proposes to change the learning rate cyclically within a band of values instead of setting it to a fixed value. The essence of this learning rate policy comes from the observation that increasing the learning rate might have a short term negative effect and yet achieve a longer term beneficial effect. This observation leads to the idea of letting the learning rate vary within a range of values rather than adopting a step-wise, fixed or exponentially decreasing value. That is, one sets minimum and maximum boundaries and the learning rate cyclically varies between these bounds.

How can one estimate reasonable minimum and maximum boundary values?

LR range test: Run your model for several epochs while letting the learning rate increase linearly between low and high LR values. This test is enormously valuable whenever you are facing a new architecture or dataset. For a shallow 3-layer architecture, large is 0.01 while for resnet, large is 3.0, you might try more than one maximum.

From my previous post, using fast.ai library to do a LR test

Using the 1-cycle LR policy with a maximum learning rate determined from an LR range test, a minimum learning rate as a tenth of the maximum appears to work well [6].

Batch Size

Unlike the learning rate hyper-parameter where its value doesn’t affect computational time, batch size must be examined in conjunction with the execution time of the training. The batch size is limited by your hardware’s memory, while the learning rate is not. Leslie recommends using a batch size that fits in your hardware’s memory and enable using larger learning rates.

If your server has multiple GPUs, the total batch size is the batch size on a GPU multiplied by the number of GPUs. If the architecture is small or your hardware permits very large batch sizes, then you might compare performance of different batch sizes. In addition, recall that small batch sizes add regularization while large batch sizes add less, so utilize this while balancing the proper amount of regularization. It is often better to use a larger batch size so a larger learning rate can be used.

Cyclical Momentum

Momentum and learning rate are closely related. The optimal learning rate is dependent on the momentum and momentum is dependent on the learning rate. Since learning rate is regarded as the most important hyper-parameter to tune then momentum is also important. Like learning rates, it is valuable to set momentum as large as possible without causing instabilities during training.

Procedure to finding Learning Rate & Momentum combination

Using Cyclical learning Rate: The optimal training procedure is a combination of an increasing cyclical learning rate, where an initial small learning rate permits convergence to begin, and a decreasing cyclical momentum, where the decreasing momentum allows the learning rate to become larger in the early to middle parts of training. Using a decreasing cyclical momentum when the learning rate increases provides a faster initial convergence and stabilizes the training to allow larger learning rates.

Cyclical momentum is useful for starting with a large momentum and decreasing momentum while the learning rate is increasing because it improves the test accuracy and makes the training more robust to large learning rates.

Below plot from my post shows typically how learning rate and momentum change during one cycle(one epoch) of training.

left: learning rate one cycle, right:momentum for one cycle

Using Constant Learning Rate: If one is using a cyclical learning rate, a cyclical momentum in the opposite direction makes sense but what is the best momentum when the learning rate is constant? Here cyclical momentum is not better than a good constant value.If a constant learning rate is used then a large constant momentum (i.e., 0.9–0.99) will act like a pseudo increasing learning rate and will speed up the training. However, use of too large a value for momentum causes poor training results that are visible early in the training and this can be quickly tested.

With either cyclical learning rate or constant learning rate, a good procedure is to test momentum values in the range of 0.9 to 0.99 and choose a value that perform best.

Weight Decay

Weight decay is one form of regularization and it plays an important role in training so its value needs to be set properly [7]. Weight decay is defined as multiplying each weight in the gradient descent at each epoch by a factor λ [0<λ<1].

Leslie’s experiments show that weight decay is not like learning rates or momentum and the best value should remain constant through the training (i.e., cyclical weight decay is not useful).

If you have no idea of a reasonable value for weight decay, test 1/10³ , 1/10⁴ , 1/10⁵ , and 0. Smaller datasets and architectures seem to require larger values for weight decay while larger datasets and deeper architectures seem to require smaller values. Our hypothesis is that complex data provides its own regularization and other regularization should be reduced.

The optimal weight decay is different if you search with a constant learning rate versus using a learning rate range. This aligns with our intuition because the larger learning rates provide regularization so a smaller weight decay value is optimal.

Summary of Key Findings

Learning rate (LR):

Perform a learning rate range test to identify a “large” learning rate.
Using the 1-cycle LR policy with a maximum learning rate determined from an LR range test, set a minimum learning rate as a tenth of the maximum.

Momentum:

Test with short runs of momentum values 0.99, 0.97, 0.95, and 0.9 to get the best value for momentum.
If using the 1-cycle learning rate schedule, it is better to use a cyclical momentum (CM) that starts at this maximum momentum value and decreases with increasing learning rate to a value of 0.8 or 0.85.

Batch Size:

Use as large batch size as possible to fit your memory then you compare performance of different batch sizes.
Small batch sizes add regularization while large batch sizes add less, so utilize this while balancing the proper amount of regularization.
It is often better to use a larger batch size so a larger learning rate can be used.

Weight decay:

A grid search to determine the proper magnitude but usually does not require more than one significant figure accuracy.
A more complex dataset requires less regularization so test smaller weight decay values, such as 10−4 , 10−5 , 10−6 , 0.
A shallow architecture requires more regularization so test larger weight decay values, such as 10−2 , 10−3 , 10−4 .

Thank you for reading.