A deep learning model is given a specific input that it tries to generalize using algorithms to make predictions on unseen data. An input data has two components: features and labels. The features are given as input to the models, while the labels are predicted by the models.
The labels used while training machine learning and deep learning models are called “actual values.” And the labels that the models predict are referred to as the “predicted values.”
The underlying concept behind any machine learning and deep learning algorithm is to reduce the actual and predicted values differences. A function is used to measure the closeness between actual and predicted values, called the “loss function.”
If the value from a loss function is minimum, it signifies that the actual values and the values that the models predict are close to each other. Such models can reasonably classify unseen data as well. But, there is always a challenge to minimize a loss function.
Now comes the optimization techniques. An optimization technique tries to minimize the output value of a loss function. During an optimization technique, we change the attributes of a neural network like weights, biases, learning rates, etc., to reduce the loss.
1. Gradient Descent
Gradient Descent is the most popular optimization algorithm. It functions as a baseline for different optimization algorithms.
The main goal of a gradient descent algorithm is to find the local minimum of a partially differentiable function. There are coefficient parameters like weights and biases that machine learning models learn from the data.
The gradient descent tries to differentiate the loss functions with respect to the learning parameters that are the weights and biases. The values of these weights and biases are optimized again and again repeatedly so that the loss function is minimized.
In the above figure, θ is the term that represents weights or a bias. J(θ) is the loss function that needs to be minimized, and α is the learning rate.
2. Stochastic Gradient Descent
The stochastic gradient descent, also called SGD, is a simple extension of the gradient descent algorithm. The gradient descent takes all the n-data points, sums up the loss, and derives them with weights and biases. While the SGD takes a single point and derives them in relation to the weights and biases.
For this reason, gradient descent requires a large amount of memory to store the entire training sample loss function. To achieve a convergence curve and properly evaluate performance of the optimization algorithm, several gradient steps are taken. These same necessary steps taken by the SGD are faster; however, it can only oscillate because SGD takes a single point at a time and updates the weights and biases consecutively.
At the same time, the gradient descent takes n-points at a time and updates the weights and biases only after summing up the overall loss and biases. The main disadvantage of these optimization techniques is that they may be stuck on local minima as no momentum parameter is present.
There is a concept known as the adaptive learning rate method. Learning rate is one of the most important hyperparameters required to properly train a machine learning model, as it greatly influences gradient descent.
Learning rate helps fix problems like getting stuck on local minima, high convergence time, huge time to complete an epoch, etc. So, fixing the right learning rate solves these issues. But, what is the right learning rate?
The learning rate value is data-dependent, and there is no specific rule to keep it. Keeping a very high learning rate misses aspects of data, and the model may not classify data correctly. On the other hand, keeping a small learning rate may lead to the model not reaching the minimum loss.
To resolve this dilemma, it’s necessary to tune the correct learning rate and use adaptive methods. There are multiple popular optimization algorithms like RMSProp, AdaDelta, AdaGrad, Adam, etc.
Among these, Adam is arguably the most popular one. Using this technique, it becomes easier to adjust the right learning rate as well as decrease the chances of getting stuck on local minima.
The Value of Optimization
Optimization techniques are an integral part of any deep learning algorithm and there are several deep learning platform technologies that help with optimizing such hyper-parameters.
While the three mentioned in this article; gradient descent, stochastic gradient descent, and ADAM are some of the most promising, there are sure to be more tools, applications, and techniques available in the coming years to further deep learning overall as a new and exciting field for futuristic tech development.