Optimization Algorithms Explained

1. Gradient Descent

Gradient Descent is the foundational optimization algorithm used to minimize a cost function (or loss function) $J(\theta)$ by iteratively moving in the direction of the steepest descent, which is the negative of the gradient. The learning rate $\alpha$ determines the size of the steps we take.

Imagine you are on a mountain and want to get to the lowest point. Gradient descent is like taking small steps downhill in the direction where the slope is steepest. The size of your step is the learning rate. If it's too small, you'll take a long time to reach the bottom. If it's too large, you might overshoot and never reach the bottom.

Mathematical Formulation:

$$ \theta_{t+1} = \theta_t - \alpha \nabla J(\theta_t) $$

Where:

$\theta_{t+1}$ are the parameters at the next iteration.
$\theta_t$ are the current parameters.
$\alpha$ is the learning rate (a small positive scalar).
$\nabla J(\theta_t)$ is the gradient of the cost function $J$ with respect to the parameters $\theta$ at the current iteration $t$.

Python/NumPy Snippet:


import numpy as np

# Example cost function (e.g., for linear regression)
def cost_function(X, y, theta):
    m = len(y)
    predictions = X.dot(theta)
    cost = (1/(2*m)) * np.sum(np.square(predictions - y))
    return cost

# Example gradient calculation (for linear regression)
def gradient(X, y, theta):
    m = len(y)
    predictions = X.dot(theta)
    grad = (1/m) * X.T.dot(predictions - y)
    return grad

def gradient_descent(X, y, theta_initial, learning_rate, n_iterations):
    """
    Performs gradient descent to learn theta.
    
    Args:
        X (np.array): Feature matrix (add intercept term if needed).
        y (np.array): Target vector.
        theta_initial (np.array): Initial parameter values.
        learning_rate (float): Step size for each iteration.
        n_iterations (int): Number of iterations to perform.
        
    Returns:
        theta (np.array): Learned parameters.
        cost_history (list): Cost at each iteration.
    """
    theta = theta_initial.copy() # Ensure we don't modify the original
    cost_history = []

    for i in range(n_iterations):
        grad = gradient(X, y, theta)
        theta = theta - learning_rate * grad
        current_cost = cost_function(X, y, theta)
        cost_history.append(current_cost)
        
        if i % 100 == 0: # Print cost every 100 iterations
            print(f"Iteration {i}: Cost = {current_cost}")
            
    return theta, cost_history

# Example Usage (assuming X_train, y_train are defined)
# X_train_b = np.c_[np.ones((len(X_train), 1)), X_train] # Add intercept term
# initial_theta = np.random.randn(X_train_b.shape[1], 1) 
# learning_rate = 0.01
# iterations = 1000
# theta_final, costs = gradient_descent(X_train_b, y_train, initial_theta, learning_rate, iterations)
# print("Final theta:", theta_final)

Note: The example usage is commented out. You'd need to define X_train and y_train with your actual data to run it.

2. Gradient Descent with Momentum

Gradient Descent with Momentum helps accelerate SGD in the relevant direction and dampens oscillations. It does this by adding a fraction $\beta$ (beta, the momentum term) of the update vector of the past time step to the current update vector. This simulates the concept of momentum in physics, where a ball rolling down a hill gains momentum and continues moving in the same direction.

This is particularly useful in situations where the cost surface has high curvature, ravines, or noisy gradients. Momentum helps to smooth out the updates and prevents the optimizer from getting stuck in local minima or slowing down excessively in flat regions.

Mathematical Formulation:

The update rule involves a "velocity" term $v_t$:

$$ v_t = \beta v_{t-1} + (1-\beta) \nabla J(\theta_t) \quad \text{(often, (1-beta) is absorbed into learning rate or simply } \alpha \nabla J(\theta_t) \text{ is used)}$$ $$ \theta_{t+1} = \theta_t - \alpha v_t $$

A more common formulation directly uses the learning rate $\alpha$ with the gradient term:

$$ v_t = \beta v_{t-1} + \alpha \nabla J(\theta_t) $$ $$ \theta_{t+1} = \theta_t - v_t $$

Where:

$v_t$ is the velocity (or momentum) vector at iteration $t$. Initialize $v_0 = 0$.
$\beta$ is the momentum coefficient (typically around 0.9). It determines how much of the previous update is carried forward.
$\alpha$ is the learning rate.
$\nabla J(\theta_t)$ is the gradient of the cost function.

Python/NumPy Snippet:


# (Code remains the same, only text and LaTeX delimiters updated)
import numpy as np

# Assuming cost_function and gradient functions are defined as in Gradient Descent section

def gradient_descent_with_momentum(X, y, theta_initial, learning_rate, n_iterations, beta):
    theta = theta_initial.copy()
    cost_history = []
    v = np.zeros_like(theta) 

    for i in range(n_iterations):
        grad = gradient(X, y, theta)
        v = beta * v + learning_rate * grad 
        theta = theta - v
        current_cost = cost_function(X, y, theta)
        cost_history.append(current_cost)
        if i % 100 == 0:
            print(f"Iteration {i}: Cost = {current_cost}")
    return theta, cost_history

3. ADAM (Adaptive Moment Estimation)

ADAM is an optimization algorithm that combines the ideas of Momentum and RMSprop (Root Mean Square Propagation). It computes adaptive learning rates for each parameter. In addition to storing an exponentially decaying average of past squared gradients like RMSprop, ADAM also keeps an exponentially decaying average of past gradients, similar to momentum.

It is computationally efficient, has little memory requirements, is invariant to diagonal rescaling of gradients, and is well suited for problems that are large in terms of data and/or parameters. It's often the default go-to optimizer for many deep learning applications.

Mathematical Formulation:

ADAM involves the following steps at each iteration $t$ (for $t=1, 2, \dots$):

Compute gradient: $g_t = \nabla J(\theta_{t-1})$
Update biased first moment estimate (momentum):
$$ m_t = \beta_1 m_{t-1} + (1-\beta_1) g_t $$
Update biased second raw moment estimate (like RMSprop):
$$ v_t = \beta_2 v_{t-1} + (1-\beta_2) g_t^2 $$
(Note: $g_t^2$ is element-wise square)
Compute bias-corrected first moment estimate:
$$ \hat{m}_t = \frac{m_t}{1 - \beta_1^t} $$
Compute bias-corrected second raw moment estimate:
$$ \hat{v}_t = \frac{v_t}{1 - \beta_2^t} $$
Update parameters:
$$ \theta_t = \theta_{t-1} - \alpha \frac{\hat{m}_t}{\sqrt{\hat{v}_t} + \epsilon} $$

Where:

$\alpha$ is the learning rate (step size).
$\beta_1$ is the exponential decay rate for the first moment estimates (e.g., 0.9).
$\beta_2$ is the exponential decay rate for the second moment estimates (e.g., 0.999).
$\epsilon$ is a small constant for numerical stability (e.g., $10^{-8}$).
$m_0$ and $v_0$ are initialized as 0-vectors.

Python/NumPy Snippet:


# (Code remains the same, only text and LaTeX delimiters updated)
import numpy as np

# Assuming cost_function and gradient functions are defined as in Gradient Descent section

def adam_optimizer(X, y, theta_initial, learning_rate, n_iterations, beta1, beta2, epsilon):
    theta = theta_initial.copy()
    cost_history = []
    m_t = np.zeros_like(theta) 
    v_t = np.zeros_like(theta) 
    t = 0                    

    for i in range(n_iterations):
        t += 1
        grad = gradient(X, y, theta)
        m_t = beta1 * m_t + (1 - beta1) * grad
        v_t = beta2 * v_t + (1 - beta2) * (grad**2)
        m_hat = m_t / (1 - beta1**t)
        v_hat = v_t / (1 - beta2**t)
        theta = theta - learning_rate * m_hat / (np.sqrt(v_hat) + epsilon)
        current_cost = cost_function(X, y, theta)
        cost_history.append(current_cost)
        if i % 100 == 0:
            print(f"Iteration {i} (t={t}): Cost = {current_cost}")
    return theta, cost_history