Tag: Loss

  • Optimization: Gradient Descent

    Optimization: Gradient Descent

    Gradient descent is one of the most used algorithms to optimize a function. Optimizing a function means finding the hyperparameter values for that function that give us the best possible outcome.

    Gradient descent has broad applications, but in this text, we will focus on its use in Machine Learning to minimize model loss in a linear function.

    Cost function

    Before optimizing anything, we first need a way to measure performance. This is where the loss function comes in. In simple terms, the loss function tells us how far off our model’s predictions are from the actual values. It’s a crucial guide for improving our model.

    Some of the most commonly used loss functions include Mean Squared Error (MSE) for regression tasks and cross-entropy for classification problems.

    Mean Squared Error

    The Mean Squared Error (MSE) is the average of the squared differences between the observed (actual) values and the values predicted by the model. By squaring the differences, we ensure that negative errors don’t cancel out positive ones. You might wonder why we don’t simply take the absolute differences instead. The reason is that squaring the errors penalizes larger mistakes more heavily than smaller ones, which can be useful in emphasizing and correcting significant prediction errors. Mean Squared error is represented by the follwing equation:

        \[\text{MSE} =\frac{1}{n} \sum_{i=1}^{n}(y - \hat y)^2\]

    Where \hat y is the predicted value and y is the observed value./

    Illustrative example

    Imagine we have a linear model, represented by a red line, and we want to fit it to a set of data points, shown as black dots. Try to find the best hyperparameters—specifically the slope and the bias—that minimize the model’s error. This error is measured using Mean Squared Error (MSE), which we can visualize as blue squares representing the squared distances between the predicted values on the red line and the actual data points. The smaller these blue squares, the better our model fits the data.

    Equation: y = 1.00x + 1.00

    MSE: 0

    What is the gradient?

    Let’s say we want to optimize a loss function defined by the equation y=x^2, where y represents the loss value and xxx is the hyperparameter we want to optimize. If we set x=2, the loss becomes y=4

    While it’s easy to see that the optimal value of x is 0—since we know the shape of the function—in real-world scenarios, the loss function is often much more complex and not visually obvious. So, how can we figure out whether to increase or decrease x to reduce the loss?

    Spoiler alert: we can do this by calculating the gradient.

    By taking the derivative of the loss function with respect to x, we obtain the gradient at that point. For x=2, the derivative of y = x^2 is y'=2x, so the gradient is 2 \times 2=4. This gradient represents the slope of the tangent line at that point.

    To minimize the loss, we want to move in the direction that reduces the loss the fastest—that is, in the opposite direction of the gradient. Our goal is to reach a point where the gradient is 0 (or very close to it), which indicates that we’ve found a minimum in the loss function.

    Learning rate

    However, there are situations where we might start far from the optimal value of our parameter. In such cases, the gradient tends to be steeper—the farther we are from the minimum, the larger the gradient usually is. This suggests that we might want to take larger steps when the gradient is large, and smaller steps when the gradient is small.

    This is where the learning rate comes into play. The learning rate is a small positive value that controls how much of the gradient we actually use to update our parameter. Instead of subtracting the full gradient from our parameter, we subtract only a fraction of it. This helps us move in the direction of lower loss without overshooting the minimum.

    Gradien descent step by step

    For the following dataset lets build a linear model to make some predictions:

        \[((x_1, y_1) = (0.5, 0.8))\]

        \[((x_2, y_2) = (2.0, 1.0)\]

        \[((x_3, y_3) = (1.0, 1.5))\]

    In a linear regression model, we have two parameters: the slope and the intercept. Therefore, we need to calculate the gradient for both of them.

        \[f(x) = mx + b\]

    Where:

    • m is the slope,
    • b is the intercept

    Before starting we assking any value both variables to have something to improve on.

    Step 1: Initialize paremeters

    Before starting the optimization process, we assign an initial value to both variables so we have something to improve upon.

    • m=0
    • b=0

    Step 2: Calculate Loss

    We’ll use Mean Squared Error (MSE):

        \[\text{Loss}=\frac{1}{n} \sum_{i=1}^{n} (y_i - f(x_i))^2 =\frac{1}{n} \sum_{i=1}^{n} (y_i -(mx_i + b))^2 \]

    Predictions with m=0, b=0

    f(x_i)=0 for all x values.

    Step 3: Compute gradients

    We calculate the partial derivatives of the loss with respect to m and b:

    Gradient with respect to m:

        \[\frac{\partial \text{Loss}}{\partial m} = -\frac{2}{n} \sum_{i=1}^{n} x_i \left( y_i - (m x_i + b) \right)\]

        \[\frac{\partial \text{Loss}}{\partial m} = -\frac{2}{3} \left[ 0.5 \cdot (0.8 - 0) + 2 \cdot (1 - 0) + 1 \cdot (1.5 - 0) \right] = -\frac{2}{3} (3.9) = -2.6\]

    Gradient with respect to b:

        \[\frac{\partial \text{Loss }}{\partial b} = -\frac{2}{n} \sum_{i=1}^{n} \left( y_i - (m x_i + b) \right)\]

        \[\frac{\partial \text{Loss}}{\partial m} = -\frac{2}{3} \left[ (0.8 - 0) + (1 - 0) + (1.5 - 0) \right] = -\frac{2}{3} (3.3) = -2.2\]

    Step 4: Update parameters

    Using a learning rate \alpha = 0.1, we update m:

        \[m_{\text{new}} = m_{\text{old}} - \alpha \cdot \frac{\partial \text{Loss}}{\partial m}\]

        \[m_{\text{new}} = 0 - 0.1 \cdot (-2.6) = 0.26\]

    Next, we update b

        \[m_{\text{new}} = m_{\text{old}} - \alpha \cdot \frac{\partial \text{Loss}}{\partial b}\]

        \[m_{\text{new}} = 0 - 0.1 \cdot (-2.6) = 0.22\]

    Note: We subtract the gradient from the current value because we want to move in the opposite direction of the gradient, as explained earlier.

    Now, we plug these updated values for mmm and bbb into our linear function:

        \[f(x) = 0.26x + 0.22\]

    And we plot the updated model:

    The model has significantly improved compared to the initial version.

    After continuing the iterations for 3 more steps, we arrive at the following values:

    m=0.42

    b=0.40

    These values yield the following model:

    Note: The learning rate α=0.1\alpha = 0.1α=0.1 is not an ideal value. Typically, smaller learning rates are used, as larger values may cause the model to have trouble converging.

  • Linear regression loss

    Linear regression loss

    Loss is the utility loss when a model makes a prediction. In simple terms, it is the difference between the observed and predicted values L(y, \hat{y})=y - \hat{y}.

    For linear models we could add up the loss for each prediction to find the overall loss value. However, this is not a good practice since positive errors (where predictions overshoot the observed value) and negative errors (where predictions undershoot the observed value) can offset each other and give us lower loss.

    To prevent model loss values from offsetting, we could take the absolute value (L_1 loss) or square the loss (L_2 loss).

    (1)   \begin{equation*} L_1(y \hat{y})=\sum |y - \hat{y}|\end{equation*}

    (2)   \begin{equation*} L_2(y \hat{y})=\sum (y - \hat{y})^2\end{equation*}

    However, these values are not very descriptive; they do not give us a good understanding of how bad or good our model is, so we can get the average loss divided by the number of observations N.

    (3)   \begin{equation*} Mean Absolute Error (MAE) = \frac{1}{N} \sum |y - \hat{y}|\end{equation*}

    (4)   \begin{equation*} Mean Squared Error (MSE) =\frac{1}{N} \sum (y - \hat{y})^2\end{equation*}