05 - A Unified View of Loss Functions in Supervised Learning

Class: CSCE-421


Notes:

Intro

A Unified View of Loss Functions: Binary-Class Classifier

The Math

(1) Given a dataset X=[x1,x2,,xn] with the corresponding label Y=[y1,y2,,yn], where xiRd+1 and yi{+1,1}.

(2) For a given sample xi, a linear classifier computes the linear score si as

si=wTxi

(3) We study the relations between loss values and yisi :

yisi=yi(wTxi)

Notes:

The Meaning

1. The Magic Metric: yisi (The Agreement Score) Up until now, we have looked at different loss functions for different problems (Least Squares for Regression, Cross-Entropy for Logistic Regression). To create a "unified view" of all classification loss functions, we need a single mathematical way to represent how right or how wrong a model's prediction is.

We do this by combining two things:

When we multiply these together to get yisi, it tells us everything we need to know about the prediction:

Zero-One Loss

The Math

(1) Prediction is correct if yisi>0

(2) The zero-one loss aims at measuring the number of prediction errors

L0/1(yi,si)={1 if yisi<00 otherwise 

Pasted image 20260205093720.png350

(3) The loss for the entire training data is 1ni=1nL0/1(yi,si)

Notes:

The Meaning

2. The Zero-One Loss (L0/1) The Zero-One loss is the simplest, most intuitive way to measure error. It acts like a very strict teacher grading a test.

To find the total error of your model on the whole dataset, you just average these penalties: 1ni=1nL0/1(yi,si).

The total Zero-One loss is literally just the percentage of misclassified examples in your dataset.

3. The Problem with Zero-One Loss Your final note highlights the most important takeaway: "This loss function is not commonly used because it is not continuous, you will have to deal with discrete optimization".

Why is this bad? Remember the analogy of Gradient Descent: a ball rolling down a smooth hill to find the minimum. If you graph the Zero-One loss, it is not a smooth hill. It is a perfectly flat line at 0, and then a sudden, sharp vertical cliff jumping up to 1. Because the line is completely flat everywhere except at the cliff, the gradient (slope) is exactly zero almost everywhere. If the computer tries to find the slope to figure out which way to step, it gets 0, so it doesn't know how to adjust the weights.

To use Gradient Descent, we must replace this "stair-step" cliff with a smooth, curved approximation (like the Cross-Entropy Log Loss) that a computer can easily roll a ball down.

Perceptron loss

The Math

(1) The zero-one loss incurs the same loss value of 1 for all wrong predictions, no matter how far a wrong prediction is from the hyperplane.

(2) The perceptron loss addresses this by penalizing each wrong prediction by the extent of violation. The perceptron loss function is defined as 1ni=1nLp(yi,si), where Lp is perceptron loss which can be described as

Lp(yi,si)=max(0,yisi).

(3) Note that the loss is 0 when the input example is correctly classified. The loss is proportional to a quantification of the extent of violation ( yisi ) when the input example is incorrectly classified.

Pasted image 20260205094913.png350

Notes:

The Meaning

1. Perceptron Loss: Measuring "How Wrong" You Are In the previous slide, we saw that the Zero-One loss is flawed because it gives a flat penalty of 1 for any mistake, whether the model was barely wrong or completely wrong. It also forms a "cliff" that prevents computers from using Gradient Descent (since the slope is 0 everywhere else).

The Perceptron Loss fixes this by penalizing the model based on the extent of the violation.

Because it forms a continuous ramp instead of a sharp cliff, the computer can now measure a slope and adjust the weights, making it a functional approximation of the Zero-One loss. However, while it is convex, it is not strictly convex (it doesn't form a perfect rounded bowl), which can sometimes make optimization slightly less smooth than other methods.

Square loss

The Math

(1) The square loss function is commonly used for regression problems.

(2) It can also be used for binary classification problems as

1ni=1nLs(yi,si),

where Ls is the square loss, defined as

Ls(yi,si)=(1yisi)2

(3) Note that the square loss tends to penalize wrong predictions excessively. In addition, when the value of yisi is large and the classifier is making correct predictions, the square loss incurs a large loss value.

Pasted image 20260205095706.png350

Notes:

The Meaning

Square Loss: The Danger of Being "Too Right" Next, the slide introduces the Square Loss: Ls(yi,si)=(1yisi)2. You already know this function—it is the standard error measure used for Linear Regression. However, this slide explores what happens if we blindly try to use it for classification.

This is the fatal flaw of using Square Loss for classification: it severely penalizes the model for being "too correct". Because it is a parabola (a U-shape), the penalty shoots up equally on both sides. The model will waste its time trying to pull its highly confident correct predictions back down to exactly 1, rather than focusing on fixing its actual mistakes.

While Square Loss is perfectly smooth and strictly convex (making the math very easy), this over-penalization of correct predictions makes it a poor choice for classification tasks.

Log loss (cross entropy)

The Math

(1) Logistic regression employs the log loss (cross entropy) to train classifiers.

(2) The loss function used in logistic regression can be expressed as

1ni=1nLlog(yi,si),

where Llog  is the log loss, defined as

Llog(yi,si)=log(1+eyisi).

Pasted image 20260205095836.png350

Notes:

The Meaning

1. Log Loss (Cross Entropy) is perfectly smooth and everywhere differentiable. It forms a perfect, rounded bowl (it is strictly convex), which is exactly why we love using it for Gradient Descent.

2. Log Loss: Never satisfied, always pushing You asked in your notes: "Should there be some loss close to 0 or not?" The answer is yes! If yisi is positive but very small (e.g., 0.05), it means the model guessed correctly, but it is extremely unconfident.

Hinge loss (support vector machines)

The Math

(1) The support vector machines employ hinge loss to obtain a classifier with "maximum-margin".

(2) The loss function in support vector machines is defined as follows:

1ni=1nLh(yi,si),

where Lh is the hinge loss:

Lh(yi,si)=max(0,1yisi).

(3) Different with the zero-one loss and perceptron loss, a data may be penalized even if it is predicted correctly.

Pasted image 20260205100458.png350

Notes:

The Meaning

3. Hinge Loss: Demanding a "Margin" of Safety The Support Vector Machine (SVM) uses Hinge Loss: Lh(yi,si)=max(0,1yisi). You perfectly noticed the only difference between this and Perceptron loss is the shift by 1. This 1 has a massive impact.

4. Support vs. Non-Support Vectors Because Hinge Loss becomes exactly zero once yisi1, an interesting phenomenon occurs.

5. "Why is it 1? Can we use 5?" Your intuition here is incredibly sharp and correct! If you used max(0,5yisi), the algorithm would just scale the entire weight vector w up by a factor of 5 to achieve the exact same geometric separating line. The margin would look the same relative to the data. We use 1 simply because it is the cleanest mathematical standard to anchor the scale.

6. Why deep learning prefers Log Loss Your note suggests Hinge Loss isn't used in Deep Learning because it only works for two classes. While multi-class adaptations of Hinge Loss do exist, the real reason Deep Learning universally prefers Log Loss is back to your first note: Differentiability and Probabilities. Log loss is completely smooth (easy for backpropagation/gradient descent) and pairs perfectly with the Softmax function to output clean percentages (probabilities), whereas Hinge Loss just outputs raw, hard boundary scores.

Exponential Loss

The Math

(1) The log term in the log loss encourages the loss to grow slowly for negative values, making it less sensitive to wrong predictions.

(2) There is a more aggressive loss function, known as the exponential loss, which grows exponentially for negative values and is thus very sensitive to wrong predictions. The AdaBoost algorithm employs the exponential loss to train the models.

(3) The exponential loss function can be expressed as 1ni=1nLexp(yi,si), where Lexp  is the exponential loss, defined as

Lexp(yi,si)=eyisi.

Pasted image 20260205101656.png350

Notes:

The Meaning

Exponential Loss: The "Aggressive" Metric The formula for Exponential Loss is Lexp(yi,si)=eyisi. To understand why your slide calls it "aggressive," let's compare it to the Perceptron and Log losses when the model makes a very wrong prediction (e.g., yisi=5):

As you can see, Exponential Loss punishes wrong predictions astronomically! If even a single data point is misclassified by a large margin, the error skyrockets. This forces the learning algorithm to focus intensely on fixing its biggest mistakes. This is the exact mathematical engine that powers AdaBoost (a famous algorithm that builds classifiers by obsessively focusing on the hardest-to-classify data points).

Convexity

The Math

(1) Mathematically, a function f() is convex if

f(tx1+(1t)x2)tf(x1)+(1t)f(x2), for t[0,1].

(2) A function f() is strictly convex if

f(tx1+(1t)x2)<tf(x1)+(1t)f(x2), for t(0,1),x1x2.

(3) Intuitively, a function is convex if the line segment between any two points on the function is not below the function.

(4) A function is strictly convex if the line segment between any two distinct points on the function is strictly above the function, except for the two points on the function itself.

Pasted image 20260205103945.png400

(5) In the zero-one loss, if a data sample is predicted correctly (yisi>0), it results in zero penalties; otherwise, there is a penalty of one. For any data sample that is not predicted correctly, it receives the same loss.

(6) For the perceptron loss, the penalty for each wrong prediction is proportional to the extent of violation. For other losses, a data sample can still incur penalty even if it is classified correctly.

(7) The log loss is similar to the hinge loss but it is a smooth function which can be optimized with the gradient descent method.

(8) While log loss grows slowly for negative values, exponential loss and square loss are more aggressive.

(9) Note that, in all of these loss functions, square loss will penalize correct predictions severely when the value of yisi is large.

(10) In addition, zero-one loss is not convex while the other loss functions are convex. Note that the hinge loss and perceptron loss are not strictly convex.

Notes:

The Meaning

1. Convexity: The Mathematics of a "Bowl" You already know intuitively that a convex function looks like a bowl, which is great for Gradient Descent. The slide provides the formal mathematical proof of what a "bowl" is: $$f(t x_1 + (1-t) x_2) \leq t f(x_1) + (1-t) f(x_2)$$ Let's translate this math into English:

2. Convex vs. Strictly Convex Why do we care about "Strictly" convex?

Summary

Pasted image 20260205103254.png450

The Grand Summary of Loss Functions The end of your slide brilliantly summarizes everything you need to know for choosing a loss function: