05 - A Unified View of Loss Functions in Supervised Learning

Class: CSCE-421


Notes:

Intro

A Unified View of Loss Functions: Binary-Class Classifier

(1) Given a dataset X=[x1,x2,,xn] with the corresponding label Y=[y1,y2,,yn], where xiRd+1 and yi{+1,1}.

(2) For a given sample xi, a linear classifier computes the linear score si as

si=wTxi

(3) We study the relations between loss values and yisi :

yisi=yi(wTxi)

Notes:

Zero-One Loss

(1) Prediction is correct if yisi>0

(2) The zero-one loss aims at measuring the number of prediction errors

L0/1(yi,si)={1 if yisi<00 otherwise 

Pasted image 20260205093720.png|350

(3) The loss for the entire training data is 1ni=1nL0/1(yi,si)

Notes:

Perceptron loss

(1) The zero-one loss incurs the same loss value of 1 for all wrong predictions, no matter how far a wrong prediction is from the hyperplane.

(2) The perceptron loss addresses this by penalizing each wrong prediction by the extent of violation. The perceptron loss function is defined as 1ni=1nLp(yi,si), where Lp is perceptron loss which can be described as

Lp(yi,si)=max(0,yisi).

(3) Note that the loss is 0 when the input example is correctly classified. The loss is proportional to a quantification of the extent of violation ( yisi ) when the input example is incorrectly classified.

Pasted image 20260205094913.png|350

Notes:

Square loss

(1) The square loss function is commonly used for regression problems.

(2) It can also be used for binary classification problems as

1ni=1nLs(yi,si),

where Ls is the square loss, defined as

Ls(yi,si)=(1yisi)2

(3) Note that the square loss tends to penalize wrong predictions excessively. In addition, when the value of yisi is large and the classifier is making correct predictions, the square loss incurs a large loss value.

Pasted image 20260205095706.png|350

Notes:

Log loss (cross entropy)

(1) Logistic regression employs the log loss (cross entropy) to train classifiers.

(2) The loss function used in logistic regression can be expressed as

1ni=1nLlog(yi,si),

where Llog  is the log loss, defined as

Llog(yi,si)=log(1+eyisi).

Pasted image 20260205095836.png|350

Notes:

Hinge loss (support vector machines)

(1) The support vector machines employ hinge loss to obtain a classifier with "maximum-margin".

(2) The loss function in support vector machines is defined as follows:

1ni=1nLh(yi,si),

where Lh is the hinge loss:

Lh(yi,si)=max(0,1yisi).

(3) Different with the zero-one loss and perceptron loss, a data may be penalized even if it is predicted correctly.

Pasted image 20260205100458.png|350

Notes:

Exponential Loss

(1) The log term in the log loss encourages the loss to grow slowly for negative values, making it less sensitive to wrong predictions.

(2) There is a more aggressive loss function, known as the exponential loss, which grows exponentially for negative values and is thus very sensitive to wrong predictions. The AdaBoost algorithm employs the exponential loss to train the models.

(3) The exponential loss function can be expressed as 1ni=1nLexp(yi,si), where Lexp  is the exponential loss, defined as

Lexp(yi,si)=eyisi.

Pasted image 20260205101656.png|350

Notes:

Convexity

(1) Mathematically, a function f() is convex if

f(tx1+(1t)x2)tf(x1)+(1t)f(x2), for t[0,1].

(2) A function f() is strictly convex if

f(tx1+(1t)x2)<tf(x1)+(1t)f(x2), for t(0,1),x1x2.

(3) Intuitively, a function is convex if the line segment between any two points on the function is not below the function.

(4) A function is strictly convex if the line segment between any two distinct points on the function is strictly above the function, except for the two points on the function itself.

Pasted image 20260205103945.png|400

(5) In the zero-one loss, if a data sample is predicted correctly (yisi>0), it results in zero penalties; otherwise, there is a penalty of one. For any data sample that is not predicted correctly, it receives the same loss.

(6) For the perceptron loss, the penalty for each wrong prediction is proportional to the extent of violation. For other losses, a data sample can still incur penalty even if it is classified correctly.

(7) The log loss is similar to the hinge loss but it is a smooth function which can be optimized with the gradient descent method.

(8) While log loss grows slowly for negative values, exponential loss and square loss are more aggressive.

(9) Note that, in all of these loss functions, square loss will penalize correct predictions severely when the value of yisi is large.

(10) In addition, zero-one loss is not convex while the other loss functions are convex. Note that the hinge loss and perceptron loss are not strictly convex.

Notes:

Summary

Pasted image 20260205103254.png|450