03 - Logistic Regression - From Binary to Multi-Class

Notes:

Multi-Class Classification

Given $D = (x_{1}, y_{1}), \dots, (x_{N}, y_{N}),$ , where $x_{i} \in R^{d + 1}$ and $y_{i} \in {1, 2, \dots, K}$
K is the number of classes and N is the number of samples.
We need to estimate the probability of x belonging to each of the K classes as

h_{w} (x) = [\begin{matrix} P (y = 1 ∣ x; w) \\ P (y = 2 ∣ x; w) \\ \dots \\ P (y = K ∣ x; w) \end{matrix}]

No ordinal relationship between the classes.

Example:

For K = 10 (10 classes), each prediction will be a vector of 10 dimensions.
It is K dimensional but it can be positive/negative

Multi-Class Logistic Regression

We need K weight vectors $w_{k}$ , k= 1,···,K
- We need a vector for each class
- With K sub-vectors, wee need to get to $w_{k}$
Compute K linear signals by the dot product between the input x and each $w_{k}$ as

[\begin{matrix} w_{1}^{T} x \\ w_{2}^{T} x \\ ⋮ \\ w_{k}^{T} x \end{matrix}]

We need to map the K outputs (as a vector in $R^{K}$ ) to the K probabilities (as a probability distribution among the K classes)

Softmax

A function very useful in general out of ML

Given a K -dimensional vector $v = {[v_{1}, v_{2}, \dots, v_{K}]}^{T} \in R^{K}$

softmax (v) = \frac{1}{\sum_{k = 1}^{K} e^{v_{k}}} [\begin{matrix} e^{v_{1}} \\ e^{v_{2}} \\ ⋮ \\ e^{v_{K}} \end{matrix}]

- The output of the softmax function is a vector of probability

The softmax maps a vector $R^{K}$ to $(0, 1)^{K}$
All elements int he output vector sum to 1.

Why is it called softmax?

Note each exponential component is a monotonic function, this means ...
Whichever component is largest, then in the output, it will still be the largest thanks to this property, this is because of the exponential term.
- Softmax = whichever is larger in the input will be larger in the output

Multi-Class Logistic Regression

The multi-class prediction function can be written as

h_{w} (x) = [\begin{matrix} P (y = 1 ∣ x; w) \\ P (y = 2 ∣ x; w) \\ \dots \\ P (y = K ∣ x; w) \end{matrix}] = softmax [\begin{matrix} w_{1}^{T} x \\ w_{2}^{T} x \\ ⋮ \\ w_{K}^{T} x \end{matrix}] = \frac{1}{\sum_{k = 1}^{K} e^{w_{k}^{T} x}} [\begin{matrix} e^{w_{1}^{T} x} \\ e^{w_{2}^{T} x} \\ \dots \\ e^{w_{k}^{T} x} \end{matrix}]

- The softmax output vector tells you the probability that $x$ belongs to each of the classes
- This is still the classifier to use for any classification problem since it works for as many classes as K (multi-class)
- Most other models only use for two class classifications

Although there are K vectors $w_{k}$ , k= 1,···,K , only K−1 of them are independent, due to the sum-to-one constraint
That is why we only need one vector for binary class problems
When K = 2, there is an equivalence relationship between softmax and θ(·).

Training with Cross Entropy Loss

A loss function measures the error between predictions and the true class labels.
We need to measure the distance between two probability distributions.
The cross entropy for a single sample is defined as

H (P, Q) = - \sum_{i = 1}^{K} p_{i} \log (q_{i}),

- This is the general definition of Cross entropy
- There is only one term that is non-zero, that term will come to be the correct class

- $P = {(p_{1}, \dots, p_{K})}^{T}$ is the one-hot vector representing the true class
- $Q = {(q_{1}, \dots, q_{K})}^{T}$ is the predicted probability vector as output of softmax

A Neural Network View

Pasted image 20260127101916.png|600

You have an e dimensional vector ( $x$ ) -> this is the vector of the input
Then we compute $w_{1}^{T} x$ , $w_{2}^{T} x$ , and $w_{3}^{T} x$ (for 3 classes) -> K = 3
- Look at the different color edges, each color represent a connection from either $w_{1}, w_{2}, w_{3}$ .
  - If you have K classes, you need K units in this vector
- You are basically connecting every input to every output
These 3-dimensional vector (a vector with 3 numbers) is passed onto the softmax function, which will output the probability vector of $w_{i}^{T} x_{i}$ belongs to each class. This is our prediction.
- Note that in the softmax function there is a normalization term. It couples all the terms together.
Now lets run through a training session using the Cross entropy loss function
- One-hot vector means that only one element in the vector is 1, everyone else is 0. This is the true value that represents to which class the vector $x$ really belongs and is inputed to the cross entropy loss function.
- It measures the difference between the true class and the probability vector outputted from the softmax function
- Note how the one-hot vector would look like:
  - $min - (0 \cdot \log (0.7) + 1 \cdot \log (0.2) + 0 \cdot \log (0.1))$
- Minimizing the cross entropy loss is equivalent to maximizing the probability vector

Example exam question:

Lets say you got 2 predictions:
- q = [0.5 0.4 0.1] and p = [0 1 0]
The distribution of the probability of other classes does not matter
What is the loss of this sample?
- Just plug in into the Cross entropy loss function
- You just need to do $- (0 \cdot \log (0.5) + 1 \cdot \log (0.4) + 0 \cdot \log (0.1))$
Answer: -0.4

Loss Function

Now the loss for a training sample x in class c is given by

\begin{aligned} loss (x, y; w) & = H (y, \hat{y}) \\ = - \sum_{k} y_{k} \log {\hat{y}}_{k} \\ = - \log {\hat{y}}_{c} \\ = - \log \frac{e^{w_{c}^{T} x}}{\sum_{k = 1}^{K} e^{w_{k}^{T} x}} \end{aligned}

where $y$ denotes the one-hot vector and $\hat{y}$ is the predicted distribution
$h (x_{i})$ . And the loss on all samples ${(X_{i}, Y_{i})}_{i = 1}^{N}$ is

loss (X, Y; w) = - \sum_{i = 1}^{N} \sum_{c = 1}^{K} I [y_{i} = c] \log \frac{e^{w_{c}^{T} x_{i}}}{\sum_{k = 1}^{K} e^{w_{k}^{T} x_{i}}}

This is for when you need the loss of multiple samples
$I [y_{i} = c]$ is equal to 1 only for the true class.
$w (t + 1) = w (t) + η \hat{v}$
- You have the w matrix as input and up with a number
- When you compute the derivate, this output (a number) it becomes a matrix.

Derivative of Loss Function

The notes (Logistic Regression: From Binary to Multi-Class) contain
details on derivative of cross entropy loss function, which is necessary for your homework. All you need are:

Univariate calculus
Chain rule
$\frac{\partial (w^{⊤} b)}{\partial w} = b$

Shift-invariance in Parameters

The softmax function in multi-class LR has an invariance property when
shifting the parameters. Given the weights $w = (w_{1}, \cdot \cdot \cdot, w_{K})$ , suppose we
subtract the same vector $u$ from each of the K weight vectors, the outputs
of softmax function will remain the same.

Notes:

If we subtract a u from each of the K weights
- (u is a vector of the same dimension of each of the w)
The outputs of softmax will be equivalent

softmax [\begin{matrix} w_{1}^{T} x \\ w_{2}^{T} x \\ ⋮ \\ w_{K}^{T} x \end{matrix}] = softmax [\begin{matrix} v_{1}^{T} x \\ v_{2}^{T} x \\ ⋮ \\ v_{K}^{T} x \end{matrix}]

v_{i} = w_{i} - u

The reason why the output is K - 1 dimension is because of normalization?

Proof

To prove this, let us denote $w^{'} = {w_{i}^{'}}_{i = 1}^{K}$ where $w^{'} = w_{i} - u$ . We have

\begin{aligned} P (y = k ∣ x; w^{'}) & = \frac{e^{{(w_{k} - u)}^{T} x}}{\sum_{i = 1}^{K} e^{{(w_{i} - u)}^{T} x}} \\ = \frac{e^{w_{k}^{T} x} e^{- u^{T} x}}{\sum_{i = 1}^{K} e^{w_{i}^{T} x} e^{- u^{T} x}} \\ = \frac{e^{w_{k}^{T} x} e^{- u^{T} x}}{(\sum_{i = 1}^{K} e^{w_{i}^{T} x}) e^{- u^{T} x}} \\ = \frac{e^{{(w_{k})}^{T} x}}{\sum_{i = 1}^{K} e^{{(w_{i})}^{T} x}} \\ = P (y = k ∣ x; w), \end{aligned}

which completes the proof.

Equivalence to Sigmoid

Once we have proved the shift-invariance, we are able to show that when K= 2, the softmax-based multi-class LR is equivalent to the sigmoid-based binary LR. In particular, the hypothesis of both LR are equivalent.

Proof

\begin{aligned} h_{w} (x) & = \frac{1}{e^{w_{1}^{T} x} + e^{w_{2}^{T} x}} [\begin{matrix} e^{w_{1}^{T} x} \\ e^{w_{2}^{T} x} \end{matrix}] \\ = \frac{1}{e^{{(w_{1} - w_{1})}^{T} x} + e^{{(w_{2} - w_{1})}^{T} x}} [\begin{matrix} e^{{(w_{1} - w_{1})}^{T} x} \\ e^{{(w_{2} - w_{1})}^{T} x} \end{matrix}] \\ = [\begin{array}{c} \frac{1}{1 + e^{{(w_{2} - w_{1})}^{T} x}} \\ \frac{e^{{(w_{2} - w_{1})}^{T}}}{1 + e^{{(w_{2} - w_{1})}^{T} x}} \end{array}] \\ = [\begin{array}{c} \frac{1}{1 + e^{- {\hat{w}}^{T} x}} \\ \frac{e^{- {\hat{w}}^{T} x}}{1 + e^{- {\hat{w}}^{T} x}} \end{array}] \\ = [\begin{array}{c} \frac{1}{1 + e^{- {\hat{w}}^{T} x}} \\ 1 - \frac{1}{1 + e^{- {\hat{w}}^{T} x}} \end{array}] = [\begin{array}{c} h_{\hat{w}} (x) \\ 1 - h_{\hat{w}} (x) \end{array}], \end{aligned}

where $\hat{w} = w_{i} - w_{2}$ . This completes the proof.

Notes:

Note how the exponent cancels and gives 1 for some terms
Note the numerator on the 3rd point, it is very similar to the binary relation theta function:

θ (w^{T} x) = \frac{1}{1 + e^{- w^{T} x}}

- You can do a change of variables $w = w_{1} - w_{2}$ so that it becomes exactly (turns the exponent negative)

Equivalence of Loss Function

Now we show that minimizing the logistic regression loss is equivalent to minimizing the cross-entropy loss with binary outcomes.
The equivalence between logistic regression loss and the cross-entropy loss, as shown below, proves that we always obtain identical weights w by minimizing the two losses. The equivalence between the losses, together with the equivalence between sigmoid and softmax, leads to the conclusion that the binary logistic regression is a particular case of multi-class logistic regression when K= 2.

Proof

\begin{aligned} \arg min_{w} E_{i n} (w) & = \arg min_{w} \frac{1}{N} \sum_{n = 1}^{N} \ln (1 + e^{- y_{n} w^{T} x_{n}}) \\ = \arg min_{w} \frac{1}{N} \sum_{n = 1}^{N} \ln \frac{1}{θ (y_{n} w^{T} x_{n})} \\ = \arg min_{w} \frac{1}{N} \sum_{n = 1}^{N} \ln \frac{1}{P (y_{n} ∣ x_{n})} \\ = \arg min_{w} \frac{1}{N} \sum_{n = 1}^{N} I [y_{n} = + 1] \ln \frac{1}{P (y_{n} ∣ x_{n})} + I [y_{n} = - 1] \ln \frac{1}{P (y_{n} ∣ x_{n})} \\ = \arg min_{w} \frac{1}{N} \sum_{n = 1}^{N} I [y_{n} = + 1] \ln \frac{1}{h (x_{n})} + I [y_{n} = - 1] \ln \frac{1}{1 - h (x_{n})} \\ = \arg min_{w} p \log \frac{1}{q} + (1 - p) \log \frac{1}{1 - q} \\ = \arg min_{w} H ({p, 1 - p}, {q, 1 - q}) \end{aligned}

where $p = I [y_{n} = + 1] and q = h (x_{n}) .$ This completes the proof.

Notes:

Remember:

θ (s) = \frac{1}{1 + e^{- s}}

- In this case $y_{n} w^{T} x_{n}$ is our $s$ .

Then:

P (y | x) = θ (y w^{T} x)

Then we write how these two cases are separate
- We want to write the +1 and -1 cases by separate, this means $I [y_{n} = + 1]$ and $I [y_{n} = - 1]$ .
Then you simply rewrite $P (y_{n} | x_{n})$ as the prediction function $h (x_{n})$ . This is just the prediction represented as a function.
- Now you can see a pattern and will be able to identify input $p$ from output $q$ and that is how you can get to the last two forms.
Finally you just need to compute the cross entropy loss $H$
- Note the definition of the cross entropy loss includes a negative, this is why there is no negative on the second log, it got positive when combined with this second log.

Homework tip:

$w = [w_{i}, w_{2}, . . . w_{K}]$
You need to take the derivative of E(w) in terms of a single $w_{i}$ , in this case you will consider all $w$ 's as constants.
The derivative will be a matrix
The termination condition code is a matrix

Learning Invariant Representations

Pasted image 20260129103048.png|400

Notes:

If you want to classify images, one challenge about images is if images are turned around, pixels will be completely shuffled, and this will confuse the model.
- In the vanilla version of convolutional networks your $x$ vector changes if you rotate the same image some degrees.
We need some transformations so that we can transform each of these images into a vector that remains the same if the image is turned around.
Then when having a fix-length vector $x$ , all we need to do to classify it, is create a layer with the $K$ units and connect every element from $x$ to every unit.
- If you have a 1000 classes you get a 1000 $w$ vector.
Then the next layer will be softmax, which will then give you K classes
Then on top, in training, you stack a cross entropy loss in top of it.

From Logistic Regression to Deep Learning

How to learning X automatically?

Pasted image 20260129103544.png|600

We have some input (i.e. images, text, natural language)
How do we compute the fix length vector $x$ ?
The next layer computes $w^{T} x$
- Each unit in the $w_{i}^{T} x$ vector is connected to each unit in the $x$ vector. (Every unit is connected to every unit)
Then softmax (to get a predicted probability)
Then training (cross entropy loss)