03 - Logistic Regression - From Binary to Multi-Class
Class: CSCE-421
Notes:
Multi-Class Classification
- Given
, where and - K is the number of classes and N is the number of samples.
- We need to estimate the probability of x belonging to each of the K classes as
- No ordinal relationship between the classes.
Example:
- For K = 10 (10 classes), each prediction will be a vector of 10 dimensions.
- It is K dimensional but it can be positive/negative
Multi-Class Logistic Regression
- We need K weight vectors
, k= 1,···,K - We need a vector for each class
- With K sub-vectors, wee need to get to
- Compute K linear signals by the dot product between the input x and each
as
- We need to map the K outputs (as a vector in
) to the K probabilities (as a probability distribution among the K classes)
Softmax
A function very useful in general out of ML
- Given a K -dimensional vector
- The output of the softmax function is a vector of probability
- The softmax maps a vector
to - All elements int he output vector sum to 1.
Why is it called softmax?
- Note each exponential component is a monotonic function, this means ...
- Whichever component is largest, then in the output, it will still be the largest thanks to this property, this is because of the exponential term.
- Softmax = whichever is larger in the input will be larger in the output
Multi-Class Logistic Regression
- The multi-class prediction function can be written as
- The softmax output vector tells you the probability that
- This is still the classifier to use for any classification problem since it works for as many classes as K (multi-class)
- Most other models only use for two class classifications
- Although there are K vectors
, k= 1,···,K , only K−1 of them are independent, due to the sum-to-one constraint - That is why we only need one vector for binary class problems
- When K = 2, there is an equivalence relationship between softmax and θ(·).
Training with Cross Entropy Loss
- A loss function measures the error between predictions and the true class labels.
- We need to measure the distance between two probability distributions.
- The cross entropy for a single sample is defined as
- This is the general definition of Cross entropy
- There is only one term that is non-zero, that term will come to be the correct class
-
-
A Neural Network View
/CSCE-421/Visual%20Aids/Pasted%20image%2020260127101916.png)
- You have an e dimensional vector (
) -> this is the vector of the input - Then we compute
, , and (for 3 classes) -> K = 3 - Look at the different color edges, each color represent a connection from either
. - If you have K classes, you need K units in this vector
- You are basically connecting every input to every output
- Look at the different color edges, each color represent a connection from either
- These 3-dimensional vector (a vector with 3 numbers) is passed onto the softmax function, which will output the probability vector of
belongs to each class. This is our prediction. - Note that in the softmax function there is a normalization term. It couples all the terms together.
- Now lets run through a training session using the Cross entropy loss function
- One-hot vector means that only one element in the vector is 1, everyone else is 0. This is the true value that represents to which class the vector
really belongs and is inputed to the cross entropy loss function. - It measures the difference between the true class and the probability vector outputted from the softmax function
- Note how the one-hot vector would look like:
- Minimizing the cross entropy loss is equivalent to maximizing the probability vector
- One-hot vector means that only one element in the vector is 1, everyone else is 0. This is the true value that represents to which class the vector
Example exam question:
- Lets say you got 2 predictions:
- q =
[0.5 0.4 0.1]and p =[0 1 0]
- q =
- The distribution of the probability of other classes does not matter
- What is the loss of this sample?
- Just plug in into the Cross entropy loss function
- You just need to do
- Answer: -0.4
Loss Function
Now the loss for a training sample x in class c is given by
where
- This is for when you need the loss of multiple samples
is equal to 1 only for the true class. - You have the w matrix as input and up with a number
- When you compute the derivate, this output (a number) it becomes a matrix.
Derivative of Loss Function
The notes (Logistic Regression: From Binary to Multi-Class) contain
details on derivative of cross entropy loss function, which is necessary for your homework. All you need are:
- Univariate calculus
- Chain rule
Shift-invariance in Parameters
The softmax function in multi-class LR has an invariance property when
shifting the parameters. Given the weights
subtract the same vector
of softmax function will remain the same.
Notes:
- If we subtract a u from each of the K weights
- (u is a vector of the same dimension of each of the w)
- The outputs of softmax will be equivalent
- The reason why the output is K - 1 dimension is because of normalization?
Proof
To prove this, let us denote
which completes the proof.
Equivalence to Sigmoid
Once we have proved the shift-invariance, we are able to show that when K= 2, the softmax-based multi-class LR is equivalent to the sigmoid-based binary LR. In particular, the hypothesis of both LR are equivalent.
Proof
where
Notes:
- Note how the exponent cancels and gives 1 for some terms
- Note the numerator on the 3rd point, it is very similar to the binary relation theta function:
- You can do a change of variables
Equivalence of Loss Function
- Now we show that minimizing the logistic regression loss is equivalent to minimizing the cross-entropy loss with binary outcomes.
- The equivalence between logistic regression loss and the cross-entropy loss, as shown below, proves that we always obtain identical weights w by minimizing the two losses. The equivalence between the losses, together with the equivalence between sigmoid and softmax, leads to the conclusion that the binary logistic regression is a particular case of multi-class logistic regression when K= 2.
Proof
where
Notes:
- Remember:
- In this case
- Then:
- Then we write how these two cases are separate
- We want to write the +1 and -1 cases by separate, this means
and .
- We want to write the +1 and -1 cases by separate, this means
- Then you simply rewrite
as the prediction function . This is just the prediction represented as a function. - Now you can see a pattern and will be able to identify input
from output and that is how you can get to the last two forms.
- Now you can see a pattern and will be able to identify input
- Finally you just need to compute the cross entropy loss
- Note the definition of the cross entropy loss includes a negative, this is why there is no negative on the second log, it got positive when combined with this second log.
Homework tip:
- You need to take the derivative of E(w) in terms of a single
, in this case you will consider all 's as constants. - The derivative will be a matrix
- The termination condition code is a matrix
Learning Invariant Representations
/CSCE-421/Visual%20Aids/Pasted%20image%2020260129103048.png)
Notes:
- If you want to classify images, one challenge about images is if images are turned around, pixels will be completely shuffled, and this will confuse the model.
- In the vanilla version of convolutional networks your
vector changes if you rotate the same image some degrees.
- In the vanilla version of convolutional networks your
- We need some transformations so that we can transform each of these images into a vector that remains the same if the image is turned around.
- Then when having a fix-length vector
, all we need to do to classify it, is create a layer with the units and connect every element from to every unit. - If you have a 1000 classes you get a 1000
vector.
- If you have a 1000 classes you get a 1000
- Then the next layer will be softmax, which will then give you K classes
- Then on top, in training, you stack a cross entropy loss in top of it.
From Logistic Regression to Deep Learning
How to learning X automatically?
/CSCE-421/Visual%20Aids/Pasted%20image%2020260129103544.png)
- We have some input (i.e. images, text, natural language)
- How do we compute the fix length vector
? - The next layer computes
- Each unit in the
vector is connected to each unit in the vector. (Every unit is connected to every unit)
- Each unit in the
- Then softmax (to get a predicted probability)
- Then training (cross entropy loss)