03 - Logistic Regression - From Binary to Multi-Class
Class: CSCE-421
Notes:
Multi-Class Classification
The Math
- Given
, where and - K is the number of classes and N is the number of samples.
- We need to estimate the probability of x belonging to each of the K classes as
- No ordinal relationship between the classes.
Example:
- For K = 10 (10 classes), each prediction will be a vector of 10 dimensions.
- It is K dimensional but it can be positive/negative
The Meaning
1. The Big Picture: Beyond "Yes or No" Up until now, you have been dealing with Binary Classification problems—questions with only two answers (e.g., "Will they get a heart attack? Yes or No" or "Is this digit a 1 or a 5?"). Now, we are moving to Multi-Class Classification, where there are
2. The Data and "No Ordinal Relationship"
- Your input
is still the same (a vector of features like intensity, symmetry, etc., plus the dummy coordinate). - Your output
is now a number from to representing the class. - Crucial concept: Your notes highlight that there is no ordinal relationship between these classes. This means class 2 is not "greater" or "better" than class 1. The numbers are just meaningless nametags (like "apple", "banana", "orange"). You cannot do standard math on the label
anymore, because predicting a 4 when the answer is 2 is not "twice as wrong" as predicting a 3.
3. The Prediction Goal Because we cannot just output a single
- "I am 10% sure this is a 0."
- "I am 85% sure this is a 1."
- "I am 5% sure this is a 2...", and so on.
Multi-Class Logistic Regression
The Math
- We need K weight vectors
, k= 1,···,K - We need a vector for each class
- With K sub-vectors, wee need to get to
- Compute K linear signals by the dot product between the input x and each
as
- We need to map the K outputs (as a vector in
) to the K probabilities (as a probability distribution among the K classes)
The Meaning
1. How We Build It:
- We create
entirely different weight vectors: . is specifically trained to look for patterns that identify class 1. looks for class 2, etc.
2. The Raw Signals When a new data point
gives us the raw score for class 1. gives us the raw score for class 2. - This results in a stacked column of
raw scores (the vector in shown at the bottom of your slide).
The cliffhanger: Because these are raw dot products, the numbers could be anything—like
Softmax
A function very useful in general out of ML
The Math
- Given a K-dimensional vector
- The output of the softmax function is a vector of probability
- The softmax maps a vector
to - All elements in the output vector sum to 1.
Why is it called softmax?
- Note each exponential component is a monotonic function, this means ...
- Whichever component is largest, then in the output, it will still be the largest thanks to this property, this is because of the exponential term.
- Softmax = whichever is larger in the input will be larger in the output
The Meaning
1. The Problem: Turning Raw Scores into Probabilities In the previous slide, we built
To make sense of these numbers, we need to convert them into a valid probability distribution, which has two strict rules:
- Every individual probability must be between
and ( % to %). - All the probabilities must add up exactly to
( %).
2. The Solution: The Softmax Function The Softmax function takes any vector of raw numbers and perfectly transforms it to follow those two rules. Here is how the math does it step-by-step:
- Step 1 (Exponentials -
): First, it takes the number (Euler's number, ) and raises it to the power of each raw score. Because to any power is always a positive number, this brilliant trick instantly eliminates all negative raw scores. - Step 2 (Normalization -
): Next, it adds up all those new positive numbers to find a total sum. Then, it divides each individual positive number by that total sum. Mathematically, if you divide parts by their whole, they are guaranteed to add up to exactly .
3. Clarifying: Why is it called "Softmax"? Your notes point out that the exponential function (
- The "Max" part: Because it preserves this size order, the class that had the highest raw score will end up with the highest probability.
- The "Soft" part: A standard "Hard Max" function would look at our scores
and output —giving 100% to the absolute biggest and 0% to everything else. "Softmax" is a softer version. It highlights the biggest number by giving it the largest percentage, but it still assigns small fractional probabilities to the others depending on how relatively large they were.
Multi-Class Logistic Regression
The Math
- The multi-class prediction function can be written as
- The softmax output vector tells you the probability that
- This is still the classifier to use for any classification problem since it works for as many classes as K (multi-class)
- Most other models only use for two class classifications
- Although there are K vectors
, k= 1,···,K , only K−1 of them are independent, due to the sum-to-one constraint - That is why we only need one vector for binary class problems
- When K = 2, there is an equivalence relationship between softmax and θ(·).
The Meaning
1. Multi-Class Logistic Regression Output To finalize our model, we just plug our
2. The
- Because the Softmax output must sum to
, the probabilities are deeply linked. If you have 3 classes and you know that Class 1 has a % probability and Class 2 has a % probability, you automatically know Class 3 must be %. - Because the last class is automatically determined by the others, only
of the weight vectors are truly mathematically independent. - Connecting back to Binary: If we only have 2 classes (
), equals exactly . This explains why in standard Binary Logistic Regression (predicting Heart Attack vs. No Heart Attack), we only ever trained one weight vector ! - Furthermore, if you plug
into the Softmax formula, the math simplifies perfectly into the exact same S-curve (Sigmoid) function we learned in the previous chapter. Multi-class Logistic Regression is just the ultimate, generalized version of Binary Logistic Regression.
Training with Cross Entropy Loss
The Math
- A loss function measures the error between predictions and the true class labels.
- We need to measure the distance between two probability distributions.
- The cross entropy for a single sample is defined as
- This is the general definition of Cross entropy
- There is only one term that is non-zero, that term will come to be the correct class
-
-
The Meaning
1. The Goal: Measuring the Distance between Probabilities In the previous slides, we used the Softmax function to make our model output a list of probabilities (e.g., "I am 10% sure this is class A, 70% sure it's class B, and 20% sure it's class C"). To train the model, we need an error function (a loss function) that measures how "far away" our predicted probabilities are from the absolute truth.
2. The One-Hot Vector (
- It is called "one-hot" because only one element is "hot" (set to 1), and all the rest are cold (set to 0). This vector is our
(the true probability distribution). Our model's prediction from the Softmax is our .
3. The General Cross-Entropy Formula To measure the distance between
4. The Beautiful Simplification Your notes point out a fantastic shortcut: "There is only one term that is non-zero." Because our true vector [0,1,0]), almost every
A Neural Network View
The Math
/CSCE-421/Visual%20Aids/Pasted%20image%2020260127101916.png)
- You have an e dimensional vector (
) -> this is the vector of the input - Then we compute
, , and (for 3 classes) -> K = 3 - Look at the different color edges, each color represent a connection from either
. - If you have K classes, you need K units in this vector
- You are basically connecting every input to every output
- Look at the different color edges, each color represent a connection from either
- These 3-dimensional vector (a vector with 3 numbers) is passed onto the softmax function, which will output the probability vector of
belongs to each class. This is our prediction. - Note that in the softmax function there is a normalization term. It couples all the terms together.
- Now lets run through a training session using the Cross entropy loss function
- One-hot vector means that only one element in the vector is 1, everyone else is 0. This is the true value that represents to which class the vector
really belongs and is inputed to the cross entropy loss function. - It measures the difference between the true class and the probability vector outputted from the softmax function
- Note how the one-hot vector would look like:
- Minimizing the cross entropy loss is equivalent to maximizing the probability vector
- One-hot vector means that only one element in the vector is 1, everyone else is 0. This is the true value that represents to which class the vector
Example exam question:
- Lets say you got 2 predictions:
- q =
[0.5 0.4 0.1]and p =[0 1 0]
- q =
- The distribution of the probability of other classes does not matter
- What is the loss of this sample?
- Just plug in into the Cross entropy loss function
- You just need to do
- Answer: -0.4
The Meaning
1. The Neural Network View (The Pipeline) The slide ties everything together into a "Neural Network" pipeline:
- Step 1 (Input): You start with your input data vector
. - Step 2 (Linear Signals): You branch out to
different "units". Each unit has its own weights ( ) and calculates a raw dot product score ( ). - Step 3 (Softmax): You pass all
raw scores into the Softmax function. The Softmax normalizes them, forcing them to become valid probabilities that add up to 1 (this is the vector). - Step 4 (Loss): You take the
vector and the true One-Hot vector, and plug them into the Cross Entropy Loss function to calculate your error. During training, the computer tweaks the weights ( ) to make this error as small as possible.
2. Correcting your Exam Question Note! Let's look at your example question:
Here is the error in your notes: You wrote the final answer is -0.4. Loss is basically a penalty, and it must always be a positive number. Because probabilities are decimals between 0 and 1, taking the logarithm of a decimal always results in a negative number. The formula puts a negative sign at the very front specifically to cancel that out and make the final loss positive!
-
If your professor uses
, the math is: (which rounds to ). -
If your professor uses the natural log (
), the math is: . Make sure you write down a positive number on your exam -
Multi-Class Neural Network Pipeline:
- Input vector
. - Compute
linear signals: . - Apply Softmax to convert signals into probability vector
. - Compute Cross Entropy Loss between
and one-hot truth . - Minimize the loss to maximize the predicted probability of the correct class.
- Input vector
Loss Function
The Math
Now the loss for a training sample x in class c is given by
where
- This is for when you need the loss of multiple samples
is equal to 1 only for the true class. - You have the w matrix as input and end up with a number
- When you compute the derivate, this output (a number) it becomes a matrix.
The Meaning
1. The Loss Function (Putting it all together) In the previous slides, we learned that the Cross-Entropy loss for a single person/data point just collapses to
- Single Sample:
. It substitutes (the prediction) with the actual Softmax formula: . - Multiple Samples (
samples): To get the total loss for the entire training dataset, we just add up the individual losses for all samples. - The formula uses a double summation:
. - The term
is called an Indicator Function. It acts exactly like your One-Hot vector. It equals if person 's true class is , and otherwise. This mathematically ensures that for every person, we only calculate the log-loss for their true class.
- The formula uses a double summation:
Derivative of Loss Function
The notes (Logistic Regression: From Binary to Multi-Class) contain
details on derivative of cross entropy loss function, which is necessary for your homework. All you need are:
- Univariate calculus
- Chain rule
Taking the Derivative (The Calculus Note) To train the model, we need to use Gradient Descent, which means taking the derivative of this giant loss function to find the slope.
- Clarifying your note: You wrote "You have the w matrix as input and end up with a number. When you compute the derivative, this output (a number) it becomes a matrix." You are absolutely correct! The loss function takes in a giant matrix of weights (
) and calculates a single number (a scalar representing the total error). When you take the derivative of a single number with respect to a matrix, the result is a matrix of slopes (the gradient). This gradient matrix tells the computer exactly how to tweak every single weight simultaneously to decrease the loss. - To calculate this for your homework, you have to use the Chain Rule (working from the outside of the log function, to the inside of the softmax fraction, down to the dot product).
Shift-invariance in Parameters
The Math
The softmax function in multi-class LR has an invariance property when
shifting the parameters. Given the weights
subtract the same vector
of softmax function will remain the same.
Notes:
- If we subtract a u from each of the K weights
- (u is a vector of the same dimension of each of the w)
- The outputs of softmax will be equivalent
- The reason why the output is K - 1 dimension is because of normalization?
Proof
To prove this, let us denote
which completes the proof.
The Meaning
1. Shift-Invariance in Parameters (A Cool Math Trick) This is a fascinating and highly useful property of the Softmax function. It states that if you take every single weight vector for every single class, and subtract the exact same vector
2. Walking through the Proof The algebra perfectly proves why this happens:
- Let's replace our old weights
with our shifted weights: . - Plug this into the Softmax numerator:
. - Using vector algebra, we distribute the
: . - Using the rules of exponents (
), we can split this into two parts multiplied together: . - The Magic: If we do this for the denominator as well, notice that the term
has absolutely no in it. It is a constant identical factor for every single class. - Because it is identical for every item in the denominator's sum, we can factor it completely out of the summation.
- Now we have
on the top, and on the bottom. They perfectly cancel each other out, leaving us with the exact original Softmax formula!
3. Answering your Note's Question You asked: "The reason why the output is K - 1 dimension is because of normalization?" Yes, exactly! Because the probabilities must normalize (sum to 1), knowing the probability of
- The new weights for the last class become
(a vector of all zeros). - This proves that we don't actually need to train
different weight vectors. We only need to train vectors, because we can always just lock the last class's weight vector to and the math will still perfectly work.
Equivalence to Sigmoid
Once we have proved the shift-invariance, we are able to show that when K= 2, the softmax-based multi-class LR is equivalent to the sigmoid-based binary LR. In particular, the hypothesis of both LR are equivalent.
Proof
where
Notes:
- Note how the exponent cancels and gives 1 for some terms
- Note the numerator on the 3rd point, it is very similar to the binary relation theta function:
- You can do a change of variables
Equivalence of Loss Function
- Now we show that minimizing the logistic regression loss is equivalent to minimizing the cross-entropy loss with binary outcomes.
- The equivalence between logistic regression loss and the cross-entropy loss, as shown below, proves that we always obtain identical weights w by minimizing the two losses. The equivalence between the losses, together with the equivalence between sigmoid and softmax, leads to the conclusion that the binary logistic regression is a particular case of multi-class logistic regression when K= 2.
Proof
where
Notes:
- Remember:
- In this case
- Then:
- Then we write how these two cases are separate
- We want to write the +1 and -1 cases by separate, this means
and .
- We want to write the +1 and -1 cases by separate, this means
- Then you simply rewrite
as the prediction function . This is just the prediction represented as a function. - Now you can see a pattern and will be able to identify input
from output and that is how you can get to the last two forms.
- Now you can see a pattern and will be able to identify input
- Finally you just need to compute the cross entropy loss
- Note the definition of the cross entropy loss includes a negative, this is why there is no negative on the second log, it got positive when combined with this second log.
Homework tip:
- You need to take the derivative of E(w) in terms of a single
, in this case you will consider all 's as constants. - The derivative will be a matrix
- The termination condition code is a matrix
Learning Invariant Representations
/CSCE-421/Visual%20Aids/Pasted%20image%2020260129103048.png)
Notes:
- If you want to classify images, one challenge about images is if images are turned around, pixels will be completely shuffled, and this will confuse the model.
- In the vanilla version of convolutional networks your
vector changes if you rotate the same image some degrees.
- In the vanilla version of convolutional networks your
- We need some transformations so that we can transform each of these images into a vector that remains the same if the image is turned around.
- Then when having a fix-length vector
, all we need to do to classify it, is create a layer with the units and connect every element from to every unit. - If you have a 1000 classes you get a 1000
vector.
- If you have a 1000 classes you get a 1000
- Then the next layer will be softmax, which will then give you K classes
- Then on top, in training, you stack a cross entropy loss in top of it.
From Logistic Regression to Deep Learning
How to learning X automatically?
/CSCE-421/Visual%20Aids/Pasted%20image%2020260129103544.png)
- We have some input (i.e. images, text, natural language)
- How do we compute the fix length vector
? - The next layer computes
- Each unit in the
vector is connected to each unit in the vector. (Every unit is connected to every unit)
- Each unit in the
- Then softmax (to get a predicted probability)
- Then training (cross entropy loss)