03 - Logistic Regression - From Binary to Multi-Class

Class: CSCE-421


Notes:

Multi-Class Classification

The Math
hw(x)=[P(y=1x;w)P(y=2x;w)P(y=Kx;w)]

Example:

The Meaning

1. The Big Picture: Beyond "Yes or No" Up until now, you have been dealing with Binary Classification problems—questions with only two answers (e.g., "Will they get a heart attack? Yes or No" or "Is this digit a 1 or a 5?"). Now, we are moving to Multi-Class Classification, where there are K different possible categories. For example, if you want the computer to recognize handwritten digits, the image could be any number from 0 to 9, meaning there are K=10 classes.

2. The Data and "No Ordinal Relationship"

3. The Prediction Goal Because we cannot just output a single +1 or 1 anymore, our model needs to output a vector (a list) of probabilities. If we have 10 classes, the model will output a list of 10 percentages:

Multi-Class Logistic Regression

The Math
[w1Txw2TxwkTx]
The Meaning

1. How We Build It: K Sets of Weights To generate these K different predictions, we cannot just use one set of weights (w). Instead, we need a separate "expert" for each class.

2. The Raw Signals When a new data point x comes in, we have every single "expert" evaluate it. We compute the dot product of x with each weight vector:

The cliffhanger: Because these are raw dot products, the numbers could be anything—like +50, 12, or 0.5. They do not look like probabilities yet (they don't add up to 100%). As your slide notes, we now need a mathematical way to map these wild raw scores into a neat, valid probability distribution. (Spoiler: In your next slide, the professor will introduce the "Softmax" function to do exactly this!)

Softmax

A function very useful in general out of ML

The Math
softmax(v)=1k=1Kevk[ev1ev2evK]

- The output of the softmax function is a vector of probability


Why is it called softmax?

The Meaning

1. The Problem: Turning Raw Scores into Probabilities In the previous slide, we built K different "experts" (weight vectors) to calculate a raw score for each of the K classes using dot products (wkTx). The problem is that these raw scores can be any wild numbers—like +42.5 for Class 1, 12.0 for Class 2, and +0.5 for Class 3.

To make sense of these numbers, we need to convert them into a valid probability distribution, which has two strict rules:

2. The Solution: The Softmax Function The Softmax function takes any vector of raw numbers and perfectly transforms it to follow those two rules. Here is how the math does it step-by-step:

3. Clarifying: Why is it called "Softmax"? Your notes point out that the exponential function (ex) is monotonic (strictly increasing). This means that if raw score A is bigger than raw score B, then eA will always be bigger than eB.

Multi-Class Logistic Regression

The Math
  1. The multi-class prediction function can be written as
hw(x)=[P(y=1x;w)P(y=2x;w)P(y=Kx;w)]=softmax[w1Txw2TxwKTx]=1k=1KewkTx[ew1Txew2TxewkTx]

- The softmax output vector tells you the probability that x belongs to each of the classes
- This is still the classifier to use for any classification problem since it works for as many classes as K (multi-class)
- Most other models only use for two class classifications

  1. Although there are K vectors wk , k= 1,···,K , only K−1 of them are independent, due to the sum-to-one constraint
  2. That is why we only need one vector for binary class problems
  3. When K = 2, there is an equivalence relationship between softmax and θ(·).
The Meaning

1. Multi-Class Logistic Regression Output To finalize our model, we just plug our K linear signals (w1Tx,w2Tx,) into the Softmax function. The output is a neat vector of K probabilities, telling us exactly how likely the computer thinks the input x belongs to each class.

2. The K1 Independence Rule & The Binary Connection Your notes end with a very profound mathematical realization about how many weight vectors we actually need.

Training with Cross Entropy Loss

The Math
H(P,Q)=i=1Kpilog(qi),

- This is the general definition of Cross entropy
- There is only one term that is non-zero, that term will come to be the correct class

- P=(p1,,pK)T is the one-hot vector representing the true class
- Q=(q1,,qK)T is the predicted probability vector as output of softmax

The Meaning

1. The Goal: Measuring the Distance between Probabilities In the previous slides, we used the Softmax function to make our model output a list of probabilities (e.g., "I am 10% sure this is class A, 70% sure it's class B, and 20% sure it's class C"). To train the model, we need an error function (a loss function) that measures how "far away" our predicted probabilities are from the absolute truth.

2. The One-Hot Vector (P) To compare our prediction to the truth, we need to express the truth as a probability distribution too. In reality, we are 100% sure of the correct answer. If we have 3 classes and the true answer is Class 2, we represent this as a One-Hot Vector: ``.

3. The General Cross-Entropy Formula To measure the distance between P and Q, we use the Cross-Entropy formula: $$H(\boldsymbol{P}, \boldsymbol{Q})=-\sum_{i=1}^K p_i \log \left(q_i\right)$$ This looks like a loop that multiplies the true probability (pi) by the log of the predicted probability (logqi) for every single class, and adds them all up.

4. The Beautiful Simplification Your notes point out a fantastic shortcut: "There is only one term that is non-zero." Because our true vector P is one-hot (e.g., [0,1,0]), almost every pi in the equation is 0. Anything multiplied by 0 is 0. Therefore, the model's predictions for the wrong classes are completely wiped out of the equation! The only term that survives is the one where pi=1 (the correct class). So, if the true class is c, the entire ugly summation collapses into a single, simple formula: log(qc) (where qc is simply the probability your model assigned to the correct class).

A Neural Network View

The Math

Pasted image 20260127101916.png600


Example exam question:


The Meaning

1. The Neural Network View (The Pipeline) The slide ties everything together into a "Neural Network" pipeline:

2. Correcting your Exam Question Note! Let's look at your example question: q=[0.5,0.4,0.1] and p=. You plug it into the formula: (0log(0.5)+1log(0.4)+0log(0.1)). This correctly simplifies to: log(0.4).

Here is the error in your notes: You wrote the final answer is -0.4. Loss is basically a penalty, and it must always be a positive number. Because probabilities are decimals between 0 and 1, taking the logarithm of a decimal always results in a negative number. The formula puts a negative sign at the very front specifically to cancel that out and make the final loss positive!

Loss Function

The Math

Now the loss for a training sample x in class c is given by

loss(x,y;w)=H(y,y^)=kyklogy^k=logy^c=logewcTxk=1KewkTx

where y denotes the one-hot vector and y^ is the predicted distribution
h(xi). And the loss on all samples (Xi,Yi)i=1N is

loss(X,Y;w)=i=1Nc=1KI[yi=c]logewcTxik=1KewkTxi
The Meaning

1. The Loss Function (Putting it all together) In the previous slides, we learned that the Cross-Entropy loss for a single person/data point just collapses to log(qc), where qc is the predicted probability of the correct class. This slide just writes that exact same concept out using full, formal mathematical notation:

Derivative of Loss Function

The notes (Logistic Regression: From Binary to Multi-Class) contain
details on derivative of cross entropy loss function, which is necessary for your homework. All you need are:

  1. Univariate calculus
  2. Chain rule
  3. (wb)w=b

Taking the Derivative (The Calculus Note) To train the model, we need to use Gradient Descent, which means taking the derivative of this giant loss function to find the slope.

Shift-invariance in Parameters

The Math

The softmax function in multi-class LR has an invariance property when
shifting the parameters. Given the weights w=(w1,···,wK), suppose we
subtract the same vector u from each of the K weight vectors, the outputs
of softmax function will remain the same.

Notes:

softmax[w1Txw2TxwKTx]=softmax[v1Txv2TxvKTx]vi=wiu
Proof

To prove this, let us denote w={wi}i=1K where w=wiu. We have

P(y=kx;w)=e(wku)Txi=1Ke(wiu)Tx=ewkTxeuTxi=1KewiTxeuTx=ewkTxeuTx(i=1KewiTx)euTx=e(wk)Txi=1Ke(wi)Tx=P(y=kx;w),

which completes the proof.

The Meaning

1. Shift-Invariance in Parameters (A Cool Math Trick) This is a fascinating and highly useful property of the Softmax function. It states that if you take every single weight vector for every single class, and subtract the exact same vector u from all of them, the final probabilities will not change at all.

2. Walking through the Proof The algebra perfectly proves why this happens:

  1. Let's replace our old weights wk with our shifted weights: wk=wku.
  2. Plug this into the Softmax numerator: e(wku)Tx.
  3. Using vector algebra, we distribute the x: ewkTxuTx.
  4. Using the rules of exponents (eAB=eAeB), we can split this into two parts multiplied together: ewkTxeuTx.
  5. The Magic: If we do this for the denominator as well, notice that the term euTx has absolutely no k in it. It is a constant identical factor for every single class.
  6. Because it is identical for every item in the denominator's sum, we can factor it completely out of the summation.
  7. Now we have euTx on the top, and euTx on the bottom. They perfectly cancel each other out, leaving us with the exact original Softmax formula!

3. Answering your Note's Question You asked: "The reason why the output is K - 1 dimension is because of normalization?" Yes, exactly! Because the probabilities must normalize (sum to 1), knowing the probability of K1 classes automatically tells you the probability of the last class. The shift-invariance property gives us the mathematical proof of this: Because we can subtract any vector u from our weights without changing the output, let's purposely choose to subtract the weights of the last class (u=wK).

Equivalence to Sigmoid

Once we have proved the shift-invariance, we are able to show that when K= 2, the softmax-based multi-class LR is equivalent to the sigmoid-based binary LR. In particular, the hypothesis of both LR are equivalent.

Proof
hw(x)=1ew1Tx+ew2Tx[ew1Txew2Tx]=1e(w1w1)Tx+e(w2w1)Tx[e(w1w1)Txe(w2w1)Tx]=[11+e(w2w1)Txe(w2w1)T1+e(w2w1)Tx]=[11+ew^Txew^Tx1+ew^Tx]=[11+ew^Tx111+ew^Tx]=[hw^(x)1hw^(x)],

where w^=wiw2. This completes the proof.

Notes:

θ(wTx)=11+ewTx

- You can do a change of variables w=w1w2 so that it becomes exactly (turns the exponent negative)

Equivalence of Loss Function

  1. Now we show that minimizing the logistic regression loss is equivalent to minimizing the cross-entropy loss with binary outcomes.
  2. The equivalence between logistic regression loss and the cross-entropy loss, as shown below, proves that we always obtain identical weights w by minimizing the two losses. The equivalence between the losses, together with the equivalence between sigmoid and softmax, leads to the conclusion that the binary logistic regression is a particular case of multi-class logistic regression when K= 2.
Proof
argminwEin(w)=argminw1Nn=1Nln(1+eynwTxn)=argminw1Nn=1Nln1θ(ynwTxn)=argminw1Nn=1Nln1P(ynxn)=argminw1Nn=1NI[yn=+1]ln1P(ynxn)+I[yn=1]ln1P(ynxn)=argminw1Nn=1NI[yn=+1]ln1h(xn)+I[yn=1]ln11h(xn)=argminwplog1q+(1p)log11q=argminwH({p,1p},{q,1q})

where p=I[yn=+1] and q=h(xn). This completes the proof.

Notes:

θ(s)=11+es

- In this case ynwTxn is our s.

P(y|x)=θ(ywTx)

Homework tip:

Learning Invariant Representations

Pasted image 20260129103048.png400

Notes:

From Logistic Regression to Deep Learning

How to learning X automatically?

Pasted image 20260129103544.png600