03 - Logistic Regression - From Binary to Multi-Class

Class: CSCE-421


Notes:

Multi-Class Classification

hw(x)=[P(y=1x;w)P(y=2x;w)P(y=Kx;w)]

Example:

Multi-Class Logistic Regression

[w1Txw2TxwkTx]

Softmax

A function very useful in general out of ML

softmax(v)=1k=1Kevk[ev1ev2evK]

- The output of the softmax function is a vector of probability


Why is it called softmax?

Multi-Class Logistic Regression

  1. The multi-class prediction function can be written as
hw(x)=[P(y=1x;w)P(y=2x;w)P(y=Kx;w)]=softmax[w1Txw2TxwKTx]=1k=1KewkTx[ew1Txew2TxewkTx]

- The softmax output vector tells you the probability that x belongs to each of the classes
- This is still the classifier to use for any classification problem since it works for as many classes as K (multi-class)
- Most other models only use for two class classifications

  1. Although there are K vectors wk , k= 1,···,K , only K−1 of them are independent, due to the sum-to-one constraint
  2. That is why we only need one vector for binary class problems
  3. When K = 2, there is an equivalence relationship between softmax and θ(·).

Training with Cross Entropy Loss

H(P,Q)=i=1Kpilog(qi),

- This is the general definition of Cross entropy
- There is only one term that is non-zero, that term will come to be the correct class

- P=(p1,,pK)T is the one-hot vector representing the true class
- Q=(q1,,qK)T is the predicted probability vector as output of softmax

A Neural Network View

Pasted image 20260127101916.png|600


Example exam question:


Loss Function

Now the loss for a training sample x in class c is given by

loss(x,y;w)=H(y,y^)=kyklogy^k=logy^c=logewcTxk=1KewkTx

where y denotes the one-hot vector and y^ is the predicted distribution
h(xi). And the loss on all samples (Xi,Yi)i=1N is

loss(X,Y;w)=i=1Nc=1KI[yi=c]logewcTxik=1KewkTxi

Derivative of Loss Function

The notes (Logistic Regression: From Binary to Multi-Class) contain
details on derivative of cross entropy loss function, which is necessary for your homework. All you need are:

  1. Univariate calculus
  2. Chain rule
  3. (wb)w=b

Shift-invariance in Parameters

The softmax function in multi-class LR has an invariance property when
shifting the parameters. Given the weights w=(w1,···,wK), suppose we
subtract the same vector u from each of the K weight vectors, the outputs
of softmax function will remain the same.

Notes:

softmax[w1Txw2TxwKTx]=softmax[v1Txv2TxvKTx]vi=wiu

Proof

To prove this, let us denote w={wi}i=1K where w=wiu. We have

P(y=kx;w)=e(wku)Txi=1Ke(wiu)Tx=ewkTxeuTxi=1KewiTxeuTx=ewkTxeuTx(i=1KewiTx)euTx=e(wk)Txi=1Ke(wi)Tx=P(y=kx;w),

which completes the proof.

Equivalence to Sigmoid

Once we have proved the shift-invariance, we are able to show that when K= 2, the softmax-based multi-class LR is equivalent to the sigmoid-based binary LR. In particular, the hypothesis of both LR are equivalent.

Proof

hw(x)=1ew1Tx+ew2Tx[ew1Txew2Tx]=1e(w1w1)Tx+e(w2w1)Tx[e(w1w1)Txe(w2w1)Tx]=[11+e(w2w1)Txe(w2w1)T1+e(w2w1)Tx]=[11+ew^Txew^Tx1+ew^Tx]=[11+ew^Tx111+ew^Tx]=[hw^(x)1hw^(x)],

where w^=wiw2. This completes the proof.

Notes:

θ(wTx)=11+ewTx

- You can do a change of variables w=w1w2 so that it becomes exactly (turns the exponent negative)

Equivalence of Loss Function

  1. Now we show that minimizing the logistic regression loss is equivalent to minimizing the cross-entropy loss with binary outcomes.
  2. The equivalence between logistic regression loss and the cross-entropy loss, as shown below, proves that we always obtain identical weights w by minimizing the two losses. The equivalence between the losses, together with the equivalence between sigmoid and softmax, leads to the conclusion that the binary logistic regression is a particular case of multi-class logistic regression when K= 2.

Proof

argminwEin(w)=argminw1Nn=1Nln(1+eynwTxn)=argminw1Nn=1Nln1θ(ynwTxn)=argminw1Nn=1Nln1P(ynxn)=argminw1Nn=1NI[yn=+1]ln1P(ynxn)+I[yn=1]ln1P(ynxn)=argminw1Nn=1NI[yn=+1]ln1h(xn)+I[yn=1]ln11h(xn)=argminwplog1q+(1p)log11q=argminwH({p,1p},{q,1q})

where p=I[yn=+1] and q=h(xn). This completes the proof.

Notes:

θ(s)=11+es

- In this case ynwTxn is our s.

P(y|x)=θ(ywTx)

Homework tip:

Learning Invariant Representations

Pasted image 20260129103048.png|400

Notes:

From Logistic Regression to Deep Learning

How to learning X automatically?

Pasted image 20260129103544.png|600