03 - Logistic Regression - From Binary to Multi-Class

Notes:

Multi-Class Classification

The Math

Given $D = (x_{1}, y_{1}), \dots, (x_{N}, y_{N}),$ , where $x_{i} \in R^{d + 1}$ and $y_{i} \in {1, 2, \dots, K}$
K is the number of classes and N is the number of samples.
We need to estimate the probability of x belonging to each of the K classes as

h_{w} (x) = [\begin{matrix} P (y = 1 ∣ x; w) \\ P (y = 2 ∣ x; w) \\ \dots \\ P (y = K ∣ x; w) \end{matrix}]

No ordinal relationship between the classes.

Example:

For K = 10 (10 classes), each prediction will be a vector of 10 dimensions.
It is K dimensional but it can be positive/negative

The Meaning

1. The Big Picture: Beyond "Yes or No" Up until now, you have been dealing with Binary Classification problems—questions with only two answers (e.g., "Will they get a heart attack? Yes or No" or "Is this digit a 1 or a 5?"). Now, we are moving to Multi-Class Classification, where there are $K$ different possible categories. For example, if you want the computer to recognize handwritten digits, the image could be any number from 0 to 9, meaning there are $K = 10$ classes.

2. The Data and "No Ordinal Relationship"

Your input $x$ is still the same (a vector of features like intensity, symmetry, etc., plus the dummy coordinate).
Your output $y$ is now a number from $1$ to $K$ representing the class.
Crucial concept: Your notes highlight that there is no ordinal relationship between these classes. This means class 2 is not "greater" or "better" than class 1. The numbers are just meaningless nametags (like "apple", "banana", "orange"). You cannot do standard math on the label $y$ anymore, because predicting a 4 when the answer is 2 is not "twice as wrong" as predicting a 3.

3. The Prediction Goal Because we cannot just output a single $+ 1$ or $- 1$ anymore, our model needs to output a vector (a list) of probabilities. If we have 10 classes, the model will output a list of 10 percentages:

"I am 10% sure this is a 0."
"I am 85% sure this is a 1."
"I am 5% sure this is a 2...", and so on.

Multi-Class Logistic Regression

The Math

We need K weight vectors $w_{k}$ , k= 1,···,K
- We need a vector for each class
- With K sub-vectors, wee need to get to $w_{k}$
Compute K linear signals by the dot product between the input x and each $w_{k}$ as

[\begin{matrix} w_{1}^{T} x \\ w_{2}^{T} x \\ ⋮ \\ w_{k}^{T} x \end{matrix}]

We need to map the K outputs (as a vector in $R^{K}$ ) to the K probabilities (as a probability distribution among the K classes)

The Meaning

1. How We Build It: $K$ Sets of Weights To generate these $K$ different predictions, we cannot just use one set of weights ( $w$ ). Instead, we need a separate "expert" for each class.

We create $K$ entirely different weight vectors: $w_{1}, w_{2}, \dots, w_{K}$ .
$w_{1}$ is specifically trained to look for patterns that identify class 1. $w_{2}$ looks for class 2, etc.

2. The Raw Signals When a new data point $x$ comes in, we have every single "expert" evaluate it. We compute the dot product of $x$ with each weight vector:

$w_{1}^{T} x$ gives us the raw score for class 1.
$w_{2}^{T} x$ gives us the raw score for class 2.
This results in a stacked column of $K$ raw scores (the vector in $R^{K}$ shown at the bottom of your slide).

The cliffhanger: Because these are raw dot products, the numbers could be anything—like $+ 50$ , $- 12$ , or $0.5$ . They do not look like probabilities yet (they don't add up to 100%). As your slide notes, we now need a mathematical way to map these wild raw scores into a neat, valid probability distribution. (Spoiler: In your next slide, the professor will introduce the "Softmax" function to do exactly this!)

Softmax

A function very useful in general out of ML

The Math

Given a K-dimensional vector $v = {[v_{1}, v_{2}, \dots, v_{K}]}^{T} \in R^{K}$

softmax (v) = \frac{1}{\sum_{k = 1}^{K} e^{v_{k}}} [\begin{matrix} e^{v_{1}} \\ e^{v_{2}} \\ ⋮ \\ e^{v_{K}} \end{matrix}]

- The output of the softmax function is a vector of probability

The softmax maps a vector $R^{K}$ to $(0, 1)^{K}$
All elements in the output vector sum to 1.

Why is it called softmax?

Note each exponential component is a monotonic function, this means ...
Whichever component is largest, then in the output, it will still be the largest thanks to this property, this is because of the exponential term.
- Softmax = whichever is larger in the input will be larger in the output

The Meaning

1. The Problem: Turning Raw Scores into Probabilities In the previous slide, we built $K$ different "experts" (weight vectors) to calculate a raw score for each of the $K$ classes using dot products ( $w_{k}^{T} x$ ). The problem is that these raw scores can be any wild numbers—like $+ 42.5$ for Class 1, $- 12.0$ for Class 2, and $+ 0.5$ for Class 3.

To make sense of these numbers, we need to convert them into a valid probability distribution, which has two strict rules:

Every individual probability must be between $0$ and $1$ ( $0$ % to $100$ %).
All the probabilities must add up exactly to $1$ ( $100$ %).

2. The Solution: The Softmax Function The Softmax function takes any vector of raw numbers and perfectly transforms it to follow those two rules. Here is how the math does it step-by-step:

Step 1 (Exponentials - $e^{v_{k}}$ ): First, it takes the number $e$ (Euler's number, $\approx 2.718$ ) and raises it to the power of each raw score. Because $e$ to any power is always a positive number, this brilliant trick instantly eliminates all negative raw scores.
Step 2 (Normalization - $\frac{1}{\sum e^{v_{k}}}$ ): Next, it adds up all those new positive numbers to find a total sum. Then, it divides each individual positive number by that total sum. Mathematically, if you divide parts by their whole, they are guaranteed to add up to exactly $1$ .

3. Clarifying: Why is it called "Softmax"? Your notes point out that the exponential function ( $e^{x}$ ) is monotonic (strictly increasing). This means that if raw score A is bigger than raw score B, then $e^{A}$ will always be bigger than $e^{B}$ .

The "Max" part: Because it preserves this size order, the class that had the highest raw score will end up with the highest probability.
The "Soft" part: A standard "Hard Max" function would look at our scores $(+ 42.5, - 12.0, + 0.5)$ and output $(1, 0, 0)$ —giving 100% to the absolute biggest and 0% to everything else. "Softmax" is a softer version. It highlights the biggest number by giving it the largest percentage, but it still assigns small fractional probabilities to the others depending on how relatively large they were.

Multi-Class Logistic Regression

The Math

The multi-class prediction function can be written as

h_{w} (x) = [\begin{matrix} P (y = 1 ∣ x; w) \\ P (y = 2 ∣ x; w) \\ \dots \\ P (y = K ∣ x; w) \end{matrix}] = softmax [\begin{matrix} w_{1}^{T} x \\ w_{2}^{T} x \\ ⋮ \\ w_{K}^{T} x \end{matrix}] = \frac{1}{\sum_{k = 1}^{K} e^{w_{k}^{T} x}} [\begin{matrix} e^{w_{1}^{T} x} \\ e^{w_{2}^{T} x} \\ \dots \\ e^{w_{k}^{T} x} \end{matrix}]

- The softmax output vector tells you the probability that $x$ belongs to each of the classes
- This is still the classifier to use for any classification problem since it works for as many classes as K (multi-class)
- Most other models only use for two class classifications

Although there are K vectors $w_{k}$ , k= 1,···,K , only K−1 of them are independent, due to the sum-to-one constraint
That is why we only need one vector for binary class problems
When K = 2, there is an equivalence relationship between softmax and θ(·).

The Meaning

1. Multi-Class Logistic Regression Output To finalize our model, we just plug our $K$ linear signals ( $w_{1}^{T} x, w_{2}^{T} x, \dots$ ) into the Softmax function. The output is a neat vector of $K$ probabilities, telling us exactly how likely the computer thinks the input $x$ belongs to each class.

2. The $K - 1$ Independence Rule & The Binary Connection Your notes end with a very profound mathematical realization about how many weight vectors we actually need.

Because the Softmax output must sum to $1$ , the probabilities are deeply linked. If you have 3 classes and you know that Class 1 has a $70$ % probability and Class 2 has a $20$ % probability, you automatically know Class 3 must be $10$ %.
Because the last class is automatically determined by the others, only $K - 1$ of the weight vectors are truly mathematically independent.
Connecting back to Binary: If we only have 2 classes ( $K = 2$ ), $K - 1$ equals exactly $1$ . This explains why in standard Binary Logistic Regression (predicting Heart Attack vs. No Heart Attack), we only ever trained one weight vector $w$ !
Furthermore, if you plug $K = 2$ into the Softmax formula, the math simplifies perfectly into the exact same S-curve $θ (\cdot)$ (Sigmoid) function we learned in the previous chapter. Multi-class Logistic Regression is just the ultimate, generalized version of Binary Logistic Regression.

Training with Cross Entropy Loss

The Math

A loss function measures the error between predictions and the true class labels.
We need to measure the distance between two probability distributions.
The cross entropy for a single sample is defined as

H (P, Q) = - \sum_{i = 1}^{K} p_{i} \log (q_{i}),

- This is the general definition of Cross entropy
- There is only one term that is non-zero, that term will come to be the correct class

- $P = {(p_{1}, \dots, p_{K})}^{T}$ is the one-hot vector representing the true class
- $Q = {(q_{1}, \dots, q_{K})}^{T}$ is the predicted probability vector as output of softmax

The Meaning

1. The Goal: Measuring the Distance between Probabilities In the previous slides, we used the Softmax function to make our model output a list of probabilities (e.g., "I am 10% sure this is class A, 70% sure it's class B, and 20% sure it's class C"). To train the model, we need an error function (a loss function) that measures how "far away" our predicted probabilities are from the absolute truth.

2. The One-Hot Vector ( $P$ ) To compare our prediction to the truth, we need to express the truth as a probability distribution too. In reality, we are 100% sure of the correct answer. If we have 3 classes and the true answer is Class 2, we represent this as a One-Hot Vector: ``.

It is called "one-hot" because only one element is "hot" (set to 1), and all the rest are cold (set to 0). This vector is our $P$ (the true probability distribution). Our model's prediction from the Softmax is our $Q$ .

3. The General Cross-Entropy Formula To measure the distance between $P$ and $Q$ , we use the Cross-Entropy formula: $$H(\boldsymbol{P}, \boldsymbol{Q})=-\sum_{i=1}^K p_i \log \left(q_i\right)$$ This looks like a loop that multiplies the true probability ( $p_{i}$ ) by the log of the predicted probability ( $\log q_{i}$ ) for every single class, and adds them all up.

4. The Beautiful Simplification Your notes point out a fantastic shortcut: "There is only one term that is non-zero." Because our true vector $P$ is one-hot (e.g., [0,1,0]), almost every $p_{i}$ in the equation is $0$ . Anything multiplied by $0$ is $0$ . Therefore, the model's predictions for the wrong classes are completely wiped out of the equation! The only term that survives is the one where $p_{i} = 1$ (the correct class). So, if the true class is $c$ , the entire ugly summation collapses into a single, simple formula: $- \log (q_{c})$ (where $q_{c}$ is simply the probability your model assigned to the correct class).

A Neural Network View

The Math

Pasted image 20260127101916.png600

You have an e dimensional vector ( $x$ ) -> this is the vector of the input
Then we compute $w_{1}^{T} x$ , $w_{2}^{T} x$ , and $w_{3}^{T} x$ (for 3 classes) -> K = 3
- Look at the different color edges, each color represent a connection from either $w_{1}, w_{2}, w_{3}$ .
  - If you have K classes, you need K units in this vector
- You are basically connecting every input to every output
These 3-dimensional vector (a vector with 3 numbers) is passed onto the softmax function, which will output the probability vector of $w_{i}^{T} x_{i}$ belongs to each class. This is our prediction.
- Note that in the softmax function there is a normalization term. It couples all the terms together.
Now lets run through a training session using the Cross entropy loss function
- One-hot vector means that only one element in the vector is 1, everyone else is 0. This is the true value that represents to which class the vector $x$ really belongs and is inputed to the cross entropy loss function.
- It measures the difference between the true class and the probability vector outputted from the softmax function
- Note how the one-hot vector would look like:
  - $min - (0 \cdot \log (0.7) + 1 \cdot \log (0.2) + 0 \cdot \log (0.1))$
- Minimizing the cross entropy loss is equivalent to maximizing the probability vector

Example exam question:

Lets say you got 2 predictions:
- q = [0.5 0.4 0.1] and p = [0 1 0]
The distribution of the probability of other classes does not matter
What is the loss of this sample?
- Just plug in into the Cross entropy loss function
- You just need to do $- (0 \cdot \log (0.5) + 1 \cdot \log (0.4) + 0 \cdot \log (0.1))$
Answer: -0.4

The Meaning

1. The Neural Network View (The Pipeline) The slide ties everything together into a "Neural Network" pipeline:

Step 1 (Input): You start with your input data vector $x$ .
Step 2 (Linear Signals): You branch out to $K$ different "units". Each unit has its own weights ( $w_{k}$ ) and calculates a raw dot product score ( $w_{k}^{T} x$ ).
Step 3 (Softmax): You pass all $K$ raw scores into the Softmax function. The Softmax normalizes them, forcing them to become valid probabilities that add up to 1 (this is the $Q$ vector).
Step 4 (Loss): You take the $Q$ vector and the true One-Hot $P$ vector, and plug them into the Cross Entropy Loss function to calculate your error. During training, the computer tweaks the weights ( $w$ ) to make this error as small as possible.

2. Correcting your Exam Question Note! Let's look at your example question: $q = [0.5, 0.4, 0.1]$ and $p =$ . You plug it into the formula: $- (0 \cdot \log (0.5) + 1 \cdot \log (0.4) + 0 \cdot \log (0.1))$ . This correctly simplifies to: $- \log (0.4)$ .

Here is the error in your notes: You wrote the final answer is -0.4. Loss is basically a penalty, and it must always be a positive number. Because probabilities are decimals between 0 and 1, taking the logarithm of a decimal always results in a negative number. The formula puts a negative sign at the very front specifically to cancel that out and make the final loss positive!

If your professor uses $\log_{10}$ , the math is: $- \log_{10} (0.4) = - (- 0.397) = + 0.397$ (which rounds to $+ 0.4$ ).
If your professor uses the natural log ( $\ln$ ), the math is: $- \ln (0.4) = - (- 0.916) = + 0.916$ . Make sure you write down a positive number on your exam
Multi-Class Neural Network Pipeline:
1. Input vector $x$ .
2. Compute $K$ linear signals: $w_{1}^{T} x, w_{2}^{T} x, \dots, w_{K}^{T} x$ .
3. Apply Softmax to convert signals into probability vector $Q$ .
4. Compute Cross Entropy Loss between $Q$ and one-hot truth $P$ .
5. Minimize the loss to maximize the predicted probability of the correct class.

Loss Function

The Math

Now the loss for a training sample x in class c is given by

\begin{aligned} loss (x, y; w) & = H (y, \hat{y}) \\ = - \sum_{k} y_{k} \log {\hat{y}}_{k} \\ = - \log {\hat{y}}_{c} \\ = - \log \frac{e^{w_{c}^{T} x}}{\sum_{k = 1}^{K} e^{w_{k}^{T} x}} \end{aligned}

where $y$ denotes the one-hot vector and $\hat{y}$ is the predicted distribution
$h (x_{i})$ . And the loss on all samples ${(X_{i}, Y_{i})}_{i = 1}^{N}$ is

loss (X, Y; w) = - \sum_{i = 1}^{N} \sum_{c = 1}^{K} I [y_{i} = c] \log \frac{e^{w_{c}^{T} x_{i}}}{\sum_{k = 1}^{K} e^{w_{k}^{T} x_{i}}}

This is for when you need the loss of multiple samples
$I [y_{i} = c]$ is equal to 1 only for the true class.
$w (t + 1) = w (t) + η \hat{v}$
- You have the w matrix as input and end up with a number
- When you compute the derivate, this output (a number) it becomes a matrix.

The Meaning

1. The Loss Function (Putting it all together) In the previous slides, we learned that the Cross-Entropy loss for a single person/data point just collapses to $- \log (q_{c})$ , where $q_{c}$ is the predicted probability of the correct class. This slide just writes that exact same concept out using full, formal mathematical notation:

Single Sample: $loss (x, y; w) = - \log {\hat{y}}_{c}$ . It substitutes ${\hat{y}}_{c}$ (the prediction) with the actual Softmax formula: $\frac{e^{w_{c}^{T} x}}{\sum_{k = 1}^{K} e^{w_{k}^{T} x}}$ .
Multiple Samples ( $N$ samples): To get the total loss for the entire training dataset, we just add up the individual losses for all $N$ samples.
- The formula uses a double summation: $\sum_{i = 1}^{N} \sum_{c = 1}^{K}$ .
- The term $I [y_{i} = c]$ is called an Indicator Function. It acts exactly like your One-Hot vector. It equals $1$ if person $i$ 's true class is $c$ , and $0$ otherwise. This mathematically ensures that for every person, we only calculate the log-loss for their true class.

Derivative of Loss Function

The notes (Logistic Regression: From Binary to Multi-Class) contain
details on derivative of cross entropy loss function, which is necessary for your homework. All you need are:

Univariate calculus
Chain rule
$\frac{\partial (w^{⊤} b)}{\partial w} = b$

Taking the Derivative (The Calculus Note) To train the model, we need to use Gradient Descent, which means taking the derivative of this giant loss function to find the slope.

Clarifying your note: You wrote "You have the w matrix as input and end up with a number. When you compute the derivative, this output (a number) it becomes a matrix." You are absolutely correct! The loss function takes in a giant matrix of weights ( $w$ ) and calculates a single number (a scalar representing the total error). When you take the derivative of a single number with respect to a matrix, the result is a matrix of slopes (the gradient). This gradient matrix tells the computer exactly how to tweak every single weight simultaneously to decrease the loss.
To calculate this for your homework, you have to use the Chain Rule (working from the outside of the log function, to the inside of the softmax fraction, down to the dot product).

Shift-invariance in Parameters

The Math

The softmax function in multi-class LR has an invariance property when
shifting the parameters. Given the weights $w = (w_{1}, \cdot \cdot \cdot, w_{K})$ , suppose we
subtract the same vector $u$ from each of the K weight vectors, the outputs
of softmax function will remain the same.

Notes:

If we subtract a u from each of the K weights
- (u is a vector of the same dimension of each of the w)
The outputs of softmax will be equivalent

softmax [\begin{matrix} w_{1}^{T} x \\ w_{2}^{T} x \\ ⋮ \\ w_{K}^{T} x \end{matrix}] = softmax [\begin{matrix} v_{1}^{T} x \\ v_{2}^{T} x \\ ⋮ \\ v_{K}^{T} x \end{matrix}]

v_{i} = w_{i} - u

The reason why the output is K - 1 dimension is because of normalization?

Proof

To prove this, let us denote $w^{'} = {w_{i}^{'}}_{i = 1}^{K}$ where $w^{'} = w_{i} - u$ . We have

\begin{aligned} P (y = k ∣ x; w^{'}) & = \frac{e^{{(w_{k} - u)}^{T} x}}{\sum_{i = 1}^{K} e^{{(w_{i} - u)}^{T} x}} \\ = \frac{e^{w_{k}^{T} x} e^{- u^{T} x}}{\sum_{i = 1}^{K} e^{w_{i}^{T} x} e^{- u^{T} x}} \\ = \frac{e^{w_{k}^{T} x} e^{- u^{T} x}}{(\sum_{i = 1}^{K} e^{w_{i}^{T} x}) e^{- u^{T} x}} \\ = \frac{e^{{(w_{k})}^{T} x}}{\sum_{i = 1}^{K} e^{{(w_{i})}^{T} x}} \\ = P (y = k ∣ x; w), \end{aligned}

which completes the proof.

The Meaning

1. Shift-Invariance in Parameters (A Cool Math Trick) This is a fascinating and highly useful property of the Softmax function. It states that if you take every single weight vector for every single class, and subtract the exact same vector $u$ from all of them, the final probabilities will not change at all.

2. Walking through the Proof The algebra perfectly proves why this happens:

Let's replace our old weights $w_{k}$ with our shifted weights: $w_{k}^{'} = w_{k} - u$ .
Plug this into the Softmax numerator: $e^{(w_{k} - u)^{T} x}$ .
Using vector algebra, we distribute the $x$ : $e^{w_{k}^{T} x - u^{T} x}$ .
Using the rules of exponents ( $e^{A - B} = e^{A} e^{- B}$ ), we can split this into two parts multiplied together: $e^{w_{k}^{T} x} \cdot e^{- u^{T} x}$ .
The Magic: If we do this for the denominator as well, notice that the term $e^{- u^{T} x}$ has absolutely no $k$ in it. It is a constant identical factor for every single class.
Because it is identical for every item in the denominator's sum, we can factor it completely out of the summation.
Now we have $e^{- u^{T} x}$ on the top, and $e^{- u^{T} x}$ on the bottom. They perfectly cancel each other out, leaving us with the exact original Softmax formula!

3. Answering your Note's Question You asked: "The reason why the output is K - 1 dimension is because of normalization?" Yes, exactly! Because the probabilities must normalize (sum to 1), knowing the probability of $K - 1$ classes automatically tells you the probability of the last class. The shift-invariance property gives us the mathematical proof of this: Because we can subtract any vector $u$ from our weights without changing the output, let's purposely choose to subtract the weights of the last class ( $u = w_{K}$ ).

The new weights for the last class become $v_{K} = w_{K} - w_{K} = 0$ (a vector of all zeros).
This proves that we don't actually need to train $K$ different weight vectors. We only need to train $K - 1$ vectors, because we can always just lock the last class's weight vector to $0$ and the math will still perfectly work.

Equivalence to Sigmoid

Once we have proved the shift-invariance, we are able to show that when K= 2, the softmax-based multi-class LR is equivalent to the sigmoid-based binary LR. In particular, the hypothesis of both LR are equivalent.

Proof

\begin{aligned} h_{w} (x) & = \frac{1}{e^{w_{1}^{T} x} + e^{w_{2}^{T} x}} [\begin{matrix} e^{w_{1}^{T} x} \\ e^{w_{2}^{T} x} \end{matrix}] \\ = \frac{1}{e^{{(w_{1} - w_{1})}^{T} x} + e^{{(w_{2} - w_{1})}^{T} x}} [\begin{matrix} e^{{(w_{1} - w_{1})}^{T} x} \\ e^{{(w_{2} - w_{1})}^{T} x} \end{matrix}] \\ = [\begin{array}{c} \frac{1}{1 + e^{{(w_{2} - w_{1})}^{T} x}} \\ \frac{e^{{(w_{2} - w_{1})}^{T}}}{1 + e^{{(w_{2} - w_{1})}^{T} x}} \end{array}] \\ = [\begin{array}{c} \frac{1}{1 + e^{- {\hat{w}}^{T} x}} \\ \frac{e^{- {\hat{w}}^{T} x}}{1 + e^{- {\hat{w}}^{T} x}} \end{array}] \\ = [\begin{array}{c} \frac{1}{1 + e^{- {\hat{w}}^{T} x}} \\ 1 - \frac{1}{1 + e^{- {\hat{w}}^{T} x}} \end{array}] = [\begin{array}{c} h_{\hat{w}} (x) \\ 1 - h_{\hat{w}} (x) \end{array}], \end{aligned}

where $\hat{w} = w_{i} - w_{2}$ . This completes the proof.

Notes:

Note how the exponent cancels and gives 1 for some terms
Note the numerator on the 3rd point, it is very similar to the binary relation theta function:

θ (w^{T} x) = \frac{1}{1 + e^{- w^{T} x}}

- You can do a change of variables $w = w_{1} - w_{2}$ so that it becomes exactly (turns the exponent negative)

Equivalence of Loss Function

Now we show that minimizing the logistic regression loss is equivalent to minimizing the cross-entropy loss with binary outcomes.
The equivalence between logistic regression loss and the cross-entropy loss, as shown below, proves that we always obtain identical weights w by minimizing the two losses. The equivalence between the losses, together with the equivalence between sigmoid and softmax, leads to the conclusion that the binary logistic regression is a particular case of multi-class logistic regression when K= 2.

Proof

\begin{aligned} \arg min_{w} E_{i n} (w) & = \arg min_{w} \frac{1}{N} \sum_{n = 1}^{N} \ln (1 + e^{- y_{n} w^{T} x_{n}}) \\ = \arg min_{w} \frac{1}{N} \sum_{n = 1}^{N} \ln \frac{1}{θ (y_{n} w^{T} x_{n})} \\ = \arg min_{w} \frac{1}{N} \sum_{n = 1}^{N} \ln \frac{1}{P (y_{n} ∣ x_{n})} \\ = \arg min_{w} \frac{1}{N} \sum_{n = 1}^{N} I [y_{n} = + 1] \ln \frac{1}{P (y_{n} ∣ x_{n})} + I [y_{n} = - 1] \ln \frac{1}{P (y_{n} ∣ x_{n})} \\ = \arg min_{w} \frac{1}{N} \sum_{n = 1}^{N} I [y_{n} = + 1] \ln \frac{1}{h (x_{n})} + I [y_{n} = - 1] \ln \frac{1}{1 - h (x_{n})} \\ = \arg min_{w} p \log \frac{1}{q} + (1 - p) \log \frac{1}{1 - q} \\ = \arg min_{w} H ({p, 1 - p}, {q, 1 - q}) \end{aligned}

where $p = I [y_{n} = + 1] and q = h (x_{n}) .$ This completes the proof.

Notes:

Remember:

θ (s) = \frac{1}{1 + e^{- s}}

- In this case $y_{n} w^{T} x_{n}$ is our $s$ .

Then:

P (y | x) = θ (y w^{T} x)

Then we write how these two cases are separate
- We want to write the +1 and -1 cases by separate, this means $I [y_{n} = + 1]$ and $I [y_{n} = - 1]$ .
Then you simply rewrite $P (y_{n} | x_{n})$ as the prediction function $h (x_{n})$ . This is just the prediction represented as a function.
- Now you can see a pattern and will be able to identify input $p$ from output $q$ and that is how you can get to the last two forms.
Finally you just need to compute the cross entropy loss $H$
- Note the definition of the cross entropy loss includes a negative, this is why there is no negative on the second log, it got positive when combined with this second log.

Homework tip:

$w = [w_{i}, w_{2}, . . . w_{K}]$
You need to take the derivative of E(w) in terms of a single $w_{i}$ , in this case you will consider all $w$ 's as constants.
The derivative will be a matrix
The termination condition code is a matrix

Learning Invariant Representations

Pasted image 20260129103048.png400

Notes:

If you want to classify images, one challenge about images is if images are turned around, pixels will be completely shuffled, and this will confuse the model.
- In the vanilla version of convolutional networks your $x$ vector changes if you rotate the same image some degrees.
We need some transformations so that we can transform each of these images into a vector that remains the same if the image is turned around.
Then when having a fix-length vector $x$ , all we need to do to classify it, is create a layer with the $K$ units and connect every element from $x$ to every unit.
- If you have a 1000 classes you get a 1000 $w$ vector.
Then the next layer will be softmax, which will then give you K classes
Then on top, in training, you stack a cross entropy loss in top of it.

From Logistic Regression to Deep Learning

How to learning X automatically?

Pasted image 20260129103544.png600

We have some input (i.e. images, text, natural language)
How do we compute the fix length vector $x$ ?
The next layer computes $w^{T} x$
- Each unit in the $w_{i}^{T} x$ vector is connected to each unit in the $x$ vector. (Every unit is connected to every unit)
Then softmax (to get a predicted probability)
Then training (cross entropy loss)