02 - Logistic Regression For Binary Clasification

Notes:

Digits Data

Pasted image 20260120093107.png350

Numbers on an envelope that represent a zip code

Each digit is a 16 * 16 image.

[-1 -1 -1 -1 -1 -1 -1 -0.63 0.86 -0.17 -1 -1 -1 -1 -1 -1 -1 -1 -1 -1 -1 -1 -0.99 0.3 1 0.31 -1 -1 -1 -1 -1 -1 -1 -1 -1 -1 ... -0.94 -1 -1 -1 -1 -1 -1 -1 -0.97 -0.42 0.30 0.82 1 0.48 -0.47 -0.99 -1 -1 -1 -1]

Basically writing a 16 * 16 picture into a long vector

\begin{array}{ll} x = (1, x_{1}, \dots, x_{256}) & \leftarrow input \\ w = (w_{0}, w_{1}, \dots, w_{256}) & \leftarrow linear model \end{array}} \dim = 257

We are trying to predict 0-9 numbers (10 classes)
- Logistic Regression only works for two classes (binary data)
- If you want to make this work for multiple other classes you have to somehow simulate a binary property

Image Representations

Pasted image 20260120093331.png600

For the computer this is a two dimensional number array

Transformations on Images

Pasted image 20260120093804.png600

Field of view changed, but object remains set, but the vector changes (moved a little bit)
The predictions should remain the same but the result of the dot product will be different

Learning Invariant Representations

Pasted image 20260120094522.png400

If you rotate an image by 90 degrees, or if you just change it a little bit, predictions must stay the same
Your $x$ should somehow remain the same, but this is not easy to achieve, because any change in the input can change the operations performed to reach a prediction
You have to design this property that if you rotate the image by some degrees it will return the same prediction.

1. The Human vs. Computer Perspective If you look at a picture of a number "8" and then turn that picture upside down or shift it to the left, you instantly still recognize it as an "8". Humans naturally understand that rotating or moving an object doesn't change what the object actually is. We want our machine learning models to have this same "invariance" (meaning the prediction does not vary when the image is slightly transformed).

2. The Problem with Raw Data To a computer, an image is not a shape; it is just a giant grid of numbers representing pixel colors or darkness. To feed an image into our models, we usually flatten this grid into one long 1D vector (our input $x$ ).

Here is the problem: if you rotate the image by 90 degrees, the actual object is the same, but the order of the pixels in that long vector gets completely shuffled. Because our model makes predictions by multiplying specific weights by specific pixel locations ( $w^{T} x$ ), shuffling the pixels completely changes the math. The dot product changes, and the model might suddenly predict it is a different number.

3. The Solution: Invariant Representations To fix this, the slide says "your $x$ should somehow remain the same". But how can $x$ remain the same if the pixels moved?

The answer is that we should not use the raw pixels directly as our $x$ . Instead, we need to design a mathematical transformation (a feature extractor) that processes the image and outputs a new vector of traits that do not change when the image rotates.

A great example from your class: Think about the "Intensity" and "Symmetry" features mentioned earlier in your notes for digit classification.

Intensity is just the total amount of dark ink in the image. If you take a picture of a "5" and rotate it 90 degrees, the pixels move around, but the total amount of ink stays exactly the same.
By extracting these "invariant" features, the input vector $x$ that we feed into our logistic regression model remains constant, meaning the model's prediction will perfectly stay the same.

Intensity and Symmetry Features

Feature: an important property of the input that you think is useful for classification.

Pasted image 20260120094850.png400

\begin{array}{ll} x = (1, x_{1}, x_{2}) & \leftarrow input \\ w = (w_{0}, w_{1}, w_{2}) & \leftarrow linear model \end{array}} \dim = 3

This only works for 1s and 5s
Intensity: represents more ink in the image

Linear Classification and Regression

\begin{aligned} The linear signal: \\ s = w^{T} x \end{aligned}

00 - TAMU Brain/6th Semester (Spring 26)/CSCE-421/Ex1/Visual Aids/image-1.png650

Logistic Regression: Predict Probabilities

The Math

Will someone have a heart attack over the next year?

age	62 years
blood sugar	120mg/dL40,000
HDL	50
LDL	120
Mass	190lbs
Height	5′10′′
...	...

Logistic Regression: Predict the probability of heart attack: $\equiv y \in [0, 1]$

\begin{aligned} h (x) = θ (\sum_{i = 0}^{d} w_{i} x_{i}) = θ (w^{T} x) \\ θ (s) = \frac{1}{1 + e^{- s}} \end{aligned}

Pasted image 20260120095341.png250

Theta is the number of inputs/outputs
Use y +- 1, Do not use y = 0,1.
To make predictions on logistic regression is easy, $w$ is given, $x$ is given, all you need to do is compute transforms and they will just give a number
Properties
1. Probabilities are bounded by 0 and 1.
2. $θ (- s) = 1 - θ (s)$ $θ (- s) = \frac{1}{1 + e^{s}} = \frac{e^{- s}}{e^{- s} + 1} = \frac{1 + e^{- s} - 1}{e^{- s} + 1} = 1 - θ (s)$
3. $w^{T} x$ is the linear signal and in order to get a prediction we are just multiplying it by $θ$ (threshold of the output)
  - Is logistic regression still a linear model?
    - This is a two class case (2 dimensional)
    - You are multiplying a "linear" signal to a non linear function ( $θ$ ). Does that change linearity?
    - The function of $θ$ is monotonically increasing
      - Still each $x$ val is attached to a $y$ val, so using a threshold on $x$ would have an equivalent threshold on $y$
      - If the function was not monotonic we would have two or more thresholds for in $y$ for a single threshold in $x$
    - Conclusion: Logistic regression is still a linear model.

The Meaning

1. The Goal: Predicting a Probability So far, you have learned how to make a hard "Yes/No" decision (Linear Classification) and how to predict an exact continuous number (Linear Regression). Logistic Regression sits perfectly in the middle. We want to predict the probability of a binary event happening. For example, instead of just saying "Yes, this person will have a heart attack," we want the model to say, "There is an 85% chance this person will have a heart attack." Because it is a probability, the output must strictly be a number between 0 and 1.

2. The Magic Function: $θ (s)$ To achieve this, we start with the exact same raw linear score we used before: $s = w^{T} x$ (the sum of the inputs multiplied by their weights). However, a raw score can be any number from $- \infty$ to $+ \infty$ .

To force this raw score to act like a probability, we pass it through a mathematical "squashing" function called the Logistic Function (or Sigmoid function, denoted as $θ$ ). The formula is: $θ (s) = \frac{1}{1 + e^{- s}}$ .

If your raw score $s$ is a huge positive number, $e^{- s}$ becomes close to 0, so $θ (s)$ becomes $\frac{1}{1 + 0} = 1$ (100% probability).
If your raw score $s$ is a huge negative number, $e^{- s}$ becomes huge, making the denominator huge, so $θ (s)$ gets close to 0 (0% probability).
If your raw score is exactly 0, $θ (0) = 0.5$ (a 50/50 coin toss).

3. Clarifying a typo in your notes You wrote: "Theta is the number of inputs/outputs". This is a slight misunderstanding. $θ$ is the mathematical function (the S-curve) that squashes the signal. The number of inputs is $d$ (the dimension of your input vector $x$ ).

4. The Labels: Why use $+ 1$ and $- 1$ ? Your notes emphasize using $y = \pm 1$ instead of $0$ and $1$ for the training data labels. Even though our prediction is a probability between $0$ and $1$ , the historical data we use to train the model consists of absolute facts: a person either did have a heart attack ( $+ 1$ ) or did not ( $- 1$ ). Using $+ 1$ and $- 1$ makes the math much cleaner for the learning algorithm, especially when calculating the error (which you will see in the next slides).

5. The Properties of $θ$

Bounded: As explained above, it strictly outputs between $0$ and $1$ .
Symmetry: $θ (- s) = 1 - θ (s)$ . This makes logical sense: if the probability of having a heart attack is 80% ( $θ (s) = 0.8$ ), then the probability of not having a heart attack must be 20% ( $θ (- s) = 0.2$ ).
Is it still a linear model? Yes! Even though the $θ$ curve looks like an "S" and is not a straight line, the model is still considered linear. Why? Because the underlying signal $s = w^{T} x$ is linear, and the $θ$ function is monotonically increasing (it strictly goes up and never dips back down). This means if you pick a probability threshold (e.g., "Approve if $> 0.8$ "), it perfectly translates to a single, flat straight-line threshold on the raw score $x$ . There is no weird warping of the decision boundary.

The Data is Still Binary, +-1

\begin{aligned} D = (x_{1}, y_{1} = \pm 1), \dots, (x_{N}, y_{N} = \pm 1) \end{aligned}

\begin{array}{ll} x_{n} & \leftarrow a person’s health information \\ y_{n} = \pm 1 & \leftarrow did they have a heart attack or not \end{array}

We cannot measure a probability.
We can only see the occurrence of an event and try to infer a probability.

Setting

We are trying to learn the target function

f (x) = P [y = + 1 ∣ x]

The data does not give us the value of f explicitly. Rather, it gives us samples generated by this probability:

\begin{matrix} P (y = 1 ∣ x) = f (x) \\ P (y = - 1 ∣ x) = 1 - f (x) \end{matrix}

To learn from such data, we need to define a proper error measure that gauges how close a given hypothesis h is to f in terms of these examples

What is the data telling us? In Logistic Regression, we are trying to learn a target function $f (x)$ that represents a true probability, specifically: "What is the true probability that $y = + 1$ given the input $x$ ?".

Here is the fundamental challenge: the universe knows the exact probability (say, an 80% chance of a heart attack), but the data does not give us that percentage. We do not have a dataset that says [Patient A: 80%]. Instead, we only see the final outcome of that probability. We see [Patient A: +1 (Had a heart attack)] or [Patient B: -1 (Did not)]. We have to reverse-engineer the hidden probability using only these hard "+1 / -1" examples.

What Makes an $h$ Good?

The Math

’fiting’ the data means finding a good h

h is good if:

h is good if: {\begin{cases} h (x_{n}) \approx 1 & whenever y_{n} = + 1; \\ h (x_{n}) \approx 0 & whenever y_{n} = - 1 . \end{cases}

A simple error measure that captures this:

E_{in} (h) = \frac{1}{N} \sum_{n = 1}^{N} {(h (x_{n}) - \frac{1}{2} (1 + y_{n}))}^{2}

In linear regression we computed the derivative
With logistic regression is similar, the only thing is that you cannot derive the root
- We would have some initial case of $w$ You have to use an iterative procedure to compute $w$ .
- Suppose we have $(x_{1} + 1)$ . We need $θ (w^{T} x)$ to be adjusted to fit this data.
  - The output is the probability that this input $x$ belongs to the +1 class
  - What should we do in order to do this?
    - We need the input to make $w^{T}$ as larger as necessary
    - If we added a $(x_{2} - 1)$ we would need $w^{T}$ to be as smaller as possible.

The Meaning

1. What Makes a Hypothesis ( $h$ ) Good? Since our model $h (x)$ is trying to predict this probability, we want its output (a number between 0 and 1) to align with reality.

If a patient actually had a heart attack in the real world ( $y = + 1$ ), our model's predicted probability $h (x)$ should ideally be as close to $1$ (100%) as possible.
If a patient did not have a heart attack ( $y = - 1$ ), our model's predicted probability $h (x)$ should be as close to $0$ (0%) as possible.

2. The Simple Error Measure (The Math Trick) To train the model, we need an equation that measures how far off our predictions are from the ideal 1 or 0. The slide introduces this preliminary error formula: $$ E_{\text {in }}(h)=\frac{1}{N} \sum_{n=1}^N\left(h\left(\mathrm{x}_n\right)-\frac{1}{2}\left(1+y_n\right)\right)^2 $$ This looks confusing, but it contains a brilliant little math trick: $\frac{1}{2} (1 + y_{n})$ . Because our dataset labels are strictly $y = + 1$ or $y = - 1$ , let's see what happens when we plug them into that trick:

If $y = + 1$ , the formula becomes $\frac{1}{2} (1 + 1) = \frac{2}{2} = 1$ .
If $y = - 1$ , the formula becomes $\frac{1}{2} (1 - 1) = 0$ .

So, this formula perfectly converts our "+1 / -1" labels into the "1 or 0" probabilities we want our model to match! The error function simply takes our prediction $h (x)$ , subtracts the ideal target (1 or 0), squares the difference, and averages it. (Note: While this formula works intuitively, the professor will introduce a much better one called "Cross-Entropy" in the next slide, but this serves to build your understanding!)

3. Why we can't solve it like Linear Regression Your notes point out a major difference between Linear and Logistic regression.

In Linear Regression, the math was a clean bowl shape, and we could easily take the derivative, set it to zero, and instantly find the perfect weights (the "root" or analytic solution).
In Logistic Regression, because we pass our signal $w^{T} x$ through that complex S-curve function $θ$ , the math gets messy. You cannot just use a formula to find the perfect weights in one step.

4. The Iterative Solution Because we cannot jump straight to the answer, we must start with random weights ( $w$ ) and use an iterative procedure (taking tiny steps to adjust the weights over time). Your notes at the bottom translate what these adjustments are trying to achieve:

If we have a data point $x_{1}$ with a label $y = + 1$ , we want our output $θ (w^{T} x)$ to be $1$ . To get an S-curve to output 1, the raw linear signal $w^{T} x$ must be as large and positive as necessary. So the algorithm tweaks $w$ to make the dot product bigger.
If we have a data point $x_{2}$ with a label $y = - 1$ , we want our output $θ (w^{T} x)$ to be $0$ . To get an S-curve to output 0, the raw linear signal $w^{T} x$ must be as small (large and negative) as possible. The algorithm tweaks $w$ to push the dot product into the negatives.

The Cross Entropy Error Measure

The Math

E_{in} (w) = \frac{1}{N} \sum_{n = 1}^{N} \ln (1 + e^{- y_{n} \cdot w^{T} x})

We want the loss (the thing in parenthesis) to be minimized (be as as small as possible)
- $\ln$ is a monotonic function, so it is minimized by it.
- This is not the only way we can achieve minimization
The last $x$ should be $x_{n}$

It looks complicated and ugly (ln, e^(·), ...),

But,

it is based on an intuitive probabilistic interpretation of h.
it is very convenient and mathematically friendly (’easy’ to minimize).

Verify:

$y_{n} = + 1$ encourages $w^{T} x_{n} ≫ 0$ , so $θ (w^{T} x_{n}) \approx 1$
$y_{n} = - 1$ encourages $w^{T} x_{n} ≪ 0$ , so $θ (w^{T} x_{n}) \approx 0$

The Meaning

1. The "Ugly" Formula that Works Beautifully In the previous slide, we looked at a preliminary way to measure mistakes. Now, the professor introduces the actual formula used in the real world to train Logistic Regression: the Cross-Entropy Error (often just called Log Loss).

The formula is: $$ E_{\text {in }}(w)=\frac{1}{N} \sum_{n=1}^N \ln \left(1+e^{-y_n \cdot w^{\mathrm{T}} x_n}\right) $$ (And you are completely correct to note that the last $x$ must be $x_{n}$ because we are evaluating it for each specific person/data point $n$ in the dataset!)

Let's break down why this formula is actually very intuitive:

The Agreement Check ( $y_{n} \cdot w^{T} x_{n}$ ): Remember that our true label $y_{n}$ is either $+ 1$ or $- 1$ . Our raw linear score $w^{T} x_{n}$ can be any positive or negative number. If you multiply them together, a positive result means they share the same sign (the model is correct), and a negative result means they have opposite signs (the model is wrong).
The Exponential Penalty ( $e^{- y_{n} \cdot w^{T} x_{n}}$ ): If the model is correct and very confident, $y_{n} \cdot w^{T} x_{n}$ is a large positive number. The negative sign in the exponent turns this into $e^{- large}$ , which is a tiny number close to $0$ . If the model is very wrong, the exponent becomes positive, and $e^{large}$ becomes a huge penalty!
The Logarithm ( $\ln$ ): The natural logarithm simply acts as a dampener. Because $\ln$ is a monotonic function (meaning it strictly goes up; if $a < b$ , then $\ln (a) < \ln (b)$ ), minimizing the whole $\ln (\dots)$ function is the exact same thing as minimizing the stuff inside the parentheses.

2. Why is it "Mathematically Friendly"? Even though the equation looks like an absolute nightmare of logs and exponents, it has a beautiful property: it is strictly convex. Imagine dropping a marble into a bowl. No matter where you drop it, it will always roll down to the exact same spot at the very bottom. The Cross-Entropy error forms a perfect mathematical bowl (only one valley). This means when we use our "iterative procedure" to tweak the weights step-by-step, the computer is guaranteed to eventually find the absolute best possible weights (the bottom of the valley) without getting stuck.

3. Verification (Sanity Check) The slide finishes by proving that minimizing this formula naturally forces the model to do exactly what we want it to do:

When a patient actually had a heart attack ( $y_{n} = + 1$ ): To make the error as close to $0$ as possible, the algorithm will push the raw score $w^{T} x_{n}$ to be a huge positive number ( $w^{T} x_{n} ≫ 0$ ). If we plug a huge positive number into our S-curve $θ$ , it outputs a probability of $\approx 1$ ( $100$ %). This matches the reality!
When a patient did not have a heart attack ( $y_{n} = - 1$ ): To minimize the error, the algorithm will push the raw score $w^{T} x_{n}$ to be a huge negative number ( $w^{T} x_{n} ≪ 0$ ). If we plug a huge negative number into our S-curve $θ$ , it outputs a probability of $\approx 0$ ( $0$ %). This also perfectly matches reality!

The Probabilistic Interpretation

Suppose that $h (x) = θ (w^{T} x)$ closely captures $P [+ 1 ∣ x]$ :

P (y ∣ x) = {\begin{cases} θ (w^{T} x) & for y = + 1 \\ 1 - θ (w^{T} x) & for y = - 1 \end{cases}

Given:

θ (s) = \frac{1}{1 + e^{- s}}

we have

θ (- s) = 1 - θ (s)

So, more compactly,

P (y | x) = θ (y \cdot w^{T} x)

Note this is purely for notational purposes

The Likelihood

The Math

P (y | x) = θ (y \cdot w^{T} x)

"Probability of y given x written in the same formula"
We will need to use this when we derive our maximum likelihood
This is a probability (is between 0 and 1)
If we plugin each of our data points, each output will be somewhere close to 1.
- $y_{i} = + 1$ , we want $θ (w^{T} x)$ as large as possible
- $y_{i} = - 1$ , we want $θ (w^{T} x)$ as small as possible
Recall: $(x_{1}, y_{1}), \dots, (x_{N}, y_{N})$ are independently generated
Likelihood: The probability of getting the $y_{1}, \dots, y_{N}$ in D from the corresponding $x_{1}, \dots, x_{N}$

- *"The product of each individual probability"* - **Why do we use product instead of summation?** - We want each of these individual probabilities to be large, so we need to maximize using the product - If you use summation it won't work, it turns out you need to use product because multiplication will affect and provide the necessary weight (it will achieve some kind of balance between each data point) - Essentially we want to adjust $w$ so that the product is as larger as possible - Maximizing the likelihood is the same as maximizing the product of individual probabilities - The likelihood measures the probability that the data were generated if $f$ were $h$. ##### The Meaning **1. The Magic Single Formula: $P(y|x) = \theta(y \cdot w^Tx)$** In the previous slides, we established that we want to predict a probability. If a person had a heart attack ($y = +1$), the probability is $\theta(w^Tx)$. If they did not ($y = -1$), the probability is $1 - \theta(w^Tx)$. Because of the neat symmetry of the logistic function, $1 - \theta(w^Tx)$ is exactly equal to $\theta(-w^Tx)$. This allows us to combine both cases into one elegant formula: **$P(y|x) = \theta(y \cdot w^Tx)$**. - If $y = +1$, it becomes $\theta(w^Tx)$. - If $y = -1$, it becomes $\theta(-w^Tx)$. This is what your note means by _"Probability of y given x written in the same formula"_. It mathematically expresses: "The probability that the model's prediction matches the true reality." **2. Answering your question: Why use the product instead of summation?** Your notes guess that multiplication provides a "necessary weight" or balance. While creative, the real reason is a fundamental rule of statistics: **Independence**. - Imagine flipping a coin. The chance of getting Heads is 50% (0.5). What is the chance of getting Heads _twice in a row_? You multiply them: $0.5 \times 0.5 = 0.25$. - In machine learning, we assume every patient in our dataset is entirely independent of the others. Therefore, to find the total probability of our _entire dataset_ occurring, we must **multiply** the individual probabilities of every single person's outcome together. This total multiplied probability is called the **Likelihood**. ### Maximizing The Likelihood ##### The Math

\begin{aligned}
& \max \prod_{n=1}^N P\left(y_n \mid x_n\right) \
\Leftrightarrow & \max \ln \left(\prod_{n=1}^N P\left(y_n \mid x_n\right)\right) \
\end

- N o t e m a x $ f (x) $ i s * * e q u i v a l e n t * * (n o t e q u a l) t o m a x $ \ln f (x) $ b e c a u s e $ \log $ i s m o n o t o n i c a l l y i n c r e a s i n g

\begin{aligned}
\equiv & \max \sum_{n=1}^N \ln P\left(y_n \mid x_n\right) \
\end

- N o t e $ \log $ o f p r o d u c t e q u a l s t h e s u m m a t i o n o f $ \log $ (l o g m o v e d i n s i d e) - T h i s i s b e c a u s e $ \ln a \cdot b = \log a + \log b $ (l o g p r o p e r t i e s) - $ \log $ i s v e r y i m p o r t a n t j u s t b e c a u s e a c h a n g e i n i n p u t d o e s n o t m e a n a g r e a t c h a n g e i n t h e o u t p u t (i n l a t e r c a s e s r a t h e r t h a n e a r l i e r) - T h i s i s c a l l e d t h e * s a t u r a t i n g p r o p e r t y ? * o f l o g s - T h i s i s e q u i v a l e n t t o (1) b e c a u s e t h e f i r s t i s t h e p r o d u c t, a n d t h e s e c o n d i s t h e s u m m a t i o n $ $ \begin{aligned} \Leftrightarrow & min - \frac{1}{N} \sum_{n = 1}^{N} \ln P (y_{n} ∣ x_{n}) \end{aligned}

We added a negative and changed max to min
Note this became equivalent (not equal) because we are doing $1 / N$ . (we need to divide by the number of values)

\begin{aligned} \equiv & min \frac{1}{N} \sum_{n = 1}^{N} \ln \frac{1}{P (y_{n} ∣ x_{n})} \end{aligned}

Note again because of log properties
We basically just moved the negative inside, this affects the log

\begin{aligned} \equiv & min \frac{1}{N} \sum_{n = 1}^{N} \ln \frac{1}{θ (y_{n} \cdot w^{T} x_{n})} \end{aligned}

We specialize to our "model" here
We basically made $θ$ to be explicit by rewriting likelihood

\begin{aligned} \equiv & min \frac{1}{N} \sum_{n = 1}^{N} \ln (1 + e^{- y_{n} \cdot w^{T} x_{n}}) \end{aligned}

Basically just use the definition of $θ$ .

E_{in} (w) = \frac{1}{N} \sum_{n = 1}^{N} \ln (1 + e^{- y_{n} \cdot w^{T} x_{n}})

The Meaning

3. Maximizing the Likelihood (The Goal) We want to find the weights ( $w$ ) that make our dataset as highly probable to occur as possible. This means we want to maximize the Likelihood product. However, multiplying thousands of probabilities (which are decimals between 0 and 1) together will result in a microscopic number (like $0.0000000001$ ). A computer cannot handle numbers this small and will round them to zero.

4. The Math Derivation (Step-by-Step) To fix this, we apply a brilliant mathematical trick. We take the Natural Logarithm ( $\ln$ ) of the whole equation.

Here is how we transform the equation step-by-step:

Step 1: Apply Logarithm ( $\ln$ ). We change $max \prod P$ to $max \ln (\prod P)$ . Why is this allowed? Because logarithms are strictly monotonically increasing. This means if $A > B$ , then $\ln (A) > \ln (B)$ . Taking the log changes the value of the number, but it does not change where the maximum is. (Note: To answer your note about the "saturating property", you are actually thinking of the logarithmic product rule, which states that $\ln (a \cdot b) = \ln (a) + \ln (b)$ . This is the true magic: it turns our messy multiplication problem into a simple addition problem!)
Step 2: Convert Product to Sum. Using that rule, $max \ln (\prod P)$ becomes $max \sum \ln (P)$ .
Step 3: Flip from Max to Min. In machine learning, we like to minimize error rather than maximize probability. If you want to maximize a number, that is the exact same thing as minimizing the negative of that number. So we add a negative sign, change "max" to "min", and divide by $N$ to find the average error: $min - \frac{1}{N} \sum \ln (P)$ .
Step 4: Move the Negative Inside. Another log rule is $- \ln (x) = \ln (1 / x)$ . So we move the negative inside to get: $min \frac{1}{N} \sum \ln (\frac{1}{P})$ .
Step 5: Substitute the Model. Replace $P$ with our magical formula from Step 1: $min \frac{1}{N} \sum \ln (\frac{1}{θ (y_{n} w^{T} x_{n})})$ .
Step 6: The Final Simplification. Recall that $θ (s) = \frac{1}{1 + e^{- s}}$ . If we take the reciprocal of that ( $1 / θ$ ), it simply flips the fraction upside down, giving us $1 + e^{- s}$ . If we plug that into our formula, we arrive exactly at the Cross-Entropy Error: $$ \min \frac{1}{N} \sum_{n=1}^N \ln \left(1+e^{-y_n w^{\mathrm{T}} x_n}\right) $$
Deriving Cross-Entropy from Likelihood:
1. Start: $max \prod P (y_{n} | x_{n})$
2. Take $\ln$ : $max \ln (\prod P) \equiv max \sum \ln (P)$ (Since $\ln$ is monotonic, the maximum location is preserved. $\ln$ turns products into sums: $\ln (a b) = \ln (a) + \ln (b)$ ).
3. Max to Min: $min - \frac{1}{N} \sum \ln (P)$ (Maximizing a value is equivalent to minimizing its negative. Divide by $N$ for average loss).
4. Invert Log: $min \frac{1}{N} \sum \ln (\frac{1}{P})$ (Using log property $- \ln (x) = \ln (1 / x)$ ).
5. Substitute $θ$ : $min \frac{1}{N} \sum \ln (\frac{1}{θ (y_{n} w^{T} x_{n})})$ .
6. Final Form: Since $1 / θ (s) = 1 + e^{- s}$ , this reduces directly to the Cross-Entropy Loss: $E_{i n} (w) = \frac{1}{N} \sum \ln (1 + e^{- y_{n} w^{T} x_{n}})$

View

$(x, y = + - 1)$ , $q = [θ (w^{T} x), 1 - θ (w^{T} x)]$
Look at the one hot vector?
- If true label is +1, then $q = [1, 0]^{T}$
- If true label is -1, then $q = [0, 1]^{T}$

Cross Entropy

C E (g, q) = - \sum_{i} g_{i} \log q_{i}

CE tries to maximize the log probability of the correct class
$g_{i}$ is each of the elements
$q_{i}$ is the probability
Why does this make sense?
- Only one term in the one hot vector is not zero
- That term turns our model into the correct class
Whenever we try about logs we have to minimize, that is the reason we put that negative there

A Neural Network View

Pasted image 20260122100430.png600

When we try to make predictions $x_{i}$ will be multiplied with $w_{i}$
The summation of all these multiplications yields a single number, which we then multiply by $θ$ to get the final probability that the input belongs to the +1 class
Note there is an input layer (the x vector), and there is an output layer (the summation), there is nothing in between, which later with a more complex neural network will.

How To Minimize $E_{in} (w)$

Recall:

E_{in} (w) = \frac{1}{N} \sum_{n = 1}^{N} \ln (1 + e^{- y_{n} \cdot w^{T} x_{n}})

The only difference from this and linear regression is that in linear regression we are minimizing $\frac{1}{N} \sum ({\hat{y}}_{i} - y_{i})^{2}$
- Linear regression is easy because we can just take the derivative of it (since it is a degree 2 polynomial) and will end up with the closest answer
- But when we try to take the derivative of $E_{in} (w)$ it is much harder because the derivative of $\ln$ and $e$ do not disappear completely
  - And of course the derivative will be a vector

Regression - pseudoinverse (analytic), from solving $\nabla_{w} E_{in} (w) = 0 .$

Logistic Regression - analytic won’t work.

Numerically/iteratively set $\nabla_{w} E_{in} (w) \to 0 .$

Finding The Best Weights - Hill Descent

Ball on a complicated hilly terrain rolls down to a local valley(a local minimum)

Pasted image 20260122103143.png200

This is a one dimensional problem (w vs. log)
Note you can only reach the true minimum in a probabilistic sense, it is not a guarantee to be able to actually reach that point

Questions:

How to get to the bottom of the deepest valley?
How to do this when we don’t have gravity?

Our $E_{in}$ Has Only One Valley

Pasted image 20260122103509.png400

Convex: just a function where you have a way to pick random two different points and the minimum will always be below that line

How to "Roll Down"?

The Math

Assume you are at weights $w (t)$ and you take a step of size $η$ in the direction $\hat{v}$ .

w (t + 1) = w (t) + η \hat{v}

$η$ : A scalar of step size
$\hat{v}$ : A unit vector of direction

We get to pick $\hat{v}$ ← what's the best direction to take the step?

Pick $\hat{v}$ to make $E_{in} (w (t + 1))$ as small as possible.

Pasted image 20260127094259.png200

We are moving down the plane

The Meaning

The Setup (Standing on the Hill) Imagine you are standing somewhere on a foggy, hilly landscape. Your goal is to reach the absolute lowest point of the valley, because the "height" of the ground represents your model's error ( $E_{i n}$ ). The lower you go, the better your model gets.

Because it is foggy, you cannot see the bottom of the valley. You can only feel the slope of the ground directly under your feet. To get to the bottom, you have to take it one step at a time. Every step you take is defined by this formula: $$ \mathrm{w}(t+1) = \mathrm{w}(t) + \eta \hat{\mathrm{v}} $$

$w (t)$ : Where you are currently standing (your current weights).
$η$ (Eta): How big of a step you are taking (a single number, called a scalar).
$\hat{v}$ : Which direction you choose to step (a unit vector, meaning its length is exactly 1).

The Gradient is the Fastest Way to Roll Down

The Math

The Taylor series expansion of a multivariate function $f (x)$ around a point a up to the second order is

f (x) \approx f (a) + \nabla f (a)^{T} (x - a) + \frac{1}{2} (x - a)^{T} H_{f} (a) (x - a)

Expanding $E_{in} (w (t + 1))$ at $E_{in} (w (t))$ gives

\begin{aligned} Δ E_{in} & = E_{in} (w (t + 1)) - E_{in} (w (t)) \\ = E_{in} (w (t) + η \hat{v}) - E_{in} (w (t)) \\ = η \underset{}{\underset{⏟}{\nabla E_{in} (w (t))^{T} \hat{v}}} + O (η^{2}) (Taylor’s Approximation) \\ minimized at \hat{v} = - \frac{\nabla E_{in} (w (t))}{‖ \nabla E_{in} (w (t)) ‖} \end{aligned}

The best (steepest) direction to move is the negative gradient:

\hat{v} = - \frac{\nabla E_{in} (w (t))}{‖ \nabla E_{in} (w (t)) ‖}

You only need to take the derivative once
$w (t)$ is a fixed vector this time, next time is a new $w$ and so on.
Given $E_{in}$ a vector of input produces a vector of output.
To make this clear: $E_{in}$ takes a vector input, and outputs one vector. This $E_{in}$ is the vector loss.

The Meaning

1. The Goal: Which way do we step? Since we want to go down into the valley, we want our new error, $E_{i n} (w (t + 1))$ , to be as small as possible compared to our old error, $E_{i n} (w (t))$ . To figure out the best direction mathematically, the slide uses a Taylor Series Expansion.

Taylor's approximation is a calculus trick that lets you estimate the value of a function near your current location. It tells us that the change in our error ( $Δ E_{i n}$ ) after taking a tiny step is approximately: $$ \Delta E_{in} \approx \eta \nabla E_{in}(w(t))^T \hat{v} $$

$\nabla E_{i n}$ (The Gradient): This is the multi-dimensional slope of the ground under your feet.
The Dot Product ( $\dots^{T} \hat{v}$ ): This multiplies the slope by your step direction.

2. Finding the Fastest Way Down We want $Δ E_{i n}$ to be as negative as possible (we want to drop in height as much as we can). Because $η$ is just a positive step size, the only thing we can control is $\hat{v}$ .

In linear algebra, if you have a vector $A$ (the gradient) and a vector $B$ (your direction), their dot product $A^{T} B$ is most negative when they point in exactly opposite directions.

The Gradient ( $\nabla E_{i n}$ ) always points in the direction of the steepest ascent (straight up the hill).
Therefore, the exact opposite direction, the negative gradient, points in the direction of the steepest descent (straight down the hill).

To make $\hat{v}$ a unit vector (length of 1) pointing straight down, we take the negative gradient and divide it by its own length ( $| | \nabla E_{i n} | |$ ): $$ \hat{\mathrm{v}}=-\frac{\nabla E_{\text {in }}(\mathrm{w}(t))}{\left|\nabla E_{\text {in }}(\mathrm{w}(t))\right|} $$
3. Important Correction to your Notes! At the bottom of your notes, you wrote: "Given $E_{i n}$ a vector of input produces a vector of output. To make this clear: $E_{i n}$ takes a vector input, and outputs one vector. This $E_{i n}$ is the vector loss."

This is incorrect, and it is crucial to fix this for your exam.

$E_{i n} (w)$ is a SCALAR (a single number). It takes a vector input (the weights $w$ ), and calculates a single number representing the total error of your model.
$\nabla E_{i n} (w)$ (The Gradient) is a VECTOR. When you take the derivative of that scalar error with respect to every single weight, the result is a vector of slopes. Remember: Error is a single number (height on the hill). The Gradient is a vector (the 3D slope under your feet).

"Rolling Down ≡ Iterating the Negative Gradient"

Pasted image 20260127094740.png450

The 'Goldilocks' Step Size

Pasted image 20260122104847.png450

Initially we use a larger step size, and every time you make it smaller and smaller
Some people call this the 'learning rate'
At the optimal point the gradient is 0, so you can use the current step to measure how close you are to this optimal point

The "Goldilocks" Problem: How big of a step do we take? In the previous slide, we figured out the direction to step to get down the hill (the negative gradient). Now we need to figure out the size of the step, denoted by $η$ (Eta, or the "learning rate").

If your step size is too small, you will take tiny baby steps. It will be computationally inefficient and take forever to reach the bottom of the valley.
If your step size is too big, you might accidentally step entirely over the valley and land higher up on the other side, causing the algorithm to bounce around and actually make the error worse.
The "Goldilocks" (Just Right) approach: Ideally, you want to take massive steps when you are high up on the hill and far away from the bottom. Then, as you get closer to the bottom, you want to take smaller, more careful steps so you don't overshoot.

Fixed Learning Rate Gradient Descent

The Math

\begin{aligned} \begin{aligned} η_{t} & = η \cdot ‖ \nabla E_{in} (w (t)) ‖ \\ ‖ \nabla E_{in} (w (t)) ‖ & \to 0 \end{aligned} \\ when closer to the minimum. \\ \begin{array}{c} \hat{v} = - η_{t} \cdot \frac{\nabla E_{in} (w (t))}{‖ \nabla E_{in} (w (t)) ‖} \\ = - η \cdot ‖ \nabla E_{in} (w (t)) ‖ \cdot \frac{\nabla E_{in} (w (t))}{‖ \nabla E_{in} (w (t)) ‖} \end{array} \end{aligned}

Initially we want $η_{t}$ to be large
- $η_{t} = η$ $| | \nabla E_{in} | |$
These two $| | \nabla E_{in} (w (t)) | |$ will cancel and we will end up with:

\hat{v} = - η \cdot \nabla E_{in} (w (t))

Pasted image 20260127093606.png400

Gradient descent can minimize any smooth function, for example

E_{in} (w) = \frac{1}{N} \sum_{n = 1}^{N} \ln (1 + e^{- y_{n} \cdot w^{T} x})

(logistic regression)

The Meaning

1. The Brilliant Solution: Let the hill do the work How does the computer know if it is far away or close to the bottom? It looks at the steepness of the slope!

Far away from the minimum, the hill is usually steep, meaning the size (or "norm") of the gradient $| | \nabla E_{i n} | |$ is a large number.
At the very bottom of the valley (the optimal point), the ground is perfectly flat, meaning the gradient is exactly $0$ .
Therefore, we can dynamically link our step size to the steepness of the hill. We define a variable step size $η_{t} = η \cdot | | \nabla E_{i n} (w (t)) | |$ . As the slope flattens out, the step size naturally shrinks to zero.

2. The Math Trick (Simplifying the Formula) In the previous slide, we forced our step direction to be a "unit vector" (a vector with a length of exactly 1) by dividing the negative gradient by its own length: $\hat{v} = - \frac{\nabla E_{i n}}{| | \nabla E_{i n} | |}$ .

When we combine our new dynamic step size ( $η_{t}$ ) with this unit vector ( $\hat{v}$ ), something beautiful happens. We multiply them together to get our final step: $$ \text{Step} = \eta_t \cdot \hat{v} = \left(\eta \cdot ||\nabla E_{in}|| \right) \cdot \left( - \frac{\nabla E_{in}}{||\nabla E_{in}||} \right) $$
Because $| | \nabla E_{i n} | |$ is in both the numerator and the denominator, they completely cancel each other out!.

3. The Final Update Rule Once the lengths cancel out, we are left with a beautifully simple formula for our step: $- η \cdot \nabla E_{i n}$ . This means we no longer have to worry about calculating unit vectors or manually shrinking the step size. We just pick a single, fixed constant number for $η$ (often around $0.1$ ), and subtract $η$ multiplied by the raw gradient. The final standard update loop for Gradient Descent is: $$ w(t+1) = w(t) - \eta \nabla E_{in}(w(t)) $$
4. Why Logistic Regression? The slide ends by pointing out that this Gradient Descent method can minimize any "smooth" function. We use it for Logistic Regression because its Cross-Entropy error function is perfectly smooth and strictly convex. This means it forms a perfect single bowl shape with only one valley, guaranteeing that our ball will eventually roll down to the absolute best global minimum without getting stuck.

Stochastic Gradient Descent (SGD)

A variation of GD that considers only the error on one data point.

E_{in} (w) = \frac{1}{N} \sum_{n = 1}^{N} \ln (1 + e^{- y_{n} \cdot w^{T} x}) = \frac{1}{N} \sum_{n = 1}^{N} e (w, x_{n}, y_{n})

Pick a random data point ( $x_{*}$ , $y_{*}$ )
Run an iteration of GCD on $e (w, x_{*}, y_{*})$

w (t + 1) \leftarrow w (t) - η \nabla_{w} e (w, x_{*}, y_{*})

Logistic Regression:

w (t + 1) \leftarrow w (t) + y_{*} x_{*} (\frac{η}{1 + e^{y * w^{T} x_{*}}})

Advantages:

The ’average’ move is the same as GD;
Computation: fraction 1/N cheaper per step;
Stochastic: helps escape local minima;
Simple;

Pasted image 20260127100026.png400

Comparison

Batch Gradient Descent (BGD): Uses the entire dataset to compute the gradient of the loss function. Takes a single update step after processing all training samples.
- If you pass through the whole data set exactly once, that is called "Epoch"
Stochastic Gradient Descent (SGD): Updates model parameters after processing each individual training example. Does not wait for the entire dataset; updates are made immediately.
- You computer the gradient of a single data point and then update and so on
- It is a special case of MBGD
Mini-Batch Gradient Descent (MBGD): A compromise between Batch GD and Stochastic GD. Divides the dataset into small batches and updates parameters after processing each batch.
- Divide your training set in subsets and update parameters after processing each subset

Method	Updates Per Iteration	Convergence Speed	Stability	Computation Cost
Batch GD	After full dataset	Slow	Stable	High
Stochastic GD	After each sample	Fast	Noisy (unstable)	Low
Mini-Batch GD	After each mini-batch	Moderate	Balanced	Moderate

Summary and explanation of different GD techniques

1. The Problem with Standard (Batch) Gradient Descent In the previous slide, we learned that Gradient Descent updates the model's weights by calculating the gradient (the multi-dimensional slope) of the error surface. The total error, $E_{i n} (w)$ , is the average of the errors across the entire dataset. If you have 1 million patients in your dataset, standard Gradient Descent—often called Batch Gradient Descent (BGD)—must calculate the prediction and the error for all 1 million patients just to take one single step down the hill. This is incredibly slow and computationally expensive.

2. The Solution: Stochastic Gradient Descent (SGD) SGD takes a radically different approach. Instead of looking at the whole dataset, it picks one single random data point $(x_{*}, y_{*})$ . It calculates the gradient of the error for just that one point, and immediately updates the weights.

The Math: The update rule is $w (t + 1) \leftarrow w (t) - η \nabla_{w} e (w, x_{*}, y_{*})$ . It looks exactly like standard GD, but the little $e$ means we are only taking the derivative of the single-point error.

3. The Logistic Regression SGD Formula Your slide shows a specific, ready-to-use formula for SGD applied to Logistic Regression: $$ \mathrm{w}(t+1) \leftarrow \mathrm{w}(t)+y_* \mathrm{x}*\left(\frac{\eta}{1+e^{y* \mathrm{w}^{\mathrm{T}} \mathrm{x}_*}}\right) $$ Where does this come from? If you take the derivative of the single-point Cross-Entropy error $e = \ln (1 + e^{- y_{*} w^{T} x_{*}})$ , the negative signs cancel out with the $- η$ in the update rule, leaving you with that exact formula.

Intuition: Look at the term $y_{*} w^{T} x_{*}$ . If the model is very right, this number is a large positive. $e^{large}$ becomes huge, making the fraction close to 0. The model barely updates. If the model is wrong, $y_{*} w^{T} x_{*}$ is negative. $e^{negative}$ is tiny, making the fraction close to 1. The model takes a large step ( $η y_{*} x_{*}$ ) to aggressively fix its mistake!

4. Why does SGD work? (The Advantages) You might wonder: "If we only look at one point, aren't we going to step in the wrong direction?" Yes, frequently! Stepping based on one point is very noisy and "wiggly". However:

The 'average' move is the same: If you take 1 million wiggly steps based on random individual points, mathematically, the average direction you travel is exactly the same as the true downward slope calculated by Batch GD.
Computationally cheaper: Calculating the slope for 1 point is $1 / N$ times cheaper than calculating it for $N$ points. By the time Batch GD takes 1 step, SGD has taken 1 million steps and might already be at the bottom of the valley!
Helps escape local minima: Because the steps are noisy and random (stochastic), SGD acts like a bouncing ball. If it accidentally rolls into a shallow, fake valley (a local minimum), the randomness can bounce it right back out.

5. The Compromise: Mini-Batch Gradient Descent (MBGD) In reality, picking 1 point at a time (SGD) is too noisy, and picking all points (BGD) is too slow. Mini-Batch GD is the perfect "Goldilocks" compromise used in modern Deep Learning. You divide your dataset into small batches (e.g., 32 or 64 points). You calculate the average gradient of those 32 points, take a step, and move to the next batch. It balances speed and stability perfectly.

6. What is an "Epoch"? This is a crucial term to memorize. An Epoch means the algorithm has looked at every single data point in the entire dataset exactly once.

In Batch GD, 1 Epoch = 1 weight update step (because it looks at all data at once).
In SGD with 1,000 data points, 1 Epoch = 1,000 weight update steps (because it updates after every single point).
Logistic Regression SGD Update Rule: $$\mathrm{w}(t+1) \leftarrow \mathrm{w}(t)+y_* \mathrm{x}*\left(\frac{\eta}{1+e^{y* \mathrm{w}^{\mathrm{T}} \mathrm{x}*}}\right)$$ *(Correct predictions yield near-0 updates; wrong predictions yield large $η y * x *$ corrective updates).
SGD Advantages:
1. Average move mathematically equals the true Batch GD gradient.
2. Cheaper computation: $1 / N$ the cost per step compared to Batch GD.
3. Stochasticity (Noise): The "wiggly" path helps the algorithm bounce out of bad local minima.
4. Simple to implement and great for online/streaming data.
Terminology: Epoch = One full pass through the entire training dataset.
Comparison of GD Variants:
- Batch GD (BGD): Uses whole dataset per step. Updates: 1 per epoch. Speed: Slow. Stability: Stable downward path. Cost: High.
- Stochastic GD (SGD): Uses 1 sample per step. Updates: $N$ per epoch. Speed: Fast. Stability: Noisy/Unstable. Cost: Low.
- Mini-Batch GD (MBGD): Uses a small subset (batch) per step. Updates: $N / batch size$ per epoch. Speed: Moderate. Stability: Balanced. Cost: Moderate. (Standard for Deep Learning).

Digits Data

Image Representations

Transformations on Images

Learning Invariant Representations

Intensity and Symmetry Features

Linear Classification and Regression

Logistic Regression: Predict Probabilities

The Math

The Meaning

The Data is Still Binary, +-1

Setting

What Makes an h Good?

The Math

The Meaning

The Cross Entropy Error Measure

The Math

The Meaning

The Probabilistic Interpretation

The Likelihood

The Math

The Meaning

A Neural Network View

How To Minimize Ein(w)

Finding The Best Weights - Hill Descent

Our Ein Has Only One Valley

How to "Roll Down"?

The Math

The Meaning

The Gradient is the Fastest Way to Roll Down

The Math

The Meaning

"Rolling Down ≡ Iterating the Negative Gradient"

The 'Goldilocks' Step Size

Fixed Learning Rate Gradient Descent

The Math

The Meaning

Stochastic Gradient Descent (SGD)

Comparison

Summary and explanation of different GD techniques

What Makes an $h$ Good?

How To Minimize $E_{in} (w)$

Our $E_{in}$ Has Only One Valley