02 - Logistic Regression For Binary Clasification

Class: CSCE-421


Notes:

Digits Data

Pasted image 20260120093107.png350

Each digit is a 16 * 16 image.

[-1 -1 -1 -1 -1 -1 -1 -0.63 0.86 -0.17 -1 -1 -1 -1 -1 -1 -1 -1 -1 -1 -1 -1 -0.99 0.3 1 0.31 -1 -1 -1 -1 -1 -1 -1 -1 -1 -1 ... -0.94 -1 -1 -1 -1 -1 -1 -1 -0.97 -0.42 0.30 0.82 1 0.48 -0.47 -0.99 -1 -1 -1 -1]
x=(1,x1,,x256) input w=(w0,w1,,w256) linear model }dim=257

Image Representations

Pasted image 20260120093331.png600

Transformations on Images

Pasted image 20260120093804.png600

Learning Invariant Representations

Pasted image 20260120094522.png400

1. The Human vs. Computer Perspective If you look at a picture of a number "8" and then turn that picture upside down or shift it to the left, you instantly still recognize it as an "8". Humans naturally understand that rotating or moving an object doesn't change what the object actually is. We want our machine learning models to have this same "invariance" (meaning the prediction does not vary when the image is slightly transformed).

2. The Problem with Raw Data To a computer, an image is not a shape; it is just a giant grid of numbers representing pixel colors or darkness. To feed an image into our models, we usually flatten this grid into one long 1D vector (our input x).

Here is the problem: if you rotate the image by 90 degrees, the actual object is the same, but the order of the pixels in that long vector gets completely shuffled. Because our model makes predictions by multiplying specific weights by specific pixel locations (wTx), shuffling the pixels completely changes the math. The dot product changes, and the model might suddenly predict it is a different number.

3. The Solution: Invariant Representations To fix this, the slide says "your x should somehow remain the same". But how can x remain the same if the pixels moved?

The answer is that we should not use the raw pixels directly as our x. Instead, we need to design a mathematical transformation (a feature extractor) that processes the image and outputs a new vector of traits that do not change when the image rotates.

A great example from your class: Think about the "Intensity" and "Symmetry" features mentioned earlier in your notes for digit classification.

Intensity and Symmetry Features

Feature: an important property of the input that you think is useful for classification.

Pasted image 20260120094850.png400

x=(1,x1,x2) input w=(w0,w1,w2) linear model }dim=3

Linear Classification and Regression

 The linear signal: s=wTx

00 - TAMU Brain/6th Semester (Spring 26)/CSCE-421/Ex1/Visual Aids/image-1.png650

Logistic Regression: Predict Probabilities

The Math

Will someone have a heart attack over the next year?

age 62 years
blood sugar 120mg/dL40,000
HDL 50
LDL 120
Mass 190lbs
Height 5′10′′
... ...

Logistic Regression: Predict the probability of heart attack: y[0,1]

h(x)=θ(i=0dwixi)=θ(wTx)θ(s)=11+es

Pasted image 20260120095341.png250

The Meaning

1. The Goal: Predicting a Probability So far, you have learned how to make a hard "Yes/No" decision (Linear Classification) and how to predict an exact continuous number (Linear Regression). Logistic Regression sits perfectly in the middle. We want to predict the probability of a binary event happening. For example, instead of just saying "Yes, this person will have a heart attack," we want the model to say, "There is an 85% chance this person will have a heart attack." Because it is a probability, the output must strictly be a number between 0 and 1.

2. The Magic Function: θ(s) To achieve this, we start with the exact same raw linear score we used before: s=wTx (the sum of the inputs multiplied by their weights). However, a raw score can be any number from to +.

To force this raw score to act like a probability, we pass it through a mathematical "squashing" function called the Logistic Function (or Sigmoid function, denoted as θ). The formula is: θ(s)=11+es.

3. Clarifying a typo in your notes You wrote: "Theta is the number of inputs/outputs". This is a slight misunderstanding. θ is the mathematical function (the S-curve) that squashes the signal. The number of inputs is d (the dimension of your input vector x).

4. The Labels: Why use +1 and 1? Your notes emphasize using y=±1 instead of 0 and 1 for the training data labels. Even though our prediction is a probability between 0 and 1, the historical data we use to train the model consists of absolute facts: a person either did have a heart attack (+1) or did not (1). Using +1 and 1 makes the math much cleaner for the learning algorithm, especially when calculating the error (which you will see in the next slides).

5. The Properties of θ

The Data is Still Binary, +-1

D=(x1,y1=±1),,(xN,yN=±1)xn a person’s health information yn=±1 did they have a heart attack or not 

Setting

We are trying to learn the target function

f(x)=P[y=+1x]

The data does not give us the value of f explicitly. Rather, it gives us samples generated by this probability:

P(y=1x)=f(x)P(y=1x)=1f(x)

To learn from such data, we need to define a proper error measure that gauges how close a given hypothesis h is to f in terms of these examples


What is the data telling us? In Logistic Regression, we are trying to learn a target function f(x) that represents a true probability, specifically: "What is the true probability that y=+1 given the input x?".

Here is the fundamental challenge: the universe knows the exact probability (say, an 80% chance of a heart attack), but the data does not give us that percentage. We do not have a dataset that says [Patient A: 80%]. Instead, we only see the final outcome of that probability. We see [Patient A: +1 (Had a heart attack)] or [Patient B: -1 (Did not)]. We have to reverse-engineer the hidden probability using only these hard "+1 / -1" examples.

What Makes an h Good?

The Math

’fiting’ the data means finding a good h

h is good if:

h is good if: {h(xn)1 whenever yn=+1;h(xn)0 whenever yn=1.

A simple error measure that captures this:

Ein (h)=1Nn=1N(h(xn)12(1+yn))2
The Meaning

1. What Makes a Hypothesis (h) Good? Since our model h(x) is trying to predict this probability, we want its output (a number between 0 and 1) to align with reality.

2. The Simple Error Measure (The Math Trick) To train the model, we need an equation that measures how far off our predictions are from the ideal 1 or 0. The slide introduces this preliminary error formula: $$ E_{\text {in }}(h)=\frac{1}{N} \sum_{n=1}^N\left(h\left(\mathrm{x}_n\right)-\frac{1}{2}\left(1+y_n\right)\right)^2 $$ This looks confusing, but it contains a brilliant little math trick: 12(1+yn). Because our dataset labels are strictly y=+1 or y=1, let's see what happens when we plug them into that trick:

So, this formula perfectly converts our "+1 / -1" labels into the "1 or 0" probabilities we want our model to match! The error function simply takes our prediction h(x), subtracts the ideal target (1 or 0), squares the difference, and averages it. (Note: While this formula works intuitively, the professor will introduce a much better one called "Cross-Entropy" in the next slide, but this serves to build your understanding!)

3. Why we can't solve it like Linear Regression Your notes point out a major difference between Linear and Logistic regression.

4. The Iterative Solution Because we cannot jump straight to the answer, we must start with random weights (w) and use an iterative procedure (taking tiny steps to adjust the weights over time). Your notes at the bottom translate what these adjustments are trying to achieve:

The Cross Entropy Error Measure

The Math
Ein (w)=1Nn=1Nln(1+eynwTx)

It looks complicated and ugly (ln, e^(·), ...),

But,

Verify:

The Meaning

1. The "Ugly" Formula that Works Beautifully In the previous slide, we looked at a preliminary way to measure mistakes. Now, the professor introduces the actual formula used in the real world to train Logistic Regression: the Cross-Entropy Error (often just called Log Loss).

The formula is: $$ E_{\text {in }}(w)=\frac{1}{N} \sum_{n=1}^N \ln \left(1+e^{-y_n \cdot w^{\mathrm{T}} x_n}\right) $$ (And you are completely correct to note that the last x must be xn because we are evaluating it for each specific person/data point n in the dataset!)

Let's break down why this formula is actually very intuitive:

2. Why is it "Mathematically Friendly"? Even though the equation looks like an absolute nightmare of logs and exponents, it has a beautiful property: it is strictly convex. Imagine dropping a marble into a bowl. No matter where you drop it, it will always roll down to the exact same spot at the very bottom. The Cross-Entropy error forms a perfect mathematical bowl (only one valley). This means when we use our "iterative procedure" to tweak the weights step-by-step, the computer is guaranteed to eventually find the absolute best possible weights (the bottom of the valley) without getting stuck.

3. Verification (Sanity Check) The slide finishes by proving that minimizing this formula naturally forces the model to do exactly what we want it to do:

The Probabilistic Interpretation

Suppose that h(x)=θ(wTx) closely captures P[+1x]:

P(yx)={θ(wTx) for y=+11θ(wTx) for y=1

Given:

θ(s)=11+es

we have

θ(s)=1θ(s)

So, more compactly,

P(y|x)=θ(y·wTx)

The Likelihood

The Math
P(y|x)=θ(y·wTx) - *"The product of each individual probability"* - **Why do we use product instead of summation?** - We want each of these individual probabilities to be large, so we need to maximize using the product - If you use summation it won't work, it turns out you need to use product because multiplication will affect and provide the necessary weight (it will achieve some kind of balance between each data point) - Essentially we want to adjust $w$ so that the product is as larger as possible - Maximizing the likelihood is the same as maximizing the product of individual probabilities - The likelihood measures the probability that the data were generated if $f$ were $h$. ##### The Meaning **1. The Magic Single Formula: $P(y|x) = \theta(y \cdot w^Tx)$** In the previous slides, we established that we want to predict a probability. If a person had a heart attack ($y = +1$), the probability is $\theta(w^Tx)$. If they did not ($y = -1$), the probability is $1 - \theta(w^Tx)$. Because of the neat symmetry of the logistic function, $1 - \theta(w^Tx)$ is exactly equal to $\theta(-w^Tx)$. This allows us to combine both cases into one elegant formula: **$P(y|x) = \theta(y \cdot w^Tx)$**. - If $y = +1$, it becomes $\theta(w^Tx)$. - If $y = -1$, it becomes $\theta(-w^Tx)$. This is what your note means by _"Probability of y given x written in the same formula"_. It mathematically expresses: "The probability that the model's prediction matches the true reality." **2. Answering your question: Why use the product instead of summation?** Your notes guess that multiplication provides a "necessary weight" or balance. While creative, the real reason is a fundamental rule of statistics: **Independence**. - Imagine flipping a coin. The chance of getting Heads is 50% (0.5). What is the chance of getting Heads _twice in a row_? You multiply them: $0.5 \times 0.5 = 0.25$. - In machine learning, we assume every patient in our dataset is entirely independent of the others. Therefore, to find the total probability of our _entire dataset_ occurring, we must **multiply** the individual probabilities of every single person's outcome together. This total multiplied probability is called the **Likelihood**. ### Maximizing The Likelihood ##### The Math

\begin{aligned}
& \max \prod_{n=1}^N P\left(y_n \mid x_n\right) \
\Leftrightarrow & \max \ln \left(\prod_{n=1}^N P\left(y_n \mid x_n\right)\right) \
\end

Notemax$f(x)$isequivalent(notequal)tomax$lnf(x)$because$log$ismonotonicallyincreasing

\begin{aligned}
\equiv & \max \sum_{n=1}^N \ln P\left(y_n \mid x_n\right) \
\end

Note$log$ofproductequalsthesummationof$log$(logmovedinside)Thisisbecause$lnab=loga+logb$(logproperties)$log$isveryimportantjustbecauseachangeininputdoesnotmeanagreatchangeintheoutput(inlatercasesratherthanearlier)Thisiscalledthesaturatingproperty?oflogsThisisequivalentto(1)becausethefirstistheproduct,andthesecondisthesummation$$min1Nn=1NlnP(ynxn) min1Nn=1Nln1P(ynxn) min1Nn=1Nln1θ(ynwTxn) min1Nn=1Nln(1+eynwTxn)
Ein (w)=1Nn=1Nln(1+eynwTxn)
The Meaning

3. Maximizing the Likelihood (The Goal) We want to find the weights (w) that make our dataset as highly probable to occur as possible. This means we want to maximize the Likelihood product. However, multiplying thousands of probabilities (which are decimals between 0 and 1) together will result in a microscopic number (like 0.0000000001). A computer cannot handle numbers this small and will round them to zero.

4. The Math Derivation (Step-by-Step) To fix this, we apply a brilliant mathematical trick. We take the Natural Logarithm (ln) of the whole equation.

Here is how we transform the equation step-by-step:


View

Cross Entropy

CE(g,q)=igilogqi

A Neural Network View

Pasted image 20260122100430.png600

How To Minimize Ein(w)

Recall:

Ein (w)=1Nn=1Nln(1+eynwTxn)

Regression - pseudoinverse (analytic), from solving wEin(w)=0.

Logistic Regression - analytic won’t work.

Numerically/iteratively set wEin(w)0.

Finding The Best Weights - Hill Descent

Ball on a complicated hilly terrain rolls down to a local valley(a local minimum)

Pasted image 20260122103143.png200

Questions:

Our Ein Has Only One Valley

Pasted image 20260122103509.png400

How to "Roll Down"?

The Math

Assume you are at weights w(t) and you take a step of size η in the direction v^.

w(t+1)=w(t)+ηv^
  1. η: A scalar of step size
  2. v^: A unit vector of direction

We get to pick v^ ← what's the best direction to take the step?

Pick v^ to make Ein(w(t+1)) as small as possible.

Pasted image 20260127094259.png200


The Meaning

The Setup (Standing on the Hill) Imagine you are standing somewhere on a foggy, hilly landscape. Your goal is to reach the absolute lowest point of the valley, because the "height" of the ground represents your model's error (Ein). The lower you go, the better your model gets.

Because it is foggy, you cannot see the bottom of the valley. You can only feel the slope of the ground directly under your feet. To get to the bottom, you have to take it one step at a time. Every step you take is defined by this formula: $$ \mathrm{w}(t+1) = \mathrm{w}(t) + \eta \hat{\mathrm{v}} $$

The Gradient is the Fastest Way to Roll Down

The Math

The Taylor series expansion of a multivariate function f(x) around a point a up to the second order is

f(x)f(a)+f(a)T(xa)+12(xa)THf(a)(xa)

Expanding Ein(w(t+1)) at Ein(w(t)) gives

ΔEin=Ein(w(t+1))Ein(w(t))=Ein(w(t)+ηv^)Ein(w(t))=ηEin(w(t))Tv^+O(η2)( Taylor’s Approximation ) minimized at v^=Ein(w(t))Ein(w(t))

The best (steepest) direction to move is the negative gradient:

v^=Ein (w(t))Ein (w(t))
The Meaning

1. The Goal: Which way do we step? Since we want to go down into the valley, we want our new error, Ein(w(t+1)), to be as small as possible compared to our old error, Ein(w(t)). To figure out the best direction mathematically, the slide uses a Taylor Series Expansion.

Taylor's approximation is a calculus trick that lets you estimate the value of a function near your current location. It tells us that the change in our error (ΔEin) after taking a tiny step is approximately: $$ \Delta E_{in} \approx \eta \nabla E_{in}(w(t))^T \hat{v} $$

2. Finding the Fastest Way Down We want ΔEin to be as negative as possible (we want to drop in height as much as we can). Because η is just a positive step size, the only thing we can control is v^.

In linear algebra, if you have a vector A (the gradient) and a vector B (your direction), their dot product ATB is most negative when they point in exactly opposite directions.

To make v^ a unit vector (length of 1) pointing straight down, we take the negative gradient and divide it by its own length (||Ein||): $$ \hat{\mathrm{v}}=-\frac{\nabla E_{\text {in }}(\mathrm{w}(t))}{\left|\nabla E_{\text {in }}(\mathrm{w}(t))\right|} $$
3. Important Correction to your Notes! At the bottom of your notes, you wrote: "Given Ein a vector of input produces a vector of output. To make this clear: Ein takes a vector input, and outputs one vector. This Ein is the vector loss."

This is incorrect, and it is crucial to fix this for your exam.

"Rolling Down ≡ Iterating the Negative Gradient"

Pasted image 20260127094740.png450

The 'Goldilocks' Step Size

Pasted image 20260122104847.png450


The "Goldilocks" Problem: How big of a step do we take? In the previous slide, we figured out the direction to step to get down the hill (the negative gradient). Now we need to figure out the size of the step, denoted by η (Eta, or the "learning rate").

Fixed Learning Rate Gradient Descent

The Math
ηt=ηEin(w(t))Ein(w(t))0 when closer to the minimum. v^=ηtEin (w(t))Ein (w(t))=ηEin (w(t))Ein (w(t))Ein (w(t)) v^=ηEin (w(t))

Pasted image 20260127093606.png400

Gradient descent can minimize any smooth function, for example

Ein (w)=1Nn=1Nln(1+eynwTx)

(logistic regression)

The Meaning

1. The Brilliant Solution: Let the hill do the work How does the computer know if it is far away or close to the bottom? It looks at the steepness of the slope!

2. The Math Trick (Simplifying the Formula) In the previous slide, we forced our step direction to be a "unit vector" (a vector with a length of exactly 1) by dividing the negative gradient by its own length: v^=Ein||Ein||.

When we combine our new dynamic step size (ηt) with this unit vector (v^), something beautiful happens. We multiply them together to get our final step: $$ \text{Step} = \eta_t \cdot \hat{v} = \left(\eta \cdot ||\nabla E_{in}|| \right) \cdot \left( - \frac{\nabla E_{in}}{||\nabla E_{in}||} \right) $$
Because ||Ein|| is in both the numerator and the denominator, they completely cancel each other out!.

3. The Final Update Rule Once the lengths cancel out, we are left with a beautifully simple formula for our step: ηEin. This means we no longer have to worry about calculating unit vectors or manually shrinking the step size. We just pick a single, fixed constant number for η (often around 0.1), and subtract η multiplied by the raw gradient. The final standard update loop for Gradient Descent is: $$ w(t+1) = w(t) - \eta \nabla E_{in}(w(t)) $$
4. Why Logistic Regression? The slide ends by pointing out that this Gradient Descent method can minimize any "smooth" function. We use it for Logistic Regression because its Cross-Entropy error function is perfectly smooth and strictly convex. This means it forms a perfect single bowl shape with only one valley, guaranteeing that our ball will eventually roll down to the absolute best global minimum without getting stuck.

Stochastic Gradient Descent (SGD)

A variation of GD that considers only the error on one data point.

Ein (w)=1Nn=1Nln(1+eynwTx)=1Nn=1Ne(w,xn,yn) w(t+1)w(t)ηwe(w,x,y)

Logistic Regression:

w(t+1)w(t)+yx(η1+eywTx)

Advantages:

  1. The ’average’ move is the same as GD;
  2. Computation: fraction 1/N cheaper per step;
  3. Stochastic: helps escape local minima;
  4. Simple;

Pasted image 20260127100026.png400

Comparison

Method Updates Per Iteration Convergence Speed Stability Computation Cost
Batch GD After full dataset Slow Stable High
Stochastic GD After each sample Fast Noisy (unstable) Low
Mini-Batch GD After each mini-batch Moderate Balanced Moderate

Summary and explanation of different GD techniques

1. The Problem with Standard (Batch) Gradient Descent In the previous slide, we learned that Gradient Descent updates the model's weights by calculating the gradient (the multi-dimensional slope) of the error surface. The total error, Ein(w), is the average of the errors across the entire dataset. If you have 1 million patients in your dataset, standard Gradient Descent—often called Batch Gradient Descent (BGD)—must calculate the prediction and the error for all 1 million patients just to take one single step down the hill. This is incredibly slow and computationally expensive.

2. The Solution: Stochastic Gradient Descent (SGD) SGD takes a radically different approach. Instead of looking at the whole dataset, it picks one single random data point (x,y). It calculates the gradient of the error for just that one point, and immediately updates the weights.

3. The Logistic Regression SGD Formula Your slide shows a specific, ready-to-use formula for SGD applied to Logistic Regression: $$ \mathrm{w}(t+1) \leftarrow \mathrm{w}(t)+y_* \mathrm{x}*\left(\frac{\eta}{1+e^{y* \mathrm{w}^{\mathrm{T}} \mathrm{x}_*}}\right) $$ Where does this come from? If you take the derivative of the single-point Cross-Entropy error e=ln(1+eywTx), the negative signs cancel out with the η in the update rule, leaving you with that exact formula.

4. Why does SGD work? (The Advantages) You might wonder: "If we only look at one point, aren't we going to step in the wrong direction?" Yes, frequently! Stepping based on one point is very noisy and "wiggly". However:

5. The Compromise: Mini-Batch Gradient Descent (MBGD) In reality, picking 1 point at a time (SGD) is too noisy, and picking all points (BGD) is too slow. Mini-Batch GD is the perfect "Goldilocks" compromise used in modern Deep Learning. You divide your dataset into small batches (e.g., 32 or 64 points). You calculate the average gradient of those 32 points, take a step, and move to the next batch. It balances speed and stability perfectly.

6. What is an "Epoch"? This is a crucial term to memorize. An Epoch means the algorithm has looked at every single data point in the entire dataset exactly once.