02 - Logistic Regression For Binary Clasification

Notes:

Digits Data

Pasted image 20260120093107.png|350

Numbers on an envelope that represent a zip code

Each digit is a 16 * 16 image.

[-1 -1 -1 -1 -1 -1 -1 -0.63 0.86 -0.17 -1 -1 -1 -1 -1 -1 -1 -1 -1 -1 -1 -1 -0.99 0.3 1 0.31 -1 -1 -1 -1 -1 -1 -1 -1 -1 -1 ... -0.94 -1 -1 -1 -1 -1 -1 -1 -0.97 -0.42 0.30 0.82 1 0.48 -0.47 -0.99 -1 -1 -1 -1]

Basically writing a 16 * 16 picture into a long vector

\begin{array}{ll} x = (1, x_{1}, \dots, x_{256}) & \leftarrow input \\ w = (w_{0}, w_{1}, \dots, w_{256}) & \leftarrow linear model \end{array}} \dim = 257

We are trying to predict 0-9 numbers (10 classes)
- Logistic Regression only works for two classes (binary data)
- If you want to make this work for multiple other classes you have to somehow simulate a binary property

Image Representations

Pasted image 20260120093331.png|600

For the computer this is a two dimensional number array

Transformations on Images

Pasted image 20260120093804.png|600

Field of view changed, but object remains set, but the vector changes (moved a little bit)
The predictions should remain the same but the result of the dot product will be different

Learning Invariant Representations

Pasted image 20260120094522.png|400

If you rotate an image by 90 degrees, or if you just change it a little bit, predictions must stay the same
Your $x$ should somehow remain the same, but this is not easy to achieve, because any change in the input can change the operations performed to reach a prediction
You have to design this property that if you rotate the image by some degrees it will return the same prediction.

Intensity and Symmetry Features

Feature: an important property of the input that you think is useful for classification.

Pasted image 20260120094850.png|400

\begin{array}{ll} x = (1, x_{1}, x_{2}) & \leftarrow input \\ w = (w_{0}, w_{1}, w_{2}) & \leftarrow linear model \end{array}} \dim = 3

This only works for 1s and 5s
Intensity: represents more ink in the image

...

Logistic Regression: Predict Probabilities

Will someone have a heart attack over the next year?

age	62 years
blood sugar	120mg/dL40,000
HDL	50
LDL	120
Mass	190lbs
Height	5′10′′
...	...

Logistic Regression: Predict the probability of heart attack: $\equiv y \in [0, 1]$

\begin{aligned} h (x) = θ (\sum_{i = 0}^{d} w_{i} x_{i}) = θ (w^{T} x) \\ θ (s) = \frac{1}{1 + e^{- s}} \end{aligned}

Pasted image 20260120095341.png|250

Theta is the number of inputs/outputs
Use y +- 1, Do not use y = 0,1.
To make predictions on logistic regression is easy, $w$ is given, $x$ is given, all you need to do is compute transforms and they will just give a number
Properties
1. Probabilities are bounded by 0 and 1.
2. $θ (- s) = 1 - θ (s)$ $θ (- s) = \frac{1}{1 + e^{s}} = \frac{e^{- s}}{e^{- s} + 1} = \frac{1 + e^{- s} - 1}{e^{- s} + 1} = 1 - θ (s)$
3. $w^{T} x$ is the linear signal and in order to get a prediction we are just multiplying it by $θ$ (threshold of the output)
  - Is logistic regression still a linear model?
    - This is a two class case (2 dimensional)
    - You are multiplying a "linear" signal to a non linear function ( $θ$ ). Does that change linearity?
    - The function of $θ$ is monotonically increasing
      - Still each $x$ val is attached to a $y$ val, so using a threshold on $x$ would have an equivalent threshold on $y$
      - If the function was not monotonic we would have two or more thresholds for in $y$ for a single threshold in $x$
    - Conclusion: Logistic regression is still a linear model.

The Data is Still Binary, +-1

\begin{aligned} D = (x_{1}, y_{1} = \pm 1), \dots, (x_{N}, y_{N} = \pm 1) \end{aligned}

\begin{array}{ll} x_{n} & \leftarrow a person’s health information \\ y_{n} = \pm 1 & \leftarrow did they have a heart attack or not \end{array}

We cannot measure a probability.
We can only see the occurrence of an event and try to infer a probability.

Setting

We are trying to learn the target function

f (x) = P [y = + 1 ∣ x]

The data does not give us the value of f explicitly. Rather, it gives us samples generated by this probability:

\begin{matrix} P (y = 1 ∣ x) = f (x) \\ P (y = - 1 ∣ x) = 1 - f (x) \end{matrix}

To learn from such data, we need to define a proper error measure that gauges how close a given hypothesis h is to f in terms of these examples

What Makes an $h$ Good?

The Math

’fiting’ the data means finding a good h

h is good if:

h is good if: {\begin{cases} h (x_{n}) \approx 1 & whenever y_{n} = + 1; \\ h (x_{n}) \approx 0 & whenever y_{n} = - 1 . \end{cases}

A simple error measure that captures this:

E_{in} (h) = \frac{1}{N} \sum_{n = 1}^{N} {(h (x_{n}) - \frac{1}{2} (1 + y_{n}))}^{2}

In linear regression we computed the derivative
With logistic regression is similar, the only thing is that you cannot derive the root
- We would have some initial case of $w$ You have to use an iterative procedure to compute $w$ .
- Suppose we have $(x_{1} + 1)$ . We need $θ (w^{T} x)$ to be adjusted to fit this data.
  - The output is the probability that this input $x$ belongs to the +1 class
  - What should we do in order to do this?
    - We need the input to make $w^{T}$ as larger as necessary
    - If we added a $(x_{2} - 1)$ we would need $w^{T}$ to be as smaller as possible.

The Cross Entropy Error Measure

The Math

E_{in} (w) = \frac{1}{N} \sum_{n = 1}^{N} \ln (1 + e^{- y_{n} \cdot w^{T} x})

We want the law (the thing in parenthesis) to be minimized (be as as small as possible)
- $\ln$ is a monotonic function, so it is minimized by it.
- This is not the only way we can achieve minimization
The last $x$ should be $x_{n}$

It looks complicated and ugly (ln, e^(·), ...),

But,

it is based on an intuitive probabilistic interpretation of h.
it is very convenient and mathematically friendly (’easy’ to minimize).

Verify:

$y_{n} = + 1$ encourages $w^{T} x_{n} ≫ 0$ , so $θ (w^{T} x_{n}) \approx 1$
$y_{n} = - 1$ encourages $w^{T} x_{n} ≪ 0$ , so $θ (w^{T} x_{n}) \approx 0$

The Probabilistic Interpretation

The Math

Suppose that $h (x) = θ (w^{T} x)$ closely captures $P [+ 1 ∣ x]$ :

P (y ∣ x) = {\begin{cases} θ (w^{T} x) & for y = + 1 \\ 1 - θ (w^{T} x) & for y = - 1 \end{cases}

Given:

θ (s) = \frac{1}{1 + e^{- s}}

we have

θ (- s) = 1 - θ (s)

So, more compactly,

P (y | x) = θ (y \cdot w^{T} x)

Note this is purely for notational purposes

The Likelihood

The Math

P (y | x) = θ (y \cdot w^{T} x)

"Probability of y given x written in the same formula"
We will need to use this when we derive our maximum likelihood
This is a probability (is between 0 and 1)
If we plugin each of our data points, each output will be somewhere close to 1.
- $y_{i} = + 1$ , we want $θ (w^{T} x)$ as large as possible
- $y_{i} = - 1$ , we want $θ (w^{T} x)$ as small as possible
Recall: $(x_{1}, y_{1}), \dots, (x_{N}, y_{N})$ are independently generated

Likelihood:

The probability of getting the $y_{1}, \dots, y_{N}$ in D from the corresponding $x_{1}, \dots, x_{N}$

- *"The product of each individual probability"* - **Why do we use product instead of summation?** - We want each of these individual probabilities to be large, so we need to maximize using the product - If you use summation it won't work, it turns out you need to use product because multiplication will affect and provide the necessary weight (it will achieve some kind of balance between each data point) - Essentially we want to adjust $w$ so that the product is as larger as possible - Maximizing the likelihood is the same as maximizing the product of individual probabilities - The likelihood measures the probability that the data were generated if $f$ were $h$. ### Maximizing The Likelihood

\begin{aligned}
& \max \prod_{n=1}^N P\left(y_n \mid x_n\right) \
\Leftrightarrow & \max \ln \left(\prod_{n=1}^N P\left(y_n \mid x_n\right)\right) \
\end

- N o t e m a x $ f (x) $ i s * * e q u i v a l e n t * * (n o t e q u a l) t o m a x $ \ln f (x) $ b e c a u s e $ \log $ i s m o n o t o n i c a l l y i n c r e a s i n g

\begin{aligned}
\equiv & \max \sum_{n=1}^N \ln P\left(y_n \mid x_n\right) \
\end

- N o t e $ \log $ o f p r o d u c t e q u a l s t h e s u m m a t i o n o f $ \log $ (l o g m o v e d i n s i d e) - T h i s i s b e c a u s e $ \ln a \cdot b = \log a + \log b $ (l o g p r o p e r t i e s) - $ \log $ i s v e r y i m p o r t a n t j u s t b e c a u s e a c h a n g e i n i n p u t d o e s n o t m e a n a g r e a t c h a n g e i n t h e o u t p u t (i n l a t e r c a s e s r a t h e r t h a n e a r l i e r) - T h i s i s c a l l e d t h e * s a t u r a t i n g p r o p e r t y ? * o f l o g s - T h i s i s e q u i v a l e n t t o (1) b e c a u s e t h e f i r s t i s t h e p r o d u c t, a n d t h e s e c o n d i s t h e s u m m a t i o n $ $ \begin{aligned} \Leftrightarrow & min - \frac{1}{N} \sum_{n = 1}^{N} \ln P (y_{n} ∣ x_{n}) \end{aligned}

We added a negative and changed max to min
Note this became equivalent (not equal) because we are doing $1 / N$ . (we need to divide by the number of values)

\begin{aligned} \equiv & min \frac{1}{N} \sum_{n = 1}^{N} \ln \frac{1}{P (y_{n} ∣ x_{n})} \end{aligned}

Note again because of log properties
We basically just moved the negative inside, this affects the log

\begin{aligned} \equiv & min \frac{1}{N} \sum_{n = 1}^{N} \ln \frac{1}{θ (y_{n} \cdot w^{T} x_{n})} \end{aligned}

We specialize to our "model" here
We basically made $θ$ to be explicit by rewriting likelihood

\begin{aligned} \equiv & min \frac{1}{N} \sum_{n = 1}^{N} \ln (1 + e^{- y_{n} \cdot w^{T} x_{n}}) \end{aligned}

Basically just use the definition of $θ$ .

E_{in} (w) = \frac{1}{N} \sum_{n = 1}^{N} \ln (1 + e^{- y_{n} \cdot w^{T} x_{n}})

View

$(x, y = + - 1)$ , $q = [θ (w^{T} x), 1 - θ (w^{T} x)]$
Look at the one hot vector?
- If true label is +1, then $q = [1, 0]^{T}$
- If true label is -1, then $q = [0, 1]^{T}$

Cross Entropy

C E (g, q) = - \sum_{i} g_{i} \log q_{i}

CE tries to maximize the log probability of the correct class
$g_{i}$ is each of the elements
$q_{i}$ is the probability
Why does this make sense?
- Only one term in the one hot vector is not zero
- That term turns our model into the correct class
Whenever we try about logs we have to minimize, that is the reason we put that negative there

A Neural Network View

Pasted image 20260122100430.png|600

When we try to make predictions $x_{i}$ will be multiplied with $w_{i}$
The summation of all these multiplications yields a single number, which we then multiply by $θ$ to get the final probability that the input belongs to the +1 class
Note there is an input layer (the x vector), and there is an output layer (the summation), there is nothing i between, which later with a more complex neural network will.

How To Minimize $E_{in} (w)$

Recall:

E_{in} (w) = \frac{1}{N} \sum_{n = 1}^{N} \ln (1 + e^{- y_{n} \cdot w^{T} x_{n}})

The only difference from this and linear regression is that in linear regression we are minimizing $\frac{1}{N} \sum ({\hat{y}}_{i} - y_{i})^{2}$
- Linear regression is easy because we can just take the derivative of it (since it is a degree 2 polynomial) and will end up with the closest answer
- But when we try to take the derivative of $E_{in} (w)$ it is much harder because the derivative of $\ln$ and $e$ do not disappear completely
  - And of course the derivative will be a vector

Regression - pseudoinverse (analytic), from solving $\nabla_{w} E_{in} (w) = 0 .$

Logistic Regression - analytic won’t work.

Numerically/iteratively set $\nabla_{w} E_{in} (w) \to 0 .$

Finding The Best Weights - Hill Descent

Ball on a complicated hilly terrain rolls down to a local valley(a local minimum)

Pasted image 20260122103143.png|200

This is a one dimensional problem (w vs. log)
Note you can only reach the true minimum in a probabilistic sense, it is not a guarantee to be able to actually reach that point

Questions:

How to get to the bottom of the deepest valey?
How to do this when we don’t have gravity?

Our $E_{in}$ Has Only One Valley

Pasted image 20260122103509.png|400

Convex: just a function where you have a way to pick random two different points and the minimum will always be below that line

How to "Roll Down"?

The Math

Assume you are at weights $w (t)$ and you take a step of size $η$ in the direction $\hat{v}$ .

w (t + 1) = w (t) + η \hat{v}

$η$ : A scalar of step size
$\hat{v}$ : A unit vector of direction

We get to pick $\hat{v}$ ← what's the best direction to take the step?

Pick $\hat{v}$ to make $E_{in} (w (t + 1))$ as small as possible.

Pasted image 20260127094259.png|200

We are moving down the plane

The Meaning

We write a loop, and then set a termination condition
Note you are adding a vector times a scalar

The Gradient is the Fastest Way to Roll Down

The Taylor series expansion of a multivariate function $f (x)$ around a point a up to the second order is

f (x) \approx f (a) + \nabla f (a)^{T} (x - a) + \frac{1}{2} (x - a)^{T} H_{f} (a) (x - a)

Expanding $E_{in} (w (t + 1))$ at $E_{in} (w (t))$ gives

\begin{aligned} Δ E_{in} & = E_{in} (w (t + 1)) - E_{in} (w (t)) \\ = E_{in} (w (t) + η \hat{v}) - E_{in} (w (t)) \\ = η \underset{}{\underset{⏟}{\nabla E_{in} (w (t))^{T} \hat{v}}} + O (η^{2}) (Taylor’s Approximation) \\ minimized at \hat{v} = - \frac{\nabla E_{in} (w (t))}{‖ \nabla E_{in} (w (t)) ‖} \end{aligned}

The best (steepest) direction to move is the negative gradient:

\hat{v} = - \frac{\nabla E_{in} (w (t))}{‖ \nabla E_{in} (w (t)) ‖}

You only need to take the derivative once
$w (t)$ is a fixed vector this time, next time is a new $w$ and so on.
Given $E_{in}$ a vector of input produces a vector of output.
To make this clear: $E_{in}$ takes a vector input, and outputs one vector. This $E_{in}$ is the vector loss.

"Rolling Down ≡ Iterating the Negative Gradient"

Pasted image 20260127094740.png|450

The 'Goldilocks' Step Size

Pasted image 20260122104847.png|450

Initially we use a larger step size, and every time you make it smaller and smaller
Some people call this the 'learning rate'
At the optimal point the gradient is 0, so you can use the current step to measure how close you are to this optimal point

Fixed Learning Rate Gradient Descent

\begin{aligned} \begin{aligned} η_{t} & = η \cdot ‖ \nabla E_{in} (w (t)) ‖ \\ ‖ \nabla E_{in} (w (t)) ‖ & \to 0 \end{aligned} \\ when closer to the minimum. \\ \begin{array}{c} \hat{v} = - η_{t} \cdot \frac{\nabla E_{in} (w (t))}{‖ \nabla E_{in} (w (t)) ‖} \\ = - η \cdot ‖ \nabla E_{in} (w (t)) ‖ \cdot \frac{\nabla E_{in} (w (t))}{‖ \nabla E_{in} (w (t)) ‖} \end{array} \end{aligned}

Initially we want $η_{t}$ to be large
- $η_{t} = η$ $| | \nabla E_{in} | |$
These two $| | \nabla E_{in} (w (t)) | |$ will cancel and we will end up with:

\hat{v} = - η \cdot \nabla E_{in} (w (t))

Pasted image 20260127093606.png|400

Gradient descent can minimize any smooth function, for example

E_{in} (w) = \frac{1}{N} \sum_{n = 1}^{N} \ln (1 + e^{- y_{n} \cdot w^{T} x})

(logistic regression)

The Meaning

How do we know

Stochastic Gradient Descent (SGD)

A variation of GD that considers only the error on one data point.

E_{in} (w) = \frac{1}{N} \sum_{n = 1}^{N} \ln (1 + e^{- y_{n} \cdot w^{T} x}) = \frac{1}{N} \sum_{n = 1}^{N} e (w, x_{n}, y_{n})

Pick a random data point ( $x_{*}$ , $y_{*}$ )
Run an iteration of GCD on $e (w, x_{*}, y_{*})$

w (t + 1) \leftarrow w (t) - η \nabla_{w} e (w, x_{*}, y_{*})

Logistic Regression:

w (t + 1) \leftarrow w (t) + y_{*} x_{*} (\frac{η}{1 + e^{y * w^{T} x_{*}}})

Advantages:

The ’average’ move is the same as GD;
Computation: fraction 1/N cheaper per step;
Stochastic: helps escape local minima;
Simple;

Pasted image 20260127100026.png|400

Comparison

Batch Gradient Descent (BGD): Uses the entire dataset to compute the gradient of the loss function. Takes a single update step after processing all training samples.
- If you pass through the whole data set exactly once, that is called "Epoch"
Stochastic Gradient Descent (SGD): Updates model parameters after processing each individual training example. Does not wait for the entire dataset; updates are made immediately.
- You computer the gradient of a single data point and then update and so on
- It is a special case of MBGD
Mini-Batch Gradient Descent (MBGD): A compromise between Batch GD and Stochastic GD. Divides the dataset into small batches and updates parameters after processing each batch.
- Divide your training set in subsets and update parameters after processing each subset

Method	Updates Per Iteration	Convergence Speed	Stability	Computation Cost
Batch GD	After full dataset	Slow	Stable	High
Stochastic GD	After each sample	Fast	Noisy (unstable)	Low
Mini-Batch GD	After each mini-batch	Moderate	Balanced	Moderate

Digits Data

Image Representations

Transformations on Images

Learning Invariant Representations

Intensity and Symmetry Features

Logistic Regression: Predict Probabilities

The Data is Still Binary, +-1

Setting

What Makes an h Good?

The Math

The Cross Entropy Error Measure

The Math

The Probabilistic Interpretation

The Math

The Likelihood

The Math

A Neural Network View

How To Minimize Ein(w)

Finding The Best Weights - Hill Descent

Our Ein Has Only One Valley

How to "Roll Down"?

The Math

The Meaning

The Gradient is the Fastest Way to Roll Down

"Rolling Down ≡ Iterating the Negative Gradient"

The 'Goldilocks' Step Size

Fixed Learning Rate Gradient Descent

The Meaning

Stochastic Gradient Descent (SGD)

Comparison

What Makes an $h$ Good?

How To Minimize $E_{in} (w)$

Our $E_{in}$ Has Only One Valley