02 - Logistic Regression For Binary Clasification
Class: CSCE-421
Notes:
Digits Data
/CSCE-421/Visual%20Aids/Pasted%20image%2020260120093107.png)
- Numbers on an envelope that represent a zip code
Each digit is a 16 * 16 image.
[-1 -1 -1 -1 -1 -1 -1 -0.63 0.86 -0.17 -1 -1 -1 -1 -1 -1 -1 -1 -1 -1 -1 -1 -0.99 0.3 1 0.31 -1 -1 -1 -1 -1 -1 -1 -1 -1 -1 ... -0.94 -1 -1 -1 -1 -1 -1 -1 -0.97 -0.42 0.30 0.82 1 0.48 -0.47 -0.99 -1 -1 -1 -1]
- Basically writing a 16 * 16 picture into a long vector
- We are trying to predict 0-9 numbers (10 classes)
- Logistic Regression only works for two classes (binary data)
- If you want to make this work for multiple other classes you have to somehow simulate a binary property
Image Representations
/CSCE-421/Visual%20Aids/Pasted%20image%2020260120093331.png)
- For the computer this is a two dimensional number array
Transformations on Images
/CSCE-421/Visual%20Aids/Pasted%20image%2020260120093804.png)
- Field of view changed, but object remains set, but the vector changes (moved a little bit)
- The predictions should remain the same but the result of the dot product will be different
Learning Invariant Representations
/CSCE-421/Visual%20Aids/Pasted%20image%2020260120094522.png)
- If you rotate an image by 90 degrees, or if you just change it a little bit, predictions must stay the same
- Your
should somehow remain the same, but this is not easy to achieve, because any change in the input can change the operations performed to reach a prediction - You have to design this property that if you rotate the image by some degrees it will return the same prediction.
Intensity and Symmetry Features
Feature: an important property of the input that you think is useful for classification.
/CSCE-421/Visual%20Aids/Pasted%20image%2020260120094850.png)
- This only works for 1s and 5s
- Intensity: represents more ink in the image
...
Logistic Regression: Predict Probabilities
Will someone have a heart attack over the next year?
| age | 62 years |
|---|---|
| blood sugar | 120mg/dL40,000 |
| HDL | 50 |
| LDL | 120 |
| Mass | 190lbs |
| Height | 5′10′′ |
| ... | ... |
Logistic Regression: Predict the probability of heart attack:
/CSCE-421/Visual%20Aids/Pasted%20image%2020260120095341.png)
-
Theta is the number of inputs/outputs
-
Use y +- 1, Do not use y = 0,1.
-
To make predictions on logistic regression is easy,
is given, is given, all you need to do is compute transforms and they will just give a number -
Properties
- Probabilities are bounded by 0 and 1.
is the linear signal and in order to get a prediction we are just multiplying it by (threshold of the output) - Is logistic regression still a linear model?
- This is a two class case (2 dimensional)
- You are multiplying a "linear" signal to a non linear function (
). Does that change linearity? - The function of
is monotonically increasing - Still each
val is attached to a val, so using a threshold on would have an equivalent threshold on - If the function was not monotonic we would have two or more thresholds for in
for a single threshold in
- Still each
- Conclusion: Logistic regression is still a linear model.
- Is logistic regression still a linear model?
The Data is Still Binary, +-1
- We cannot measure a probability.
- We can only see the occurrence of an event and try to infer a probability.
Setting
We are trying to learn the target function
The data does not give us the value of f explicitly. Rather, it gives us samples generated by this probability:
To learn from such data, we need to define a proper error measure that gauges how close a given hypothesis h is to f in terms of these examples
What Makes an Good?
The Math
’fiting’ the data means finding a good h
h is good if:
A simple error measure that captures this:
- In linear regression we computed the derivative
- With logistic regression is similar, the only thing is that you cannot derive the root
- We would have some initial case of
You have to use an iterative procedure to compute . - Suppose we have
. We need to be adjusted to fit this data. - The output is the probability that this input
belongs to the +1 class - What should we do in order to do this?
- We need the input to make
as larger as necessary - If we added a
we would need to be as smaller as possible.
- We need the input to make
- The output is the probability that this input
- We would have some initial case of
The Cross Entropy Error Measure
The Math
- We want the law (the thing in parenthesis) to be minimized (be as as small as possible)
is a monotonic function, so it is minimized by it. - This is not the only way we can achieve minimization
- The last
should be
It looks complicated and ugly (ln, e^(·), ...),
But,
- it is based on an intuitive probabilistic interpretation of h.
- it is very convenient and mathematically friendly (’easy’ to minimize).
Verify:
encourages , so encourages , so
The Probabilistic Interpretation
The Math
Suppose that
Given:
we have
So, more compactly,
- Note this is purely for notational purposes
The Likelihood
The Math
- "Probability of y given x written in the same formula"
- We will need to use this when we derive our maximum likelihood
- This is a probability (is between 0 and 1)
- If we plugin each of our data points, each output will be somewhere close to 1.
, we want as large as possible , we want as small as possible
- Recall:
are independently generated
Likelihood:
- The probability of getting the
in D from the corresponding
\begin{aligned}
& \max \prod_{n=1}^N P\left(y_n \mid x_n\right) \
\Leftrightarrow & \max \ln \left(\prod_{n=1}^N P\left(y_n \mid x_n\right)\right) \
\end
\begin{aligned}
\equiv & \max \sum_{n=1}^N \ln P\left(y_n \mid x_n\right) \
\end
- We added a negative and changed max to min
- Note this became equivalent (not equal) because we are doing
. (we need to divide by the number of values)
- Note again because of log properties
- We basically just moved the negative inside, this affects the log
- We specialize to our "model" here
- We basically made
to be explicit by rewriting likelihood
- Basically just use the definition of
.
View
, - Look at the one hot vector?
- If true label is +1, then
- If true label is -1, then
- If true label is +1, then
Cross Entropy
- CE tries to maximize the log probability of the correct class
is each of the elements is the probability - Why does this make sense?
- Only one term in the one hot vector is not zero
- That term turns our model into the correct class
- Whenever we try about logs we have to minimize, that is the reason we put that negative there
A Neural Network View
/CSCE-421/Visual%20Aids/Pasted%20image%2020260122100430.png)
- When we try to make predictions
will be multiplied with - The summation of all these multiplications yields a single number, which we then multiply by
to get the final probability that the input belongs to the +1 class - Note there is an input layer (the x vector), and there is an output layer (the summation), there is nothing i between, which later with a more complex neural network will.
How To Minimize
Recall:
- The only difference from this and linear regression is that in linear regression we are minimizing
- Linear regression is easy because we can just take the derivative of it (since it is a degree 2 polynomial) and will end up with the closest answer
- But when we try to take the derivative of
it is much harder because the derivative of and do not disappear completely - And of course the derivative will be a vector
Regression - pseudoinverse (analytic), from solving
Logistic Regression - analytic won’t work.
Numerically/iteratively set
Finding The Best Weights - Hill Descent
Ball on a complicated hilly terrain rolls down to a local valley(a local minimum)
/CSCE-421/Visual%20Aids/Pasted%20image%2020260122103143.png)
- This is a one dimensional problem (w vs. log)
- Note you can only reach the true minimum in a probabilistic sense, it is not a guarantee to be able to actually reach that point
Questions:
- How to get to the bottom of the deepest valey?
- How to do this when we don’t have gravity?
Our Has Only One Valley
/CSCE-421/Visual%20Aids/Pasted%20image%2020260122103509.png)
- Convex: just a function where you have a way to pick random two different points and the minimum will always be below that line
How to "Roll Down"?
The Math
Assume you are at weights
: A scalar of step size : A unit vector of direction
We get to pick
Pick
/CSCE-421/Visual%20Aids/Pasted%20image%2020260127094259.png)
- We are moving down the plane
The Meaning
- We write a loop, and then set a termination condition
- Note you are adding a vector times a scalar
The Gradient is the Fastest Way to Roll Down
The Taylor series expansion of a multivariate function
Expanding
The best (steepest) direction to move is the negative gradient:
- You only need to take the derivative once
is a fixed vector this time, next time is a new and so on. - Given
a vector of input produces a vector of output. - To make this clear:
takes a vector input, and outputs one vector. This is the vector loss.
"Rolling Down ≡ Iterating the Negative Gradient"
/CSCE-421/Visual%20Aids/Pasted%20image%2020260127094740.png)
The 'Goldilocks' Step Size
/CSCE-421/Visual%20Aids/Pasted%20image%2020260122104847.png)
- Initially we use a larger step size, and every time you make it smaller and smaller
- Some people call this the 'learning rate'
- At the optimal point the gradient is 0, so you can use the current step to measure how close you are to this optimal point
Fixed Learning Rate Gradient Descent
- Initially we want
to be large - These two
will cancel and we will end up with:
/CSCE-421/Visual%20Aids/Pasted%20image%2020260127093606.png)
Gradient descent can minimize any smooth function, for example
(logistic regression)
The Meaning
- How do we know
Stochastic Gradient Descent (SGD)
A variation of GD that considers only the error on one data point.
- Pick a random data point (
, ) - Run an iteration of GCD on
Logistic Regression:
Advantages:
- The ’average’ move is the same as GD;
- Computation: fraction 1/N cheaper per step;
- Stochastic: helps escape local minima;
- Simple;
/CSCE-421/Visual%20Aids/Pasted%20image%2020260127100026.png)
Comparison
- Batch Gradient Descent (BGD): Uses the entire dataset to compute the gradient of the loss function. Takes a single update step after processing all training samples.
- If you pass through the whole data set exactly once, that is called "Epoch"
- Stochastic Gradient Descent (SGD): Updates model parameters after processing each individual training example. Does not wait for the entire dataset; updates are made immediately.
- You computer the gradient of a single data point and then update and so on
- It is a special case of MBGD
- Mini-Batch Gradient Descent (MBGD): A compromise between Batch GD and Stochastic GD. Divides the dataset into small batches and updates parameters after processing each batch.
- Divide your training set in subsets and update parameters after processing each subset
| Method | Updates Per Iteration | Convergence Speed | Stability | Computation Cost |
|---|---|---|---|---|
| Batch GD | After full dataset | Slow | Stable | High |
| Stochastic GD | After each sample | Fast | Noisy (unstable) | Low |
| Mini-Batch GD | After each mini-batch | Moderate | Balanced | Moderate |