02 - Logistic Regression For Binary Clasification
Class: CSCE-421
Notes:
Digits Data
/CSCE-421/Visual%20Aids/Pasted%20image%2020260120093107.png)
- Numbers on an envelope that represent a zip code
Each digit is a 16 * 16 image.
[-1 -1 -1 -1 -1 -1 -1 -0.63 0.86 -0.17 -1 -1 -1 -1 -1 -1 -1 -1 -1 -1 -1 -1 -0.99 0.3 1 0.31 -1 -1 -1 -1 -1 -1 -1 -1 -1 -1 ... -0.94 -1 -1 -1 -1 -1 -1 -1 -0.97 -0.42 0.30 0.82 1 0.48 -0.47 -0.99 -1 -1 -1 -1]
- Basically writing a 16 * 16 picture into a long vector
- We are trying to predict 0-9 numbers (10 classes)
- Logistic Regression only works for two classes (binary data)
- If you want to make this work for multiple other classes you have to somehow simulate a binary property
Image Representations
/CSCE-421/Visual%20Aids/Pasted%20image%2020260120093331.png)
- For the computer this is a two dimensional number array
Transformations on Images
/CSCE-421/Visual%20Aids/Pasted%20image%2020260120093804.png)
- Field of view changed, but object remains set, but the vector changes (moved a little bit)
- The predictions should remain the same but the result of the dot product will be different
Learning Invariant Representations
/CSCE-421/Visual%20Aids/Pasted%20image%2020260120094522.png)
- If you rotate an image by 90 degrees, or if you just change it a little bit, predictions must stay the same
- Your
should somehow remain the same, but this is not easy to achieve, because any change in the input can change the operations performed to reach a prediction - You have to design this property that if you rotate the image by some degrees it will return the same prediction.
1. The Human vs. Computer Perspective If you look at a picture of a number "8" and then turn that picture upside down or shift it to the left, you instantly still recognize it as an "8". Humans naturally understand that rotating or moving an object doesn't change what the object actually is. We want our machine learning models to have this same "invariance" (meaning the prediction does not vary when the image is slightly transformed).
2. The Problem with Raw Data To a computer, an image is not a shape; it is just a giant grid of numbers representing pixel colors or darkness. To feed an image into our models, we usually flatten this grid into one long 1D vector (our input
Here is the problem: if you rotate the image by 90 degrees, the actual object is the same, but the order of the pixels in that long vector gets completely shuffled. Because our model makes predictions by multiplying specific weights by specific pixel locations (
3. The Solution: Invariant Representations To fix this, the slide says "your
The answer is that we should not use the raw pixels directly as our
A great example from your class: Think about the "Intensity" and "Symmetry" features mentioned earlier in your notes for digit classification.
- Intensity is just the total amount of dark ink in the image. If you take a picture of a "5" and rotate it 90 degrees, the pixels move around, but the total amount of ink stays exactly the same.
- By extracting these "invariant" features, the input vector
that we feed into our logistic regression model remains constant, meaning the model's prediction will perfectly stay the same.
Intensity and Symmetry Features
Feature: an important property of the input that you think is useful for classification.
/CSCE-421/Visual%20Aids/Pasted%20image%2020260120094850.png)
- This only works for 1s and 5s
- Intensity: represents more ink in the image
Linear Classification and Regression
/CSCE-421/Ex1/Visual%20Aids/image-1.png)
Logistic Regression: Predict Probabilities
The Math
Will someone have a heart attack over the next year?
| age | 62 years |
|---|---|
| blood sugar | 120mg/dL40,000 |
| HDL | 50 |
| LDL | 120 |
| Mass | 190lbs |
| Height | 5′10′′ |
| ... | ... |
Logistic Regression: Predict the probability of heart attack:
/CSCE-421/Visual%20Aids/Pasted%20image%2020260120095341.png)
-
Theta is the number of inputs/outputs
-
Use y +- 1, Do not use y = 0,1.
-
To make predictions on logistic regression is easy,
is given, is given, all you need to do is compute transforms and they will just give a number -
Properties
- Probabilities are bounded by 0 and 1.
is the linear signal and in order to get a prediction we are just multiplying it by (threshold of the output) - Is logistic regression still a linear model?
- This is a two class case (2 dimensional)
- You are multiplying a "linear" signal to a non linear function (
). Does that change linearity? - The function of
is monotonically increasing - Still each
val is attached to a val, so using a threshold on would have an equivalent threshold on - If the function was not monotonic we would have two or more thresholds for in
for a single threshold in
- Still each
- Conclusion: Logistic regression is still a linear model.
- Is logistic regression still a linear model?
The Meaning
1. The Goal: Predicting a Probability So far, you have learned how to make a hard "Yes/No" decision (Linear Classification) and how to predict an exact continuous number (Linear Regression). Logistic Regression sits perfectly in the middle. We want to predict the probability of a binary event happening. For example, instead of just saying "Yes, this person will have a heart attack," we want the model to say, "There is an 85% chance this person will have a heart attack." Because it is a probability, the output must strictly be a number between 0 and 1.
2. The Magic Function:
To force this raw score to act like a probability, we pass it through a mathematical "squashing" function called the Logistic Function (or Sigmoid function, denoted as
- If your raw score
is a huge positive number, becomes close to 0, so becomes (100% probability). - If your raw score
is a huge negative number, becomes huge, making the denominator huge, so gets close to 0 (0% probability). - If your raw score is exactly 0,
(a 50/50 coin toss).
3. Clarifying a typo in your notes You wrote: "Theta is the number of inputs/outputs". This is a slight misunderstanding.
4. The Labels: Why use
5. The Properties of
- Bounded: As explained above, it strictly outputs between
and . - Symmetry:
. This makes logical sense: if the probability of having a heart attack is 80% ( ), then the probability of not having a heart attack must be 20% ( ). - Is it still a linear model? Yes! Even though the
curve looks like an "S" and is not a straight line, the model is still considered linear. Why? Because the underlying signal is linear, and the function is monotonically increasing (it strictly goes up and never dips back down). This means if you pick a probability threshold (e.g., "Approve if "), it perfectly translates to a single, flat straight-line threshold on the raw score . There is no weird warping of the decision boundary.
The Data is Still Binary, +-1
- We cannot measure a probability.
- We can only see the occurrence of an event and try to infer a probability.
Setting
We are trying to learn the target function
The data does not give us the value of f explicitly. Rather, it gives us samples generated by this probability:
To learn from such data, we need to define a proper error measure that gauges how close a given hypothesis h is to f in terms of these examples
What is the data telling us? In Logistic Regression, we are trying to learn a target function
Here is the fundamental challenge: the universe knows the exact probability (say, an 80% chance of a heart attack), but the data does not give us that percentage. We do not have a dataset that says [Patient A: 80%]. Instead, we only see the final outcome of that probability. We see [Patient A: +1 (Had a heart attack)] or [Patient B: -1 (Did not)]. We have to reverse-engineer the hidden probability using only these hard "+1 / -1" examples.
What Makes an Good?
The Math
’fiting’ the data means finding a good h
h is good if:
A simple error measure that captures this:
- In linear regression we computed the derivative
- With logistic regression is similar, the only thing is that you cannot derive the root
- We would have some initial case of
You have to use an iterative procedure to compute . - Suppose we have
. We need to be adjusted to fit this data. - The output is the probability that this input
belongs to the +1 class - What should we do in order to do this?
- We need the input to make
as larger as necessary - If we added a
we would need to be as smaller as possible.
- We need the input to make
- The output is the probability that this input
- We would have some initial case of
The Meaning
1. What Makes a Hypothesis (
- If a patient actually had a heart attack in the real world (
), our model's predicted probability should ideally be as close to (100%) as possible. - If a patient did not have a heart attack (
), our model's predicted probability should be as close to (0%) as possible.
2. The Simple Error Measure (The Math Trick) To train the model, we need an equation that measures how far off our predictions are from the ideal 1 or 0. The slide introduces this preliminary error formula: $$ E_{\text {in }}(h)=\frac{1}{N} \sum_{n=1}^N\left(h\left(\mathrm{x}_n\right)-\frac{1}{2}\left(1+y_n\right)\right)^2 $$ This looks confusing, but it contains a brilliant little math trick:
- If
, the formula becomes . - If
, the formula becomes .
So, this formula perfectly converts our "+1 / -1" labels into the "1 or 0" probabilities we want our model to match! The error function simply takes our prediction
3. Why we can't solve it like Linear Regression Your notes point out a major difference between Linear and Logistic regression.
- In Linear Regression, the math was a clean bowl shape, and we could easily take the derivative, set it to zero, and instantly find the perfect weights (the "root" or analytic solution).
- In Logistic Regression, because we pass our signal
through that complex S-curve function , the math gets messy. You cannot just use a formula to find the perfect weights in one step.
4. The Iterative Solution Because we cannot jump straight to the answer, we must start with random weights (
- If we have a data point
with a label , we want our output to be . To get an S-curve to output 1, the raw linear signal must be as large and positive as necessary. So the algorithm tweaks to make the dot product bigger. - If we have a data point
with a label , we want our output to be . To get an S-curve to output 0, the raw linear signal must be as small (large and negative) as possible. The algorithm tweaks to push the dot product into the negatives.
The Cross Entropy Error Measure
The Math
- We want the loss (the thing in parenthesis) to be minimized (be as as small as possible)
is a monotonic function, so it is minimized by it. - This is not the only way we can achieve minimization
- The last
should be
It looks complicated and ugly (ln, e^(·), ...),
But,
- it is based on an intuitive probabilistic interpretation of h.
- it is very convenient and mathematically friendly (’easy’ to minimize).
Verify:
encourages , so encourages , so
The Meaning
1. The "Ugly" Formula that Works Beautifully In the previous slide, we looked at a preliminary way to measure mistakes. Now, the professor introduces the actual formula used in the real world to train Logistic Regression: the Cross-Entropy Error (often just called Log Loss).
The formula is: $$ E_{\text {in }}(w)=\frac{1}{N} \sum_{n=1}^N \ln \left(1+e^{-y_n \cdot w^{\mathrm{T}} x_n}\right) $$ (And you are completely correct to note that the last
Let's break down why this formula is actually very intuitive:
- The Agreement Check (
): Remember that our true label is either or . Our raw linear score can be any positive or negative number. If you multiply them together, a positive result means they share the same sign (the model is correct), and a negative result means they have opposite signs (the model is wrong). - The Exponential Penalty (
): If the model is correct and very confident, is a large positive number. The negative sign in the exponent turns this into , which is a tiny number close to . If the model is very wrong, the exponent becomes positive, and becomes a huge penalty! - The Logarithm (
): The natural logarithm simply acts as a dampener. Because is a monotonic function (meaning it strictly goes up; if , then ), minimizing the whole function is the exact same thing as minimizing the stuff inside the parentheses.
2. Why is it "Mathematically Friendly"? Even though the equation looks like an absolute nightmare of logs and exponents, it has a beautiful property: it is strictly convex. Imagine dropping a marble into a bowl. No matter where you drop it, it will always roll down to the exact same spot at the very bottom. The Cross-Entropy error forms a perfect mathematical bowl (only one valley). This means when we use our "iterative procedure" to tweak the weights step-by-step, the computer is guaranteed to eventually find the absolute best possible weights (the bottom of the valley) without getting stuck.
3. Verification (Sanity Check) The slide finishes by proving that minimizing this formula naturally forces the model to do exactly what we want it to do:
- When a patient actually had a heart attack (
): To make the error as close to as possible, the algorithm will push the raw score to be a huge positive number ( ). If we plug a huge positive number into our S-curve , it outputs a probability of ( %). This matches the reality! - When a patient did not have a heart attack (
): To minimize the error, the algorithm will push the raw score to be a huge negative number ( ). If we plug a huge negative number into our S-curve , it outputs a probability of ( %). This also perfectly matches reality!
The Probabilistic Interpretation
Suppose that
Given:
we have
So, more compactly,
- Note this is purely for notational purposes
The Likelihood
The Math
-
"Probability of y given x written in the same formula"
-
We will need to use this when we derive our maximum likelihood
-
This is a probability (is between 0 and 1)
-
If we plugin each of our data points, each output will be somewhere close to 1.
, we want as large as possible , we want as small as possible
-
Recall:
are independently generated -
Likelihood: The probability of getting the
in D from the corresponding
\begin{aligned}
& \max \prod_{n=1}^N P\left(y_n \mid x_n\right) \
\Leftrightarrow & \max \ln \left(\prod_{n=1}^N P\left(y_n \mid x_n\right)\right) \
\end
\begin{aligned}
\equiv & \max \sum_{n=1}^N \ln P\left(y_n \mid x_n\right) \
\end
- We added a negative and changed max to min
- Note this became equivalent (not equal) because we are doing
. (we need to divide by the number of values)
- Note again because of log properties
- We basically just moved the negative inside, this affects the log
- We specialize to our "model" here
- We basically made
to be explicit by rewriting likelihood
- Basically just use the definition of
.
The Meaning
3. Maximizing the Likelihood (The Goal) We want to find the weights (
4. The Math Derivation (Step-by-Step) To fix this, we apply a brilliant mathematical trick. We take the Natural Logarithm (
Here is how we transform the equation step-by-step:
-
Step 1: Apply Logarithm (
). We change to . Why is this allowed? Because logarithms are strictly monotonically increasing. This means if , then . Taking the log changes the value of the number, but it does not change where the maximum is. (Note: To answer your note about the "saturating property", you are actually thinking of the logarithmic product rule, which states that . This is the true magic: it turns our messy multiplication problem into a simple addition problem!) -
Step 2: Convert Product to Sum. Using that rule,
becomes . -
Step 3: Flip from Max to Min. In machine learning, we like to minimize error rather than maximize probability. If you want to maximize a number, that is the exact same thing as minimizing the negative of that number. So we add a negative sign, change "max" to "min", and divide by
to find the average error: . -
Step 4: Move the Negative Inside. Another log rule is
. So we move the negative inside to get: . -
Step 5: Substitute the Model. Replace
with our magical formula from Step 1: . -
Step 6: The Final Simplification. Recall that
. If we take the reciprocal of that ( ), it simply flips the fraction upside down, giving us . If we plug that into our formula, we arrive exactly at the Cross-Entropy Error: $$ \min \frac{1}{N} \sum_{n=1}^N \ln \left(1+e^{-y_n w^{\mathrm{T}} x_n}\right) $$ -
Deriving Cross-Entropy from Likelihood:
- Start:
- Take
: (Since is monotonic, the maximum location is preserved. turns products into sums: ). - Max to Min:
(Maximizing a value is equivalent to minimizing its negative. Divide by for average loss). - Invert Log:
(Using log property ). - Substitute
: . - Final Form: Since
, this reduces directly to the Cross-Entropy Loss:
- Start:
View
, - Look at the one hot vector?
- If true label is +1, then
- If true label is -1, then
- If true label is +1, then
Cross Entropy
- CE tries to maximize the log probability of the correct class
is each of the elements is the probability - Why does this make sense?
- Only one term in the one hot vector is not zero
- That term turns our model into the correct class
- Whenever we try about logs we have to minimize, that is the reason we put that negative there
A Neural Network View
/CSCE-421/Visual%20Aids/Pasted%20image%2020260122100430.png)
- When we try to make predictions
will be multiplied with - The summation of all these multiplications yields a single number, which we then multiply by
to get the final probability that the input belongs to the +1 class - Note there is an input layer (the x vector), and there is an output layer (the summation), there is nothing in between, which later with a more complex neural network will.
How To Minimize
Recall:
- The only difference from this and linear regression is that in linear regression we are minimizing
- Linear regression is easy because we can just take the derivative of it (since it is a degree 2 polynomial) and will end up with the closest answer
- But when we try to take the derivative of
it is much harder because the derivative of and do not disappear completely - And of course the derivative will be a vector
Regression - pseudoinverse (analytic), from solving
Logistic Regression - analytic won’t work.
Numerically/iteratively set
Finding The Best Weights - Hill Descent
Ball on a complicated hilly terrain rolls down to a local valley(a local minimum)
/CSCE-421/Visual%20Aids/Pasted%20image%2020260122103143.png)
- This is a one dimensional problem (w vs. log)
- Note you can only reach the true minimum in a probabilistic sense, it is not a guarantee to be able to actually reach that point
Questions:
- How to get to the bottom of the deepest valley?
- How to do this when we don’t have gravity?
Our Has Only One Valley
/CSCE-421/Visual%20Aids/Pasted%20image%2020260122103509.png)
- Convex: just a function where you have a way to pick random two different points and the minimum will always be below that line
How to "Roll Down"?
The Math
Assume you are at weights
: A scalar of step size : A unit vector of direction
We get to pick
Pick
/CSCE-421/Visual%20Aids/Pasted%20image%2020260127094259.png)
- We are moving down the plane
The Meaning
The Setup (Standing on the Hill) Imagine you are standing somewhere on a foggy, hilly landscape. Your goal is to reach the absolute lowest point of the valley, because the "height" of the ground represents your model's error (
Because it is foggy, you cannot see the bottom of the valley. You can only feel the slope of the ground directly under your feet. To get to the bottom, you have to take it one step at a time. Every step you take is defined by this formula: $$ \mathrm{w}(t+1) = \mathrm{w}(t) + \eta \hat{\mathrm{v}} $$
: Where you are currently standing (your current weights). (Eta): How big of a step you are taking (a single number, called a scalar). : Which direction you choose to step (a unit vector, meaning its length is exactly 1).
The Gradient is the Fastest Way to Roll Down
The Math
The Taylor series expansion of a multivariate function
Expanding
The best (steepest) direction to move is the negative gradient:
- You only need to take the derivative once
is a fixed vector this time, next time is a new and so on. - Given
a vector of input produces a vector of output. - To make this clear:
takes a vector input, and outputs one vector. This is the vector loss.
The Meaning
1. The Goal: Which way do we step? Since we want to go down into the valley, we want our new error,
Taylor's approximation is a calculus trick that lets you estimate the value of a function near your current location. It tells us that the change in our error (
(The Gradient): This is the multi-dimensional slope of the ground under your feet. - The Dot Product (
): This multiplies the slope by your step direction.
2. Finding the Fastest Way Down We want
In linear algebra, if you have a vector
- The Gradient (
) always points in the direction of the steepest ascent (straight up the hill). - Therefore, the exact opposite direction, the negative gradient, points in the direction of the steepest descent (straight down the hill).
To make
3. Important Correction to your Notes! At the bottom of your notes, you wrote: "Given
This is incorrect, and it is crucial to fix this for your exam.
is a SCALAR (a single number). It takes a vector input (the weights ), and calculates a single number representing the total error of your model. (The Gradient) is a VECTOR. When you take the derivative of that scalar error with respect to every single weight, the result is a vector of slopes. Remember: Error is a single number (height on the hill). The Gradient is a vector (the 3D slope under your feet).
"Rolling Down ≡ Iterating the Negative Gradient"
/CSCE-421/Visual%20Aids/Pasted%20image%2020260127094740.png)
The 'Goldilocks' Step Size
/CSCE-421/Visual%20Aids/Pasted%20image%2020260122104847.png)
- Initially we use a larger step size, and every time you make it smaller and smaller
- Some people call this the 'learning rate'
- At the optimal point the gradient is 0, so you can use the current step to measure how close you are to this optimal point
The "Goldilocks" Problem: How big of a step do we take? In the previous slide, we figured out the direction to step to get down the hill (the negative gradient). Now we need to figure out the size of the step, denoted by
- If your step size is too small, you will take tiny baby steps. It will be computationally inefficient and take forever to reach the bottom of the valley.
- If your step size is too big, you might accidentally step entirely over the valley and land higher up on the other side, causing the algorithm to bounce around and actually make the error worse.
- The "Goldilocks" (Just Right) approach: Ideally, you want to take massive steps when you are high up on the hill and far away from the bottom. Then, as you get closer to the bottom, you want to take smaller, more careful steps so you don't overshoot.
Fixed Learning Rate Gradient Descent
The Math
- Initially we want
to be large - These two
will cancel and we will end up with:
/CSCE-421/Visual%20Aids/Pasted%20image%2020260127093606.png)
Gradient descent can minimize any smooth function, for example
(logistic regression)
The Meaning
1. The Brilliant Solution: Let the hill do the work How does the computer know if it is far away or close to the bottom? It looks at the steepness of the slope!
- Far away from the minimum, the hill is usually steep, meaning the size (or "norm") of the gradient
is a large number. - At the very bottom of the valley (the optimal point), the ground is perfectly flat, meaning the gradient is exactly
. - Therefore, we can dynamically link our step size to the steepness of the hill. We define a variable step size
. As the slope flattens out, the step size naturally shrinks to zero.
2. The Math Trick (Simplifying the Formula) In the previous slide, we forced our step direction to be a "unit vector" (a vector with a length of exactly 1) by dividing the negative gradient by its own length:
When we combine our new dynamic step size (
Because
3. The Final Update Rule Once the lengths cancel out, we are left with a beautifully simple formula for our step:
4. Why Logistic Regression? The slide ends by pointing out that this Gradient Descent method can minimize any "smooth" function. We use it for Logistic Regression because its Cross-Entropy error function is perfectly smooth and strictly convex. This means it forms a perfect single bowl shape with only one valley, guaranteeing that our ball will eventually roll down to the absolute best global minimum without getting stuck.
Stochastic Gradient Descent (SGD)
A variation of GD that considers only the error on one data point.
- Pick a random data point (
, ) - Run an iteration of GCD on
Logistic Regression:
Advantages:
- The ’average’ move is the same as GD;
- Computation: fraction 1/N cheaper per step;
- Stochastic: helps escape local minima;
- Simple;
/CSCE-421/Visual%20Aids/Pasted%20image%2020260127100026.png)
Comparison
- Batch Gradient Descent (BGD): Uses the entire dataset to compute the gradient of the loss function. Takes a single update step after processing all training samples.
- If you pass through the whole data set exactly once, that is called "Epoch"
- Stochastic Gradient Descent (SGD): Updates model parameters after processing each individual training example. Does not wait for the entire dataset; updates are made immediately.
- You computer the gradient of a single data point and then update and so on
- It is a special case of MBGD
- Mini-Batch Gradient Descent (MBGD): A compromise between Batch GD and Stochastic GD. Divides the dataset into small batches and updates parameters after processing each batch.
- Divide your training set in subsets and update parameters after processing each subset
| Method | Updates Per Iteration | Convergence Speed | Stability | Computation Cost |
|---|---|---|---|---|
| Batch GD | After full dataset | Slow | Stable | High |
| Stochastic GD | After each sample | Fast | Noisy (unstable) | Low |
| Mini-Batch GD | After each mini-batch | Moderate | Balanced | Moderate |
Summary and explanation of different GD techniques
1. The Problem with Standard (Batch) Gradient Descent In the previous slide, we learned that Gradient Descent updates the model's weights by calculating the gradient (the multi-dimensional slope) of the error surface. The total error,
2. The Solution: Stochastic Gradient Descent (SGD) SGD takes a radically different approach. Instead of looking at the whole dataset, it picks one single random data point
- The Math: The update rule is
. It looks exactly like standard GD, but the little means we are only taking the derivative of the single-point error.
3. The Logistic Regression SGD Formula Your slide shows a specific, ready-to-use formula for SGD applied to Logistic Regression: $$ \mathrm{w}(t+1) \leftarrow \mathrm{w}(t)+y_* \mathrm{x}*\left(\frac{\eta}{1+e^{y* \mathrm{w}^{\mathrm{T}} \mathrm{x}_*}}\right) $$ Where does this come from? If you take the derivative of the single-point Cross-Entropy error
- Intuition: Look at the term
. If the model is very right, this number is a large positive. becomes huge, making the fraction close to 0. The model barely updates. If the model is wrong, is negative. is tiny, making the fraction close to 1. The model takes a large step ( ) to aggressively fix its mistake!
4. Why does SGD work? (The Advantages) You might wonder: "If we only look at one point, aren't we going to step in the wrong direction?" Yes, frequently! Stepping based on one point is very noisy and "wiggly". However:
- The 'average' move is the same: If you take 1 million wiggly steps based on random individual points, mathematically, the average direction you travel is exactly the same as the true downward slope calculated by Batch GD.
- Computationally cheaper: Calculating the slope for 1 point is
times cheaper than calculating it for points. By the time Batch GD takes 1 step, SGD has taken 1 million steps and might already be at the bottom of the valley! - Helps escape local minima: Because the steps are noisy and random (stochastic), SGD acts like a bouncing ball. If it accidentally rolls into a shallow, fake valley (a local minimum), the randomness can bounce it right back out.
5. The Compromise: Mini-Batch Gradient Descent (MBGD) In reality, picking 1 point at a time (SGD) is too noisy, and picking all points (BGD) is too slow. Mini-Batch GD is the perfect "Goldilocks" compromise used in modern Deep Learning. You divide your dataset into small batches (e.g., 32 or 64 points). You calculate the average gradient of those 32 points, take a step, and move to the next batch. It balances speed and stability perfectly.
6. What is an "Epoch"? This is a crucial term to memorize. An Epoch means the algorithm has looked at every single data point in the entire dataset exactly once.
-
In Batch GD, 1 Epoch = 1 weight update step (because it looks at all data at once).
-
In SGD with 1,000 data points, 1 Epoch = 1,000 weight update steps (because it updates after every single point).
-
Logistic Regression SGD Update Rule: $$\mathrm{w}(t+1) \leftarrow \mathrm{w}(t)+y_* \mathrm{x}*\left(\frac{\eta}{1+e^{y* \mathrm{w}^{\mathrm{T}} \mathrm{x}*}}\right)$$ *(Correct predictions yield near-0 updates; wrong predictions yield large
corrective updates). -
SGD Advantages:
- Average move mathematically equals the true Batch GD gradient.
- Cheaper computation:
the cost per step compared to Batch GD. - Stochasticity (Noise): The "wiggly" path helps the algorithm bounce out of bad local minima.
- Simple to implement and great for online/streaming data.
-
Terminology: Epoch = One full pass through the entire training dataset.
-
Comparison of GD Variants:
- Batch GD (BGD): Uses whole dataset per step. Updates: 1 per epoch. Speed: Slow. Stability: Stable downward path. Cost: High.
- Stochastic GD (SGD): Uses 1 sample per step. Updates:
per epoch. Speed: Fast. Stability: Noisy/Unstable. Cost: Low. - Mini-Batch GD (MBGD): Uses a small subset (batch) per step. Updates:
per epoch. Speed: Moderate. Stability: Balanced. Cost: Moderate. (Standard for Deep Learning).