04 - Overfitting and regularization

Notes:

Recap

Linear regression: we use a linear model to predict a continuous value
- 01 - Linear Regression
- There are a lot of different names for this: linear regression/least squares regression etc.
- "you first compute the difference and then take the square"
  - This is L2 loss
  - L1 will be doing absolute value without squaring
If your $y$ is discrete (not continuous) we need to use cross-entropy loss on multi-class logistic regression
- Generalization: we want to make multi-class predictions
  - Somehow you have K classes
  - You want to have a score for each of the classes meaning the likelihood an x belongs for each of the classes
    - If you do $w_{k}^{T} x$ you get the score for each of the classes
  - Then you pass the score vector to softmax to make a prediction, that is all you need
  - In training, if we have a training vector $y$ .
    - You get the prediction you got from softmax and in training time you call this vector of probability Q.
    - We want to convert this into P and compare it to Q, the cross entropy loss is simply the negative of the summation of Pi logQi.
- The most commonly used

Case Study Polynomial Curve Fitting

The Math

Suppose we observe a real-valued input variable x and we wish to use this observation to predict the value of a real-valued target variable t.

Pasted image 20260203091602.png350

polynomial function $$
y(x, \mathbf{w})=w_0+w_1 x+w_2 x^2+\ldots+w_M x^M=\sum_{j=0}^M w_j x^j

* * N o t e s * * : - I f w e u s e a l i n e a r m o d e l, o u r i n p u t i s a s i n g l e n u m b e r (1 d i m e n s i o n), w e n e e d t o a d d a 1 (t h e t h r e s h o l d t e r m) a n d t h e n t h a t w o u l d b e o u r i n p u t t h a t w e c a n m u l t i p l y w i t h t h e $ [w_{0}, w_{1}] $ v e c t o r . - T h i s p o l y n o m i a l c o e f f i c i e n t w i l l t a k e t h e $ [1, x] $ v e c t o r a n d t u r n i t i n t o a h i g h e r d i m e n s i o n a l v e c t o r .

\left[\begin{array}{c}
1 \
x \
\end{array}\right]
\to f
\left[\begin{array}{c}
1 \
x \
\end{array}\right]

- N o w w e h a v e :

[w_0, w_1, ...,w_M]
\left[\begin{array}{c}
1 \
x_0 \
x_1 \
\vdots \
x_M
\end{array}\right]

- We are just applying the threshold term to make our vector bigger ##### The Meaning **1. The Problem: Straight Lines Aren't Always Enough** Up until now, our models have been purely linear. If we only have one input variable ($x$) and we want to predict a continuous target number ($t$, which we've previously called $y$), Linear Regression will draw a perfectly straight line through the data: $y = w_0 \cdot 1 + w_1 \cdot x$. However, in the real world, data often curves. If your data looks like a U-shape or a wave, a straight line will do a terrible job of predicting the target. We need a way to draw curves. **2. The Brilliant "Trick": Polynomial Transformation** To draw a curve, we need a polynomial equation (like $y = w_0 + w_1x + w_2x^2 + w_3x^3$). Here is the brilliant trick of machine learning: we do not need to invent a brand new "Non-Linear Regression" algorithm to solve this. Instead, we can just trick our existing Linear Regression algorithm into doing it for us. How? By creating artificial features. - We start with our single real piece of data: **$x$**. - We add the dummy coordinate **$1$** (which you correctly noted is for the threshold/bias term $w_0$). So our input vector is $[1, x]$. - Now, we use math to stretch this vector out and make it much larger. We manually calculate $x^2$, $x^3$, all the way up to $x^M$ (where $M$ is the highest degree of the polynomial we want). - Our new, transformed input vector becomes: **$[1, x, x^2, \dots, x^M]$**. **3. Why does this work?** Once we have this giant vector, we hand it over to our standard Linear Regression algorithm. The algorithm doesn't know that $x^2$ is just $x$ multiplied by itself. It just treats $x^2$ as if it were a completely independent, brand new feature (like "salary" or "debt"). Because the equation $y = w_0(1) + w_1(x) + w_2(x^2) + \dots + w_M(x^M)$ is just multiplying weights by inputs and adding them up, **it is still perfectly linear with respect to the weights ($w$)**. The computer can use the exact same matrix math (the pseudo-inverse) from Chapter 3 to instantly find the perfect weights! But when you take those weights and plot the final prediction on a graph against your original $x$, it magically draws a curve. **4. Clarifying your last note!** You wrote: _"We are just applying the threshold term to make our vector bigger"_. This is slightly incorrect. - Adding the dummy coordinate / threshold term (the **$1$**) is what allows us to multiply with $w_0$. - What makes the vector _bigger_ (higher dimensional) is the **Feature Transform**. We are applying a _polynomial transformation_ to expand the vector from 2 dimensions ($[1, x]$) to $M+1$ dimensions ($[1, x, x^2, \dots, x^M]$). ### Sum-of-Squares Error Function The values of the coefficients will be determined by fitting the polynomial to the training data. This can be done by minimizing an error function that measures the misfit between the function y(x,w), for any given value of w, and the training set data points. ![Pasted image 20260203093247.png\|350](/img/user/00%20-%20TAMU%20Brain/6th%20Semester%20(Spring%2026)/CSCE-421/Visual%20Aids/Pasted%20image%2020260203093247.png) The sum of the squares error function:

E(\mathbf{w})=\frac{1}{2} \sum_{n=1}^N\left{y\left(x_n, \mathbf{w}\right)-t_n\right}^2

**Notes**: - $y(x_N,w)$ is your prediction. - This is how you get the sum of squares error - This is the exact same "Least Squares" error we learned for Linear Regression! We take the difference between our prediction and the target, square it, and sum it up for all $N$ data points. - _Why the $\frac{1}{2}$ at the front?_ It is just a convenient math trick. Because we squared the error, when we eventually take the derivative (to find the minimum using gradient descent), the exponent $2$ drops down to the front. The $\frac{1}{2}$ simply cancels out that $2$, making the final calculus formula cleaner. ### How to choose the order M? ##### The Math ![Pasted image 20260203093353.png\|600](/img/user/00%20-%20TAMU%20Brain/6th%20Semester%20(Spring%2026)/CSCE-421/Visual%20Aids/Pasted%20image%2020260203093353.png) - M=0 means your prediction id simply $w_0$ (a single number) which means your model is simply a line which is parallel to the $x$ axis. - The only degree of freedom is that your prediction is equal to a constant/number (move that line up and down) - In this case the error regulation will consist of moving the line up or down to make it be at a good distance from every point - As you increase M your model will increase the degrees of freedom. - How do you choose the order of M? - If you choose too low it will be very simple - If you choose to high it will be very complex - This is essentially overfitting (the model is too complicated/has too many parameters that it overfits your data) ##### Observations - The constant ( $M=0$ ) and first order ( $M=1$ ) polynomials give rather poor fits to the data. - The third order ( $\mathrm{M}=3$ ) polynomial seems to give the best fit to the data. - Using a much higher order polynomial ( $\mathrm{M}=9$ ), we obtain an excellent fit to the training data. However, the fitted curve oscillates wildly and gives a very poor representation. This leads to <span style="color:rgb(159, 239, 0)">over-fitting</span>. ##### The Meaning **1. How to choose the order $M$? (Degrees of Freedom)** Remember that $M$ is the highest power in your polynomial (e.g., $M=2$ means $w_0 + w_1x + w_2x^2$). The value of $M$ dictates the **complexity** of your model, which we often call its "degrees of freedom." - **$M=0$ (Too Simple):** The formula is just $y = w_0$. This is a completely flat horizontal line. The only "freedom" the model has is to shift this flat line up or down to find the average height of the data points. It cannot tilt or curve. This is called **Underfitting**, because it is too simple to capture the real trend. - **$M=1$ (Still Simple):** The formula is $y = w_0 + w_1x$. This is a standard straight diagonal line. If your data naturally curves, a straight line will be a poor fit. - **$M=3$ (Just Right):** This gives you a cubic curve ($w_0 + w_1x + w_2x^2 + w_3x^3$). It has enough flexibility to gently curve and follow the true underlying path of the data points without going crazy. - **$M=9$ (Overfitting):** This is a massive polynomial with 10 different weights ($w_0$ through $w_9$). Because it has so many degrees of freedom, the algorithm will use them to force the curve to pass through _every single training point exactly_. **2. The Danger of Over-fitting ($M=9$)** When you look at the $M=9$ model, the training error ($E_{in}$) is basically zero because the line perfectly connects all the dots. However, to mathematically force a line through all those points, the curve has to **oscillate wildly** (whip up and down) between the points. This is the very definition of **Over-fitting**. The model has completely memorized the training data (and any random noise in it), but it has entirely lost the true, smooth shape of the target. If you tested this $M=9$ model on a brand new data point that falls in between your training points, the wild curve would output a massively incorrect prediction, resulting in a terrible Test Error ($E_{out}$). ### Over-fitting ![Pasted image 20260203093503.png\|350](/img/user/00%20-%20TAMU%20Brain/6th%20Semester%20(Spring%2026)/CSCE-421/Visual%20Aids/Pasted%20image%2020260203093503.png) - poor generalization Root-Mean-Square (RMS) Error:

E_{\text {RMS }}=\sqrt{2 E\left(\mathbf{w}^{\star}\right) / N}

**Notes**: - Note how when you increase M, your training error will keep decreasing (your model has more and more parameters) - What we care about is the Test line. As you increase M, the test error will also decrease but afterwards it will also increase, this is a sign that your model has become too complicated (your model overfits your data) ### Polynomial Coefficients ![Pasted image 20260203093619.png\|400](/img/user/00%20-%20TAMU%20Brain/6th%20Semester%20(Spring%2026)/CSCE-421/Visual%20Aids/Pasted%20image%2020260203093619.png) **Notes**: - You can see the trend is when the model overfits the data the optimal $w$ will tend to have a very large magnitude, this is a sign of overfitting - We need a way to constraint the magnitude of $w$ (wee need it small) ### Regularization ##### The Math - One technique that is often used to control the over-fitting phenomenon in such cases is that of regularization, which involves adding a penalty term to the error function in order to discourage the coefficients from reaching large values.

\widetilde{E}(\mathbf{w})=\frac{1}{2} \sum_{n=1}^N\left{y\left(x_n, \mathbf{w}\right)-t_n\right}^2+\frac{\lambda}{2}|\mathbf{w}|^2

- The coefficient $\lambda$ governs the relative importance of the regularization term compared with the sum-of-squares error term. **Notes**: - This is just the squared loss we previously talked about. - Regularization also means that we want to make the magnitude of $w$ to be small - $\lambda$ is nonnegative. - We have choices for M: - Choose M - Or use this form and set M to a fixed value but also need to choose $\lambda$ which may not be a continuous value or may not be discrete? - These two forms are somehow equivalent. - Our hyperparameter will always be either $M$ or $\lambda$. - In deep learning the $\frac{\lambda}{2}\|\mathbf{w}\|^2$ term is called the **weight decay**. - Think of01 - Linear Regression#Linear Regression Algorithm- $X^Tx = \lambda I$ is nonsingular, why? - To prove this to be nonsingular is equivalent to prove that this matrix is a positive definite, and since this is a symmetric matrix, then it is. - The only thing you need is definition" - $y^T(X^TX+\lambda I)y > 0$ - $\lambda\|\mathbf{y}\|^2$ - So we can get: $(X^TX+ \lambda I)^{-1}X^T$ - Why does this matter? ##### The Meaning **1. The Problem: Wild Curves Require Huge Weights** In the previous slides, we saw that when a model overfits (like a 9th-degree polynomial trying to fit 10 noisy points), the curve whips up and down wildly to perfectly hit every single data point. Mathematically, the only way a polynomial can generate these extreme, wild oscillations is by having **massive weight coefficients**. - _Symptom of overfitting:_ The magnitude of the weights ($||\mathbf{w}||$) becomes incredibly large. **2. The Solution: Regularization** If huge weights cause wild overfitting, let's punish the model for having huge weights! We do this by changing the "rules" of the training phase. Instead of just telling the computer to minimize the standard Sum-of-Squares Error ($E_{in}$), we add a **penalty term** to the error formula. The new "Augmented Error" formula becomes: $$ \widetilde{E}(\mathbf{w}) = E_{in}(\mathbf{w}) + \frac{\lambda}{2}|\mathbf{w}|^2

The Penalty: The term $| w |^{2}$ is just the sum of all your weights squared. If the weights get big, this penalty gets huge.
The Dial ( $λ$ ): $λ$ (Lambda) is a nonnegative number that dictates how much you care about the penalty. If $λ$ is huge, the model will be terrified of having large weights and will keep them very close to zero, resulting in a smooth, simple curve. If $λ = 0$ , there is no penalty at all (which is just standard regression).

3. "Soft" vs. "Hard" Complexity ( $M$ vs. $λ$ ) Your notes point out that $λ$ and $M$ (the polynomial degree) are both hyperparameters (settings you choose before running the model to control its complexity).

Hard Constraint ( $M$ ): If you choose $M = 2$ , you are strictly forcing the model to be a parabola. It has exactly 3 weights, and the rest are explicitly zero.
Soft Constraint ( $λ$ ): Alternatively, you can use a massive polynomial like $M = 10$ , but apply a large $λ$ . The model technically has 11 weights, but the $λ$ penalty "squashes" them down so the curve acts smoothly like a lower-degree polynomial. This is often vastly superior to just picking a low $M$ .

4. "Weight Decay": As the algorithm trains, this penalty constantly forces the weights to "decay" toward zero, creating a smoother, less volatile hypothesis.

5. Linear Regression Matrix Math (Why does $(X^{T} X + λ I)$ matter?) In standard Linear Regression, the magic formula to find the best weights is $w = (X^{T} X)^{- 1} X^{T} y$ . However, if your data is weird or you have more features than data points, the matrix $X^{T} X$ might be "singular" (meaning it is impossible to invert), and the math crashes.

Why adding $λ I$ makes it nonsingular: In linear algebra, $X^{T} X$ is positive semi-definite (it can have eigenvalues of 0, making it singular). By adding $λ I$ (a diagonal matrix of positive numbers), you shift all the eigenvalues up by $λ$ , strictly making it positive definite.
Why this matters: A positive definite matrix is mathematically guaranteed to be invertible (nonsingular). Therefore, the regularized formula $w = (X^{T} X + λ I)^{- 1} X^{T} y$ will always work and never crash, giving you a highly stable, unique mathematical solution every single time!

Effect of Regularization

Pasted image 20260203093801.png500

$λ$ is a monotonic function
If you choose lambda to be smaller your model will fit well, if you choose larger lambda your model becomes simpler
- This is like saying: choosing a smaller lambda is equivalent to choosing a larger M and so on.

Polynomial Coefficients

Pasted image 20260203093838.png400

Notes:

Whether is a linear model or non-linear, you always have some sort of hyperparameter, in this case you need to either choose $λ$ or choose $M$
- A hyperparameter means you need to specify the model some sort of measurement that tells it to level the complexity of the model

Regularization: $E_{RMS} vs. \ln λ$

Pasted image 20260203102357.png350

Model selection for machine learning

Model selection: Estimation of the optimal value of the regularization parameter. In practice, cross validation is commonly applied for model selection.

Pasted image 20260203102422.png500

Notes:

You pretend the test data is never given to you (you hold onto it)
But you use the training data to play with your model
Before you deliver the model you want to do a test on the data that was not used for training.
A lot of ML papers make a lot of mistakes:
- First you cannot use 10 different models on the same training data and then test it on the test data
  - Analogy with you on an exam choosing the questions that give you the most points
- You have to divide your training data again into some kind of validation and train

Model selection for deep learning

Pasted image 20260203102754.png400

Notes:

We further divide the training data into training and validation, which will help us measure the performance/accuracy of our model before submitting it and testing it with the test data.
- This is because sometimes a test dataset is not available.
- For example the professor may provide training/validation data but won't provide labels (we won't be able to actually test our model)
In practice people train different models using this kind of selection
Then your final model will depend on this particual split
- The validation section may not be representative of the whole dataset, which can be a problem
- What we do in this case is to do several splits for both train and validation and see which one split performs the best
  - This is called cross-validation procedure
    - If you do 5 folds, this is training 5 different models
    - But notice that training is very expensive
There is still something that is not perfect and we have not discussed!
- Imagine you have 10 different $λ$ values and have 10 different models and test each of them in the data set
- Generally if you have more training data, your model will be able to perform better, but here you are prohibiting your model to access the validation data, this is a loss.
- What you do is to train another model that is able to use the validation section?
  - Downside: you do not if it is better because you have no validation set to test it.
  - This is the tradeoff:
    - Either we validate with a validation set and test the performance of our model before submission
    - Or let our model fit more data because generally more data = better predictions, but may not be necessarily true.

Model selection

Pasted image 20260203104050.png400

Notes:

You have some hyperparameters
You do some cross-validation procedure
You obtain better parameters
Build a new model and train it using these parameters
Key takeaway: You cannot use the test data to validate your model
- You cannot use $y$ to test your hyperparameters
- If you do this is a mistake.
- We do not want our model to depend on $y$ .
So if we cannot use $y$ , then can we use $x$ to somehow tune/test our model?
- No, you should not, because then you can overfit your data!
In Machine learning there are two different settings:
- Inductive learning
  - Training process where we do not use test $x$
- Transductive learning
  - In this type test $x$ can be used to train your model
  - This is useful when the only thing you need is to make good predictions on a given data set and that's it.

Meaning

1. The Three Buckets of Data To build a machine learning model, you must divide your data into three distinct buckets, each with a very specific purpose:

Training Set (The Homework): The data the algorithm actually uses to learn the weights ( $w$ ) and minimize $E_{i n}$ .
Validation Set (The Practice Exam): A set of data held back from the training process. You use this to test different models (e.g., a linear model vs. a 10th-degree polynomial) or to tune hyperparameters (like $λ$ ). Because the model didn't "see" this data during training, its error ( $E_{v a l}$ ) gives you a great estimate of how the model will perform in the real world.
Test Set (The Final Exam): A set of data locked in a vault. It is only used at the very end of the entire project to report the final, unbiased accuracy to your boss or customer.

2. The Cardinal Rule: Avoid "Data Snooping" Your notes emphasize: "You cannot use the test data to validate your model." If you test 10 different models on the Test Set, and pick the one that gets the highest score, you have committed a machine learning crime called Data Snooping. By making a decision based on the Test Set, you have inadvertently turned it into a Validation Set. Because you picked the model specifically because it did well on that exact data, the score is now optimistically biased. When you deploy the model in the real world, it will perform significantly worse.

3. The Validation Tradeoff and Cross-Validation As you noted, setting aside data for a Validation Set creates a painful tradeoff: data is expensive, and taking data away from the Training Set means your model won't be as smart. Furthermore, what if the 20% of data you randomly selected for validation happens to be really weird or unrepresentative?

To solve this, we use Cross-Validation (often called V-fold or K-fold cross-validation):

Instead of splitting the data once, you chop your training data into several equal blocks (e.g., 5 folds).
You train 5 different models. Each time, you hold out a different block as the validation set, and train on the other 4 blocks.
You average the 5 validation scores together. This gives you a highly reliable estimate of the model's performance without permanently sacrificing a single validation set. As your notes point out, the only downside is that training 5 models is computationally expensive!

4. Resolving your "Retraining Dilemma" In your notes for Deep Learning model selection, you wrote: "What you do is to train another model that is able to use the validation section? Downside: you do not know if it is better because you have no validation set to test it."

Here is the exact textbook solution to this dilemma: You should retrain on the validation data! The purpose of the validation phase is purely to pick the best hyperparameters (e.g., finding out that $λ = 0.01$ is the best choice). Once you have proven that $λ = 0.01$ is the optimal architecture, you no longer need the validation set. You take $λ = 0.01$ , combine your Training Set and your Validation Set back into one giant dataset, and retrain the model from scratch on all of it. Because machine learning models get better when they have more data (the learning curve), this final model is mathematically expected to be even better than the one you validated. You then use your locked-away Test Set to verify its final performance.

5. Inductive vs. Transductive Learning & Snooping with $X$ Can you use the Test Set's $x$ values (the inputs, without the labels $y$ ) to help train the model?

In standard Inductive Learning (building a general model to predict any future data), the answer is generally NO. If you look at the test $x$ values to calculate the mean and variance to normalize your training data, you are data snooping. The test data has leaked into your training process, and your results will be artificially high.
In Transductive Learning, the answer is YES. Transduction is a special, rare case where you only care about predicting the exact specific test points you have been given, and you will never use the model again for anything else. In this unique scenario, algorithms can safely use the unlabeled test $x$ data to understand the shape of the data distribution.

Recap

Case Study Polynomial Curve Fitting

The Math

Effect of Regularization

Polynomial Coefficients

Regularization: ERMS vs. ln⁡λ

Model selection for machine learning

Model selection for deep learning

Model selection

Meaning

Regularization: $E_{RMS} vs. \ln λ$