04 - Overfitting and regularization

Class: CSCE-421


Notes:

Recap

Case Study Polynomial Curve Fitting

The Math

Suppose we observe a real-valued input variable x and we wish to use this observation to predict the value of a real-valued target variable t.

Pasted image 20260203091602.png350

polynomial function $$
y(x, \mathbf{w})=w_0+w_1 x+w_2 x^2+\ldots+w_M x^M=\sum_{j=0}^M w_j x^j

Notes:Ifweusealinearmodel,ourinputisasinglenumber(1dimension),weneedtoadda1(thethresholdterm)andthenthatwouldbeourinputthatwecanmultiplywiththe$[w0,w1]$vector.Thispolynomialcoefficientwilltakethe$[1,x]$vectorandturnitintoahigherdimensionalvector.

\left[\begin{array}{c}
1 \
x \
\end{array}\right]
\to f
\left[\begin{array}{c}
1 \
x \
\end{array}\right]

Nowwehave:

[w_0, w_1, ...,w_M]
\left[\begin{array}{c}
1 \
x_0 \
x_1 \
\vdots \
x_M
\end{array}\right]

- We are just applying the threshold term to make our vector bigger ##### The Meaning **1. The Problem: Straight Lines Aren't Always Enough** Up until now, our models have been purely linear. If we only have one input variable ($x$) and we want to predict a continuous target number ($t$, which we've previously called $y$), Linear Regression will draw a perfectly straight line through the data: $y = w_0 \cdot 1 + w_1 \cdot x$. However, in the real world, data often curves. If your data looks like a U-shape or a wave, a straight line will do a terrible job of predicting the target. We need a way to draw curves. **2. The Brilliant "Trick": Polynomial Transformation** To draw a curve, we need a polynomial equation (like $y = w_0 + w_1x + w_2x^2 + w_3x^3$). Here is the brilliant trick of machine learning: we do not need to invent a brand new "Non-Linear Regression" algorithm to solve this. Instead, we can just trick our existing Linear Regression algorithm into doing it for us. How? By creating artificial features. - We start with our single real piece of data: **$x$**. - We add the dummy coordinate **$1$** (which you correctly noted is for the threshold/bias term $w_0$). So our input vector is $[1, x]$. - Now, we use math to stretch this vector out and make it much larger. We manually calculate $x^2$, $x^3$, all the way up to $x^M$ (where $M$ is the highest degree of the polynomial we want). - Our new, transformed input vector becomes: **$[1, x, x^2, \dots, x^M]$**. **3. Why does this work?** Once we have this giant vector, we hand it over to our standard Linear Regression algorithm. The algorithm doesn't know that $x^2$ is just $x$ multiplied by itself. It just treats $x^2$ as if it were a completely independent, brand new feature (like "salary" or "debt"). Because the equation $y = w_0(1) + w_1(x) + w_2(x^2) + \dots + w_M(x^M)$ is just multiplying weights by inputs and adding them up, **it is still perfectly linear with respect to the weights ($w$)**. The computer can use the exact same matrix math (the pseudo-inverse) from Chapter 3 to instantly find the perfect weights! But when you take those weights and plot the final prediction on a graph against your original $x$, it magically draws a curve. **4. Clarifying your last note!** You wrote: _"We are just applying the threshold term to make our vector bigger"_. This is slightly incorrect. - Adding the dummy coordinate / threshold term (the **$1$**) is what allows us to multiply with $w_0$. - What makes the vector _bigger_ (higher dimensional) is the **Feature Transform**. We are applying a _polynomial transformation_ to expand the vector from 2 dimensions ($[1, x]$) to $M+1$ dimensions ($[1, x, x^2, \dots, x^M]$). ### Sum-of-Squares Error Function The values of the coefficients will be determined by fitting the polynomial to the training data. This can be done by minimizing an error function that measures the misfit between the function y(x,w), for any given value of w, and the training set data points. ![Pasted image 20260203093247.png\|350](/img/user/00%20-%20TAMU%20Brain/6th%20Semester%20(Spring%2026)/CSCE-421/Visual%20Aids/Pasted%20image%2020260203093247.png) The sum of the squares error function:

E(\mathbf{w})=\frac{1}{2} \sum_{n=1}^N\left{y\left(x_n, \mathbf{w}\right)-t_n\right}^2

**Notes**: - $y(x_N,w)$ is your prediction. - This is how you get the sum of squares error - This is the exact same "Least Squares" error we learned for Linear Regression! We take the difference between our prediction and the target, square it, and sum it up for all $N$ data points. - _Why the $\frac{1}{2}$ at the front?_ It is just a convenient math trick. Because we squared the error, when we eventually take the derivative (to find the minimum using gradient descent), the exponent $2$ drops down to the front. The $\frac{1}{2}$ simply cancels out that $2$, making the final calculus formula cleaner. ### How to choose the order M? ##### The Math ![Pasted image 20260203093353.png\|600](/img/user/00%20-%20TAMU%20Brain/6th%20Semester%20(Spring%2026)/CSCE-421/Visual%20Aids/Pasted%20image%2020260203093353.png) - M=0 means your prediction id simply $w_0$ (a single number) which means your model is simply a line which is parallel to the $x$ axis. - The only degree of freedom is that your prediction is equal to a constant/number (move that line up and down) - In this case the error regulation will consist of moving the line up or down to make it be at a good distance from every point - As you increase M your model will increase the degrees of freedom. - How do you choose the order of M? - If you choose too low it will be very simple - If you choose to high it will be very complex - This is essentially overfitting (the model is too complicated/has too many parameters that it overfits your data) ##### Observations - The constant ( $M=0$ ) and first order ( $M=1$ ) polynomials give rather poor fits to the data. - The third order ( $\mathrm{M}=3$ ) polynomial seems to give the best fit to the data. - Using a much higher order polynomial ( $\mathrm{M}=9$ ), we obtain an excellent fit to the training data. However, the fitted curve oscillates wildly and gives a very poor representation. This leads to <span style="color:rgb(159, 239, 0)">over-fitting</span>. ##### The Meaning **1. How to choose the order $M$? (Degrees of Freedom)** Remember that $M$ is the highest power in your polynomial (e.g., $M=2$ means $w_0 + w_1x + w_2x^2$). The value of $M$ dictates the **complexity** of your model, which we often call its "degrees of freedom." - **$M=0$ (Too Simple):** The formula is just $y = w_0$. This is a completely flat horizontal line. The only "freedom" the model has is to shift this flat line up or down to find the average height of the data points. It cannot tilt or curve. This is called **Underfitting**, because it is too simple to capture the real trend. - **$M=1$ (Still Simple):** The formula is $y = w_0 + w_1x$. This is a standard straight diagonal line. If your data naturally curves, a straight line will be a poor fit. - **$M=3$ (Just Right):** This gives you a cubic curve ($w_0 + w_1x + w_2x^2 + w_3x^3$). It has enough flexibility to gently curve and follow the true underlying path of the data points without going crazy. - **$M=9$ (Overfitting):** This is a massive polynomial with 10 different weights ($w_0$ through $w_9$). Because it has so many degrees of freedom, the algorithm will use them to force the curve to pass through _every single training point exactly_. **2. The Danger of Over-fitting ($M=9$)** When you look at the $M=9$ model, the training error ($E_{in}$) is basically zero because the line perfectly connects all the dots. However, to mathematically force a line through all those points, the curve has to **oscillate wildly** (whip up and down) between the points. This is the very definition of **Over-fitting**. The model has completely memorized the training data (and any random noise in it), but it has entirely lost the true, smooth shape of the target. If you tested this $M=9$ model on a brand new data point that falls in between your training points, the wild curve would output a massively incorrect prediction, resulting in a terrible Test Error ($E_{out}$). ### Over-fitting ![Pasted image 20260203093503.png\|350](/img/user/00%20-%20TAMU%20Brain/6th%20Semester%20(Spring%2026)/CSCE-421/Visual%20Aids/Pasted%20image%2020260203093503.png) - poor generalization Root-Mean-Square (RMS) Error:

E_{\text {RMS }}=\sqrt{2 E\left(\mathbf{w}^{\star}\right) / N}

**Notes**: - Note how when you increase M, your training error will keep decreasing (your model has more and more parameters) - What we care about is the Test line. As you increase M, the test error will also decrease but afterwards it will also increase, this is a sign that your model has become too complicated (your model overfits your data) ### Polynomial Coefficients ![Pasted image 20260203093619.png\|400](/img/user/00%20-%20TAMU%20Brain/6th%20Semester%20(Spring%2026)/CSCE-421/Visual%20Aids/Pasted%20image%2020260203093619.png) **Notes**: - You can see the trend is when the model overfits the data the optimal $w$ will tend to have a very large magnitude, this is a sign of overfitting - We need a way to constraint the magnitude of $w$ (wee need it small) ### Regularization ##### The Math - One technique that is often used to control the over-fitting phenomenon in such cases is that of regularization, which involves adding a penalty term to the error function in order to discourage the coefficients from reaching large values.

\widetilde{E}(\mathbf{w})=\frac{1}{2} \sum_{n=1}^N\left{y\left(x_n, \mathbf{w}\right)-t_n\right}^2+\frac{\lambda}{2}|\mathbf{w}|^2

- The coefficient $\lambda$ governs the relative importance of the regularization term compared with the sum-of-squares error term. **Notes**: - This is just the squared loss we previously talked about. - Regularization also means that we want to make the magnitude of $w$ to be small - $\lambda$ is nonnegative. - We have choices for M: - Choose M - Or use this form and set M to a fixed value but also need to choose $\lambda$ which may not be a continuous value or may not be discrete? - These two forms are somehow equivalent. - Our hyperparameter will always be either $M$ or $\lambda$. - In deep learning the $\frac{\lambda}{2}\|\mathbf{w}\|^2$ term is called the **weight decay**. - Think of 01 - Linear Regression#Linear Regression Algorithm - $X^Tx = \lambda I$ is nonsingular, why? - To prove this to be nonsingular is equivalent to prove that this matrix is a positive definite, and since this is a symmetric matrix, then it is. - The only thing you need is definition" - $y^T(X^TX+\lambda I)y > 0$ - $\lambda\|\mathbf{y}\|^2$ - So we can get: $(X^TX+ \lambda I)^{-1}X^T$ - Why does this matter? ##### The Meaning **1. The Problem: Wild Curves Require Huge Weights** In the previous slides, we saw that when a model overfits (like a 9th-degree polynomial trying to fit 10 noisy points), the curve whips up and down wildly to perfectly hit every single data point. Mathematically, the only way a polynomial can generate these extreme, wild oscillations is by having **massive weight coefficients**. - _Symptom of overfitting:_ The magnitude of the weights ($||\mathbf{w}||$) becomes incredibly large. **2. The Solution: Regularization** If huge weights cause wild overfitting, let's punish the model for having huge weights! We do this by changing the "rules" of the training phase. Instead of just telling the computer to minimize the standard Sum-of-Squares Error ($E_{in}$), we add a **penalty term** to the error formula. The new "Augmented Error" formula becomes: $$ \widetilde{E}(\mathbf{w}) = E_{in}(\mathbf{w}) + \frac{\lambda}{2}|\mathbf{w}|^2

3. "Soft" vs. "Hard" Complexity (M vs. λ) Your notes point out that λ and M (the polynomial degree) are both hyperparameters (settings you choose before running the model to control its complexity).

4. "Weight Decay": As the algorithm trains, this penalty constantly forces the weights to "decay" toward zero, creating a smoother, less volatile hypothesis.

5. Linear Regression Matrix Math (Why does (XTX+λI) matter?) In standard Linear Regression, the magic formula to find the best weights is w=(XTX)1XTy. However, if your data is weird or you have more features than data points, the matrix XTX might be "singular" (meaning it is impossible to invert), and the math crashes.

Effect of Regularization

Pasted image 20260203093801.png500

Polynomial Coefficients

Pasted image 20260203093838.png400

Notes:

Regularization: ERMS vs. lnλ

Pasted image 20260203102357.png350

Model selection for machine learning

Model selection: Estimation of the optimal value of the regularization parameter. In practice, cross validation is commonly applied for model selection.

Pasted image 20260203102422.png500

Notes:

Model selection for deep learning

Pasted image 20260203102754.png400

Notes:

Model selection

Pasted image 20260203104050.png400

Notes:


Meaning

1. The Three Buckets of Data To build a machine learning model, you must divide your data into three distinct buckets, each with a very specific purpose:

2. The Cardinal Rule: Avoid "Data Snooping" Your notes emphasize: "You cannot use the test data to validate your model." If you test 10 different models on the Test Set, and pick the one that gets the highest score, you have committed a machine learning crime called Data Snooping. By making a decision based on the Test Set, you have inadvertently turned it into a Validation Set. Because you picked the model specifically because it did well on that exact data, the score is now optimistically biased. When you deploy the model in the real world, it will perform significantly worse.

3. The Validation Tradeoff and Cross-Validation As you noted, setting aside data for a Validation Set creates a painful tradeoff: data is expensive, and taking data away from the Training Set means your model won't be as smart. Furthermore, what if the 20% of data you randomly selected for validation happens to be really weird or unrepresentative?

To solve this, we use Cross-Validation (often called V-fold or K-fold cross-validation):

4. Resolving your "Retraining Dilemma" In your notes for Deep Learning model selection, you wrote: "What you do is to train another model that is able to use the validation section? Downside: you do not know if it is better because you have no validation set to test it."

Here is the exact textbook solution to this dilemma: You should retrain on the validation data! The purpose of the validation phase is purely to pick the best hyperparameters (e.g., finding out that λ=0.01 is the best choice). Once you have proven that λ=0.01 is the optimal architecture, you no longer need the validation set. You take λ=0.01, combine your Training Set and your Validation Set back into one giant dataset, and retrain the model from scratch on all of it. Because machine learning models get better when they have more data (the learning curve), this final model is mathematically expected to be even better than the one you validated. You then use your locked-away Test Set to verify its final performance.

5. Inductive vs. Transductive Learning & Snooping with X Can you use the Test Set's x values (the inputs, without the labels y) to help train the model?