04 - Overfitting and regularization

Notes:

Recap

Linear regression: we use a linear model to predict a continuous value
- 01 - Linear Regression
- There are a lot of different names for this: linear regression/least squares regression etc.
- "you first compute the difference and then take the square"
  - This is L2 loss
  - L1 will be doing absolute value without squaring
If your $y$ is discrete (not continuous) we need to use cross-entropy loss os multi-class logistic regression
- Generalization: we want to make multi-class predictions
  - Somehow you have K classes
  - You want to have a score for each of the classes meaning the likelihood an x belongs for each of the classes
    - If you do $w_{k}^{T} x$ you get the score for each of the classes
  - Then you pass the score vector to softmax to make a prediction, that is all you need
  - In training, if we have a training vector $y$ .
    - You get the prediction you got from softmax and in training time you call this vector of probability Q.
    - We want to convert this into P and compare it to Q, the cross entropy loss is simply the negative of the summation of Pi logQi.
- The most commonly used

Case Study Polynomial Curve Fitting

Suppose we observe a real-valued input variable x and we wish to use this observation to predict the value of a real-valued target variable t.

Pasted image 20260203091602.png|350

polynomial function $$
y(x, \mathbf{w})=w_0+w_1 x+w_2 x^2+\ldots+w_M x^M=\sum_{j=0}^M w_j x^j

* * N o t e s * * : - I f w e u s e a l i n e a r m o d e l, o u r i n p u t i s a s i n g l e n u m b e r (1 d i m e n s i o n), w e n e e d t o a d d a 1 (t h e t h r e s h o l d t e r m) a n d t h e n t h a t w o u l d b e o u r i n p u t t h a t w e c a n m u l t i p l y w i t h t h e $ [w_{0}, w_{1}] $ v e c t o r . - T h i s p o l y n o m i a l c o e f f i c i e n t w i l l t a k e t h e $ [1, x] $ v e c t o r a n d t u r n i t i n t o a h i g h e r d i m e n s i o n a l v e c t o r .

\left[\begin{array}{c}
1 \
x \
\end{array}\right]
\to f
\left[\begin{array}{c}
1 \
x \
\end{array}\right]

- N o w w e h a v e :

[w_0, w_1, ...,w_M]
\left[\begin{array}{c}
1 \
x_0 \
x_1 \
\vdots \
x_M
\end{array}\right]

- We are just applying the threshold term to make our vector bigger ### Sum-of-Squares Error Function The values of the coefficients will be determined by fitting the polynomial to the training data. This can be done by minimizing an error function that measures the misfit between the function y(x,w), for any given value of w, and the training set data points. ![Pasted image 20260203093247.png|350](/img/user/00%20-%20TAMU%20Brain/6th%20Semester%20(Spring%2026)/CSCE-421/Visual%20Aids/Pasted%20image%2020260203093247.png) The sum of the squares error function:

E(\mathbf{w})=\frac{1}{2} \sum_{n=1}^N\left{y\left(x_n, \mathbf{w}\right)-t_n\right}^2

**Notes**: - $y(x_N,w)$ is your prediction. - This is how you get the sum of squares error ### How to choose the order M? ![Pasted image 20260203093353.png|600](/img/user/00%20-%20TAMU%20Brain/6th%20Semester%20(Spring%2026)/CSCE-421/Visual%20Aids/Pasted%20image%2020260203093353.png) - M=0 means your prediction id simply $w_0$ (a single number) which means your model is simply a line which is parallel to the $x$ axis. - The only degree of freedom is that your prediction is equal to a constant/number (move that line up and down) - In this case the error regulation will consist of moving the line up or down to make it be at a good distance from every point - As you increase M your model will increase the degrees of freedom. - How do you choose the order of M? - If you choose too low it will be very simple - If you choose to high it will be very complex - This is essentially overfitting (the model is too complicated/has too many parameters that it overfits your data) ### Observations - The constant ( $M=0$ ) and first order ( $M=1$ ) polynomials give rather poor fits to the data. - The third order ( $\mathrm{M}=3$ ) polynomial seems to give the best fit to the data. - Using a much higher order polynomial ( $\mathrm{M}=9$ ), we obtain an excellent fit to the training data. However, the fitted curve oscillates wildly and gives a very poor representation. This leads to <span style="color:rgb(159, 239, 0)">over-fitting</span>. ### Over-fitting ![Pasted image 20260203093503.png|350](/img/user/00%20-%20TAMU%20Brain/6th%20Semester%20(Spring%2026)/CSCE-421/Visual%20Aids/Pasted%20image%2020260203093503.png) - poor generalization Root-Mean-Square (RMS) Error:

E_{\text {RMS }}=\sqrt{2 E\left(\mathbf{w}^{\star}\right) / N}

**Notes**: - Note how when you increase M, your training error will keep decreasing (your model has more and more parameters) - What we care about is the Test line. As you increase M, the test error will also decrease but afterwards it will also increase, this is a sign that your model has become too complicated (your model overfits your data) ### Polynomial Coefficients ![Pasted image 20260203093619.png|400](/img/user/00%20-%20TAMU%20Brain/6th%20Semester%20(Spring%2026)/CSCE-421/Visual%20Aids/Pasted%20image%2020260203093619.png) **Notes**: - You can see the trend is when the model overfits the data the optimal $w$ will tend to have a very large magnitude, this is a sign of overfitting - We need a way to constraint the magnitude of $w$ (wee need it small) ### Regularization - One technique that is often used to control the over-fitting phenomenon in such cases is that of regularization, which involves adding a penalty term to the error function in order to discourage the coefficients from reaching large values.

\widetilde{E}(\mathbf{w})=\frac{1}{2} \sum_{n=1}^N\left{y\left(x_n, \mathbf{w}\right)-t_n\right}^2+\frac{\lambda}{2}|\mathbf{w}|^2

- The coefficient $\lambda$ governs the relative importance of the regularization term compared with the sum-of-squares error term. **Notes**: - This is just the squared loss we previously talked about. - Regularization also means that we want to make the magnitude of $w$ to be small - $\lambda$ is nonnegative. - We have choices for M: - Choose M - Or use this form and set M to a fixed value but also need to choose $\lambda$ which may not be a continuous value or may not be discrete? - These two forms are somehow equivalent. - Our hyperparameter will always be either $M$ or $\lambda$. - In deep learning the $\frac{\lambda}{2}\|\mathbf{w}\|^2$ term is called the graded decay. - Think of01 - Linear Regression#Linear Regression Algorithm- $X^Tx = \lambda I$ is nonsingular, why? - To prove this to be nonsingular is equivalent to prove that this matrix is a positive definite, and since this is a simmetric matrix, then it is. - Then only thing you need is definition" - $y^T(X^TX+\lambda I)y > 0$ - $\lambda\|\mathbf{y}\|^2$ - So we can get: $(X^TX+ \lambda I)^{-1}X^T$ - Why does this matter? ### Effect of Regularization ![Pasted image 20260203093801.png|500](/img/user/00%20-%20TAMU%20Brain/6th%20Semester%20(Spring%2026)/CSCE-421/Visual%20Aids/Pasted%20image%2020260203093801.png) - $\lambda$ is a monotonic function - If you choose lambda to be smaller your model will fit well, if you choose larger lambda your model becomes simpler - This is like saying: choosing a smaller lambda is equivalent to choosing a larger M and so on. ### Polynomial Coefficients ![Pasted image 20260203093838.png|400](/img/user/00%20-%20TAMU%20Brain/6th%20Semester%20(Spring%2026)/CSCE-421/Visual%20Aids/Pasted%20image%2020260203093838.png) **Notes**: - Wither is a linear model or non-linear, you always have some sort of hyperparameter, in this case you need to either choose $\lambda$ or choose $M$ - A hyperparameter means you need to specify the model some sort of measurement that tells it to level the complexity of the model ### Regularization: $E_{\mathrm{RMS}} \text { vs. } \ln \lambda$ ![Pasted image 20260203102357.png|350](/img/user/00%20-%20TAMU%20Brain/6th%20Semester%20(Spring%2026)/CSCE-421/Visual%20Aids/Pasted%20image%2020260203102357.png) ### Model selection for machine learning ![Pasted image 20260203102422.png|500](/img/user/00%20-%20TAMU%20Brain/6th%20Semester%20(Spring%2026)/CSCE-421/Visual%20Aids/Pasted%20image%2020260203102422.png) **Notes**: - You pretend the test data is never given to you (you hold onto it) - But you use the training data to play with your model - Before you deliver the model you want to do a test on the data that was not used for training. - A lot of ML papers make a lot of mistakes: - First you cannot use 10 different models on the same training data and then test it on the test data - Analogy with you on an exam choosing the questions that give you the most points - You have to divide your training data again into some kind of validation and train ### Model selection for deep learning ![Pasted image 20260203102754.png|400](/img/user/00%20-%20TAMU%20Brain/6th%20Semester%20(Spring%2026)/CSCE-421/Visual%20Aids/Pasted%20image%2020260203102754.png) **Notes**: - We further divide the training data into training and validation, which will help us measure the performance/accuracy of our model before submitting it and testing it with the test data. - This is because sometimes a test dataset is not available. - For example the professor may provide training/validation data but won't provide labels (we won't be able to actually test our model) - In practice people train different models using this kind of selection - Then your final model will depend on this particual split - The validation section may not be representative of the whole dataset, which can be a problem - What we do in this case is to do several splits for both train and validation and see which one split performs the best - This is called *cross-validation procedure* - If you do 5 folds, this is training 5 different models - But notice that training is very expensive - There is still something that is not perfect and we have not discussed! - Imagine you have 10 different $\lambda$ values and have 10 different models and test each of them in the data set - Generally if you have more training data, your model will be able to perform better, but here you are prohibiting your model to access the validation data, this is a loss. - What you do is to train another model that is able to use the validation section? - Downside: you do not if it is better because you have no validation set to test it. - This is the tradeoff: - Either we validate with a validation set and test the performance of our model before submission - Or let our model fit more data because generally more data = better predictions, but may not be necessarily true. ### Model selection ![Pasted image 20260203104050.png|400](/img/user/00%20-%20TAMU%20Brain/6th%20Semester%20(Spring%2026)/CSCE-421/Visual%20Aids/Pasted%20image%2020260203104050.png) **Notes**: - You have some hyperparameters - You do some cross-validation procedure - You obtain better parameters - Build a new model and train it using these parameters - Key takeaway: *You cannot use the test data to validate your model* - You cannot use $y$ to test your hyperparameters - If you do this is a mistake. - We do not want our model to depend on $y$. - So if we cannot use $y$, then can we use $x$ to somehow tune/test our model? - No, you should not, because then you can overfit your data! - In Machine learning there are two different settings: - **Inductive learning** - Training process where we do not use test $x$ - **Transductive learning** - In this type test $x$ can be used to train your model - This is useful when the only thing you need is to make good predictions on a given data set and that's it.