The Penalty: The term is just the sum of all your weights squared. If the weights get big, this penalty gets huge.
The Dial (): (Lambda) is a nonnegative number that dictates how much you care about the penalty. If is huge, the model will be terrified of having large weights and will keep them very close to zero, resulting in a smooth, simple curve. If , there is no penalty at all (which is just standard regression).
3. "Soft" vs. "Hard" Complexity ( vs. ) Your notes point out that and (the polynomial degree) are both hyperparameters (settings you choose before running the model to control its complexity).
Hard Constraint (): If you choose , you are strictly forcing the model to be a parabola. It has exactly 3 weights, and the rest are explicitly zero.
Soft Constraint (): Alternatively, you can use a massive polynomial like , but apply a large . The model technically has 11 weights, but the penalty "squashes" them down so the curve acts smoothly like a lower-degree polynomial. This is often vastly superior to just picking a low .
4. "Weight Decay": As the algorithm trains, this penalty constantly forces the weights to "decay" toward zero, creating a smoother, less volatile hypothesis.
5. Linear Regression Matrix Math (Why does matter?) In standard Linear Regression, the magic formula to find the best weights is . However, if your data is weird or you have more features than data points, the matrix might be "singular" (meaning it is impossible to invert), and the math crashes.
Why adding makes it nonsingular: In linear algebra, is positive semi-definite (it can have eigenvalues of 0, making it singular). By adding (a diagonal matrix of positive numbers), you shift all the eigenvalues up by , strictly making it positive definite.
Why this matters: A positive definite matrix is mathematically guaranteed to be invertible (nonsingular). Therefore, the regularized formula will always work and never crash, giving you a highly stable, unique mathematical solution every single time!
Effect of Regularization
is a monotonic function
If you choose lambda to be smaller your model will fit well, if you choose larger lambda your model becomes simpler
This is like saying: choosing a smaller lambda is equivalent to choosing a larger M and so on.
Polynomial Coefficients
Notes:
Whether is a linear model or non-linear, you always have some sort of hyperparameter, in this case you need to either choose or choose
A hyperparameter means you need to specify the model some sort of measurement that tells it to level the complexity of the model
Regularization:
Model selection for machine learning
Model selection: Estimation of the optimal value of the regularization parameter. In practice, cross validation is commonly applied for model selection.
Notes:
You pretend the test data is never given to you (you hold onto it)
But you use the training data to play with your model
Before you deliver the model you want to do a test on the data that was not used for training.
A lot of ML papers make a lot of mistakes:
First you cannot use 10 different models on the same training data and then test it on the test data
Analogy with you on an exam choosing the questions that give you the most points
You have to divide your training data again into some kind of validation and train
Model selection for deep learning
Notes:
We further divide the training data into training and validation, which will help us measure the performance/accuracy of our model before submitting it and testing it with the test data.
This is because sometimes a test dataset is not available.
For example the professor may provide training/validation data but won't provide labels (we won't be able to actually test our model)
In practice people train different models using this kind of selection
Then your final model will depend on this particual split
The validation section may not be representative of the whole dataset, which can be a problem
What we do in this case is to do several splits for both train and validation and see which one split performs the best
This is called cross-validation procedure
If you do 5 folds, this is training 5 different models
But notice that training is very expensive
There is still something that is not perfect and we have not discussed!
Imagine you have 10 different values and have 10 different models and test each of them in the data set
Generally if you have more training data, your model will be able to perform better, but here you are prohibiting your model to access the validation data, this is a loss.
What you do is to train another model that is able to use the validation section?
Downside: you do not if it is better because you have no validation set to test it.
This is the tradeoff:
Either we validate with a validation set and test the performance of our model before submission
Or let our model fit more data because generally more data = better predictions, but may not be necessarily true.
Model selection
Notes:
You have some hyperparameters
You do some cross-validation procedure
You obtain better parameters
Build a new model and train it using these parameters
Key takeaway: You cannot use the test data to validate your model
You cannot use to test your hyperparameters
If you do this is a mistake.
We do not want our model to depend on .
So if we cannot use , then can we use to somehow tune/test our model?
No, you should not, because then you can overfit your data!
In Machine learning there are two different settings:
Inductive learning
Training process where we do not use test
Transductive learning
In this type test can be used to train your model
This is useful when the only thing you need is to make good predictions on a given data set and that's it.
Meaning
1. The Three Buckets of Data To build a machine learning model, you must divide your data into three distinct buckets, each with a very specific purpose:
Training Set (The Homework): The data the algorithm actually uses to learn the weights () and minimize .
Validation Set (The Practice Exam): A set of data held back from the training process. You use this to test different models (e.g., a linear model vs. a 10th-degree polynomial) or to tune hyperparameters (like ). Because the model didn't "see" this data during training, its error () gives you a great estimate of how the model will perform in the real world.
Test Set (The Final Exam): A set of data locked in a vault. It is only used at the very end of the entire project to report the final, unbiased accuracy to your boss or customer.
2. The Cardinal Rule: Avoid "Data Snooping" Your notes emphasize: "You cannot use the test data to validate your model." If you test 10 different models on the Test Set, and pick the one that gets the highest score, you have committed a machine learning crime called Data Snooping. By making a decision based on the Test Set, you have inadvertently turned it into a Validation Set. Because you picked the model specifically because it did well on that exact data, the score is now optimistically biased. When you deploy the model in the real world, it will perform significantly worse.
3. The Validation Tradeoff and Cross-Validation As you noted, setting aside data for a Validation Set creates a painful tradeoff: data is expensive, and taking data away from the Training Set means your model won't be as smart. Furthermore, what if the 20% of data you randomly selected for validation happens to be really weird or unrepresentative?
To solve this, we use Cross-Validation (often called V-fold or K-fold cross-validation):
Instead of splitting the data once, you chop your training data into several equal blocks (e.g., 5 folds).
You train 5 different models. Each time, you hold out a different block as the validation set, and train on the other 4 blocks.
You average the 5 validation scores together. This gives you a highly reliable estimate of the model's performance without permanently sacrificing a single validation set. As your notes point out, the only downside is that training 5 models is computationally expensive!
4. Resolving your "Retraining Dilemma" In your notes for Deep Learning model selection, you wrote: "What you do is to train another model that is able to use the validation section? Downside: you do not know if it is better because you have no validation set to test it."
Here is the exact textbook solution to this dilemma: You should retrain on the validation data! The purpose of the validation phase is purely to pick the best hyperparameters (e.g., finding out that is the best choice). Once you have proven that is the optimal architecture, you no longer need the validation set. You take , combine your Training Set and your Validation Set back into one giant dataset, and retrain the model from scratch on all of it. Because machine learning models get better when they have more data (the learning curve), this final model is mathematically expected to be even better than the one you validated. You then use your locked-away Test Set to verify its final performance.
5. Inductive vs. Transductive Learning & Snooping with Can you use the Test Set's values (the inputs, without the labels ) to help train the model?
In standard Inductive Learning (building a general model to predict any future data), the answer is generally NO. If you look at the test values to calculate the mean and variance to normalize your training data, you are data snooping. The test data has leaked into your training process, and your results will be artificially high.
In Transductive Learning, the answer is YES. Transduction is a special, rare case where you only care about predicting the exact specific test points you have been given, and you will never use the model again for anything else. In this unique scenario, algorithms can safely use the unlabeled test data to understand the shape of the data distribution.