01 - Linear Regression
Class: CSCE-421
Notes:
Learning from Data
- Training a program by showing examples with desired outputs.
- Tweak the parameter values if the output is wrong
You essentially need a data set and a training data set with the correct answers that the model should output (+1/-1)
/CSCE-421/Visual%20Aids/Pasted%20image%2020260113101815.png)
-
If we train this model on this data set, and if the model is given some answers, the model will be able to make more accurate predictions
-
We want this model to be accurate
- If this model makes lots of mistakes on training data it won't be able to predict correctly on new data
-
If our data contains any mistake/noise, for example an image with a wrong label, this might mislead our model, but we need it to be reasonably accurate
-
This is the idea of overfitting
- Your model can memorize all training data and make more mistakes on new data
/CSCE-421/Visual%20Aids/Pasted%20image%2020260113102208.png)
- Test Process is more intuitive than the Training Process, since for a training process you need to do test first to see what the prediction is, then adjust the model to correct them
Paradigms of Machine Learning
- Supervised Learning: x → y
- (given x, we try to predict y)
- Classification: y is discrete (binary or multi-class)
- Regression: y is continuous
- This is by far the most important machine learning technique (the underlying technique of ChatGPT -> tries to predict what your next word is/next token)
- This is called next token prediction
- Unsupervised Learning: x
- Clustering, density estimation, etc.
- The model is only given x, it analyzes x and makes some decision based on it
- Reinforcement Learning
- Sequential decision making
- Useful in training LLMs
/CSCE-421/Visual%20Aids/Pasted%20image%2020260113102606.png)
Three Learning Problems
Given x, we try to predict a y.
/CSCE-421/Visual%20Aids/Pasted%20image%2020260113102649.png)
- Linear models are perhaps the fundamental model.
- The linear model is the first model to try.
- Logistic regression deals with classification problems - wrong name!
- In logistic regression, training data have discrete y, but it makes probabilistic predictions
- Logistic regression is a linear model, though may not be obvious
- Logistic regression is a model that solves classification, it is not a continuous regression
- It has noting to do with regression, it will just give you a probability
Linear Models
The Math
"The simplest model, but by far the most important"
- Input vector
| age | 32 years |
|---|---|
| salary | 40,000 |
| debt | 26,000 |
| years in job | 1 year |
| years at home | 3 years |
| . . . | . . . |
- You are assigning a weight to each of these numbers
-
Give importance weights to the different inputs and compute a "Credit Score"
- A linear model has to do with the linear combination of the inputs
-
Approve credit if the "Credit Score" is acceptable
-
Approve credit if
> threshold -
Deny credit if
< threshold -
Can be written formally as:
/CSCE-421/Visual%20Aids/Pasted%20image%2020260114220419.png)
- We do not want to write the two cases every time, this formula puts it together
- If you do
- you will get the sign - Prediction = sign ( some function )
- Since the summation is an inner product we could write something like: h(x) = sign (
)
- Since the summation is an inner product we could write something like: h(x) = sign (
-
-
How to choose the importance weights wi?
- Input xi is important ⇒ large weight |wi|
- Input xi beneficial for credit ⇒ positive weight wi > 0
- Input xi detrimental for credit ⇒ negative weight wi < 0
-
The "bias weight"
corresponds to the threshold. (How?)
The Meaning
-
The Core Concept: Weighted Inputs
- A linear model essentially acts like a scorecard. It makes decisions by looking at a list of input variables (like age, salary, or debt) and assigning an "importance weight" to each one.
- The Input (
): These are the features of the data you are analyzing. For example, in a credit application, x1 might be salary and x2 might be years in a job. - The Weights (
): The model learns which inputs are important. - If an input is important, it gets a large weight.
- If an input is beneficial (like high salary), it gets a positive weight.
- If an input is detrimental (like high debt), it gets a negative weight.
- The model calculates a "score" by multiplying each input by its weight and summing them up (
).
-
Making a Decision: Thresholds and Bias
-
Once the model calculates the total weighted score, it needs to make a decision (e.g., Approve or Deny credit).
-
The Threshold: The model compares the score to a specific threshold. If the score is higher than the threshold, the credit is approved; if lower, it is denied.
-
The Bias (
): To make the math cleaner, the slides show that the threshold is moved to the other side of the equation and treated as a "bias weight" ( ). Instead of checking if Score > Threshold, the model checks ifScore + Bias > 0./CSCE-421/Visual%20Aids/Pasted%20image%2020260114220419.png)
-
The Perceptron Hypothesis Set
The Math
We have defined a Hypothesis set H
where
This hypothesis set is called the perceptron or linear separator
The Meaning
- The "Hypothesis Set" (
) - A "hypothesis" (
) is just one specific guess at a formula that might distinguish between "Yes" and "No" outputs. The Hypothesis Set is the collection of all possible linear formulas (lines or planes) that the model could potentially use. The learning algorithm's job is to search through this set to find the single "best" one that matches the data.
- A "hypothesis" (
- The Formula:
- The slide condenses the decision-making process into a compact linear algebra formula. Here is how to read it:
(The Score): This represents the "signal" or score. It is calculated by multiplying each input (x) by its importance weight (w) and summing them up. - sign(…) (The Decision): This function looks at the total score.
- If the score is positive (>0), the output is +1 (e.g., Approve Credit).
- If the score is negative (<0), the output is −1 (e.g., Deny Credit).
- The slide condenses the decision-making process into a compact linear algebra formula. Here is how to read it:
- What are
, , and ? - What is
? represents the number of features in your raw data. For example, if you are predicting credit limits, your features might be "Age," "Salary," and "Years in Job." In this case, d=3.
- Why is
? - In the slides, the raw input vector is defined as
. This contains just the features from the data set. However, to make the math work for the linear model, we have to modify this vector slightly in the next step.
- In the slides, the raw input vector is defined as
- Why is
? - The linear model uses a threshold (or bias) to make decisions. The slides explain that the threshold is moved into the weight vector as a "bias weight" labeled
. - To account for this new
, we add a "dummy" input to the input vector . - The weight vector
must match the size of this new "augmented" input vector so they can be multiplied. - Therefore, w includes
plus the weights for the features, making it size .
- To account for this new
- The linear model uses a threshold (or bias) to make decisions. The slides explain that the threshold is moved into the weight vector as a "bias weight" labeled
- What is
- The "Augmented" Vectors (
and ) - The slide shows vectors
and that look slightly different from the previous slide. This is a mathematical trick to handle the threshold: - The Bias (
): Instead of checking if a score is greater than a threshold (e.g., Score > 50), the algebra is easier if we move the threshold to the left side (e.g.,Score - 50 > 0). This-50becomes a new weight called the bias (). - The Dummy Input (
): To include the bias in the standard multiplication formula ( ), the slide adds a fixed input of 1 at the top of the input vector . - Now,
is simply added to the total score automatically.
- Now,
- The slide shows vectors
- The "Linear Separator"
- The slide refers to this hypothesis set as the "linear separator".
- Geometrically, this formula represents a straight line (in 2D) or a flat plane (in higher dimensions) that cuts the space in half.
- Everything on one side of the line is classified as +1, and everything on the other side is -1. The model "learns" by wiggling this line around until it separates the data points correctly.
The Linear Signal
The Math
-
linear
: gives the line/hyperplane separator -
linear
: makes the algorithms work -
is the augmented vector:
/CSCE-421/Visual%20Aids/Pasted%20image%2020260115094635.png)
- Refer to classification, regression, and logistic regression respectively
The Meaning
- The Signal Formula:
- The "signal" (
) is the raw score the model calculates for a specific input. (Input): The data (e.g., salary, debt). (Weights): The parameters the model learns (importance of salary vs. debt). (Signal): The result of multiplying inputs by weights and adding them up (the dot product).
- Think of
as a credit score calculated by the bank. Before the bank decides what to do with you, they first simply calculate this number.
- The "signal" (
- One Signal, Three Decisions
- The slide shows that once this signal
is calculated, it can be passed through three different "functions" to produce three different types of answers (outputs): - Classification (The Step Function):
- The Action: The model looks at the signal and asks, "Is it positive or negative?"
- The Math: h(x) = sign(s).
- The Output: A simple Yes/No (+1 or −1).
- Example: Approve or Deny the credit application.
- Linear Regression (The Identity Function):
- The Action: The model uses the signal exactly as it is, without changing it.
- The Math: h(x) = s.
- The Output: A Real Number (R).
- Example: Determine the specific amount of credit line to give (e.g., $5,000).
- Logistic Regression (The S-Curve):
- The Action: The model squashes the signal into a range between 0 and 1 using a curve (often denoted as θ).
- The Math: h(x) = θ(s).
- The Output: A Probability.
- Example: Calculate the probability that the customer will default on the loan.
- Classification (The Step Function):
- The slide shows that once this signal
- Why is it called "Linear"?
- Linear in
(Geometry): When you plot this signal, it creates a straight line (or a flat plane) that separates the data. This acts as the boundary between "Yes" and "No". - Linear in
(The Math): The relationship between the weights is simple. This simplicity is crucial because it allows the computer to use efficient algorithms (like minimizing squared errors) to find the best weights easily.
- Linear in
Linear Regression
| age | 32 years |
|---|---|
| gender | male |
| salary | 40,000 |
| debt | 26,000 |
| years in job | 1 year |
| years at home | 3 years |
| . . . | . . . |
- Classification: Approve/Deny
- Regression: Credit Line (dollar amount)
- regression
- regression
Least Squares Linear Regression
The Math
/CSCE-421/Visual%20Aids/Pasted%20image%2020260115100601.png)
/CSCE-421/Visual%20Aids/Pasted%20image%2020260115100620.png)
The Meaning
- The Goal: Fitting a Line
- The objective of Linear Regression is to find a line (or a hyperplane in higher dimensions) that passes through your data points as closely as possible.
- The Hypothesis: The model assumes the relationship between the input and output is linear. The formula for the prediction, h(x), is a weighted sum of the inputs:
Here, is the dot product of the weights and the input features.
- "Least Squares": Measuring the Error
- To find the "best" line, we need a way to measure how bad a specific line is. We do this by calculating the In-sample Error (
). - The Residual: For every data point, the model looks at the difference between the prediction (
) and the actual target value ( ). - Squaring: It squares this difference. This ensures that the error is always positive and penalizes large errors more heavily than small ones.
- Averaging: The total error is the average of these squared differences over all
data points: This method is called "Least Squares" because the goal is to find the weights that result in the least sum of squared errors.
- The Residual: For every data point, the model looks at the difference between the prediction (
- To find the "best" line, we need a way to measure how bad a specific line is. We do this by calculating the In-sample Error (
Using Matrices for Linear Regression
The Math
number of samples in training data - The (d + 1) in the data matrix refers to the vector of data that is
dimension + 1.
- We are trying to figure out what term will get us to the weighted vector
-
We want to find different w's that will get us closer to the predicted value (we want each of the
to be closer to the actual target vector ) -
This is the training process
-
y is given, x is given, the only thing we do not know is w
/CSCE-421/Visual%20Aids/Pasted%20image%2020260115095555.png)
is error (the error on your training data) and is a function of . What we want is to minimize this error value - We are doing both, getting closer to the target vector and minimizing the error at the same time
- Note the "loss" is measured as a sum of square difference.
- Now we have only one job to do, and is how to find the best
, as soon as we solve this, we are done with linear regression.
- The summation is an element-wise difference but we do not want to write it like that every time
of a vector is just (the square of each component of the vector summed together)
- Now we write
in terms of - In this way, the dependency of
in is now explicit
- In this way, the dependency of
- The quantity
is a vector which we can rewrite - Now we move the
(transpose) inside the first term: - =
- =
- Now we just need to expand this multiplication:
- =
- =
- Note that taking a transpose of a vector switches terms:
- =
- So we can just write
- The reason why you can combine them is because all of these vectors simplify to a single number, otherwise we would not be able to combine them using different vectors
- =
-
The Meaning
Calculating the error for one data point at a time is inefficient. We can stack all the data together to calculate the best weights in a single mathematical step,.
-
A. The Data Matrix (
) Instead of handling different input vectors separately, we stack them on top of each other to form a large matrix . - Every row is a different person (data point).
- Every column is a different feature (including the first column, which is all 1s for the bias
). - The dimensions of this matrix are
.
-
B. The Target Vector (
) We stack all the correct answers (the credit limits given by experts) into one long column vector. -
C. The Predictions (
) In one shot, we can predict the values for every single person in the database by multiplying the Data Matrix ( ) by the weight vector ( ). -
D. The In-Sample Error (
) We want to verify how close our predictions ( ) are to the correct answers ( ). We subtract them to get a "difference vector," take the length (norm) of that vector, and square it. - Expanded, this looks like: $$E_{in}(w) = \frac{1}{N} (w^T X^T X w - 2w^T X^T y + y^T y)$$
Linear Regression Solution
The Math
-
Vector Calculus:
- To minimize
and
- To minimize
-
How do we take the derivative of the second term in terms of
? - Think about it as:
- Which you can write as
which is your - Note
- We want to take the partial derivatives of each term. The respective partial derivative of each term will come up to be just a vector of the
terms: - Derivative of
- Derivative of
- Derivative of
- We have taken the derivatives of each term and are able to write:
<- when is invertible
- Think about it as:
-
Note: eventually each of these terms is simply a number, no much more complex vectors
-
Think about degree 1 and degree 2 polynomials, these are the reasons why linear regression can be reduced to a formula, because a lot of time a degree 2 polynomial will end up being a degree 1 polynomial.
The Meaning
- The Strategy: Calculus on Matrices
- To find the weights that produce the smallest error, the algorithm treats the error formula like a curve (or a bowl). The goal is to find the "bottom" of this bowl where the slope is zero. Since we are dealing with vectors and matrices instead of simple numbers, we use Vector Calculus.
- The slide starts with the expanded matrix error formula derived in the previous section.
- The Tools: Matrix Derivatives
- To find the slope (gradient) of the error with respect to the weights
, the slide introduces two specific rules for differentiating matrices. These are similar to standard calculus rules (like the derivative of is ): - Linear Rule: The derivative of a linear term
is just . - Quadratic Rule: The derivative of a quadratic term
involves the matrix .
- To find the slope (gradient) of the error with respect to the weights
- Calculating the Gradient
- The slide applies these rules to the error formula. It identifies the parts of the error formula that match the rules:
corresponds to (which is the quadratic part). corresponds to (which is the linear part).
- By applying the rules, the gradient (slope) of the error is calculated as:
- The slide applies these rules to the error formula. It identifies the parts of the error formula that match the rules:
- The Solution: Normal Equations
- To minimize the error, we set the gradient to zero. This finds the point where the error stops decreasing and starts increasing (the bottom of the bowl).
- Set Gradient to 0:
- The Normal Equations: By removing the
and moving the term to the other side, we get a famous equation in statistics:
- Set Gradient to 0:
- To minimize the error, we set the gradient to zero. This finds the point where the error stops decreasing and starts increasing (the bottom of the bowl).
- The Final Answer (
) - Finally, to isolate
and find the optimal weights, we multiply both sides by the inverse of . - The Analytic Solution:
- This formula tells the computer exactly how to find the best-fitting line in a single step, provided that the matrix
can be inverted (which is usually true if you have enough data points).
- This formula tells the computer exactly how to find the best-fitting line in a single step, provided that the matrix
- Finally, to isolate
Linear Regression Algorithm
The Math
Construct the matrix X and the vector y from the data set
Compute the pseudo inverse
Return
The Meaning
Unlike the Perceptron algorithm, which learns by taking small steps and correcting mistakes one by one (iterative), the Linear Regression Algorithm finds the perfect answer in a single mathematical step. It uses a "closed-form" or "analytic" solution, meaning you just plug the numbers into a formula and get the result immediately.
- Construct the Matrices
- First, the computer organizes all the training data into standard linear algebra structures.
- The Inputs: You take all your input vectors
and stack them to create the Data Matrix . - Crucial Detail: You must ensure every input vector has the "bias coordinate"
added to it before stacking them.
- Crucial Detail: You must ensure every input vector has the "bias coordinate"
- The Outputs: You take all the correct answers (targets)
and stack them into a Target Vector .
- Compute the Pseudo-Inverse (
) - To solve for the weights, we ideally want to divide by the matrix
. However, because is rarely a square matrix (you usually have more data points than features ), it cannot be inverted in the traditional sense. - Instead, the algorithm calculates the Pseudo-Inverse, denoted by the symbol
(dagger). - The slide provides the specific formula for this calculation, assuming the matrix
is invertible (which is usually true if you have enough data):
- Instead, the algorithm calculates the Pseudo-Inverse, denoted by the symbol
- To solve for the weights, we ideally want to divide by the matrix
- Summary
- The algorithm can be summarized in one line: You pack your data into a big matrix
and a vector , and then you run one formula to get the best weights:
- The algorithm can be summarized in one line: You pack your data into a big matrix