01 - Linear Regression

#MachineLearning #Linear_Regression

Class: CSCE-421

Notes:

Learning from Data

Training a program by showing examples with desired outputs.
Tweak the parameter values if the output is wrong

You essentially need a data set and a training data set with the correct answers that the model should output (+1/-1)

Pasted image 20260113101815.png|400

If we train this model on this data set, and if the model is given some answers, the model will be able to make more accurate predictions
We want this model to be accurate
- If this model makes lots of mistakes on training data it won't be able to predict correctly on new data
If our data contains any mistake/noise, for example an image with a wrong label, this might mislead our model, but we need it to be reasonably accurate
This is the idea of overfitting
- Your model can memorize all training data and make more mistakes on new data

Pasted image 20260113102208.png|500

Test Process is more intuitive than the Training Process, since for a training process you need to do test first to see what the prediction is, then adjust the model to correct them

Paradigms of Machine Learning

Supervised Learning: x → y
- (given x, we try to predict y)
- Classification: y is discrete (binary or multi-class)
- Regression: y is continuous
- This is by far the most important machine learning technique (the underlying technique of ChatGPT -> tries to predict what your next word is/next token)
  - This is called next token prediction
Unsupervised Learning: x
- Clustering, density estimation, etc.
- The model is only given x, it analyzes x and makes some decision based on it
Reinforcement Learning
- Sequential decision making
- Useful in training LLMs

Pasted image 20260113102606.png|500

Three Learning Problems

Given x, we try to predict a y.
Pasted image 20260113102649.png|500

Linear models are perhaps the fundamental model.
The linear model is the first model to try.
Logistic regression deals with classification problems - wrong name!
In logistic regression, training data have discrete y, but it makes probabilistic predictions
Logistic regression is a linear model, though may not be obvious
- Logistic regression is a model that solves classification, it is not a continuous regression
- It has noting to do with regression, it will just give you a probability

Linear Models

The Math

"The simplest model, but by far the most important"

Input vector $x = [x_{1}, . . ., x_{d}]^{T}$

age	32 years
salary	40,000
debt	26,000
years in job	1 year
years at home	3 years
. . .	. . .

You are assigning a weight to each of these numbers

Give importance weights to the different inputs and compute a "Credit Score"
$"Credit Score" = \sum_{i = 1}^{d} w_{i} x_{i}$
- A linear model has to do with the linear combination of the inputs
Approve credit if the "Credit Score" is acceptable
- Approve credit if $\sum_{i = 1}^{d} w_{i} x_{i}$ > threshold
- Deny credit if $\sum_{i = 1}^{d} w_{i} x_{i}$ < threshold
- Can be written formally as:
  - We do not want to write the two cases every time, this formula puts it together
  - If you do $\sum_{i = 1}^{d} w_{i} x_{i}$ - $w_{0}$ you will get the sign
  - Prediction = sign ( some function )
    - Since the summation is an inner product we could write something like: h(x) = sign ( $w^{T} x$ )
How to choose the importance weights wi?
- Input xi is important ⇒ large weight |wi|
- Input xi beneficial for credit ⇒ positive weight wi > 0
- Input xi detrimental for credit ⇒ negative weight wi < 0
The "bias weight" $w_{0}$ corresponds to the threshold. (How?)

The Meaning

The Core Concept: Weighted Inputs
- A linear model essentially acts like a scorecard. It makes decisions by looking at a list of input variables (like age, salary, or debt) and assigning an "importance weight" to each one.
- The Input ( $x$ ): These are the features of the data you are analyzing. For example, in a credit application, x1 might be salary and x2 might be years in a job.
- The Weights ( $w$ ): The model learns which inputs are important.
  - If an input is important, it gets a large weight.
  - If an input is beneficial (like high salary), it gets a positive weight.
  - If an input is detrimental (like high debt), it gets a negative weight.
- The model calculates a "score" by multiplying each input by its weight and summing them up ( $\sum w_{i} x_{i}$ ).
Making a Decision: Thresholds and Bias
- Once the model calculates the total weighted score, it needs to make a decision (e.g., Approve or Deny credit).
- The Threshold: The model compares the score to a specific threshold. If the score is higher than the threshold, the credit is approved; if lower, it is denied.
- The Bias ( $w_{0}$ ): To make the math cleaner, the slides show that the threshold is moved to the other side of the equation and treated as a "bias weight" ( $w 0_{}$ ). Instead of checking if Score > Threshold, the model checks if Score + Bias > 0.

The Perceptron Hypothesis Set

The Math

We have defined a Hypothesis set H

H = {h (x) = sign (w^{T} x)}

where

w = [\begin{matrix} w_{0} \\ w_{1} \\ ⋮ \\ w_{d} \end{matrix}] \in R^{d + 1}, x = [\begin{matrix} 1 \\ x_{1} \\ ⋮ \\ x_{d} \end{matrix}] \in {1} \times R^{d}

This hypothesis set is called the perceptron or linear separator

The Meaning

The "Hypothesis Set" ( $H$ )
- A "hypothesis" ( $h$ ) is just one specific guess at a formula that might distinguish between "Yes" and "No" outputs. The Hypothesis Set is the collection of all possible linear formulas (lines or planes) that the model could potentially use. The learning algorithm's job is to search through this set to find the single "best" one that matches the data.
The Formula: $h (x) = sign (w^{T} x)$
- The slide condenses the decision-making process into a compact linear algebra formula. Here is how to read it:
  - $w^{T} x$ (The Score): This represents the "signal" or score. It is calculated by multiplying each input (x) by its importance weight (w) and summing them up.
  - sign(…) (The Decision): This function looks at the total score.
    - If the score is positive (>0), the output is +1 (e.g., Approve Credit).
    - If the score is negative (<0), the output is −1 (e.g., Deny Credit).
What are $d$ , $x$ , and $w$ ?
- What is $d$ ?
  - $d$ represents the number of features in your raw data. For example, if you are predicting credit limits, your features might be "Age," "Salary," and "Years in Job." In this case, d=3.
- Why is $x \in R^{d}$ ?
  - In the slides, the raw input vector is defined as $x = [x_{1}, . . ., x_{d}]^{T}$ . This contains just the features from the data set. However, to make the math work for the linear model, we have to modify this vector slightly in the next step.
- Why is $w \in R^{d + 1}$ ?
  - The linear model uses a threshold (or bias) to make decisions. The slides explain that the threshold is moved into the weight vector as a "bias weight" labeled $w_{0}$ .
    - To account for this new $w_{0}$ , we add a "dummy" input $x_{0} = 1$ to the input vector $x$ .
    - The weight vector $w$ must match the size of this new "augmented" input vector so they can be multiplied.
    - Therefore, w includes $w_{0}$ plus the $d$ weights for the features, making it size $d + 1$ .
The "Augmented" Vectors ( $x_{0}$ and $w_{0}$ )
- The slide shows vectors $x$ and $w$ that look slightly different from the previous slide. This is a mathematical trick to handle the threshold:
- The Bias ( $w_{0}$ ): Instead of checking if a score is greater than a threshold (e.g., Score > 50), the algebra is easier if we move the threshold to the left side (e.g., Score - 50 > 0). This -50 becomes a new weight called the bias ( $w_{0}$ ).
- The Dummy Input ( $x_{0} = 1$ ): To include the bias in the standard multiplication formula ( $w^{T} x$ ), the slide adds a fixed input of 1 at the top of the input vector $x$ .
  - Now, $w_{0} \times 1$ is simply added to the total score automatically.
The "Linear Separator"
- The slide refers to this hypothesis set as the "linear separator".
- Geometrically, this formula represents a straight line (in 2D) or a flat plane (in higher dimensions) that cuts the space in half.
- Everything on one side of the line is classified as +1, and everything on the other side is -1. The model "learns" by wiggling this line around until it separates the data points correctly.

The Linear Signal

The Math

s = w^{T} x

linear $x$ : gives the line/hyperplane separator
linear $w$ : makes the algorithms work
$x$ is the augmented vector: $x \in {1} \times R^{d}$

Pasted image 20260115094635.png|400

Refer to classification, regression, and logistic regression respectively

The Meaning

The Signal Formula: $s = w^{T} x$
- The "signal" ( $s$ ) is the raw score the model calculates for a specific input.
  - $x$ (Input): The data (e.g., salary, debt).
  - $w$ (Weights): The parameters the model learns (importance of salary vs. debt).
  - $s$ (Signal): The result of multiplying inputs by weights and adding them up (the dot product).
- Think of $s$ as a credit score calculated by the bank. Before the bank decides what to do with you, they first simply calculate this number.
One Signal, Three Decisions
- The slide shows that once this signal $s$ is calculated, it can be passed through three different "functions" to produce three different types of answers (outputs):
  - Classification (The Step Function):
    - The Action: The model looks at the signal and asks, "Is it positive or negative?"
    - The Math: h(x) = sign(s).
    - The Output: A simple Yes/No (+1 or −1).
    - Example: Approve or Deny the credit application.
  - Linear Regression (The Identity Function):
    - The Action: The model uses the signal exactly as it is, without changing it.
    - The Math: h(x) = s.
    - The Output: A Real Number (R).
    - Example: Determine the specific amount of credit line to give (e.g., $5,000).
  - Logistic Regression (The S-Curve):
    - The Action: The model squashes the signal into a range between 0 and 1 using a curve (often denoted as θ).
    - The Math: h(x) = θ(s).
    - The Output: A Probability.
    - Example: Calculate the probability that the customer will default on the loan.
Why is it called "Linear"?
- Linear in $x$ (Geometry): When you plot this signal, it creates a straight line (or a flat plane) that separates the data. This acts as the boundary between "Yes" and "No".
- Linear in $w$ (The Math): The relationship between the weights is simple. This simplicity is crucial because it allows the computer to use efficient algorithms (like minimizing squared errors) to find the best weights easily.

Linear Regression

age	32 years
gender	male
salary	40,000
debt	26,000
years in job	1 year
years at home	3 years
. . .	. . .

Classification: Approve/Deny
Regression: Credit Line (dollar amount)
- regression $\equiv y \in R$

h (x) = \sum_{i = 0}^{d} w_{i} x_{i} = w^{T} x

Least Squares Linear Regression

The Math

Pasted image 20260115100601.png|400

Pasted image 20260115100620.png|500

The Meaning

The Goal: Fitting a Line
- The objective of Linear Regression is to find a line (or a hyperplane in higher dimensions) that passes through your data points as closely as possible.
- The Hypothesis: The model assumes the relationship between the input and output is linear. The formula for the prediction, h(x), is a weighted sum of the inputs: $h (x) = \sum_{i = 0}^{d} w_{i} x_{i} = w^{T} x$ Here, $w^{T} x$ is the dot product of the weights and the input features.
"Least Squares": Measuring the Error
- To find the "best" line, we need a way to measure how bad a specific line is. We do this by calculating the In-sample Error ( $E_{in}$ ).
  - The Residual: For every data point, the model looks at the difference between the prediction ( $h (x_{n} )$ ) and the actual target value ( $y_{n}$ ).
  - Squaring: It squares this difference. This ensures that the error is always positive and penalizes large errors more heavily than small ones.
  - Averaging: The total error is the average of these squared differences over all $N$ data points: $E_{i n} (h) = \frac{1}{N} \sum_{n = 1}^{N} (h (x_{n}) - y_{n})^{2}$ This method is called "Least Squares" because the goal is to find the weights $w$ that result in the least sum of squared errors.

Using Matrices for Linear Regression

The Math

\underset{data matrix, N \times (d + 1)}{\underset{⏟}{X = [\begin{matrix} - x_{1} - \\ - x_{2} - \\ ⋮ \\ - x_{N} - \end{matrix}]}} \underset{target vector}{\underset{⏟}{y = [\begin{matrix} y_{1} \\ y_{2} \\ ⋮ \\ y_{N} \end{matrix}]}} \underset{in-sample predictions}{\underset{⏟}{\hat{y} = [\begin{matrix} {\hat{y}}_{1} \\ {\hat{y}}_{2} \\ ⋮ \\ {\hat{y}}_{N} \end{matrix}] = [\begin{matrix} w^{T} x_{1} \\ w^{T} x_{2} \\ ⋮ \\ w^{T} x_{N} \end{matrix}] = Xw}}

${x_{i}, y_{i}}_{i = 1}^{N}$
- $N$ number of samples in training data
- The (d + 1) in the data matrix refers to the vector of data that is $d$ dimension + 1.
We are trying to figure out what term will get us to the weighted vector
- We want to find different w's that will get us closer to the predicted value (we want each of the $\hat{y}$ to be closer to the actual target vector $y$ )
- This is the training process
- y is given, x is given, the only thing we do not know is w
  1. $E_{i n}$ is error (the error on your training data) and is a function of $w$ . What we want is to minimize this error value
    - We are doing both, getting closer to the target vector and minimizing the error at the same time
    - Note the "loss" is measured as a sum of square difference.
    - Now we have only one job to do, and is how to find the best $w$ , as soon as we solve this, we are done with linear regression.
  2. The summation is an element-wise difference but we do not want to write it like that every time
    - $| | a | |_{2}$ of a vector $a$ is just $\sum_{i} a_{i}^{2}$ (the square of each component of the vector summed together)
    - $| | a | |_{2}^{2} = a^{T} a$
  3. Now we write $\hat{y}$ in terms of $X w$
    - In this way, the dependency of $E_{i n} (w)$ in $w$ is now explicit
  4. The quantity $| | X w - y | |_{2}^{2}$ is a vector which we can rewrite
    - $| | X w | |^{2} = (X w - y)^{T} (X w - y)$
    - Now we move the $T$ (transpose) inside the first term:
      - = $(w^{T} X^{T} - y^{T}) (X w - y)$
    - Now we just need to expand this multiplication:
      - = $w^{T} X^{T} X w + y^{T} y - w^{T} X^{T} y - y^{T} X w$
    - Note that taking a transpose of a vector switches terms:
      - = $- w^{T} X^{T} y = - y^{T} X w$
      - So we can just write $- 2 w^{T} X^{T} y$
      - The reason why you can combine them is because all of these vectors simplify to a single number, otherwise we would not be able to combine them using different vectors

The Meaning

Calculating the error for one data point at a time is inefficient. We can stack all the data together to calculate the best weights in a single mathematical step,.

A. The Data Matrix ( $X$ ) Instead of handling $N$ different input vectors separately, we stack them on top of each other to form a large matrix $X$ .
- Every row is a different person (data point).
- Every column is a different feature (including the first column, which is all 1s for the bias $x_{0}$ ).
- $X = [\begin{matrix} - x_{1}^{T} - \\ - x_{2}^{T} - \\ ⋮ \\ - x_{N}^{T} - \end{matrix}]$
- The dimensions of this matrix are $N \times (d + 1)$ .
B. The Target Vector ( $y$ ) We stack all the correct answers (the credit limits given by experts) into one long column vector.
- $y = [\begin{matrix} y_{1} \\ y_{2} \\ ⋮ \\ y_{N} \end{matrix}]$
C. The Predictions ( $\hat{y}$ ) In one shot, we can predict the values for every single person in the database by multiplying the Data Matrix ( $X$ ) by the weight vector ( $w$ ).
- $\hat{y} = X w$
D. The In-Sample Error ( $E_{in}$ ) We want to verify how close our predictions ( $X w$ ) are to the correct answers ( $y$ ). We subtract them to get a "difference vector," take the length (norm) of that vector, and square it.
- $E_{i n} (w) = \frac{1}{N} ∥ X w - y ∥^{2}$
- Expanded, this looks like: $$E_{in}(w) = \frac{1}{N} (w^T X^T X w - 2w^T X^T y + y^T y)$$

Linear Regression Solution

The Math

E_{in} (w) = \frac{1}{N} (w^{T} X^{T} Xw - 2 w^{T} X^{T} y + y^{T} y)

Vector Calculus:
- To minimize $E_{in} (w), set \nabla_{w} E_{in} (w) = 0 .$
- $\nabla_{w} (w^{T} b) = b$
- $\nabla_{w} (w^{T} A w) = (A + A^{T}) w$
- $A = X^{T} X$ and $b = X^{T} y$
How do we take the derivative of the second term in terms of $w$ ?
- Think about it as: $w^{T} (2 X^{T} y)$
- Which you can write as $w^{T} b$ which is your $f (w)$
- Note $w^{T} b = \sum_{i} w_{i} b_{i} = w_{1} b_{1} + w_{2} b_{2} \dots$
- We want to take the partial derivatives of each term. The respective partial derivative of each term will come up to be just a vector of the $b_{i}$ terms:
  - $[b_{1}, b_{2}, \dots]$
- Derivative of $y^{T} y = 0$
- Derivative of $- 2 w^{T} X^{T} y = - 2 X^{T} y$
- Derivative of $w^{T} X^{T} Xw = (X^{T} X + X^{T} X) w = 2 X^{T} X w$
- We have taken the derivatives of each term and are able to write:
  - $w_{lin} = (X^{T} X)^{- 1} X^{T} y$ <- when $X^{T} X$ is invertible
Note: eventually each of these terms is simply a number, no much more complex vectors
Think about degree 1 and degree 2 polynomials, these are the reasons why linear regression can be reduced to a formula, because a lot of time a degree 2 polynomial will end up being a degree 1 polynomial.

The Meaning

The Strategy: Calculus on Matrices
- To find the weights that produce the smallest error, the algorithm treats the error formula like a curve (or a bowl). The goal is to find the "bottom" of this bowl where the slope is zero. Since we are dealing with vectors and matrices instead of simple numbers, we use Vector Calculus.
- The slide starts with the expanded matrix error formula derived in the previous section.
The Tools: Matrix Derivatives
- To find the slope (gradient) of the error with respect to the weights $w$ , the slide introduces two specific rules for differentiating matrices. These are similar to standard calculus rules (like the derivative of $x^{2}$ is $2 x$ ):
- Linear Rule: The derivative of a linear term $w^{T} b$ is just $b$ .
  - $\nabla_{w} (w^{T} b) = b$
- Quadratic Rule: The derivative of a quadratic term $w^{T} A w$ involves the matrix $A$ .
  - $\nabla_{w} (w^{T} A w) = (A + A^{T}) w$
Calculating the Gradient
- The slide applies these rules to the error formula. It identifies the parts of the error formula that match the rules:
  - $A$ corresponds to $X^{T} X$ (which is the quadratic part).
  - $b$ corresponds to $X^{T} y$ (which is the linear part).
- By applying the rules, the gradient (slope) of the error is calculated as:
  - $\nabla E_{i n} (w) = \frac{2}{N} (X^{T} X w - X^{T} y)$
The Solution: Normal Equations
- To minimize the error, we set the gradient to zero. This finds the point where the error stops decreasing and starts increasing (the bottom of the bowl).
  - Set Gradient to 0:
    - $\frac{2}{N} (X^{T} X w - X^{T} y) = 0$
  - The Normal Equations: By removing the $2 / N$ and moving the $y$ term to the other side, we get a famous equation in statistics: $X^{T} X w = X^{T} y$
The Final Answer ( $w_{lin}$ )
- Finally, to isolate $w$ and find the optimal weights, we multiply both sides by the inverse of $X^{T} X$ .
- The Analytic Solution:
  - $w_{l i n} = (X^{T} X)^{- 1} X^{T} y$
    - This formula tells the computer exactly how to find the best-fitting line in a single step, provided that the matrix $X^{T} X$ can be inverted (which is usually true if you have enough data points).

Linear Regression Algorithm

The Math

Construct the matrix X and the vector y from the data set $(x_{1}, y_{1})$ ,···, $(x_{N}, y_{N})$ , where each $x$ includes the $x_{0} = 1$ coordinate,

\underset{data matrix}{\underset{⏟}{X = [\begin{matrix} - x_{1} - \\ - x_{2} - \\ ⋮ \\ - x_{N} - \end{matrix}]}}, \underset{target vector}{\underset{⏟}{y = [\begin{matrix} y_{1} \\ y_{2} \\ ⋮ \\ y_{N} \end{matrix}]}} .

Compute the pseudo inverse $X^{†}$ of the matrix $X$ . If $X^{T} X$ is invertible,

X^{†} = {(X^{T} X)}^{- 1} X^{T}

Return $w_{lin} = X^{†} y$ .

The Meaning

Unlike the Perceptron algorithm, which learns by taking small steps and correcting mistakes one by one (iterative), the Linear Regression Algorithm finds the perfect answer in a single mathematical step. It uses a "closed-form" or "analytic" solution, meaning you just plug the numbers into a formula and get the result immediately.

Construct the Matrices
- First, the computer organizes all the training data into standard linear algebra structures.
- The Inputs: You take all your input vectors $x_{1}, . . ., x_{N}$ and stack them to create the Data Matrix $X$ .
  - Crucial Detail: You must ensure every input vector has the "bias coordinate" $x_{0} = 1$ added to it before stacking them.
- The Outputs: You take all the correct answers (targets) $y_{1}, . . ., y_{N}$ and stack them into a Target Vector $y$ . $X = [\begin{matrix} - x_{1}^{T} - \\ - x_{2}^{T} - \\ ⋮ \\ - x_{N}^{T} - \end{matrix}], y = [\begin{matrix} y_{1} \\ y_{2} \\ ⋮ \\ y_{N} \end{matrix}]$
Compute the Pseudo-Inverse ( $X^{†}$ )
- To solve for the weights, we ideally want to divide by the matrix $X$ . However, because $X$ is rarely a square matrix (you usually have more data points $N$ than features $d$ ), it cannot be inverted in the traditional sense.
  - Instead, the algorithm calculates the Pseudo-Inverse, denoted by the symbol $†$ (dagger).
  - The slide provides the specific formula for this calculation, assuming the matrix $X^{T} X$ is invertible (which is usually true if you have enough data): $X^{†} = (X^{T} X)^{- 1} X^{T}$
Summary
- The algorithm can be summarized in one line: You pack your data into a big matrix $X$ and a vector $y$ , and then you run one formula to get the best weights: $w_{l i n} = (X^{T} X)^{- 1} X^{T} y$