01 - Linear Regression

Class: CSCE-421


Notes:

Learning from Data

  1. Training a program by showing examples with desired outputs.
  2. Tweak the parameter values if the output is wrong

You essentially need a data set and a training data set with the correct answers that the model should output (+1/-1)

Pasted image 20260113101815.png|400

Pasted image 20260113102208.png|500

Paradigms of Machine Learning

  1. Supervised Learning: x → y
    • (given x, we try to predict y)
    • Classification: y is discrete (binary or multi-class)
    • Regression: y is continuous
    • This is by far the most important machine learning technique (the underlying technique of ChatGPT -> tries to predict what your next word is/next token)
      • This is called next token prediction
  2. Unsupervised Learning: x
    • Clustering, density estimation, etc.
    • The model is only given x, it analyzes x and makes some decision based on it
  3. Reinforcement Learning
    • Sequential decision making
    • Useful in training LLMs

Pasted image 20260113102606.png|500

Three Learning Problems

Given x, we try to predict a y.
Pasted image 20260113102649.png|500

Linear Models

The Math

"The simplest model, but by far the most important"

  1. Input vector x=[x1,...,xd]T
age 32 years
salary 40,000
debt 26,000
years in job 1 year
years at home 3 years
. . . . . .
  1. Give importance weights to the different inputs and compute a "Credit Score"

    "Credit Score"=i=1dwixi
    • A linear model has to do with the linear combination of the inputs
  2. Approve credit if the "Credit Score" is acceptable

    • Approve credit if i=1dwixi > threshold

    • Deny credit if i=1dwixi < threshold

    • Can be written formally as:

      Pasted image 20260114220419.png|300

      • We do not want to write the two cases every time, this formula puts it together
      • If you do i=1dwixi - w0 you will get the sign
      • Prediction = sign ( some function )
        • Since the summation is an inner product we could write something like: h(x) = sign (wTx)
  3. How to choose the importance weights wi?

    • Input xi is important ⇒ large weight |wi|
    • Input xi beneficial for credit ⇒ positive weight wi > 0
    • Input xi detrimental for credit ⇒ negative weight wi < 0
  4. The "bias weight" w0 corresponds to the threshold. (How?)


The Meaning

  1. The Core Concept: Weighted Inputs

    • A linear model essentially acts like a scorecard. It makes decisions by looking at a list of input variables (like age, salary, or debt) and assigning an "importance weight" to each one.
    • The Input (x): These are the features of the data you are analyzing. For example, in a credit application, x1​ might be salary and x2​ might be years in a job.
    • The Weights (w): The model learns which inputs are important.
      • If an input is important, it gets a large weight.
      • If an input is beneficial (like high salary), it gets a positive weight.
      • If an input is detrimental (like high debt), it gets a negative weight.
    • The model calculates a "score" by multiplying each input by its weight and summing them up (wixi).
  2. Making a Decision: Thresholds and Bias

    • Once the model calculates the total weighted score, it needs to make a decision (e.g., Approve or Deny credit).

    • The Threshold: The model compares the score to a specific threshold. If the score is higher than the threshold, the credit is approved; if lower, it is denied.

    • The Bias (w0​): To make the math cleaner, the slides show that the threshold is moved to the other side of the equation and treated as a "bias weight" (w0). Instead of checking if Score > Threshold, the model checks if Score + Bias > 0.

      Pasted image 20260114220419.png|300

The Perceptron Hypothesis Set

The Math

We have defined a Hypothesis set H

H={h(x)=sign(wTx)}

where

w=[w0w1wd]Rd+1,x=[1x1xd]{1}×Rd

This hypothesis set is called the perceptron or linear separator


The Meaning

  1. The "Hypothesis Set" (H)
    • A "hypothesis" (h) is just one specific guess at a formula that might distinguish between "Yes" and "No" outputs. The Hypothesis Set is the collection of all possible linear formulas (lines or planes) that the model could potentially use. The learning algorithm's job is to search through this set to find the single "best" one that matches the data.
  2. The Formula: h(x)=sign(wTx)
    • The slide condenses the decision-making process into a compact linear algebra formula. Here is how to read it:
      • wTx (The Score): This represents the "signal" or score. It is calculated by multiplying each input (x) by its importance weight (w) and summing them up.
      • sign(…) (The Decision): This function looks at the total score.
        • If the score is positive (>0), the output is +1 (e.g., Approve Credit).
        • If the score is negative (<0), the output is −1 (e.g., Deny Credit).
  3. What are d, x, and w?
    • What is d?
      • d represents the number of features in your raw data. For example, if you are predicting credit limits, your features might be "Age," "Salary," and "Years in Job." In this case, d=3.
    • Why is xRd?
      • In the slides, the raw input vector is defined as x=[x1,...,xd]T. This contains just the features from the data set. However, to make the math work for the linear model, we have to modify this vector slightly in the next step.
    • Why is wRd+1?
      • The linear model uses a threshold (or bias) to make decisions. The slides explain that the threshold is moved into the weight vector as a "bias weight" labeled w0.
        • To account for this new w0, we add a "dummy" input x0=1 to the input vector x.
        • The weight vector w must match the size of this new "augmented" input vector so they can be multiplied.
        • Therefore, w includes w0 plus the d weights for the features, making it size d+1.
  4. The "Augmented" Vectors (x0​ and w0​)
    • The slide shows vectors x and w that look slightly different from the previous slide. This is a mathematical trick to handle the threshold:
    • The Bias (w0​): Instead of checking if a score is greater than a threshold (e.g., Score > 50), the algebra is easier if we move the threshold to the left side (e.g., Score - 50 > 0). This -50 becomes a new weight called the bias (w0​).
    • The Dummy Input (x0=1): To include the bias in the standard multiplication formula (wTx), the slide adds a fixed input of 1 at the top of the input vector x.
      • Now, w0×1 is simply added to the total score automatically.
  5. The "Linear Separator"
    • The slide refers to this hypothesis set as the "linear separator".
    • Geometrically, this formula represents a straight line (in 2D) or a flat plane (in higher dimensions) that cuts the space in half.
    • Everything on one side of the line is classified as +1, and everything on the other side is -1. The model "learns" by wiggling this line around until it separates the data points correctly.

The Linear Signal

The Math

s=wTx

Pasted image 20260115094635.png|400


The Meaning

  1. The Signal Formula: s=wTx
    • The "signal" (s) is the raw score the model calculates for a specific input.
      • x (Input): The data (e.g., salary, debt).
      • w (Weights): The parameters the model learns (importance of salary vs. debt).
      • s (Signal): The result of multiplying inputs by weights and adding them up (the dot product).
    • Think of s as a credit score calculated by the bank. Before the bank decides what to do with you, they first simply calculate this number.
  2. One Signal, Three Decisions
    • The slide shows that once this signal s is calculated, it can be passed through three different "functions" to produce three different types of answers (outputs):
      • Classification (The Step Function):
        • The Action: The model looks at the signal and asks, "Is it positive or negative?"
        • The Math: h(x) = sign(s).
        • The Output: A simple Yes/No (+1 or −1).
        • Example: Approve or Deny the credit application.
      • Linear Regression (The Identity Function):
        • The Action: The model uses the signal exactly as it is, without changing it.
        • The Math: h(x) = s.
        • The Output: A Real Number (R).
        • Example: Determine the specific amount of credit line to give (e.g., $5,000).
      • Logistic Regression (The S-Curve):
        • The Action: The model squashes the signal into a range between 0 and 1 using a curve (often denoted as θ).
        • The Math: h(x) = θ(s).
        • The Output: A Probability.
        • Example: Calculate the probability that the customer will default on the loan.
  3. Why is it called "Linear"?
    • Linear in x (Geometry): When you plot this signal, it creates a straight line (or a flat plane) that separates the data. This acts as the boundary between "Yes" and "No".
    • Linear in w (The Math): The relationship between the weights is simple. This simplicity is crucial because it allows the computer to use efficient algorithms (like minimizing squared errors) to find the best weights easily.

Linear Regression

age 32 years
gender male
salary 40,000
debt 26,000
years in job 1 year
years at home 3 years
. . . . . .
h(x)=i=0dwixi=wTx

Least Squares Linear Regression

The Math

Pasted image 20260115100601.png|400

Pasted image 20260115100620.png|500


The Meaning

  1. The Goal: Fitting a Line
    • The objective of Linear Regression is to find a line (or a hyperplane in higher dimensions) that passes through your data points as closely as possible.
    • The Hypothesis: The model assumes the relationship between the input and output is linear. The formula for the prediction, h(x), is a weighted sum of the inputs: h(x)=i=0dwixi=wTx Here, wTx is the dot product of the weights and the input features.
  2. "Least Squares": Measuring the Error
    • To find the "best" line, we need a way to measure how bad a specific line is. We do this by calculating the In-sample Error (Ein ).
      • The Residual: For every data point, the model looks at the difference between the prediction (h(xn)) and the actual target value (yn​).
      • Squaring: It squares this difference. This ensures that the error is always positive and penalizes large errors more heavily than small ones.
      • Averaging: The total error is the average of these squared differences over all N data points: Ein(h)=1Nn=1N(h(xn)yn)2 This method is called "Least Squares" because the goal is to find the weights w that result in the least sum of squared errors.

Using Matrices for Linear Regression

The Math

X=[x1x2xN]data  matrix, N×(d+1)y=[y1y2yN]target vector y^=[y^1y^2y^N]=[wTx1wTx2wTxN]=Xwin-sample predictions 

The Meaning

Calculating the error for one data point at a time is inefficient. We can stack all the data together to calculate the best weights in a single mathematical step,.

Linear Regression Solution

The Math

Ein (w)=1N(wTXTXw2wTXTy+yTy)

The Meaning

  1. The Strategy: Calculus on Matrices
    • To find the weights that produce the smallest error, the algorithm treats the error formula like a curve (or a bowl). The goal is to find the "bottom" of this bowl where the slope is zero. Since we are dealing with vectors and matrices instead of simple numbers, we use Vector Calculus.
    • The slide starts with the expanded matrix error formula derived in the previous section.
  2. The Tools: Matrix Derivatives
    • To find the slope (gradient) of the error with respect to the weights w, the slide introduces two specific rules for differentiating matrices. These are similar to standard calculus rules (like the derivative of x2 is 2x):
    • Linear Rule: The derivative of a linear term wTb is just b.
      • w(wTb)=b
    • Quadratic Rule: The derivative of a quadratic term wTAw involves the matrix A.
      • w(wTAw)=(A+AT)w
  3. Calculating the Gradient
    • The slide applies these rules to the error formula. It identifies the parts of the error formula that match the rules:
      • A corresponds to XTX (which is the quadratic part).
      • b corresponds to XTy (which is the linear part).
    • By applying the rules, the gradient (slope) of the error is calculated as:
      • Ein(w)=2N(XTXwXTy)
  4. The Solution: Normal Equations
    • To minimize the error, we set the gradient to zero. This finds the point where the error stops decreasing and starts increasing (the bottom of the bowl).
      • Set Gradient to 0:
        • 2N(XTXwXTy)=0
      • The Normal Equations: By removing the 2/N and moving the y term to the other side, we get a famous equation in statistics: XTXw=XTy
  5. The Final Answer (wlin​)
    • Finally, to isolate w and find the optimal weights, we multiply both sides by the inverse of XTX.
    • The Analytic Solution:
      • wlin=(XTX)1XTy
        • This formula tells the computer exactly how to find the best-fitting line in a single step, provided that the matrix XTX can be inverted (which is usually true if you have enough data points).

Linear Regression Algorithm

The Math

Construct the matrix X and the vector y from the data set (x1,y1),···, (xN,yN), where each x includes the x0=1 coordinate,

X=[x1x2xN]data  matrix ,y=[y1y2yN]target vector .

Compute the pseudo inverse X of the matrix X . If XTX is invertible,

X=(XTX)1XT

Return wlin=Xy.


The Meaning

Unlike the Perceptron algorithm, which learns by taking small steps and correcting mistakes one by one (iterative), the Linear Regression Algorithm finds the perfect answer in a single mathematical step. It uses a "closed-form" or "analytic" solution, meaning you just plug the numbers into a formula and get the result immediately.

  1. Construct the Matrices
    • First, the computer organizes all the training data into standard linear algebra structures.
    • The Inputs: You take all your input vectors x1,...,xN and stack them to create the Data Matrix X.
      • Crucial Detail: You must ensure every input vector has the "bias coordinate" x0=1 added to it before stacking them.
    • The Outputs: You take all the correct answers (targets) y1,...,yN and stack them into a Target Vector y.X=[x1Tx2TxNT],y=[y1y2yN]
  2. Compute the Pseudo-Inverse (X)
    • To solve for the weights, we ideally want to divide by the matrix X. However, because X is rarely a square matrix (you usually have more data points N than features d), it cannot be inverted in the traditional sense.
      • Instead, the algorithm calculates the Pseudo-Inverse, denoted by the symbol (dagger).
      • The slide provides the specific formula for this calculation, assuming the matrix XTX is invertible (which is usually true if you have enough data):X=(XTX)1XT
  3. Summary
    • The algorithm can be summarized in one line: You pack your data into a big matrix X and a vector y, and then you run one formula to get the best weights:wlin=(XTX)1XTy