HW1 - Linear Models for Handwritten Digits Classification
Overview:
In this assignment, you will implement the binary logistic regression model and multi-class logistic regression model on a partial dataset from MNIST. In this classification task, the model will take a
File structure:
code/
DataReader.py
LogisticRegression.py
LRM.py
main.py
data/
test.npz
training.npz
1. Data Preprocessing
DataReader.py
Function: load_data(filename)
def load_data(filename):
"""Load a given txt file.
Args:
filename: A string.
Returns:
raw_data: An array of shape [n_samples, 256].
labels : An array of shape [n_samples,].
"""
data= np.load(filename)
x= data['x']
y= data['y']
return x, y
What does it do?
- This function is responsible for importing the raw data from a file into the program so the math operations (like those in Linear Regression) can be performed.
np.load(filename): The code expects a file format compatible with NumPy (likely .npz). This file acts like a dictionary containing multiple arrays.x = data['x']: This extracts the Input Matrix.- In the slides, this corresponds to the data vectors
. - The shape
[n_samples, 256]tells us thatis the number of samples and is the number of features (dimensions). - Context from Source: The textbook discusses a "Handwritten digit recognition" problem using 16×16 pixel images.
- In the slides, this corresponds to the data vectors
y = data['y']: This extracts the Target Vector.- In the slides, this corresponds to the correct answers
(e.g., +1 or -1 for classification).
- In the slides, this corresponds to the correct answers
Function: train_valid_split(raw_data, labels, split_index)
def train_valid_split(raw_data, labels, split_index):
"""Split the original training data into a new training dataset
and a validation dataset.
n_samples = n_train_samples + n_valid_samples
Args:
raw_data: An array of shape [n_samples, 256].
labels : An array of shape [n_samples,].
split_index: An integer.
"""
return raw_data[:split_index], raw_data[split_index:], labels[:split_index], labels[split_index:]
What does it do?
- This function implements the concept of Validation. It takes the total available data and slices it into two separate sets: one for learning parameters and one for checking performance.
- The Slicing (
[:split_index]and[split_index:]):- The function uses the
split_indexinteger to cut the arrays. - Example:
split_index = 800means:- First 800 -> training
- Remaining -> validation
- The function uses the
- Training Set (
:split_index): The data from the start up to the split point becomes. This is the set used to calculate the weights (minimizing ). - Validation Set (
split_index:): The data from the split point to the end becomes. This set is not used for training.
The slicing (this is the key part):
return raw_data[:split_index], raw_data[split_index:], \
labels[:split_index], labels[split_index:]
Let’s break that apart.
Features:
raw_data[:split_index] # training data
raw_data[split_index:] # validation data
Labels:
labels[:split_index] # training labels
labels[split_index:] # validation labels
So the return order is:
x_train, x_valid, y_train, y_valid
Theoretical Connections
-
According to the text, "The validation set is a subset of
that is not involved in the learning process and is used to evaluate the final hypothesis". - This split allows you to estimate the out-of-sample error (E out) and perform model selection (like choosing a learning rate or regularization parameter) without "snooping" on the test data.
-
Summary of Returns:
- New Training Inputs: Used to calculate gradients and update weights.
- New Validation Inputs: Used only to predict and check accuracy.
- New Training Labels: The correct answers for the training inputs.
- New Validation Labels: The correct answers for the validation inputs.
Why do we need this function?
1. Training Error ( ) is Biased and Optimistic
When a model minimizes the error on the training set, it is specifically adapting to that specific data. Consequently, the in-sample error (
- The Validation Role: The validation set acts as a proxy for real-world data. Since this data was not used to calculate the weights, the error calculated on it (
) provides an unbiased estimate of the out-of-sample error ( ),.
2. Model Selection (Tuning Hyperparameters)
You often need to make choices about the learning process that the training algorithm cannot determine on its own. For example:
- Which polynomial degree to use?
- Which regularization parameter (
) to use? - When to stop training (early stopping)?
If you choose these parameters based on which one gives the lowest Training Error, you will simply pick the most complex model possible, leading to overfitting.
- The Validation Role: You test different versions of the model (e.g., different
values) on the validation set and pick the one with the lowest Validation Error. This allows you to select the model that generalizes best, rather than the one that just memorized the training data.
3. Protecting the Test Set (Avoiding Data Snooping)
You might wonder, "Why not just use the Test set for this?"
- The Principle: If a data set affects any step in the learning process (including choosing parameters), it is no longer a valid test set.
- The Danger: If you use the test set to tune your model, you are effectively "snooping" on the final exam. The test set will no longer give you an honest assessment of how the model will perform in the real world. The validation set serves as a "sandbox" for tuning so that the test set remains pure and untouched until the very end.
Summary: You split the data so you can train on one part (
Function: prepare_X(raw_X)
def prepare_X(raw_X):
"""Extract features from raw_X as required.
Args:
raw_X: An array of shape [n_samples, 256].
Returns:
X: An array of shape [n_samples, n_features].
"""
raw_image = raw_X.reshape((-1, 16, 16))
Restore spatial structure of the images
raw_image = raw_X.reshape((-1, 16, 16))
- Context: The input data
raw_Xcomes as a "flattened" matrix where every image is a single row of 256 numbers. To flip an image left-to-right, the computer needs to know which pixels are on the left and which are on the right. - The Code:
16, 16: This tells the computer that the 256 pixels actually represent a 16×16 grid.-1: This is a NumPy trick that means "infer this number automatically." It represents, the total number of images (samples) in your dataset.
- Result: You go from a 2D matrix of shape
(N, 256)to a 3D block of shape(N, 16, 16).
Feature 1: Symmetry
# Feature 1: Measure of Symmetry
# Flip images horizontally
flipped_image = np.flip(raw_image, axis=2)
# Since shape == (N, 16, 16), width is axis 2
# Formula for symmetry: - (Sum of |x - flip(x)|) / 256
symmetry = -np.sum(np.abs(raw_image - flipped_image), axis=(1, 2)) / 256
Explanation:
flipped_image = np.flip(raw_image, axis=2)
- This line creates a mirror image of every digit in the dataset.
- Context: To measure symmetry, we compare the original image to its horizontal reflection. If a digit is perfectly symmetric (like the number '1'), the reflection should look identical to the original.
- The Code:
axis=2: In the shape(N, height, width), axis 0 is the sample index, axis 1 is the height, and axis 2 is the width. Flipping along axis 2 performs a horizontal (left-right) flip.
Explanation:
symmetry = -np.sum(np.abs(raw_image - flipped_image), axis=(1, 2)) / 256
- This line calculates the final symmetry score using the formula provided in the assignment: $$F_{\text {symmetry }}=-\frac{\sum_{\text {pixel }}|x-f l i p(x)|}{256}$$
raw_image - flipped_image: This calculates the difference between the original and the mirror version pixel-by-pixel.np.abs(...): We care about the magnitude of the difference (e.g., a black pixel becoming white or vice versa), not the sign. This calculates the Asymmetry.np.sum(..., axis=(1, 2)): This sums up all the pixel differences for each image. By specifyingaxis=(1, 2), we collapse the height and width, resulting in a single total error number for each of theimages. / 256: Dividing the sum by the total number of pixels (256) gives us the average difference per pixel.-(The negative sign): This is crucial.- A large absolute difference means the image is asymmetric (like the digit '5').
- A small absolute difference means the image is symmetric (like the digit '1').
- By adding the negative sign, we flip this relationship: a highly symmetric image (difference
) gets a score close to 0 (the maximum), while an asymmetric image gets a large negative score. This creates a "Symmetry" feature rather than an "Asymmetry" feature.
Feature 2: Intensity
# Feature 2: Measure of Intensity
### YOUR CODE HERE
# Formula for intensity: (Sum of x) / 256
intensity = np.sum(raw_image, axis=(1,2)) / 256
Explanation:
intensity = np.sum(raw_image, axis=(1,2)) / 256
- The formula for intensity given in the assignment is: $$F_{\text {intensity }}=\frac{\sum_{\text {pixel }} x}{256}$$
- The code performs this calculation in two steps:
np.sum(raw_image, axis=(1,2)): This calculates the top part of the fraction (). It adds up all the pixel values within each specific image. / 256: This divides the sum by the total number of pixels to get the average value.
- This matches the description in the textbook, where intensity is described as the average pixel value used to distinguish between digits like "1" (low intensity) and "5" (high intensity).
(extra) What does axis=(1,2) do?
To understand the axis parameter, you have to visualize how the raw_image data is structured after you reshape it.
The variable raw_image is a 3-Dimensional array with the shape (N, 16, 16).
- Axis 0 (
): The index of the image (e.g., Image #1, Image #2, ... Image # N). - Axis 1 (16): The height (rows) of the image.
- Axis 2 (16): The width (columns) of the image.
When you specify axis=(1, 2): You are telling NumPy to "collapse" or "squash" dimensions 1 and 2 by summing them up, while keeping Axis 0 separate.
- If you summed
axis=0: You would add Image #1 + Image #2 + ..., which would result in one "super image" combining everyone's handwriting. This is incorrect. - If you sum
axis=(1, 2): You are saying, "For each individual image (Axis 0), sum up all the rows (Axis 1) and all the columns (Axis 2)."
Visualizing the Operation
- Think of a single image as a spreadsheet with 16 rows and 16 columns (256 cells total).
- axis=1 (Rows): If you summed only this axis, you would squash the spreadsheet down to one row of column totals.
- axis=2 (Columns): If you summed only this axis, you would squash the spreadsheet into one column of row totals.
- axis=(1, 2) (Both): This tells the computer to squash both dimensions. It takes every number in that 16x16 spreadsheet and adds them all up into one single number.
- Summary:
np.sum(..., axis=(1,2))adds all 256 numbers together. - Output: One single number (the total value).
- Next step: Divide by 256 to get the average value (intensity).
Visualizing the Result:
- Input: A 3D block of
Nimages, each. - Operation: Sum over the height and width.
- Output: A 1D vector of length
N. Each number in this vector is the total intensity for that specific image.
Feature 3: Bias Term. Always 1.
# Feature 3: Bias Term. Always 1.
# Vector of 1s with lenght == n_samples
bias = np.ones(raw_image.shape[0])
What the code does
bias = np.ones(raw_image.shape[0])
raw_image.shape[0]: This retrieves, the total number of images (samples) in your dataset. np.ones(...): This creates a NumPy array of lengthwhere every single number is 1.0.
Why do we need it? (The "Bias Trick")
In linear models, the decision boundary is determined by a weighted sum of the inputs plus a threshold (or offset).
-
The Formula: The hypothesis is formally written as: $$h(x) = \text{sign}\left( \left( \sum_{i=1}^{d} w_i x_i \right) + w_0 \right)$$ Here,
is the weighted score of your actual features (intensity and symmetry), and is the bias weight (threshold). -
The Matrix Problem: We want to calculate this entire score using a single neat matrix multiplication:
. However, standard matrix multiplication only does the summing part ( ); it doesn't automatically add the loose constant . -
The Solution (
): To fix this, we use a mathematical trick called the augmented vector. We imagine there is a "dummy" feature that is always equal to 1,. $$w_0 \cdot 1 + w_1 x_1 + w_2 x_2 = \mathbf{w}^T \mathbf{x}$$
By adding this column of 1s to your data matrix, the computer treats the bias just like any other weight. When it multiplies the weight by the feature , it effectively adds the threshold to the score.
What happens if you skip this?
If you do not include this bias column (the column of 1s), you force
- Geometrically, this forces your separating line (or hyperplane) to pass directly through the origin
. - If your data is not centered at the origin (which is true for pixel intensity, as it is always positive), a line passing through
will likely fail to separate the digits "1" and "5" correctly. The bias allows the line to shift up, down, left, or right to find the best fit.
(extra) The concept of a threshold/bias weight
Think of a linear model as a scorecard. In the credit card example, the model calculates a weighted score based on your inputs (salary, years in job, etc.).
- The Score:
- The Decision: To make a Yes/No decision, the bank needs a specific threshold. For example, "If the score is greater than 50, approve credit."
Mathematically, this looks like: $$ \sum_{i=1}^d w_i x_i > \text{threshold} $$
The Mathematical Trick: Turning Threshold into Bias:
To make the math and the code cleaner, machine learning experts move the threshold to the other side of the inequality. $$ \sum_{i=1}^d w_i x_i - \text{threshold} > 0 $$
The term
The "Augmented" Vector ( )
This explains why our code has bias = np.ones(...).
We want to write the entire formula as a single matrix multiplication (dot product):
- The Solution: We create a "dummy" feature, called
, which is always equal to 1. - Now,
is effectively .
This allows us to merge the bias into the weights and inputs perfectly: $$ \underbrace{w_0 \cdot 1}{\text{bias}} + \sum^d w_i x_i = \sum_{i=0}^d w_i x_i = w^T x $$
This is referred to in the slides as the augmented vector,.
Geometric Meaning
Without a bias term, a linear model (a line or hyperplane) is forced to pass through the origin
- Without Bias: If all inputs
are 0, the score is always 0. The line cannot shift left, right, up, or down; it can only rotate. - With Bias: The line can shift away from the origin. This is crucial for data (like image pixel intensities) that might not be naturally centered around zero. The bias allows the separating line to "slide" to the correct position to separate the digits '1' and '5'.
Stack features together
# Stack features together in the following order.
# [Feature 3, Feature 1, Feature 2]
# Final feature matrix with shape == (n_samples, 3)
X = np.column_stack((bias, symmetry, intensity))
return X
1. What the code does (The Mechanics)
The function np.column_stack takes a sequence of 1D arrays (vectors) and stacks them side-by-side as columns to create a single 2D matrix.
- Inputs: Three separate vectors, each of length
(the number of images). bias: A vector of all 1s (). symmetry: The symmetry score for each image (). intensity: The intensity score for each image ().
- Output: A single matrix
with dimensions .
2. The Resulting Data Matrix ( )
This line constructs the Data Matrix (often called the Design Matrix) described in the "Linear Regression Algorithm" slide. Each row represents one specific image (data point), and each column represents a specific feature of that image.
Visually, the matrix
- Column 0 (Bias): Corresponds to the bias weight
. - Column 1 (Symmetry): Corresponds to the weight
. - Column 2 (Intensity): Corresponds to the weight
.
3. Why do we stack them this way?
We stack them into this specific shape to utilize matrix multiplication for calculating predictions and errors efficiently.
As explained in the slides, the linear model calculates a "score" or "signal" for every data point using the formula: $$s = \mathbf{X}\mathbf{w}$$
This single line of linear algebra (
- Without
column_stack, you would have disjoint vectors and would have to write a loop to calculate the score for each image individually. - By creating
, you enable the use of the analytic solution for Linear Regression provided in your slides: $$ \mathbf{w}_{lin} = (\mathbf{X}^T \mathbf{X})^{-1} \mathbf{X}^T \mathbf{y} $$
This matrixis the primary input required to solve for the best weights using the pseudo-inverse method.
Function: prepare_y(raw_y)
def prepare_y(raw_y):
"""
Args:
raw_y: An array of shape [n_samples,].
Returns:
y: An array of shape [n_samples,].
idx:return idx for data label 1 and 2.
"""
y = raw_y
idx = np.where((raw_y==1) | (raw_y==2))
y[np.where(raw_y==0)] = 0
y[np.where(raw_y==1)] = 1
y[np.where(raw_y==2)] = 2
return y, idx
1. The Purpose of the Function
This function is designed to filter the dataset for a specific binary classification task. Even though the original dataset contains many digits (0, 1, 2, etc.), this function creates an index (idx) to isolate only the samples corresponding to the digits "1" and "2".
This aligns with the strategy mentioned in the textbook of decomposing a multiclass problem (recognizing ten digits) into smaller binary problems (e.g., separating digit 1 from digit 2).
2. Line-by-Line Explanation
y = raw_y:- This creates a variable
ythat references the input label array.
- This creates a variable
idx = np.where((raw_y==1) | (raw_y==2)):- The Logic: This acts as a filter. It scans the
raw_yarray and finds the index locations (row numbers) where the label is either 1 or 2. - The Result:
idxis a list of integers representing the specific examples you will use for training and testing. Any data point not in this list (like digit 0, 3, 4...) will effectively be ignored in later steps when you apply this filter.
- The Logic: This acts as a filter. It scans the
y[np.where(raw_y==0)] = 0, etc.:- Sanitization: These lines explicitly set the values at specific indices to integers 0, 1, and 2.
- Why? In many starter codes, this is a safety measure. It ensures that the labels are clean integers and not floats (e.g.,
1.0becoming1) or other formats, ensuring consistency before the mathematical operations begin.
3. Summary of Returns
y: The array of labels (sanitized).idx: The critical piece of information. You will likely use this in yourmain.pyor training loop to slice your data matrix. - Instead of using the full
, you will use X[idx]to get only the images of 1s and 2s. - Instead of using the full
, you will use y[idx]to get only the labels 1 and 2.
- Instead of using the full
This function identifies samples belonging to classes 1 and 2 for binary classification, and these indices should be used to filter the data and convert the labels to ±1 if required by the learning algorithm.
main.py
Function main()
def main():
# ------------Data Preprocessing------------
# Read data for training.
raw_data, labels = load_data(os.path.join(data_dir, train_filename))
raw_train, raw_valid, label_train, label_valid = train_valid_split(raw_data, labels, 2300)
##### Preprocess raw data to extract features
train_X_all = prepare_X(raw_train)
valid_X_all = prepare_X(raw_valid)
##### Preprocess labels for all data to 0,1,2 and return the idx for data from '1' and '2' class.
train_y_all, train_idx = prepare_y(label_train)
valid_y_all, val_idx = prepare_y(label_valid)
####### For binary case, only use data from '1' and '2'
train_X = train_X_all[train_idx]
train_y = train_y_all[train_idx]
####### Only use the first 1350 data examples for binary training.
train_X = train_X[0:1350]
train_y = train_y[0:1350]
valid_X = valid_X_all[val_idx]
valid_y = valid_y_all[val_idx]
####### set lables to 1 and -1. Here convert label '2' to '-1' which means we treat data '1' as postitive class.
### YOUR CODE HERE
# Convert label 2 to -1
train_y[train_y == 2] = -1
valid_y[valid_y == 2] = -1
# Explicitely convert label 1 to 1 (not really needed)
train_y[train_y == 1] = 1
valid_y[valid_y == 1] = 1
### END YOUR CODE
data_shape = train_y.shape[0]
# # Visualize training data.
visualize_features(train_X[:, 1:3], train_y)
1. Loading and Splitting Data
raw_data, labels = load_data(os.path.join(data_dir, train_filename))
raw_train, raw_valid, label_train, label_valid = train_valid_split(raw_data, labels, 2300)
- Action: It loads the full dataset and splits it into a Training Set (first 2300 examples) and a Validation Set (the rest).
- Why? As discussed in the textbook, the validation set is essential for model selection. It acts as a proxy for out-of-sample data (
) to help you tune hyperparameters without "snooping" on the final test data.
2. Feature Extraction (prepare_X)
train_X_all = prepare_X(raw_train)
valid_X_all = prepare_X(raw_valid)
- Action: This calls the function we implemented earlier. It transforms the raw 256-pixel vectors into the Design Matrix
with 3 columns: - Bias: Always 1 (to handle the threshold
). - Symmetry: How symmetric the digit is.
- Intensity: How black/bold the digit is.
- Bias: Always 1 (to handle the threshold
- Context: This maps the input space from 256 dimensions down to 3 dimensions (
)
3. Filtering for Binary Classification (prepare_y)
train_y_all, train_idx = prepare_y(label_train)
valid_y_all, val_idx = prepare_y(label_valid)
- Action: This function (which you provided in the previous turn) scans the labels. It returns
train_idx, a list of row numbers where the digit is either '1' or '2'. - Why? The textbook describes decomposing a multi-class problem (10 digits) into binary problems (e.g., "is it a 1 or a 5?"). Your specific assignment is determining "is it a 1 or a 2?".
4. Applying the Filter
train_X = train_X_all[train_idx]
train_y = train_y_all[train_idx]
- Action: It uses the indices found in step 3 to discard all data related to digits 0, 3, 4, 5, 6, 7, 8, and 9. The matrices
train_Xandtrain_ynow only contain information about digits 1 and 2.
5. Limiting Sample Size
train_X = train_X[0:1350]
train_y = train_y[0:1350]
- Action: It explicitly slices the data to keep only the first 1,350 examples.
- Why? This is likely an artificial constraint for your specific homework problem (e.g., to test how the model behaves with a specific sample size
or to ensure everyone gets the same results).
6. Label Conversion (The Code You Wrote)
train_y[train_y == 2] = -1
valid_y[valid_y == 2] = -1
# ... (and implicitly 1 remains 1)
- Action: This converts the "class labels" from
to . - Mathematical Necessity: Linear classification models (like the Perceptron) operate on the sign of the signal:
. This function outputs or . - If you left the labels as 1 and 2, the math for minimizing error (e.g.,
) would break because the model is trying to output signs, not the integer "2",. - Now, the model treats Digit 1 as the Positive Class (+1) and Digit 2 as the Negative Class (-1).
- If you left the labels as 1 and 2, the math for minimizing error (e.g.,
7. Visualization
visualize_features(train_X[:, 1:3], train_y)
- Action: It sends the data to a plotting function.
train_X[:, 1:3]: This selects all rows (:) but only columns 1 and 2.- Column 0 is the Bias (all 1s), which is not useful to plot.
- Column 1 is Symmetry.
- Column 2 is Intensity.
- Result: This will likely generate a 2D scatter plot similar to the one shown in the textbook (Figure 3.1 or Example 3.1), where you can see if the 1s and 2s form distinct clusters based on how symmetric and dark they are.
Function: visualize_features(X, y)
def visualize_features(X, y):
'''This function is used to plot a 2-D scatter plot of training features.
Args:
X: An array of shape [n_samples, 2].
y: An array of shape [n_samples,]. Only contains 1 or -1.
Returns:
No return. Save the plot to 'train_features.*' and include it
in submission.
'''
### YOUR CODE HERE
# Separate the data into 1 (digit 1) and -1 (digit 2) classes
X_pos = X[y == 1]
X_neg = X[y == -1]
# Initialize plot
plt.figure(figsize=(8, 6))
# Plot positives (1) with 'o'
plt.scatter(X_pos[:,0], X_pos[:, 1], color='blue', marker='o', label='Class +1')
# Plot negatives (-1) with 'x'
plt.scatter(X_neg[:,0], X_pos[:, 1], color='red', marker='x', label='Class +1')
# Styling
plt.xlabel('Symmetry')
plt.ylabel('Inensity')
plt.title('Feature Vizualization')
plt.legend()
plt.grid(True)
# Save plot
plt.savefig('features_vizualization.png')
# Display
plt.show()
### END YOUR CODE
Just a simple training feature scatter plot, not very complex.
2. Cross-entropy loss
(a) Write the loss function for one training data sample ( )
1. The Setup
- The Signal: Let
. - The Logistic Function:
. - The Probability: The model states that
. - The Loss: We want to minimize the negative log-likelihood, so the error for one point is
.
2. Identify the Probability Function
In Logistic Regression with binary labels
The Property:
The logistic function is defined as
Now, look at
Since both equal
Applying it to the Piecewise Function
Now we look at the two cases for
- Case 1 (
): The formula is . Since , we can write this as . - Case 2 (
): The formula is . Using the property derived above, we swap for : $$ 1 - \theta(w^T x) = \theta(-w^T x) $$ Since , we can write as , which is equal to . So, .
Summary:
By using the symmetry
3. Apply the Negative Log-Likelihood
To find the optimal weights, the algorithm seeks to maximize the likelihood of the data. Mathematically, maximizing likelihood is equivalent to minimizing the negative natural logarithm (negative log-likelihood) of that probability. Therefore, the loss
The Substitution Step-by-Step
Step A: Substitute the probability
By substituting the definition of the logistic function into the equation above, the text derives the standard cross-entropy loss formula for a single training sample
(b) Compute the gradient . Please provide intermediate steps of derivation.
Based on the loss function derived in the previous step and the gradient descent discussions in the Learning From Data textbook, here is the step-by-step derivation to compute the gradient
The Loss Function
Recalling the Cross-Entropy loss for a single sample
Step-by-Step Derivation
Step 1: Identify the nested functions
Let us define a variable
Step 2: Apply the Chain Rule:
The Chain Rule states that if you have composite functions (functions inside of other functions), the derivative of the whole is the product of the derivatives of the parts.
To calculate the gradient, we view the error function
- Layer 1 (Inner Variable): Let
be the exponent. $$u(w) = -y w^T x$$ - Layer 2 (Intermediate Term): Let
be the term inside the logarithm. $$z(u) = 1 + e^u$$ - Layer 3 (Outer Function): The error function itself is the natural log of
. $$E(z) = \ln(z)$$
Apply the Chain Rule:
We want to find howchanges when changes ( ). Because is buried inside, we must calculate how changes propagate through the layers:
- How
changes as changes ( ). - How
changes as changes ( ). - How
changes as changes ( ).
When we multiply these rates of change together, we get the total gradient: $$ \nabla E(w) = \underbrace{\frac{\partial E}{\partial z}}{\text{Outer}} \cdot \underbrace{\frac{\partial z}{\partial u}}{\text{Middle}} \cdot \underbrace{\nabla_w u}_{\text{Inner}} $$
If we substitute the definitions of
- Replace
with : $$ \frac{\partial E}{\partial z} \rightarrow \frac{\partial E}{\partial (1+e^u)} $$ - Replace
with in the second term: $$ \frac{\partial z}{\partial u} \rightarrow \frac{\partial (1+e^u)}{\partial u} $$
So this is how we know the gradientis the product of the derivatives of the nested parts and is written as: $$\nabla E(w) = \frac{\partial E}{\partial (1+e^u)} \cdot \frac{\partial (1+e^u)}{\partial u} \cdot \nabla_w u$$
Step 3: Compute the individual derivatives
- Outer Logarithm: The derivative of
is . $$ \frac{\partial E}{\partial (1+e^u)} = \frac{1}{1 + e^u} $$ - Exponential Term: The derivative of
with respect to is . $$ \frac{\partial (1+e^u)}{\partial u} = e^u $$ - Linear Term: The gradient of
with respect to the vector is simply the coefficient vector . $$ \nabla_w u = \nabla_w (-y w^T x) = -y x $$
Step 4: Combine and Substitute Multiply the three parts together:
Substitute
Step 5: Simplify the Expression
To clean up the fraction, multiply the numerator and the denominator by
Now, combine this with the term
Final Answer
The gradient of the loss function for a single sample is:
This result matches the summation formula provided in Exercise 3.7 of the textbook, which defines the gradient contribution of a single point as
Interpretation: The gradient is a scaled version of the input vector
(c) Once the optimal is obtained,
it can be used to make predictions as follows:
$$
\text { Predicted class of } x= \begin{cases}1 & \text { if } \theta\left(w^T x\right) \geq 0.5 \ -1 & \text { if } \theta\left(w^T x\right)<0.5\end{cases}
$$
where the function
/CSCE-421/HWs/Visual%20Aids/2026-02-05_14-42-07.png)
However, this is not the most efficient way since the decision boundary is linear. Why? Explain it. When will we need to use the sigmoid function in prediction?
1. Why is checking not the most efficient way?
Explanation: Using the sigmoid function to make a binary prediction is inefficient because it involves unnecessary and computationally expensive mathematical operations.
- The Math: The sigmoid function
is strictly monotonic. This means it constantly increases as the input increases. Crucially, . - The Equivalence: Because of this property, the inequality
is mathematically identical to checking if the input signal is non-negative: $$ \theta(w^T x) \geq 0.5 \iff w^T x \geq 0 $$ - Computational Cost:
- Computing
only requires basic multiplication and addition (a dot product). - Computing
requires calculating the dot product plus an exponentiation ( ) and a division. Exponentiation is computationally much more expensive than simple arithmetic.
- Computing
Conclusion: Since the decision boundary is linear (defined by
2. When will we need to use the sigmoid function in prediction?
You need the sigmoid function when the goal is not just a hard classification (Yes/No), but an estimation of uncertainty or risk.
- Probability Estimation: In many applications, such as medical diagnosis (e.g., predicting a heart attack) or financial forecasting, it is crucial to know the likelihood of an event, not just the binary outcome. A patient with a 99% probability of a heart attack requires different treatment than one with a 51% probability, even though a hard classifier would treat them both as "Positive".
- Weighted Cost / Risk Matrices: If the "cost" of making a mistake is not symmetric (e.g., a false negative is much worse than a false positive, as in CIA security clearance), you should not use the standard 0.5 threshold.
- As shown in the textbook problems regarding risk matrices, the optimal threshold might shift (e.g., to 0.1 or 0.9) depending on the costs
(false accept) and (false reject). - To apply a shifted threshold (e.g., "Predict positive only if probability
"), you must compute the actual probability value using the sigmoid function .
- As shown in the textbook problems regarding risk matrices, the optimal threshold might shift (e.g., to 0.1 or 0.9) depending on the costs
(d) Is the decision boundary still linear if the prediction rule is changed to the following?
Based on the properties of the linear signal and the logistic function described in the "Linear Models" slides and the Learning From Data textbook, the answer is Yes, the decision boundary is still linear.
1. Identify the Decision Boundary Equation
The decision boundary is the "line" (or hyperplane) that separates the region where you predict
2. Substitute the Logistic Function
Recall the definition of the logistic function
3. Solve for the Linear Signal ( )
To see the shape of the boundary in the input space, we need to isolate the input terms (
- Invert both sides: $$ 1 + e^{-w^T x} = \frac{1}{0.9} = \frac{10}{9} $$
- Subtract 1 from both sides: $$ e^{-w^T x} = \frac{10}{9} - 1 = \frac{1}{9} $$
- Take the natural logarithm (
) of both sides: $$ -w^T x = \ln\left(\frac{1}{9}\right) $$ - Simplify the logarithm (
): $$ -w^T x = -\ln(9) $$ $$ w^T x = \ln(9) $$
4. Conclusion: It is a Linear Boundary
The final equation for the decision boundary is: $$ w^T x - \ln(9) = 0 $$
Since
- Original Boundary (Threshold 0.5):
- New Boundary (Threshold 0.9):
Geometrically, the equation
(e) What is the essential property of logistic regression that results in the linear decision boundary?
The essential property that results in a linear decision boundary is that Logistic Regression models the probability as a monotonic function of a linear signal.
Here is the breakdown of the components you should mention in your answer:
-
The Linear Signal (
): The core of the model is the "signal" , which is a linear combination of the input features: $$s = w^T x = \sum w_i x_i$$ This part is linear in the input . -
The Monotonic Transformation (
): The probability is computed by passing this linear signal through the logistic (sigmoid) function: $$P(y=1|x) = \theta(s) = \frac{1}{1+e^{-s}}$$Crucially, this function is strictly monotonic (it effectively just scales the signal from the range to without changing the order of the values). -
The Resulting Boundary: Because the function is monotonic, applying a threshold to the probability (e.g.,
) is mathematically equivalent to applying a corresponding threshold to the linear signal (e.g., ). - Equation:
. - Since
defines a hyperplane (a line in 2D), the decision boundary is linear.
- Equation:
Summary for your answer: You should state that the decision boundary is linear because the logistic regression model relies on a linear signal (
3. Sigmoid Logistic Regression
LogisticRegression.py
Function: _gradient
Based on (b) in the last problem, implement the function _gradient.
def _gradient(self, _x, _y):
"""Compute the gradient of cross-entropy with respect to self.W
for one training sample (_x, _y). This function is used in fit_*.
Args:
_x: An array of shape [n_features,].
_y: An integer. 1 or -1.
Returns:
_g: An array of shape [n_features,]. The gradient of
cross-entropy with respect to self.W.
"""
### YOUR CODE HERE
### END YOUR CODE
Based on the derivation we performed in the previous step (part b), we found that the gradient of the cross-entropy loss
Here is the implementation for the _gradient function in LogisticRegression.py, followed by a detailed explanation.
Implementation
def _gradient(self, _x, _y):
"""Compute the gradient of cross-entropy with respect to self.W
for one training sample (_x, _y). This function is used in fit_*.
Args:
_x: An array of shape [n_features,].
_y: An integer. 1 or -1.
Returns:
_g: An array of shape [n_features,]. The gradient of
cross-entropy with respect to self.W.
"""
### YOUR CODE HERE
# 1. Compute the exponent term: y * (w^T * x)
# self.W is the weight vector (w)
exponent = _y * np.dot(self.W, _x)
# 2. Compute the scalar coefficient: -y / (1 + e^(exponent))
# This represents how much "correction" is needed based on the prediction error.
coefficient = -_y / (1 + np.exp(exponent))
# 3. Compute the gradient vector: coefficient * x
_g = coefficient * _x
### END YOUR CODE
return _g
Explanation of the Steps
1. The Dot Product (np.dot(self.W, _x))
- Math: This calculates the linear signal
. - Context: As described in the slides and textbook, the linear signal is the weighted sum of the inputs. This determines the "score" before it is squashed into a probability by the logistic function.
2. The Exponent (exponent = _y * ...)
- Math: This calculates
(or ). - Context: This term represents the "margin."
- If
and the signal have the same sign (correct classification), this value is positive. - If they have different signs (incorrect classification), this value is negative.
- If
3. The Denominator (1 + np.exp(exponent))
- Math: This calculates the term
. - Context: This is the scaling factor from the logistic function. As we derived in the previous question, the gradient of the loss involves the derivative of the logarithm and the sigmoid function, which results in this term in the denominator.
4. The Final Gradient (_g)
- Math:
. - Context:
- The gradient points in the direction of steepest ascent (how to increase error). Since we want to minimize error, we will eventually subtract this gradient (or add the negative gradient) during the weight update step (Gradient Descent).
- Notice that
coefficientis a scalar (a single number) and_xis a vector. Multiplying them scales the input vector_xby the magnitude of the error. If the model is very confident and correct (exponent is large positive), the coefficient approaches 0 (little update). If the model is wrong (exponent is large negative), the coefficient approaches(large update).
Function: fit_BGD
def fit_BGD(self, X, y):
"""Train perceptron model on data (X,y) with Batch Gradient Descent.
Args:
X: An array of shape [n_samples, n_features].
y: An array of shape [n_samples,]. Only contains 1 or -1.
Returns:
self: Returns an instance of self.
"""
n_samples, n_features = X.shape
### YOUR CODE HERE
### END YOUR CODE
return self
Implementation
def fit_BGD(self, X, y):
"""Train perceptron model on data (X,y) with Batch Gradient Descent.
Args:
X: An array of shape [n_samples, n_features].
y: An array of shape [n_samples,]. Only contains 1 or -1.
Returns:
self: Returns an instance of self.
"""
n_samples, n_features = X.shape
### YOUR CODE HERE
# 1. Initialize weights (usually to zeros or small random values)
self.W = np.zeros(n_features)
# 2. Loop over epochs (iterations)
for _ in range(self.max_iter):
# Initialize a gradient accumulator for the batch
batch_grad = np.zeros(n_features)
# 3. Compute the gradient for the entire dataset
for i in range(n_samples):
# Accumulate the gradient contribution from each sample
batch_grad += self._gradient(X[i], y[i])
# 4. Average the gradient
# The in-sample error is the average over N samples, so we divide the sum by N
batch_grad /= n_samples
# 5. Update the weights
# Move in the opposite direction of the gradient (Descent)
self.W -= self.learning_rate * batch_grad
### END YOUR CODE
return self
Explanation of the Steps
1. Initialization (self.W = np.zeros(n_features))
- Concept: Before training begins, the weight vector
must be initialized. - Why: For logistic regression with cross-entropy error, the error surface is convex (it has only one global valley), so initializing weights to zero is a safe and common practice that works well.
2. The Epoch Loop (for _ in range(self.max_iter))
- Concept: Gradient descent is an iterative algorithm. Each pass through the loop represents one step of moving the weights down the error surface.
- Why: We repeat the update
max_itertimes (or until the gradient is close to zero) to reach the minimum of the in-sample error.
3. Accumulating the Gradient (batch_grad += ...)
- Concept: Batch Gradient Descent (BGD) is defined by using all training examples to compute the gradient direction for a single update step.
- Why: The gradient of the total in-sample error
is the average of the gradients of the individual examples . We calculate this by summing self._gradient(X[i], y[i])for everyfrom 0 to .
4. Averaging (batch_grad /= n_samples)
- Concept: The formal definition of in-sample error usually includes a factor of
. - Math:
. Therefore, the full gradient is the sum divided by .
5. Weight Update (self.W -= self.learning_rate * batch_grad)
- Concept: We update the weights by taking a step of size
(learning rate) in the direction of the negative gradient ( ). - Math:
. This moves the weight vector "downhill" towards the minimum error.
Function: fit_miniBGD
def fit_miniBGD(self, X, y, batch_size):
"""Train perceptron model on data (X,y) with mini-Batch Gradient Descent.
Args:
X: An array of shape [n_samples, n_features].
y: An array of shape [n_samples,]. Only contains 1 or -1.
batch_size: An integer.
Returns:
self: Returns an instance of self.
"""
### YOUR CODE HERE
### END YOUR CODE
return self
Based on the description provided in the prompt and the concepts of Stochastic Gradient Descent (SGD) discussed in the Learning From Data textbook, Mini-Batch Gradient Descent is a middle ground between Batch Gradient Descent (BGD) and SGD.
In BGD, you update weights once per epoch (using all data). In SGD, you update weights
Implementation
def fit_miniBGD(self, X, y, batch_size):
"""Train perceptron model on data (X,y) with mini-Batch Gradient Descent.
Args:
X: An array of shape [n_samples, n_features].
y: An array of shape [n_samples,]. Only contains 1 or -1.
batch_size: An integer.
Returns:
self: Returns an instance of self.
"""
n_samples, n_features = X.shape
### YOUR CODE HERE
# 1. Initialize weights
self.W = np.zeros(n_features)
# 2. Loop over epochs
for _ in range(self.max_iter):
# 3. Shuffle the data at the start of each epoch
# This ensures the batches are different every time, preventing cycles.
indices = np.arange(n_samples)
np.random.shuffle(indices)
X_shuffled = X[indices]
y_shuffled = y[indices]
# 4. Loop over the data in chunks of 'batch_size'
for start_idx in range(0, n_samples, batch_size):
# Handle the last batch, which might be smaller than batch_size
end_idx = min(start_idx + batch_size, n_samples)
# Slice the current mini-batch
X_batch = X_shuffled[start_idx:end_idx]
y_batch = y_shuffled[start_idx:end_idx]
current_batch_size = end_idx - start_idx
# 5. Compute the gradient for this specific batch
batch_grad = np.zeros(n_features)
for i in range(current_batch_size):
batch_grad += self._gradient(X_batch[i], y_batch[i])
# Average the gradient over the batch size
batch_grad /= current_batch_size
# 6. Update the weights immediately
self.W -= self.learning_rate * batch_grad
### END YOUR CODE
return self
Step-by-Step Explanation
1. Initialization (self.W = np.zeros)
- Just like in Batch Gradient Descent, we start with zero weights.
2. The Epoch Loop (for _ in range(self.max_iter))
- An epoch represents one full pass through the entire dataset. In BGD, one epoch equals one weight update. In Mini-Batch, one epoch contains
n_samples / batch_sizeweight updates.
3. Shuffling (np.random.shuffle)
- The Logic: The prompt instructs to "reshuffle your samples before each epoch."
- Why: If you feed the batches in the exact same order every time, the model might get stuck in loops or learn the order of the data rather than the features. Shuffling adds the necessary randomness (stochasticity) that helps the algorithm escape local minima and converge faster.
4. The Batch Loop (range(0, n_samples, batch_size))
- This loop steps through the shuffled data.
- Indices:
start_idxis the beginning of the batch.end_idxis the end. We useminto ensure we don't go out of bounds on the very last batch (e.g., if you have 105 samples and a batch size of 10, the last batch will only have 5 samples).
5. Accumulating Gradients (batch_grad += ...)
- Instead of summing gradients for the entire dataset (like BGD), we only sum the gradients for the samples in the current
X_batch. - We use the
self._gradientfunction you wrote earlier to calculate the contribution of each individual sample in the batch. - Averaging: We divide by
current_batch_sizeto get the average gradient direction for this specific group of points.
6. Weight Update (self.W -= ...)
- The Critical Difference: Notice that the indentation of this line is inside the batch loop, not the epoch loop.
- In Batch GD, you update weights once after seeing all samples.
- In Mini-Batch GD, you update weights immediately after seeing one batch. This causes the weights to "wiggle" towards the solution more frequently, often resulting in faster learning than BGD because you don't have to wait for the whole dataset to make a move.
Function: fit_SGD
def fit_SGD(self, X, y):
"""Train perceptron model on data (X,y) with Stochastic Gradient Descent.
Args:
X: An array of shape [n_samples, n_features].
y: An array of shape [n_samples,]. Only contains 1 or -1.
Returns:
self: Returns an instance of self.
"""
### YOUR CODE HERE
### END YOUR CODE
return self
Based on the definition of Stochastic Gradient Descent (SGD) provided in the Learning From Data textbook, specifically the section on "Stochastic Gradient Descent," here is the implementation and detailed explanation.
SGD differs from Batch Gradient Descent in that it updates the weights using the gradient of a single training example
Implementation
def fit_SGD(self, X, y):
"""Train perceptron model on data (X,y) with Stochastic Gradient Descent.
Args:
X: An array of shape [n_samples, n_features].
y: An array of shape [n_samples,]. Only contains 1 or -1.
Returns:
self: Returns an instance of self.
"""
n_samples, n_features = X.shape
### YOUR CODE HERE
# 1. Initialize weights
self.W = np.zeros(n_features)
# 2. Loop over epochs
# An epoch is one full pass through the data.
for _ in range(self.max_iter):
# 3. Shuffle the data
# Essential in SGD to ensure the model sees samples in a random order
indices = np.arange(n_samples)
np.random.shuffle(indices)
# 4. Iterate through every single sample
for i in indices:
# 5. Compute gradient for ONE sample (X[i], y[i])
# This uses the gradient formula derived earlier: -y*x / (1 + e^(y*w^T*x))
grad = self._gradient(X[i], y[i])
# 6. Update weights immediately
# w = w - learning_rate * gradient
self.W -= self.learning_rate * grad
### END YOUR CODE
return self
Detailed Explanation of Each Step
1. Initialization (self.W = np.zeros)
- Concept: We start with an initial guess for the weights.
- Context: While random initialization is generally safer to avoid symmetric traps (especially in neural networks), initializing to zero is common and effective for logistic regression.
2. The Epoch Loop (for _ in range(self.max_iter))
- Concept: In SGD, an "iteration" often refers to a single weight update (seeing one sample), while an epoch refers to a full pass through the entire dataset.
- Why:
self.max_iterhere usually controls the number of epochs (passes over the data) to ensure the model sees the data enough times to converge.
3. Shuffling (np.random.shuffle)
- Concept: We randomize the order in which we visit the data points.
- Why: SGD relies on the randomness ("stochastic") of the sample selection. If you feed the data in a fixed order (e.g., all class +1, then all class -1), the weights might oscillate or drift in cycles rather than converging. Shuffling simulates picking
uniformly at random.
4. The Inner Loop (for i in indices)
- Concept: We iterate through the shuffled indices one by one.
- Difference from BGD: In Batch Gradient Descent, there is no inner loop over samples; you calculate the gradient for the whole matrix
at once. In SGD, you must loop through samples individually.
5. Gradient Calculation (grad = self._gradient(X[i], y[i]))
- Math: You compute
, which is the gradient of the error for just the -th data point. - Formula: As derived in previous steps, this is
.
6. Weight Update (self.W -= self.learning_rate * grad)
- Math:
. - Concept: You update the weights immediately after calculating the error for that single point.
- Behavior: This makes the path of the weights "wiggly." Instead of moving directly down the true gradient (like BGD), the weights move based on what is good for just one example. On average, these random fluctuations cancel out, and the weights move toward the global minimum.
Function: predict_proba
def predict_proba(self, X):
"""Predict class probabilities for samples in X.
Args:
X: An array of shape [n_samples, n_features].
Returns:
preds_proba: An array of shape [n_samples, 2].
Only contains floats between [0,1].
"""
### YOUR CODE HERE
### END YOUR CODE
This function computes the probability
def predict_proba(self, X):
"""Predict class probabilities for samples in X.
Args:
X: An array of shape [n_samples, n_features].
Returns:
preds_proba: An array of shape [n_samples, 2].
Only contains floats between.
"""
### YOUR CODE HERE
# 1. Compute the linear signal s = w^T * x
signal = np.dot(X, self.W)
# 2. Compute P(y=1|x) using the sigmoid function
prob_pos = 1 / (1 + np.exp(-signal))
# 3. Compute P(y=-1|x)
# Since probabilities sum to 1: P(y=-1) = 1 - P(y=1)
prob_neg = 1 - prob_pos
# 4. Stack them into a matrix of shape [n_samples, 2]
# Column 0 corresponds to the negative class (-1)
# Column 1 corresponds to the positive class (1)
preds_proba = np.column_stack((prob_neg, prob_pos))
return preds_proba
### END YOUR CODE
Explanation:
- Signal: We calculate the "linear signal"
. - Sigmoid: We apply the logistic function to transform the signal into a probability between 0 and 1. This represents the likelihood that the label is +1,.
- Column Stack: The function returns a 2D array where the first column represents the probability of the negative class and the second column represents the probability of the positive class.
Function: predict
def predict(self, X):
"""Predict class labels for samples in X.
Args:
X: An array of shape [n_samples, n_features].
Returns:
preds: An array of shape [n_samples,]. Only contains 1 or -1.
"""
### YOUR CODE HERE
### END YOUR CODE
This function returns the binary class labels. As discussed in the slides, the decision boundary is linear. We assign the class based on which probability is higher, which is equivalent to checking if the linear signal is non-negative,.
def predict(self, X):
"""Predict class labels for samples in X.
Args:
X: An array of shape [n_samples, n_features].
Returns:
preds: An array of shape [n_samples,]. Only contains 1 or -1.
"""
### YOUR CODE HERE
# 1. Compute the linear signal s = w^T * x
signal = np.dot(X, self.W)
# 2. Apply the threshold
# If w^T * x >= 0, then P(y=1|x) >= 0.5, so we predict 1.
# Otherwise, we predict -1.
preds = np.where(signal >= 0, 1, -1)
return preds
### END YOUR CODE
Explanation:
- Linear Decision: Instead of computing the expensive exponential for the probability and then checking if it is
, we check the sign of the signal directly. implies . np.where: This efficiently vectorizes the conditional logic, returning 1 where the condition is true and -1 otherwise.
Function: score
def score(self, X, y):
"""Returns the mean accuracy on the given test data and labels.
Args:
X: An array of shape [n_samples, n_features].
y: An array of shape [n_samples,]. Only contains 1 or -1.
Returns:
score: An float. Mean accuracy of self.predict(X) wrt. y.
"""
### YOUR CODE HERE
### END YOUR CODE
This function calculates the mean accuracy of the model, quantifying how well the predictions match the true labels.
def score(self, X, y):
"""Returns the mean accuracy on the given test data and labels.
Args:
X: An array of shape [n_samples, n_features].
y: An array of shape [n_samples,]. Only contains 1 or -1.
Returns:
score: An float. Mean accuracy of self.predict(X) wrt. y.
"""
### YOUR CODE HERE
# 1. Generate predictions using the trained model
predictions = self.predict(X)
# 2. Compare predictions to true labels
# (predictions == y) creates a boolean array (True for correct, False for incorrect)
# np.mean converts True to 1 and False to 0, then calculates the average
score = np.mean(predictions == y)
return score
### END YOUR CODE
Explanation:
- Accuracy: This implements the standard definition of classification accuracy: the fraction of correctly classified data points.
main.py
Function: visualize_result
def visualize_result(X, y, W):
'''This function is used to plot the sigmoid model after training.
Args:
X: An array of shape [n_samples, 2].
y: An array of shape [n_samples,]. Only contains 1 or -1.
W: An array of shape [n_features,].
Returns:
No return. Save the plot to 'train_result_sigmoid.*' and include it
in submission.
'''
### YOUR CODE HERE
### END YOUR CODE
This function visualizes the decision boundary learned by the Logistic Regression model. As established in the sources, the decision boundary for a linear model (including Logistic Regression) is defined where the "signal"
In a 2D feature space with an added bias term (
To plot this as a line, we solve for the vertical axis feature (
Implementation
import numpy as np
import matplotlib.pyplot as plt
def visualize_result(X, y, W):
'''This function is used to plot the sigmoid model after training.
Args:
X: An array of shape [n_samples, 2]. (Symmetry, Intensity)
y: An array of shape [n_samples,]. Only contains 1 or -1.
W: An array of shape [n_features,]. (Bias, w_symmetry, w_intensity)
Returns:
No return. Save the plot to 'train_result_sigmoid.*' and include it
in submission.
'''
### YOUR CODE HERE
# 1. Plot the data points (Same as visualize_features)
plt.figure(figsize=(8, 6))
# Separate positive (Digit 1) and negative (Digit 2) classes
X_pos = X[y == 1]
X_neg = X[y == -1]
# Plot scatter points
plt.scatter(X_pos[:, 0], X_pos[:, 1], color='blue', marker='o', label='Digit 1 (+1)')
plt.scatter(X_neg[:, 0], X_neg[:, 1], color='red', marker='x', label='Digit 2 (-1)')
# 2. Calculate the Decision Boundary
# The boundary is where w^T * x = 0 => w0 + w1*x1 + w2*x2 = 0
# We solve for x2 (Intensity): x2 = -(w1*x1 + w0) / w2
# Define a range for x1 (Symmetry) based on the data
x1_min, x1_max = np.min(X[:, 0]), np.max(X[:, 0])
x1_values = np.linspace(x1_min, x1_max, 100)
# Calculate corresponding x2 (Intensity) values
# W is bias, W is Symmetry weight, W is Intensity weight
x2_values = -(W * x1_values + W) / W
# 3. Plot the Decision Boundary
plt.plot(x1_values, x2_values, color='green', linewidth=2, label='Decision Boundary')
# 4. Add labels, legend, and save
plt.xlabel('Symmetry')
plt.ylabel('Intensity')
plt.title('Logistic Regression: Training Data and Decision Boundary')
plt.legend()
plt.ylim(np.min(X[:, 1]) - 0.5, np.max(X[:, 1]) + 0.5) # restrict y-axis to data range
plt.savefig('train_result_sigmoid.png')
# plt.show()
### END YOUR CODE
Step-by-Step Explanation:
- Scatter Plot: We plot the 1s and 2s just like in the previous visualization function to show the distribution of the training data.
- Deriving the Line:
Wcontains 3 weights:W(Bias),W(Symmetry),W(Intensity).- We create a range of x-coordinates (
x1_values) spanning the width of the data. - We use the derived formula
to generate the y-coordinates ( x2_values).
- Plotting: We plot this line in green. Any point on one side of this line results in a probability
(prediction +1), and any point on the other side results in probability (prediction -1).
Function: main()
Based on the principles of Validation and Testing discussed in Chapter 4 of the Learning From Data textbook (specifically Section 4.3 on Model Selection), here is the implementation for the final section of main().
This code performs hyperparameter tuning using the validation set, visualizes the best result, and finally estimates the out-of-sample error (
Implementation
# Explore different hyper-parameters.
### YOUR CODE HERE
# 1. Define a search space for hyperparameters (e.g., learning rate)
learning_rates = [0.01, 0.05, 0.1, 0.5, 1.0]
best_acc = 0
best_logisticR = None
print("--- Hyperparameter Tuning ---")
# 2. Iterate through hyperparameters to find the best model
for lr in learning_rates:
# Initialize model
model = logistic_regression(learning_rate=lr, max_iter=1000)
# Train on Training Set (using BGD as a baseline)
model.fit_BGD(train_X, train_y)
# Evaluate on Validation Set (NOT Test Set - avoid data snooping)
val_acc = model.score(valid_X, valid_y)
print(f"Learning Rate: {lr}, Validation Accuracy: {val_acc}")
# Track the best model
if val_acc > best_acc:
best_acc = val_acc
best_logisticR = model
print(f"Best Validation Accuracy: {best_acc} with LR: {best_logisticR.learning_rate}")
### END YOUR CODE
# Visualize the your 'best' model after training.
### YOUR CODE HERE
# Pass the features (columns 1 and 2) and the learned weights (W)
# get_params() presumably returns self.W
visualize_result(train_X[:, 1:3], train_y, best_logisticR.get_params())
### END YOUR CODE
# Use the 'best' model above to do testing. Note that the test data should be loaded and processed in the same way as the training data.
### YOUR CODE HERE
print("--- Testing ---")
# 1. Load the raw test data
# We assume 'test_filename' is defined in the imports/config section similar to 'train_filename'
raw_test_data, test_labels = load_data(os.path.join(data_dir, test_filename))
# 2. Extract Features (Symmetry and Intensity)
# Must use the exact same prepare_X function as training
test_X_all = prepare_X(raw_test_data)
# 3. Filter for Class 1 and 2
# prepare_y returns indices for these classes
test_y_all, test_idx = prepare_y(test_labels)
# 4. Apply filter to features and labels
test_X = test_X_all[test_idx]
test_y = test_y_all[test_idx]
# 5. Convert labels to binary {-1, 1}
# Map label '2' to '-1' (Digit 1 is positive class)
test_y[test_y == 2] = -1
# 6. Evaluate the best model on the Test Set
test_acc = best_logisticR.score(test_X, test_y)
print(f"Final Test Accuracy: {test_acc}")
### END YOUR CODE
Explanation of Steps
- Hyperparameter Tuning (Validation):
- Concept: As described in Chapter 4 (Validation), we cannot use the Test Set to choose parameters (like learning rate) because that constitutes data snooping.
- Action: We loop through different learning rates, train on
train_X, and evaluate onvalid_X. The model that performs best on the validation set is selected asbest_logisticR.
- Visualization:
- Concept: We visualize the decision boundary to verify if the linear model effectively separates the two classes in the feature space (Symmetry vs. Intensity).
- Action: We call
visualize_resultwith the best model's weights. We calculatetrain_X[:, 1:3]becausetrain_Xcolumn 0 is the bias (), but the scatter plot only needs the 2D features ( ).
- Testing:
- Concept: The Test Set is used only once at the very end to estimate
(out-of-sample error). This provides an unbiased estimate of how the model will perform on unseen data. - Action:
- Consistency: It is crucial that test data undergoes the exact same preprocessing as training data. We call
prepare_Xto extract features andprepare_yto filter for digits 1 and 2. - Label Mapping: We must convert the raw labels (1 and 2) to the binary format (1 and -1) used by the logistic regression logic.
- Scoring: We calculate the final accuracy using
score().
- Consistency: It is crucial that test data undergoes the exact same preprocessing as training data. We call
- Concept: The Test Set is used only once at the very end to estimate
4. Softmax Logistic Regression
LRM.py
Function: _gradient
def _gradient(self, _x, _y):
"""Compute the gradient of cross-entropy with respect to self.W
for one training sample (_x, _y). This function is used in fit_*.
Args:
_x: An array of shape [n_features,].
_y: One_hot vector.
Returns:
_g: An array of shape [n_features,]. The gradient of
cross-entropy with respect to self.W.
"""
### YOUR CODE HERE
### END YOUR CODE
Although the provided docstring states the return shape is [n_features,], Softmax Regression typically requires a weight matrix of shape [n_features, n_classes] (one weight vector per class). Consequently, the gradient should also be a matrix of the same shape. I will implement it returning the matrix, as this is mathematically required for the one-hot vector input.
Implementation
def _gradient(self, _x, _y):
"""Compute the gradient of cross-entropy with respect to self.W
for one training sample (_x, _y). This function is used in fit_*.
Args:
_x: An array of shape [n_features,].
_y: One_hot vector.
Returns:
_g: An array of shape [n_features,]. The gradient of
cross-entropy with respect to self.W.
"""
### YOUR CODE HERE
# Assumed self.W to be shape (n_features, n_classes)
# Compute the Softmax probabilities
probs = self.softmax(_x)
# Compute the error term (prediction prob - target)
error = probs - _y
# probs is the vector of class probabilities
# Compute the gradient (outer product of the input x and the error vector)
_g = np.outer(_x, error) # shape: (n_features, k)
return _g
Detailed Step-by-Step Explanation
1. Compute the Linear Signal (
- Math: In Softmax regression, we have a separate set of weights for each class
. We compute a "score" (logit) for each class by taking the dot product of the input feature vector _xand the weight vector for that class. - Code:
z = np.dot(_x, self.W). Ifself.Wis a matrix where each column represents a class, this operation calculates the scores for all classes simultaneously.
2. Apply the Softmax Function
- Math: To convert the scores
into probabilities (values between 0 and 1 that sum to 1), we use the Softmax function: $$ P(y=k|x) = \frac{e^{z_k}}{\sum_{j} e^{z_j}} $$ - Code:
z - np.max(z): This is a standard numerical stability trick. Subtracting the maximum value doesn't change the result of the softmax (it cancels out in the numerator and denominator) but preventsnp.exp()from returninginffor large scores.probs = exp_z / np.sum(exp_z): This calculates the probability vector. - Look at 03 - Logistic Regression - From Binary to Multi-Class#Multi-Class Logistic Regression
3. Compute the Error (
- Math: The goal of training is to minimize the Cross-Entropy Loss. $$ E = - \sum_{k} y_k \ln(\hat{y}_k) $$ When you take the derivative of this loss with respect to the scores
, the math simplifies beautifully to just the difference between the prediction and the target: $$ \frac{\partial E}{\partial z} = \hat{y} - y $$ - Code:
error = probs - _y. Since_yis a one-hot vector, andprobsis the predicted distribution (e.g.,[0.1, 0.7, 0.2]), this calculates the gradient for every class output.
4. Compute the Gradient w.r.t Weights (
- Math: Using the chain rule,
. Since , the derivative is just . Therefore, the gradient for the weights is the input multiplied by the error term: $$ \nabla W = x (\hat{y} - y)^T $$ - Code:
_g = np.outer(_x, error). This creates a matrix where each columnis the input vector _xscaled by the error for class. This matches the shape of self.Wand indicates the direction to adjust the weights to reduce the error.
Function: fit_miniBGD
def fit_miniBGD(self, X, labels, batch_size):
"""Train perceptron model on data (X,y) with mini-Batch GD.
Args:
X: An array of shape [n_samples, n_features].
labels: An array of shape [n_samples,]. Only contains 0,..,k-1.
batch_size: An integer.
Returns:
self: Returns an instance of self.
Hint: the labels should be converted to one-hot vectors, for example: 1----> [0,1,0]; 2---->[0,0,1].
"""
### YOUR CODE HERE
### END YOUR CODE
Implementation
def fit_miniBGD(self, X, labels, batch_size):
"""Train perceptron model on data (X,y) with mini-Batch GD.
Args:
X: An array of shape [n_samples, n_features].
labels: An array of shape [n_samples,]. Only contains 0,..,k-1.
batch_size: An integer.
Returns:
self: Returns an instance of self.
Hint: the labels should be converted to one-hot vectors, for example: 1---->; 2---->.
"""
### YOUR CODE HERE
n_samples, n_features = X.shape
# 1. Initialize Weights
# Unlike binary logistic regression (vector), Softmax uses a weight matrix.
# Shape: (n_features, k) where k is the number of classes.
self.W = np.zeros((n_features, self.k))
# 2. One-Hot Encoding of Labels
# We need to convert integer labels (e.g., 2) into vectors (e.g.,).
# Create a matrix of zeros of shape (n_samples, k)
y_one_hot = np.zeros((n_samples, self.k))
# Set the index corresponding to the label to 1 for each sample
y_one_hot[np.arange(n_samples), labels] = 1
# 3. Loop over epochs
for _ in range(self.max_iter):
# 4. Shuffle data
# Essential for Mini-Batch/SGD to ensure batches are random representatives
perm_indices = np.random.permutation(n_samples)
X_shuffled = X[perm_indices]
y_shuffled = y_one_hot[perm_indices]
# 5. Iterate over mini-batches
for start_idx in range(0, n_samples, batch_size):
# Handle the last batch (might be smaller than batch_size)
end_idx = min(start_idx + batch_size, n_samples)
# Slice the batch
X_batch = X_shuffled[start_idx:end_idx]
y_batch = y_shuffled[start_idx:end_idx]
current_batch_size = end_idx - start_idx
# 6. Accumulate Gradients for this batch
# Initialize a zero gradient matrix of shape (n_features, k)
batch_grad_sum = np.zeros((n_features, self.k))
# Sum the gradients for every sample in the batch
for i in range(current_batch_size):
# Use the _gradient helper implemented previously
# It returns the gradient contribution of a single sample
batch_grad_sum += self._gradient(X_batch[i], y_batch[i])
# 7. Average the gradient
# The gradient of the loss is the average over the batch samples
avg_grad = batch_grad_sum / current_batch_size
# 8. Update Weights
# Move W in the opposite direction of the gradient
self.W -= self.learning_rate * avg_grad
### END YOUR CODE
return self
Detailed Explanation of Each Step
- Initialize Weights (
self.W = np.zeros(...)):- Concept: In Softmax Regression, we have a separate weight vector for each class
. Therefore, self.Wbecomes a matrix of size[n_features, k]. Initializing to zeros is a standard starting point for convex problems like logistic regression.
- Concept: In Softmax Regression, we have a separate weight vector for each class
- One-Hot Encoding:
- Math: The mathematical derivation of the cross-entropy gradient relies on the target
being a probability vector. If the true class is , the target vector should have a at index and elsewhere. - Code:
y_one_hot[np.arange(n_samples), labels] = 1efficiently creates this matrix. If sample 0 has label 2, this setsy_one_hot = 1.
- Math: The mathematical derivation of the cross-entropy gradient relies on the target
- Shuffle (
np.random.permutation):- Reason: As discussed in Chapter 3 regarding SGD, randomness is crucial. If we process the data in the same order every epoch, the model might learn the sequence of the data rather than the features, or get stuck in cycles. Shuffling ensures that each mini-batch is a random sample of the true population.
- Batch Loop:
- We iterate through the dataset in chunks of size
batch_size. The logicmin(start + batch_size, n_samples)ensures we don't crash when the dataset size isn't perfectly divisible by the batch size (handling the last "remainder" batch).
- We iterate through the dataset in chunks of size
- Accumulate Gradients:
- Math: The gradient of the error for a batch is the sum (or average) of the gradients of the individual samples in that batch.
- Code: We loop through the current batch, call
self._gradient(which you implemented in the previous step to calculate), and add it to batch_grad_sum.
- Average Gradient:
- Math: We divide by
current_batch_size. This normalizes the update so that the magnitude of the step doesn't depend on how many samples are in the batch. This effectively estimates the "true" gradient of the expected error.
- Math: We divide by
- Weight Update:
- Math:
. - We subtract the scaled gradient to move down the error surface toward the minimum. Note that
self.Wandavg_gradare both matrices of shape[n_features, k], so this updates all class weights simultaneously.
- Math:
Function: softmax
def softmax(self, x):
"""Compute softmax values for each sets of scores in x."""
### You must implement softmax by youself, otherwise you will not get credits for this part.
### YOUR CODE HERE
# Compute linear term: s = w^t * x (matrix multiplication)
s = x @ self.W
# (n_samples, n_features) @ (n_features, k) -> (n_samples, k)
# Stabilize (prevents overflow and made the model predict faster for me)
s = s - np.max(s)
# Softmax function: exp(s) / sum(exp(s)), where s = w^t * x
exp_s = np.exp(s) # = e^s
soft_probs = exp_s / np.sum(exp_s)
return soft_probs
Given a K -dimensional vector
The multi-class prediction function can be written as
Function: predict
def predict(self, X):
"""Predict class labels for samples in X.
Args:
X: An array of shape [n_samples, n_features].
Returns:
preds: An array of shape [n_samples,]. Only contains 0,..,k-1.
"""
### YOUR CODE HERE
### END YOUR CODE
Implementation
def predict(self, X):
"""Predict class labels for samples in X.
Args:
X: An array of shape [n_samples, n_features].
Returns:
preds: An array of shape [n_samples,]. Only contains 0,..,k-1.
"""
### YOUR CODE HERE
# 1. Compute the linear scores (logits) for all classes
# X shape: (n_samples, n_features)
# self.W shape: (n_features, k)
# Result scores shape: (n_samples, k)
scores = np.dot(X, self.W)
# 2. Determine the predicted class
# The predicted class is the index of the highest score.
# We assume the index corresponds to the label (0, 1, ..., k-1).
# axis=1 means we find the max across columns (classes) for each row.
preds = np.argmax(scores, axis=1)
return preds
### END YOUR CODE
- Compute Scores (
np.dot(X, self.W)):- Concept: As discussed in the "Linear Models" slides, the core of the classifier is the linear signal. In the multiclass case, we have
signals, one for each class. We compute the score for every class simultaneously using matrix multiplication. - Why not Softmax?: While Softmax converts these scores into probabilities (0 to 1), the order of the values does not change because the exponential function is strictly monotonic. Therefore, the class with the highest probability is always the class with the highest linear score.
- Efficiency: We skip the expensive
np.exp()calculation required for Softmax becauseargmax(scores)yields the exact same prediction asargmax(softmax(scores)), but faster.
- Concept: As discussed in the "Linear Models" slides, the core of the classifier is the linear signal. In the multiclass case, we have
- Argmax (
np.argmax(scores, axis=1)):- Concept: The decision rule for classification is to pick the most probable class.
- Implementation:
scoresis a matrix where rowcontains the scores for sample across all classes. np.argmax(..., axis=1)returns the index of the maximum value in each row. Since your labels are0, ..., k-1, these indices are exactly the predicted class labels.
Function: score
def score(self, X, labels):
"""Returns the mean accuracy on the given test data and labels.
Args:
X: An array of shape [n_samples, n_features].
labels: An array of shape [n_samples,]. Only contains 0,..,k-1.
Returns:
score: An float. Mean accuracy of self.predict(X) wrt. labels.
"""
### YOUR CODE HERE
### END YOUR CODE
Implementation
def score(self, X, labels):
"""Returns the mean accuracy on the given test data and labels.
Args:
X: An array of shape [n_samples, n_features].
labels: An array of shape [n_samples,]. Only contains 0,..,k-1.
Returns:
score: An float. Mean accuracy of self.predict(X) wrt. labels.
"""
### YOUR CODE HERE
# 1. Generate predictions using the function implemented above
predictions = self.predict(X)
# 2. Calculate accuracy
# Compare predictions to ground truth labels.
# np.mean converts boolean (True/False) to float (1.0/0.0) and averages them.
score = np.mean(predictions == labels)
return score
### END YOUR CODE
- Generate Predictions:
- We simply call
self.predict(X)to get the vector of predicted labels for the test set.
- We simply call
- Calculate Accuracy (
np.mean(predictions == labels)):- Concept: This implements the standard accuracy metric defined in the problem description and implied in the "Error Measures" section of the text. It represents the fraction of correctly classified examples.
- Operation:
predictions == labelscreates a boolean array (e.g.,[True, False, True]).np.meantreatsTrueas 1 andFalseas 0, so the average is the percentage of correct answers (e.g.,).
main.py
Function: visualize_result_multi
def visualize_result_multi(X, y, W):
'''This function is used to plot the softmax model after training.
Args:
X: An array of shape [n_samples, 2].
y: An array of shape [n_samples,]. Only contains 0,1,2.
W: An array of shape [n_features, 3].
Returns:
No return. Save the plot to 'train_result_softmax.*' and include it
in submission.
'''
### YOUR CODE HERE
### END YOUR CODE
Based on the principles of Softmax Regression (multiclass linear model) discussed in the course notes, the decision boundary is determined by comparing the "scores" of each class. A point
Because there are 3 classes (0, 1, 2) and multiple linear boundaries interacting (Class 0 vs 1, 1 vs 2, 0 vs 2), the best way to visualize the result is to draw the decision regions. We do this by creating a fine grid of points over the plot area, predicting the class for every point on that grid, and coloring the background based on those predictions.
Here is the implementation and a detailed step-by-step explanation.
Implementation
import matplotlib.pyplot as plt
def visualize_result_multi(X, y, W):
'''This function is used to plot the softmax model after training.
Args:
X: An array of shape [n_samples, 2]. (Symmetry, Intensity)
y: An array of shape [n_samples,]. Only contains 0, 1, 2.
W: An array of shape [n_features, 3]. (Bias + 2 features x 3 classes)
Returns:
No return. Save the plot to 'train_result_softmax.png' and include it
in submission.
'''
### YOUR CODE HERE
plt.figure(figsize=(8, 6))
# 1. Create a Meshgrid
# Define the boundaries of the plot based on the data range, adding some padding.
x_min, x_max = X[:, 0].min() - 0.5, X[:, 0].max() + 0.5
y_min, y_max = X[:, 1].min() - 0.5, X[:, 1].max() + 0.5
# Generate a grid of points with 0.02 distance between them
h = 0.02
xx, yy = np.meshgrid(np.arange(x_min, x_max, h), np.arange(y_min, y_max, h))
# 2. Prepare the Grid for Prediction
# Flatten the grid matrices into a single list of (x, y) coordinates
grid_points = np.c_[xx.ravel(), yy.ravel()]
# Add the bias term (column of 1s) to match the shape of W (3 rows)
# grid_points becomes shape [n_grid_points, 3]
grid_points_bias = np.c_[np.ones(grid_points.shape), grid_points]
# 3. Predict the Class for Every Point on the Grid
# Compute scores: Z = X * W. Shape: [n_grid_points, 3]
scores = np.dot(grid_points_bias, W)
# Select class with highest score. Shape: [n_grid_points,]
Z = np.argmax(scores, axis=1)
# Reshape the predictions back to the shape of the meshgrid
Z = Z.reshape(xx.shape)
# 4. Plot the Decision Regions (Contours)
# Use contourf to fill the regions with colors corresponding to classes
plt.contourf(xx, yy, Z, alpha=0.4, cmap=plt.cm.Paired)
# 5. Plot the Training Data Points
# Scatter plot the actual data points, colored by their true label 'y'
# We loop through classes to add a legend
colors = ['blue', 'red', 'green']
labels = ['Digit 0', 'Digit 1', 'Digit 2']
for i in range(3):
# Select points belonging to class i
idx = np.where(y == i)
plt.scatter(X[idx, 0], X[idx, 1], c=colors[i], label=labels[i], edgecolor='k', s=20)
# 6. Formatting and Saving
plt.xlabel('Symmetry')
plt.ylabel('Intensity')
plt.title('Softmax Regression Decision Boundaries')
plt.legend()
plt.savefig('train_result_softmax.png')
# plt.show()
### END YOUR CODE
Explanation of Each Step
- Create a Meshgrid (
np.meshgrid):- Context: Unlike the binary case where we could solve a single equation
to get a line, here we have three boundaries. The region for Class 0 is where and . - Action: Instead of calculating lines mathematically, we define a "mesh" (a fine net) covering the entire area where the data points sit. We will ask the model to classify every intersection point on this net.
- Context: Unlike the binary case where we could solve a single equation
- Prepare Grid for Prediction (
grid_points_bias):- Context: The model weights
Whave shape `` (Bias, Feature1, Feature2 for 3 classes). - Action: The meshgrid gives us just the 2D coordinates (
). We flatten this into a long list of points and add a column of s ( ) so that we can perform the matrix multiplication with W.
- Context: The model weights
- Predict Class (
np.argmax):- Math: For every point on the grid, we calculate 3 scores:
. - Logic: The predicted class is the index of the maximum score.
- Result:
Zcontains the predicted class (0, 1, or 2) for every pixel in our plot background.
- Math: For every point on the grid, we calculate 3 scores:
- Plot Contours (
plt.contourf):- Action: This function draws filled regions. It looks at the predictions in
Zand colors the area blue where the prediction is 0, brown/red where it is 1, etc. This visualizes the decision surface and the linear boundaries separating the classes.
- Action: This function draws filled regions. It looks at the predictions in
- Plot Training Data:
- Action: Finally, we overlay the actual data points
X. If the model is good, the blue points (Digit 0) should mostly fall inside the blue region, the red points (Digit 1) inside the red region, and so on.
- Action: Finally, we overlay the actual data points
Function: main()
# ------------Logistic Regression Multiple-class case, let k= 3------------
###### Use all data from '0' '1' '2' for training
train_X = train_X_all
train_y = train_y_all
valid_X = valid_X_all
valid_y = valid_y_all
######### miniBGD for multiclass Logistic Regression
logisticR_classifier_multiclass = logistic_regression_multiclass(learning_rate=0.5, max_iter=100, k= 3)
logisticR_classifier_multiclass.fit_miniBGD(train_X, train_y, 10)
print(logisticR_classifier_multiclass.get_params())
print(logisticR_classifier_multiclass.score(train_X, train_y))
# Explore different hyper-parameters.
### YOUR CODE HERE
### END YOUR CODE
# Visualize the your 'best' model after training.
# visualize_result_multi(train_X[:, 1:3], train_y, best_logistic_multi_R.get_params())
# Use the 'best' model above to do testing.
### YOUR CODE HERE
### END YOUR CODE
Implementation
# ------------Logistic Regression Multiple-class case, let k= 3------------
###### Use all data from '0' '1' '2' for training
# Unlike the binary case, we do not filter for specific digits here.
# We use the full dataset processed earlier.
train_X = train_X_all
train_y = train_y_all
valid_X = valid_X_all
valid_y = valid_y_all
######### miniBGD for multiclass Logistic Regression
# Initial test run to verify the code works
logisticR_classifier_multiclass = logistic_regression_multiclass(learning_rate=0.5, max_iter=100, k= 3)
logisticR_classifier_multiclass.fit_miniBGD(train_X, train_y, 10)
print("Initial weights:", logisticR_classifier_multiclass.get_params())
print("Initial Train Score:", logisticR_classifier_multiclass.score(train_X, train_y))
# Explore different hyper-parameters.
### YOUR CODE HERE
print("\n--- Multiclass Hyperparameter Tuning ---")
learning_rates = [0.01, 0.05, 0.1, 0.5, 1.0]
best_multi_acc = -1
best_logistic_multi_R = None
# 1. Grid Search for the best Hyperparameters
for lr in learning_rates:
# Initialize the multiclass model
# We assume k=3 for digits 0, 1, 2
model = logistic_regression_multiclass(learning_rate=lr, max_iter=1000, k=3)
# Train using Mini-Batch Gradient Descent
# Batch size is set to 10, but could also be tuned
model.fit_miniBGD(train_X, train_y, batch_size=10)
# Evaluate on the VALIDATION set
val_acc = model.score(valid_X, valid_y)
print(f"LR: {lr}, Validation Accuracy: {val_acc}")
# Select the best model
if val_acc > best_multi_acc:
best_multi_acc = val_acc
best_logistic_multi_R = model
print(f"Best Multiclass Model LR: {best_logistic_multi_R.learning_rate} with Val Acc: {best_multi_acc}")
### END YOUR CODE
# Visualize the your 'best' model after training.
# We pass the features (cols 1 and 2) and the learned weight matrix W.
visualize_result_multi(train_X[:, 1:3], train_y, best_logistic_multi_R.get_params())
# Use the 'best' model above to do testing.
### YOUR CODE HERE
print("\n--- Multiclass Testing ---")
# 1. Load the raw test data
raw_test_data, test_labels = load_data(os.path.join(data_dir, test_filename))
# 2. Extract Features
# Use the same function as training to get Bias, Symmetry, Intensity
test_X_all = prepare_X(raw_test_data)
# 3. Prepare Labels
# prepare_y usually returns the mapped labels and indices.
# For multiclass (0, 1, 2), we typically use all data returned.
test_y_all, _ = prepare_y(test_labels)
# 4. Score the best model on the Test Set
# This provides the unbiased estimate of E_out (Out-of-sample error)
test_acc = best_logistic_multi_R.score(test_X_all, test_y_all)
print(f"Final Multiclass Test Accuracy: {test_acc}")
### END YOUR CODE
Detailed Explanation of Each Step
-
Data Setup (
train_X = train_X_all):- Context: In the binary section, you likely filtered
train_X_allto keep only digits 1 and 2. - Action: For the multiclass section, we revert to using
train_X_all, which contains features for all available classes (digits 0, 1, and 2) as prepared by yourprepare_Xfunction.
- Context: In the binary section, you likely filtered
-
Hyperparameter Tuning (The Loop):
- Concept: As discussed in Chapter 4 regarding Model Selection, we use the Validation Set (
valid_X,valid_y) to select the best learning rate. - Action: We train multiple instances of
logistic_regression_multiclasswith different learning rates usingfit_miniBGD. We select the one that achieves the highest accuracy on the validation set.
- Concept: As discussed in Chapter 4 regarding Model Selection, we use the Validation Set (
-
Visualization (
visualize_result_multi):- Concept: Visualizing the decision boundaries helps verify that the model has learned meaningful geometric separations between the classes.
- Action: We call the function you implemented earlier.
train_X[:, 1:3]: The visualization function expects a 2D feature array (Symmetry vs. Intensity). Sincetrain_Xhas a bias column of 1s at index 0, we slice it to pass only columns 1 and 2.best_...get_params(): This passes the learned weight matrix(shape ) to the visualizer so it can compute scores and draw the regions.
-
Testing:
- Concept: This step estimates
(out-of-sample error). Per the "Three Learning Principles" (Chapter 5) and "Training versus Testing" (Chapter 2), the test set must be used only once at the very end. - Action:
- Loading: We load the test data from the file.
- Preprocessing: We must apply the exact same
prepare_Xandprepare_ytransformations used on the training data. This ensures the test feature matrix has the same dimensions (bias + 2 features) and format as the model expects. - Scoring: We calculate the accuracy of our
best_logistic_multi_Rmodel on this fresh data. This number is your final performance metric.
- Concept: This step estimates
Here is the implementation for the final part of your assignment.
This section compares Softmax Regression (configured for
Implementation
# ------------Connection between sigmoid and softmax------------
############ Now set k=2, only use data from '1' and '2'
##### set labels to 0,1 for softmax classifer
train_X = train_X_all[train_idx]
train_y = train_y_all[train_idx]
train_X = train_X[0:1350]
train_y = train_y[0:1350]
valid_X = valid_X_all[val_idx]
valid_y = valid_y_all[val_idx]
# Softmax requires labels 0 and 1.
# Currently we have 1 and 2. Map 2 -> 0, and 1 -> 1 (implicit).
train_y[np.where(train_y==2)] = 0
valid_y[np.where(valid_y==2)] = 0
###### First, fit softmax classifer until convergence, and evaluate
##### Hint: we suggest to set the convergence condition as "np.linalg.norm(gradients*1./batch_size) < 0.0005" or max_iter=10000:
### YOUR CODE HERE
# Initialize Softmax model with k=2
# We use a large max_iter=10000 to ensure convergence as suggested in the hint
softmax_model = logistic_regression_multiclass(learning_rate=0.05, max_iter=10000, k=2)
# Train using Mini-Batch Gradient Descent
softmax_model.fit_miniBGD(train_X, train_y, batch_size=10)
# Evaluate
train_acc_soft = softmax_model.score(train_X, train_y)
valid_acc_soft = softmax_model.score(valid_X, valid_y)
print(f"Softmax (k=2) - Train Acc: {train_acc_soft:.4f}, Valid Acc: {valid_acc_soft:.4f}")
### END YOUR CODE
train_X = train_X_all[train_idx]
train_y = train_y_all[train_idx]
train_X = train_X[0:1350]
train_y = train_y[0:1350]
valid_X = valid_X_all[val_idx]
valid_y = valid_y_all[val_idx]
##### set lables to -1 and 1 for sigmoid classifer
### YOUR CODE HERE
# Sigmoid (Binary) requires labels -1 and 1.
# Currently we have 1 and 2. Map 2 -> -1.
train_y[train_y == 2] = -1
valid_y[valid_y == 2] = -1
### END YOUR CODE
###### Next, fit sigmoid classifer until convergence, and evaluate
##### Hint: we suggest to set the convergence condition as "np.linalg.norm(gradients*1./batch_size) < 0.0005" or max_iter=10000:
### YOUR CODE HERE
# Initialize Binary Logistic Regression model
sigmoid_model = logistic_regression(learning_rate=0.05, max_iter=10000)
# Train using Mini-Batch Gradient Descent (to match the Softmax comparison)
sigmoid_model.fit_miniBGD(train_X, train_y, batch_size=10)
# Evaluate
train_acc_sig = sigmoid_model.score(train_X, train_y)
valid_acc_sig = sigmoid_model.score(valid_X, valid_y)
print(f"Sigmoid (Binary) - Train Acc: {train_acc_sig:.4f}, Valid Acc: {valid_acc_sig:.4f}")
### END YOUR CODE
################Compare and report the observations/prediction accuracy
print("\n--- Observations ---")
print(f"Difference in Validation Accuracy: {abs(valid_acc_soft - valid_acc_sig):.6f}")
# Optional: Compare weights
# Softmax has 2 weight vectors (one for class 0, one for class 1).
# Sigmoid has 1 weight vector (distinguishing class 1 from -1).
# The decision boundary for Softmax is (w_1 - w_0)^T x = 0.
# The decision boundary for Sigmoid is w^T x = 0.
# Therefore, (w_1 - w_0) from Softmax should be proportional to w from Sigmoid.
softmax_W = softmax_model.get_params() # Shape (d, 2)
sigmoid_W = sigmoid_model.get_params() # Shape (d,)
# Calculate effective weight for softmax (w_class1 - w_class0)
# Note: In our mapping, label 1 is column 1, label 0 (old 2) is column 0.
effective_softmax_W = softmax_W[:, 1] - softmax_W[:, 0]
print("Shape of Softmax Weights:", softmax_W.shape)
print("Shape of Sigmoid Weights:", sigmoid_W.shape)
# Cosine similarity to check alignment
cosine_sim = np.dot(effective_softmax_W, sigmoid_W) / (np.linalg.norm(effective_softmax_W) * np.linalg.norm(sigmoid_W))
print(f"Alignment (Cosine Similarity) between models: {cosine_sim:.4f}")
# ------------End------------
Explanation of Observations
- Prediction Accuracy: You should observe that the validation accuracy for both models is identical (or extremely close, allowing for small numerical differences in optimization).
- Theoretical Connection:
- Sigmoid: Predicts
. The decision boundary is . - Softmax (
): Predicts . The decision boundary is . - Conclusion: Softmax with
is mathematically equivalent to Sigmoid Logistic Regression. The effective weight vector in Softmax ( ) plays the same role as the single weight vector in Sigmoid.
- Sigmoid: Predicts