HW1 - Linear Models for Handwritten Digits Classification

#MachineLearning #Logistic_Regression

Overview:
In this assignment, you will implement the binary logistic regression model and multi-class logistic regression model on a partial dataset from MNIST. In this classification task, the model will take a $16 \times 16$ image of handwritten digits as inputs and classify the image into different classes. For the binary case, the classes are 1 and 2 while for the multi-class case, the classes are 0, 1, and 2. The "data" fold contains the dataset which has already been split into a training set and a testing set. All data examples are saved in dictionary-like objects using "npz" file. For each data sample, the dictionary key ' $x$ ' indicates its raw features, which are represented by a 256 -dimensional vector where the values between $[- 1, 1]$ indicate grayscale pixel values for a $16 \times 16$ image. In addition, the key 'y' is the label for a data example, which can be 0,1 , or 2 . The "code" fold provides the starting code. You must implement the models using the starting code.

File structure:

code/
	DataReader.py
	LogisticRegression.py
	LRM.py
	main.py
	
data/
	test.npz
	training.npz

1. Data Preprocessing

DataReader.py

Function: `load_data(filename)`

def load_data(filename):
    """Load a given txt file.

    Args:
        filename: A string.

    Returns:
        raw_data: An array of shape [n_samples, 256].
        labels : An array of shape [n_samples,].
        
    """
    data= np.load(filename)
    x= data['x']
    y= data['y']
    return x, y

What does it do?

This function is responsible for importing the raw data from a file into the program so the math operations (like those in Linear Regression) can be performed.
np.load(filename): The code expects a file format compatible with NumPy (likely .npz). This file acts like a dictionary containing multiple arrays.
x = data['x']: This extracts the Input Matrix.
- In the slides, this corresponds to the data vectors $x_{1} , \dots, x_{N} $ .
- The shape [n_samples, 256] tells us that $N$ is the number of samples and $d = 256$ is the number of features (dimensions).
- Context from Source: The textbook discusses a "Handwritten digit recognition" problem using 16×16 pixel images.
y = data['y']: This extracts the Target Vector.
- In the slides, this corresponds to the correct answers $y$ (e.g., +1 or -1 for classification).

Function: `train_valid_split(raw_data, labels, split_index)`

def train_valid_split(raw_data, labels, split_index):
	"""Split the original training data into a new training dataset
	and a validation dataset.
	n_samples = n_train_samples + n_valid_samples

	Args:
		raw_data: An array of shape [n_samples, 256].
        labels : An array of shape [n_samples,].
		split_index: An integer.

	"""
	return raw_data[:split_index], raw_data[split_index:], labels[:split_index], labels[split_index:]

What does it do?

This function implements the concept of Validation. It takes the total available data and slices it into two separate sets: one for learning parameters and one for checking performance.
The Slicing ([:split_index] and [split_index:]):
- The function uses the split_index integer to cut the arrays.
- Example: split_index = 800 means:
  - First 800 -> training
  - Remaining -> validation
Training Set (:split_index): The data from the start up to the split point becomes $D_{t r a i n}$ . This is the set used to calculate the weights $w$ (minimizing $E_{i n}$ ).
Validation Set (split_index:): The data from the split point to the end becomes $D_{v a l}$ . This set is not used for training.

The slicing (this is the key part):

return raw_data[:split_index], raw_data[split_index:], \
       labels[:split_index], labels[split_index:]

Let’s break that apart.

Features:

raw_data[:split_index]      # training data
raw_data[split_index:]      # validation data

Labels:

labels[:split_index]        # training labels
labels[split_index:]        # validation labels

So the return order is:

x_train, x_valid, y_train, y_valid

Theoretical Connections

According to the text, "The validation set is a subset of $D$ that is not involved in the learning process and is used to evaluate the final hypothesis".
- This split allows you to estimate the out-of-sample error (E out) and perform model selection (like choosing a learning rate or regularization parameter) without "snooping" on the test data.
Summary of Returns:
1. New Training Inputs: Used to calculate gradients and update weights.
2. New Validation Inputs: Used only to predict and check accuracy.
3. New Training Labels: The correct answers for the training inputs.
4. New Validation Labels: The correct answers for the validation inputs.

Why do we need this function?

1. Training Error ( $E_{i n}$ ) is Biased and Optimistic

When a model minimizes the error on the training set, it is specifically adapting to that specific data. Consequently, the in-sample error ( $E_{i n}$ ) will always be optimistically biased—it will look lower than the actual error the model would make on new, unseen data.

The Validation Role: The validation set acts as a proxy for real-world data. Since this data was not used to calculate the weights, the error calculated on it ( $E_{v a l}$ ) provides an unbiased estimate of the out-of-sample error ( $E_{o u t}$ ),.

2. Model Selection (Tuning Hyperparameters)

You often need to make choices about the learning process that the training algorithm cannot determine on its own. For example:

Which polynomial degree to use?
Which regularization parameter ( $λ$ ) to use?
When to stop training (early stopping)?

If you choose these parameters based on which one gives the lowest Training Error, you will simply pick the most complex model possible, leading to overfitting.

The Validation Role: You test different versions of the model (e.g., different $λ$ values) on the validation set and pick the one with the lowest Validation Error. This allows you to select the model that generalizes best, rather than the one that just memorized the training data.

3. Protecting the Test Set (Avoiding Data Snooping)

You might wonder, "Why not just use the Test set for this?"

The Principle: If a data set affects any step in the learning process (including choosing parameters), it is no longer a valid test set.
The Danger: If you use the test set to tune your model, you are effectively "snooping" on the final exam. The test set will no longer give you an honest assessment of how the model will perform in the real world. The validation set serves as a "sandbox" for tuning so that the test set remains pure and untouched until the very end.

Summary: You split the data so you can train on one part ( $D_{t r a i n}$ ) and use the other part ( $D_{v a l}$ ) to simulate how the model behaves in the real world, allowing you to tune the model without cheating by looking at the final test data,.

Function: `prepare_X(raw_X)`

def prepare_X(raw_X):
    """Extract features from raw_X as required.

    Args:
        raw_X: An array of shape [n_samples, 256].

    Returns:
        X: An array of shape [n_samples, n_features].
    """
    raw_image = raw_X.reshape((-1, 16, 16))

Restore spatial structure of the images

raw_image = raw_X.reshape((-1, 16, 16))

Context: The input data raw_X comes as a "flattened" matrix where every image is a single row of 256 numbers. To flip an image left-to-right, the computer needs to know which pixels are on the left and which are on the right.
The Code:
- 16, 16: This tells the computer that the 256 pixels actually represent a 16×16 grid.
- -1: This is a NumPy trick that means "infer this number automatically." It represents $N$ , the total number of images (samples) in your dataset.
Result: You go from a 2D matrix of shape (N, 256) to a 3D block of shape (N, 16, 16).

Feature 1: Symmetry

	# Feature 1: Measure of Symmetry
	
	# Flip images horizontally
	flipped_image = np.flip(raw_image, axis=2)
		# Since shape == (N, 16, 16), width is axis 2
	
	# Formula for symmetry: - (Sum of |x - flip(x)|) / 256
	symmetry = -np.sum(np.abs(raw_image - flipped_image), axis=(1, 2)) / 256

Explanation:

	flipped_image = np.flip(raw_image, axis=2)

This line creates a mirror image of every digit in the dataset.
Context: To measure symmetry, we compare the original image to its horizontal reflection. If a digit is perfectly symmetric (like the number '1'), the reflection should look identical to the original.
The Code:
- axis=2: In the shape (N, height, width), axis 0 is the sample index, axis 1 is the height, and axis 2 is the width. Flipping along axis 2 performs a horizontal (left-right) flip.

Explanation:

	symmetry = -np.sum(np.abs(raw_image - flipped_image), axis=(1, 2)) / 256

This line calculates the final symmetry score using the formula provided in the assignment: $$F_{\text {symmetry }}=-\frac{\sum_{\text {pixel }}|x-f l i p(x)|}{256}$$
raw_image - flipped_image: This calculates the difference between the original and the mirror version pixel-by-pixel.
np.abs(...): We care about the magnitude of the difference (e.g., a black pixel becoming white or vice versa), not the sign. This calculates the Asymmetry.
np.sum(..., axis=(1, 2)): This sums up all the pixel differences for each image. By specifying axis=(1, 2), we collapse the height and width, resulting in a single total error number for each of the $N$ images.
/ 256: Dividing the sum by the total number of pixels (256) gives us the average difference per pixel.
- (The negative sign): This is crucial.
- A large absolute difference means the image is asymmetric (like the digit '5').
- A small absolute difference means the image is symmetric (like the digit '1').
- By adding the negative sign, we flip this relationship: a highly symmetric image (difference $\approx 0$ ) gets a score close to 0 (the maximum), while an asymmetric image gets a large negative score. This creates a "Symmetry" feature rather than an "Asymmetry" feature.

Feature 2: Intensity

	# Feature 2: Measure of Intensity
	### YOUR CODE HERE
     
    # Formula for intensity: (Sum of x) / 256
    intensity = np.sum(raw_image, axis=(1,2)) / 256

Explanation:

	intensity = np.sum(raw_image, axis=(1,2)) / 256

The formula for intensity given in the assignment is: $$F_{\text {intensity }}=\frac{\sum_{\text {pixel }} x}{256}$$
The code performs this calculation in two steps:
1. np.sum(raw_image, axis=(1,2)): This calculates the top part of the fraction ( $\sum_{pixel} x$ ). It adds up all the pixel values within each specific image.
2. / 256: This divides the sum by the total number of pixels to get the average value.
This matches the description in the textbook, where intensity is described as the average pixel value used to distinguish between digits like "1" (low intensity) and "5" (high intensity).

(extra) What does `axis=(1,2)` do?

To understand the axis parameter, you have to visualize how the raw_image data is structured after you reshape it.

The variable raw_image is a 3-Dimensional array with the shape (N, 16, 16).

Axis 0 ( $N$ ): The index of the image (e.g., Image #1, Image #2, ... Image # N).
Axis 1 (16): The height (rows) of the image.
Axis 2 (16): The width (columns) of the image.

When you specify axis=(1, 2): You are telling NumPy to "collapse" or "squash" dimensions 1 and 2 by summing them up, while keeping Axis 0 separate.

If you summed axis=0: You would add Image #1 + Image #2 + ..., which would result in one "super image" combining everyone's handwriting. This is incorrect.
If you sum axis=(1, 2): You are saying, "For each individual image (Axis 0), sum up all the rows (Axis 1) and all the columns (Axis 2)."

Visualizing the Operation

Think of a single image as a spreadsheet with 16 rows and 16 columns (256 cells total).
1. axis=1 (Rows): If you summed only this axis, you would squash the spreadsheet down to one row of column totals.
2. axis=2 (Columns): If you summed only this axis, you would squash the spreadsheet into one column of row totals.
3. axis=(1, 2) (Both): This tells the computer to squash both dimensions. It takes every number in that 16x16 spreadsheet and adds them all up into one single number.
Summary: np.sum(..., axis=(1,2)) adds all 256 numbers together.
Output: One single number (the total value).
Next step: Divide by 256 to get the average value (intensity).

Visualizing the Result:

Input: A 3D block of N images, each $16 \times 16$ .
Operation: Sum over the height and width.
Output: A 1D vector of length N. Each number in this vector is the total intensity for that specific image.

Feature 3: Bias Term. Always 1.

	# Feature 3: Bias Term. Always 1.
	# Vector of 1s with lenght == n_samples
	bias = np.ones(raw_image.shape[0])

What the code does

bias = np.ones(raw_image.shape[0])

raw_image.shape[0]: This retrieves $N$ , the total number of images (samples) in your dataset.
np.ones(...): This creates a NumPy array of length $N$ where every single number is 1.0.

Why do we need it? (The "Bias Trick")

In linear models, the decision boundary is determined by a weighted sum of the inputs plus a threshold (or offset).

The Formula: The hypothesis is formally written as: $$h(x) = \text{sign}\left( \left( \sum_{i=1}^{d} w_i x_i \right) + w_0 \right)$$ Here, $\sum w_{i} x_{i}$ is the weighted score of your actual features (intensity and symmetry), and $w_{0}$ is the bias weight (threshold).
The Matrix Problem: We want to calculate this entire score using a single neat matrix multiplication: $X w$ . However, standard matrix multiplication only does the summing part ( $\sum w_{i} x_{i}$ ); it doesn't automatically add the loose constant $w_{0}$ .
The Solution ( $x_{0} = 1$ ): To fix this, we use a mathematical trick called the augmented vector. We imagine there is a "dummy" feature $x_{0}$ that is always equal to 1,. $$w_0 \cdot 1 + w_1 x_1 + w_2 x_2 = \mathbf{w}^T \mathbf{x}$$
By adding this column of 1s to your data matrix $X$ , the computer treats the bias $w_{0}$ just like any other weight. When it multiplies the weight $w_{0}$ by the feature $1$ , it effectively adds the threshold to the score.

What happens if you skip this?

If you do not include this bias column (the column of 1s), you force $w_{0}$ to be zero.

Geometrically, this forces your separating line (or hyperplane) to pass directly through the origin $(0, 0)$ .
If your data is not centered at the origin (which is true for pixel intensity, as it is always positive), a line passing through $(0, 0)$ will likely fail to separate the digits "1" and "5" correctly. The bias allows the line to shift up, down, left, or right to find the best fit.

(extra) The concept of a threshold/bias weight

Think of a linear model as a scorecard. In the credit card example, the model calculates a weighted score based on your inputs (salary, years in job, etc.).

The Score: $\sum_{i = 1}^{d} w_{i} x_{i}$
The Decision: To make a Yes/No decision, the bank needs a specific threshold. For example, "If the score is greater than 50, approve credit."

Mathematically, this looks like: $$ \sum_{i=1}^d w_i x_i > \text{threshold} $$

The Mathematical Trick: Turning Threshold into Bias:

To make the math and the code cleaner, machine learning experts move the threshold to the other side of the inequality. $$ \sum_{i=1}^d w_i x_i - \text{threshold} > 0 $$
The term $- threshold$ is renamed the bias weight ( $w_{0}$ ). $$ \sum_{i=1}^d w_i x_i + w_0 > 0 $$By doing this, the "threshold" isn't a special number we have to keep track of separately; it just becomes another weight that the machine learns automatically,.

The "Augmented" Vector ( $x_{0} = 1$ )

This explains why our code has bias = np.ones(...).

We want to write the entire formula as a single matrix multiplication (dot product): $w^{T} x$ . However, the bias $w_{0}$ is floating by itself—it isn't being multiplied by any input feature.

The Solution: We create a "dummy" feature, called $x_{0}$ , which is always equal to 1.
Now, $w_{0} \times 1$ is effectively $w_{0}$ .

This allows us to merge the bias into the weights and inputs perfectly: $$ \underbrace{w_0 \cdot 1}{\text{bias}} + \sum^d w_i x_i = \sum_{i=0}^d w_i x_i = w^T x $$
This is referred to in the slides as the augmented vector,.

Geometric Meaning

Without a bias term, a linear model (a line or hyperplane) is forced to pass through the origin $(0, 0)$ .

Without Bias: If all inputs $x$ are 0, the score is always 0. The line cannot shift left, right, up, or down; it can only rotate.
With Bias: The line can shift away from the origin. This is crucial for data (like image pixel intensities) that might not be naturally centered around zero. The bias allows the separating line to "slide" to the correct position to separate the digits '1' and '5'.

Stack features together

	# Stack features together in the following order.
	# [Feature 3, Feature 1, Feature 2]     
    # Final feature matrix with shape == (n_samples, 3)
    X = np.column_stack((bias, symmetry, intensity))
    
    return X

1. What the code does (The Mechanics)

The function np.column_stack takes a sequence of 1D arrays (vectors) and stacks them side-by-side as columns to create a single 2D matrix.

Inputs: Three separate vectors, each of length $N$ (the number of images).
- bias: A vector of all 1s ( $x_{0}$ ).
- symmetry: The symmetry score for each image ( $x_{1}$ ).
- intensity: The intensity score for each image ( $x_{2}$ ).
Output: A single matrix $X$ with dimensions $(N, 3)$ .

2. The Resulting Data Matrix ( $X$ )

This line constructs the Data Matrix (often called the Design Matrix) described in the "Linear Regression Algorithm" slide. Each row represents one specific image (data point), and each column represents a specific feature of that image.

Visually, the matrix $X$ looks like this:

X = [\begin{array}{ccc} 1 & {symmetry}_{1} & {intensity}_{1} \\ 1 & {symmetry}_{2} & {intensity}_{2} \\ ⋮ & ⋮ & ⋮ \\ 1 & {symmetry}_{N} & {intensity}_{N} \end{array}]

Column 0 (Bias): Corresponds to the bias weight $w_{0}$ .
Column 1 (Symmetry): Corresponds to the weight $w_{1}$ .
Column 2 (Intensity): Corresponds to the weight $w_{2}$ .

3. Why do we stack them this way?

We stack them into this specific shape to utilize matrix multiplication for calculating predictions and errors efficiently.

As explained in the slides, the linear model calculates a "score" or "signal" for every data point using the formula: $$s = \mathbf{X}\mathbf{w}$$
This single line of linear algebra ( $X w$ ) performs the following calculation for every image at once: $$ \text{score} = (w_0 \cdot 1) + (w_1 \cdot \text{symmetry}) + (w_2 \cdot \text{intensity}) $$

Without column_stack, you would have disjoint vectors and would have to write a loop to calculate the score for each image individually.
By creating $X$ , you enable the use of the analytic solution for Linear Regression provided in your slides: $$ \mathbf{w}_{lin} = (\mathbf{X}^T \mathbf{X})^{-1} \mathbf{X}^T \mathbf{y} $$
This matrix $X$ is the primary input required to solve for the best weights using the pseudo-inverse method.

Function: `prepare_y(raw_y)`

def prepare_y(raw_y):
    """
    Args:
        raw_y: An array of shape [n_samples,].
        
    Returns:
        y: An array of shape [n_samples,].
        idx:return idx for data label 1 and 2.
    """
    y = raw_y
    idx = np.where((raw_y==1) | (raw_y==2))
    y[np.where(raw_y==0)] = 0
    y[np.where(raw_y==1)] = 1
    y[np.where(raw_y==2)] = 2

    return y, idx

1. The Purpose of the Function

This function is designed to filter the dataset for a specific binary classification task. Even though the original dataset contains many digits (0, 1, 2, etc.), this function creates an index (idx) to isolate only the samples corresponding to the digits "1" and "2".

This aligns with the strategy mentioned in the textbook of decomposing a multiclass problem (recognizing ten digits) into smaller binary problems (e.g., separating digit 1 from digit 2).

2. Line-by-Line Explanation

y = raw_y:
- This creates a variable y that references the input label array.
idx = np.where((raw_y==1) | (raw_y==2)):
- The Logic: This acts as a filter. It scans the raw_y array and finds the index locations (row numbers) where the label is either 1 or 2.
- The Result: idx is a list of integers representing the specific examples you will use for training and testing. Any data point not in this list (like digit 0, 3, 4...) will effectively be ignored in later steps when you apply this filter.
y[np.where(raw_y==0)] = 0, etc.:
- Sanitization: These lines explicitly set the values at specific indices to integers 0, 1, and 2.
- Why? In many starter codes, this is a safety measure. It ensures that the labels are clean integers and not floats (e.g., 1.0 becoming 1) or other formats, ensuring consistency before the mathematical operations begin.

3. Summary of Returns

y: The array of labels (sanitized).
idx: The critical piece of information. You will likely use this in your main.py or training loop to slice your data matrix $X$ .
- Instead of using the full $X$ , you will use X[idx] to get only the images of 1s and 2s.
- Instead of using the full $y$ , you will use y[idx] to get only the labels 1 and 2.

This function identifies samples belonging to classes 1 and 2 for binary classification, and these indices should be used to filter the data and convert the labels to ±1 if required by the learning algorithm.

main.py

Function `main()`

def main():
	# ------------Data Preprocessing------------
	# Read data for training.
    
    raw_data, labels = load_data(os.path.join(data_dir, train_filename))
    raw_train, raw_valid, label_train, label_valid = train_valid_split(raw_data, labels, 2300)

    ##### Preprocess raw data to extract features
    train_X_all = prepare_X(raw_train)
    valid_X_all = prepare_X(raw_valid)
    ##### Preprocess labels for all data to 0,1,2 and return the idx for data from '1' and '2' class.
    train_y_all, train_idx = prepare_y(label_train)
    valid_y_all, val_idx = prepare_y(label_valid)  

    ####### For binary case, only use data from '1' and '2'  
    train_X = train_X_all[train_idx]
    train_y = train_y_all[train_idx]
    ####### Only use the first 1350 data examples for binary training. 
    train_X = train_X[0:1350]
    train_y = train_y[0:1350]
    valid_X = valid_X_all[val_idx]
    valid_y = valid_y_all[val_idx]
    ####### set lables to  1 and -1. Here convert label '2' to '-1' which means we treat data '1' as postitive class. 
	### YOUR CODE HERE	
	
    # Convert label 2 to -1
    train_y[train_y == 2] = -1
    valid_y[valid_y == 2] = -1

    # Explicitely convert label 1 to 1 (not really needed)
    train_y[train_y == 1] = 1
    valid_y[valid_y == 1] = 1

	### END YOUR CODE
    data_shape = train_y.shape[0] 

#    # Visualize training data.
    visualize_features(train_X[:, 1:3], train_y)

1. Loading and Splitting Data

raw_data, labels = load_data(os.path.join(data_dir, train_filename))
raw_train, raw_valid, label_train, label_valid = train_valid_split(raw_data, labels, 2300)

Action: It loads the full dataset and splits it into a Training Set (first 2300 examples) and a Validation Set (the rest).
Why? As discussed in the textbook, the validation set is essential for model selection. It acts as a proxy for out-of-sample data ( $E_{o u t}$ ) to help you tune hyperparameters without "snooping" on the final test data.

2. Feature Extraction (`prepare_X`)

train_X_all = prepare_X(raw_train)
valid_X_all = prepare_X(raw_valid)

Action: This calls the function we implemented earlier. It transforms the raw 256-pixel vectors into the Design Matrix $X$ with 3 columns:
1. Bias: Always 1 (to handle the threshold $w_{0}$ ).
2. Symmetry: How symmetric the digit is.
3. Intensity: How black/bold the digit is.
Context: This maps the input space from 256 dimensions down to 3 dimensions ( $d = 2$ )

3. Filtering for Binary Classification (`prepare_y`)

train_y_all, train_idx = prepare_y(label_train)
valid_y_all, val_idx = prepare_y(label_valid)

Action: This function (which you provided in the previous turn) scans the labels. It returns train_idx, a list of row numbers where the digit is either '1' or '2'.
Why? The textbook describes decomposing a multi-class problem (10 digits) into binary problems (e.g., "is it a 1 or a 5?"). Your specific assignment is determining "is it a 1 or a 2?".

4. Applying the Filter

train_X = train_X_all[train_idx]
train_y = train_y_all[train_idx]

Action: It uses the indices found in step 3 to discard all data related to digits 0, 3, 4, 5, 6, 7, 8, and 9. The matrices train_X and train_y now only contain information about digits 1 and 2.

5. Limiting Sample Size

train_X = train_X[0:1350]
train_y = train_y[0:1350]

Action: It explicitly slices the data to keep only the first 1,350 examples.
Why? This is likely an artificial constraint for your specific homework problem (e.g., to test how the model behaves with a specific sample size $N$ or to ensure everyone gets the same results).

6. Label Conversion (The Code You Wrote)

train_y[train_y == 2] = -1
valid_y[valid_y == 2] = -1
# ... (and implicitly 1 remains 1)

Action: This converts the "class labels" from $1, 2$ to $+ 1, - 1$ .
Mathematical Necessity: Linear classification models (like the Perceptron) operate on the sign of the signal: $h (x) = sign (w^{T} x)$ . This function outputs $+ 1$ or $- 1$ .
- If you left the labels as 1 and 2, the math for minimizing error (e.g., $h (x) \neq y$ ) would break because the model is trying to output signs, not the integer "2",.
- Now, the model treats Digit 1 as the Positive Class (+1) and Digit 2 as the Negative Class (-1).

7. Visualization

visualize_features(train_X[:, 1:3], train_y)

Action: It sends the data to a plotting function.
train_X[:, 1:3]: This selects all rows (:) but only columns 1 and 2.
- Column 0 is the Bias (all 1s), which is not useful to plot.
- Column 1 is Symmetry.
- Column 2 is Intensity.
Result: This will likely generate a 2D scatter plot similar to the one shown in the textbook (Figure 3.1 or Example 3.1), where you can see if the 1s and 2s form distinct clusters based on how symmetric and dark they are.

Function: `visualize_features(X, y)`

def visualize_features(X, y):
	'''This function is used to plot a 2-D scatter plot of training features. 

	Args:
		X: An array of shape [n_samples, 2].
		y: An array of shape [n_samples,]. Only contains 1 or -1.

	Returns:
		No return. Save the plot to 'train_features.*' and include it
		in submission.
	'''
	### YOUR CODE HERE
	
	# Separate the data into 1 (digit 1) and -1 (digit 2) classes
	X_pos = X[y == 1]
	X_neg = X[y == -1]

	# Initialize plot
	plt.figure(figsize=(8, 6))
	
	# Plot positives (1) with 'o'
	plt.scatter(X_pos[:,0], X_pos[:, 1], color='blue', marker='o', label='Class +1')
	
	# Plot negatives (-1) with 'x'
	plt.scatter(X_neg[:,0], X_pos[:, 1], color='red', marker='x', label='Class +1')
	
	# Styling
	plt.xlabel('Symmetry')
	plt.ylabel('Inensity')
	plt.title('Feature Vizualization')
	plt.legend()
	plt.grid(True)

	# Save plot
	plt.savefig('features_vizualization.png')
	
	# Display
	plt.show()
	
	### END YOUR CODE

Just a simple training feature scatter plot, not very complex.

2. Cross-entropy loss

(a) Write the loss function $E (w)$ for one training data sample ( $x, y$ )

1. The Setup

The Signal: Let $s = y w^{T} x$ .
The Logistic Function: $θ (s) = \frac{e^{s}}{1 + e^{s}}$ .
The Probability: The model states that $P (y | x) = θ (y w^{T} x) = θ (s)$ .
The Loss: We want to minimize the negative log-likelihood, so the error for one point is $E (w) = - \ln (P (y | x))$ .

2. Identify the Probability Function

In Logistic Regression with binary labels $y \in + 1, - 1$ , the model estimates the probability $P (y | x)$ using the logistic function $θ (s)$ . The text defines this probability as: $$P(y|x) = \theta(y w^T x)$$ where $θ (s) = \frac{e^{s}}{1 + e^{s}}$ .

The Property: $1 - θ (s) = θ (- s)$

The logistic function is defined as $θ (s) = \frac{e^{s}}{1 + e^{s}}$ . If you calculate $1 - θ (s)$ , you get: $$ 1 - \frac{e^s}{1 + e^s} = \frac{1 + e^s - e^s}{1 + e^s} = \frac{1}{1 + e^s} $$
Now, look at $θ (- s)$ : $$ \theta(-s) = \frac{e^{-s}}{1 + e^{-s}} $$Multiply the top and bottom by $e^{s}$ : $$ \frac{e^{-s} \cdot e^s}{(1 + e^{-s}) \cdot e^s} = \frac{1}{e^s + 1} $$
Since both equal $\frac{1}{1 + e^{s}}$ , we have proven that $1 - θ (s) = θ (- s)$ .

Applying it to the Piecewise Function

Now we look at the two cases for $P (y | x)$ :

Case 1 ( $y = + 1$ ): The formula is $θ (w^{T} x)$ . Since $y = + 1$ , we can write this as $θ (1 \cdot w^{T} x) = θ (y w^{T} x)$ .
Case 2 ( $y = - 1$ ): The formula is $1 - θ (w^{T} x)$ . Using the property derived above, we swap $1 - θ (s)$ for $θ (- s)$ : $$ 1 - \theta(w^T x) = \theta(-w^T x) $$ Since $y = - 1$ , we can write $- w^{T} x$ as $- 1 \cdot w^{T} x$ , which is equal to $y w^{T} x$ . So, $P (y | x) = θ (y w^{T} x)$ .

Summary:
By using the symmetry $θ (- s) = 1 - θ (s)$ , we can cover both cases with a single expression: $$P(y | x) = \theta(y w^T x)$$This simplifies the math significantly because we don't have to write separate equations for positive and negative examples.

3. Apply the Negative Log-Likelihood

To find the optimal weights, the algorithm seeks to maximize the likelihood of the data. Mathematically, maximizing likelihood is equivalent to minimizing the negative natural logarithm (negative log-likelihood) of that probability. Therefore, the loss $E (w)$ for a single sample is: $$E(w) = -\ln(P(y|x))$$$$E(w) = -\ln(\theta(y w^T x))$$

The Substitution Step-by-Step

Step A: Substitute the probability $P (y | x)$ with the logistic function $θ (s)$ . $$E(w) = -\ln(\theta(s))$$Step B: Substitute the definition of $θ (s)$ . $$E(w) = -\ln\left( \frac{e^s}{1 + e^s} \right)$$Step C: Apply the logarithm rule $- \ln (\frac{a}{b}) = \ln (\frac{b}{a})$ to remove the negative sign. $$E(w) = \ln\left( \frac{1 + e^s}{e^s} \right)$$Step D: Simplify the fraction inside the logarithm. $$E(w) = \ln\left( \frac{1}{e^s} + \frac{e^s}{e^s} \right)$$$$E(w) = \ln\left( e^{-s} + 1 \right)$$Step E: Substitute $s = y w^{T} x$ back into the equation. $$E(w) = \ln\left( 1 + e^{-y w^T x} \right)$$#### 3. Simplify the Expression
By substituting the definition of the logistic function into the equation above, the text derives the standard cross-entropy loss formula for a single training sample $(x, y)$ :

E (w) = \ln (1 + e^{- y w^{T} x})

(b) Compute the gradient $\nabla E (w)$ . Please provide intermediate steps of derivation.

Based on the loss function derived in the previous step and the gradient descent discussions in the Learning From Data textbook, here is the step-by-step derivation to compute the gradient $\nabla E (w)$ .

The Loss Function

Recalling the Cross-Entropy loss for a single sample $(x, y)$ derived from the previous step: $$E(w) = \ln(1 + e^{-y w^T x})$$To find the gradient $\nabla E (w)$ with respect to the weight vector $w$ , we apply the chain rule of calculus.

Step-by-Step Derivation

Step 1: Identify the nested functions

Let us define a variable $u$ to represent the exponent: $$u = -y w^T x$$Now, the loss function can be written as: $$E(w) = \ln(1 + e^u)$$

Step 2: Apply the Chain Rule:

The Chain Rule states that if you have composite functions (functions inside of other functions), the derivative of the whole is the product of the derivatives of the parts.

To calculate the gradient, we view the error function $E (w) = \ln (1 + e^{- y w^{T} x})$ as a series of three nested layers, like an onion:

Layer 1 (Inner Variable): Let $u$ be the exponent. $$u(w) = -y w^T x$$
Layer 2 (Intermediate Term): Let $z$ be the term inside the logarithm. $$z(u) = 1 + e^u$$
Layer 3 (Outer Function): The error function itself is the natural log of $z$ . $$E(z) = \ln(z)$$
Apply the Chain Rule:
We want to find how $E$ changes when $w$ changes ( $\nabla E (w)$ ). Because $w$ is buried inside, we must calculate how changes propagate through the layers:

How $E$ changes as $z$ changes ( $\frac{\partial E}{\partial z}$ ).
How $z$ changes as $u$ changes ( $\frac{\partial z}{\partial u}$ ).
How $u$ changes as $w$ changes ( $\nabla_{w} u$ ).

When we multiply these rates of change together, we get the total gradient: $$ \nabla E(w) = \underbrace{\frac{\partial E}{\partial z}}{\text{Outer}} \cdot \underbrace{\frac{\partial z}{\partial u}}{\text{Middle}} \cdot \underbrace{\nabla_w u}_{\text{Inner}} $$
If we substitute the definitions of $z$ and $u$ back into the Chain Rule equation above, we get exactly:

Replace $z$ with $(1 + e^{u})$ : $$ \frac{\partial E}{\partial z} \rightarrow \frac{\partial E}{\partial (1+e^u)} $$
Replace $z$ with $(1 + e^{u})$ in the second term: $$ \frac{\partial z}{\partial u} \rightarrow \frac{\partial (1+e^u)}{\partial u} $$
So this is how we know the gradient $\nabla E (w)$ is the product of the derivatives of the nested parts and is written as: $$\nabla E(w) = \frac{\partial E}{\partial (1+e^u)} \cdot \frac{\partial (1+e^u)}{\partial u} \cdot \nabla_w u$$

Step 3: Compute the individual derivatives

Outer Logarithm: The derivative of $\ln (z)$ is $\frac{1}{z}$ . $$ \frac{\partial E}{\partial (1+e^u)} = \frac{1}{1 + e^u} $$
Exponential Term: The derivative of $1 + e^{u}$ with respect to $u$ is $e^{u}$ . $$ \frac{\partial (1+e^u)}{\partial u} = e^u $$
Linear Term: The gradient of $u = - y w^{T} x$ with respect to the vector $w$ is simply the coefficient vector $- y x$ . $$ \nabla_w u = \nabla_w (-y w^T x) = -y x $$

Step 4: Combine and Substitute Multiply the three parts together:

\nabla E (w) = (\frac{1}{1 + e^{u}}) \cdot (e^{u}) \cdot (- y x)

Substitute $u = - y w^{T} x$ back into the equation: $$\nabla E(w) = \frac{e^{-y w^T x}}{1 + e^{-y w^T x}} (-yx)$$

Step 5: Simplify the Expression

To clean up the fraction, multiply the numerator and the denominator by $e^{y w^{T} x}$ : $$ \frac{e^{-y w^T x} \cdot e^{y w^T x}}{(1 + e^{-y w^T x}) \cdot e^{y w^T x}} = \frac{1}{e^{y w^T x} + 1} $$
Now, combine this with the term $(- y x)$ : $$\nabla E(w) = \frac{1}{1 + e^{y w^T x}} (-yx)$$

Final Answer

The gradient of the loss function for a single sample is:

\nabla E (w) = - \frac{y x}{1 + e^{y w^{T} x}}

This result matches the summation formula provided in Exercise 3.7 of the textbook, which defines the gradient contribution of a single point as $- \frac{y_{n} x_{n}}{1 + e^{y_{n} w^{T} x_{n}}}$ .

Interpretation: The gradient is a scaled version of the input vector $- y x$ . The scaling factor $\frac{1}{1 + e^{y w^{T} x}}$ determines how much this specific point affects the weight update. If $y w^{T} x$ is large and positive (the point is correctly classified with high confidence), the denominator is large, and the gradient is close to zero (little update). If $y w^{T} x$ is negative (misclassified), the denominator is small, and the gradient contribution is larger.

(c) Once the optimal $w$ is obtained,

it can be used to make predictions as follows:
$$
\text { Predicted class of } x= \begin{cases}1 & \text { if } \theta\left(w^T x\right) \geq 0.5 \ -1 & \text { if } \theta\left(w^T x\right)<0.5\end{cases}
$$
where the function $θ (z) = \frac{1}{1 + e^{- z}}$ looks like

2026-02-05_14-42-07.png|300

However, this is not the most efficient way since the decision boundary is linear. Why? Explain it. When will we need to use the sigmoid function in prediction?

1. Why is checking $θ (w^{T} x) \geq 0.5$ not the most efficient way?

Explanation: Using the sigmoid function to make a binary prediction is inefficient because it involves unnecessary and computationally expensive mathematical operations.

The Math: The sigmoid function $θ (s) = \frac{1}{1 + e^{- s}}$ is strictly monotonic. This means it constantly increases as the input $s$ increases. Crucially, $θ (0) = 0.5$ .
The Equivalence: Because of this property, the inequality $θ (w^{T} x) \geq 0.5$ is mathematically identical to checking if the input signal is non-negative: $$ \theta(w^T x) \geq 0.5 \iff w^T x \geq 0 $$
Computational Cost:
- Computing $w^{T} x$ only requires basic multiplication and addition (a dot product).
- Computing $θ (w^{T} x)$ requires calculating the dot product plus an exponentiation ( $e^{- z}$ ) and a division. Exponentiation is computationally much more expensive than simple arithmetic.

Conclusion: Since the decision boundary is linear (defined by $w^{T} x = 0$ ), you can simply check the sign of the linear signal. If $w^{T} x > 0$ , predict 1; otherwise, predict -1. Calculating the full probability via the sigmoid function is wasted effort if you only need the binary class label.

2. When will we need to use the sigmoid function in prediction?

You need the sigmoid function when the goal is not just a hard classification (Yes/No), but an estimation of uncertainty or risk.

Probability Estimation: In many applications, such as medical diagnosis (e.g., predicting a heart attack) or financial forecasting, it is crucial to know the likelihood of an event, not just the binary outcome. A patient with a 99% probability of a heart attack requires different treatment than one with a 51% probability, even though a hard classifier would treat them both as "Positive".
Weighted Cost / Risk Matrices: If the "cost" of making a mistake is not symmetric (e.g., a false negative is much worse than a false positive, as in CIA security clearance), you should not use the standard 0.5 threshold.
- As shown in the textbook problems regarding risk matrices, the optimal threshold might shift (e.g., to 0.1 or 0.9) depending on the costs $c_{a}$ (false accept) and $c_{r}$ (false reject).
- To apply a shifted threshold (e.g., "Predict positive only if probability $> 0.9$ "), you must compute the actual probability value using the sigmoid function $θ (w^{T} x)$ .

(d) Is the decision boundary still linear if the prediction rule is changed to the following?

Predicted label of x = {\begin{cases} 1 & if θ (w^{T} x) \geq 0.9 \\ - 1 & if θ (w^{T} x) < 0.9 \end{cases}

Based on the properties of the linear signal and the logistic function described in the "Linear Models" slides and the Learning From Data textbook, the answer is Yes, the decision boundary is still linear.

1. Identify the Decision Boundary Equation

The decision boundary is the "line" (or hyperplane) that separates the region where you predict $+ 1$ from the region where you predict $- 1$ . In your new rule, this boundary occurs exactly when the probability equals the threshold: $$ \theta(w^T x) = 0.9 $$

2. Substitute the Logistic Function

Recall the definition of the logistic function $θ (s)$ provided in the slides and textbook: $$ \frac{1}{1 + e^{-w^T x}} = 0.9 $$

3. Solve for the Linear Signal ( $w^{T} x$ )

To see the shape of the boundary in the input space, we need to isolate the input terms ( $x$ ).

Invert both sides: $$ 1 + e^{-w^T x} = \frac{1}{0.9} = \frac{10}{9} $$
Subtract 1 from both sides: $$ e^{-w^T x} = \frac{10}{9} - 1 = \frac{1}{9} $$
Take the natural logarithm ( $\ln$ ) of both sides: $$ -w^T x = \ln\left(\frac{1}{9}\right) $$
Simplify the logarithm ( $\ln (1 / 9) = - \ln (9)$ ): $$ -w^T x = -\ln(9) $$ $$ w^T x = \ln(9) $$

4. Conclusion: It is a Linear Boundary

The final equation for the decision boundary is: $$ w^T x - \ln(9) = 0 $$
Since $\ln (9)$ is just a constant number (approximately $2.2$ ), this equation effectively shifts the threshold of the linear signal.

Original Boundary (Threshold 0.5): $w^{T} x = 0$
New Boundary (Threshold 0.9): $w^{T} x = 2.2$

Geometrically, the equation $w^{T} x = constant$ defines a hyperplane (a line in 2D space). Therefore, changing the probability threshold shifts the position of the boundary, but it remains linear; it does not curve.

(e) What is the essential property of logistic regression that results in the linear decision boundary?

The essential property that results in a linear decision boundary is that Logistic Regression models the probability as a monotonic function of a linear signal.

Here is the breakdown of the components you should mention in your answer:

The Linear Signal ( $s$ ): The core of the model is the "signal" $s$ , which is a linear combination of the input features: $$s = w^T x = \sum w_i x_i$$ This part is linear in the input $x$ .
The Monotonic Transformation ( $θ$ ): The probability $P (y = 1 | x)$ is computed by passing this linear signal through the logistic (sigmoid) function: $$P(y=1|x) = \theta(s) = \frac{1}{1+e^{-s}}$$Crucially, this function is strictly monotonic (it effectively just scales the signal from the range $(- \infty, \infty)$ to $(0, 1)$ without changing the order of the values).
The Resulting Boundary: Because the function is monotonic, applying a threshold to the probability (e.g., $P \geq 0.9$ ) is mathematically equivalent to applying a corresponding threshold to the linear signal (e.g., $s \geq 2.2$ ).
- Equation: $θ (w^{T} x) = threshold ⟺ w^{T} x = constant$ .
- Since $w^{T} x = constant$ defines a hyperplane (a line in 2D), the decision boundary is linear.

Summary for your answer: You should state that the decision boundary is linear because the logistic regression model relies on a linear signal ( $w^{T} x$ ) and the decision rule involves thresholding a monotonic function of that signal. Therefore, contours of constant probability correspond directly to contours of constant linear score (hyperplanes).

3. Sigmoid Logistic Regression

LogisticRegression.py

Function: `_gradient`

Based on (b) in the last problem, implement the function _gradient.

    def _gradient(self, _x, _y):
        """Compute the gradient of cross-entropy with respect to self.W
        for one training sample (_x, _y). This function is used in fit_*.

        Args:
            _x: An array of shape [n_features,].
            _y: An integer. 1 or -1.

        Returns:
            _g: An array of shape [n_features,]. The gradient of
                cross-entropy with respect to self.W.
        """
		### YOUR CODE HERE

		### END YOUR CODE

Based on the derivation we performed in the previous step (part b), we found that the gradient of the cross-entropy loss $E (w)$ for a single sample is:

\nabla E (w) = - \frac{y x}{1 + e^{y w^{T} x}}

Here is the implementation for the _gradient function in LogisticRegression.py, followed by a detailed explanation.

Implementation

    def _gradient(self, _x, _y):
        """Compute the gradient of cross-entropy with respect to self.W
        for one training sample (_x, _y). This function is used in fit_*.

        Args:
            _x: An array of shape [n_features,].
            _y: An integer. 1 or -1.

        Returns:
            _g: An array of shape [n_features,]. The gradient of
                cross-entropy with respect to self.W.
        """
        ### YOUR CODE HERE
        # 1. Compute the exponent term: y * (w^T * x)
        # self.W is the weight vector (w)
        exponent = _y * np.dot(self.W, _x)

        # 2. Compute the scalar coefficient: -y / (1 + e^(exponent))
        # This represents how much "correction" is needed based on the prediction error.
        coefficient = -_y / (1 + np.exp(exponent))

        # 3. Compute the gradient vector: coefficient * x
        _g = coefficient * _x
        ### END YOUR CODE

        return _g

Explanation of the Steps

1. The Dot Product (np.dot(self.W, _x))

Math: This calculates the linear signal $s = w^{T} x$ .
Context: As described in the slides and textbook, the linear signal is the weighted sum of the inputs. This determines the "score" before it is squashed into a probability by the logistic function.

2. The Exponent (exponent = _y * ...)

Math: This calculates $y w^{T} x$ (or $y s$ ).
Context: This term represents the "margin."
- If $y$ and the signal $w^{T} x$ have the same sign (correct classification), this value is positive.
- If they have different signs (incorrect classification), this value is negative.

3. The Denominator (1 + np.exp(exponent))

Math: This calculates the term $1 + e^{y w^{T} x}$ .
Context: This is the scaling factor from the logistic function. As we derived in the previous question, the gradient of the loss involves the derivative of the logarithm and the sigmoid function, which results in this term in the denominator.

4. The Final Gradient (_g)

Math: $\nabla E (w) = (\frac{- y}{1 + e^{y w^{T} x}}) x$ .
Context:
- The gradient points in the direction of steepest ascent (how to increase error). Since we want to minimize error, we will eventually subtract this gradient (or add the negative gradient) during the weight update step (Gradient Descent).
- Notice that coefficient is a scalar (a single number) and _x is a vector. Multiplying them scales the input vector _x by the magnitude of the error. If the model is very confident and correct (exponent is large positive), the coefficient approaches 0 (little update). If the model is wrong (exponent is large negative), the coefficient approaches $- y$ (large update).

Function: `fit_BGD`

    def fit_BGD(self, X, y):
        """Train perceptron model on data (X,y) with Batch Gradient Descent.

        Args:
            X: An array of shape [n_samples, n_features].
            y: An array of shape [n_samples,]. Only contains 1 or -1.

        Returns:
            self: Returns an instance of self.
        """
        n_samples, n_features = X.shape

		### YOUR CODE HERE
        

		### END YOUR CODE
        return self

Implementation

    def fit_BGD(self, X, y):
        """Train perceptron model on data (X,y) with Batch Gradient Descent.

        Args:
            X: An array of shape [n_samples, n_features].
            y: An array of shape [n_samples,]. Only contains 1 or -1.

        Returns:
            self: Returns an instance of self.
        """
        n_samples, n_features = X.shape

        ### YOUR CODE HERE
        # 1. Initialize weights (usually to zeros or small random values)
        self.W = np.zeros(n_features)

        # 2. Loop over epochs (iterations)
        for _ in range(self.max_iter):
            # Initialize a gradient accumulator for the batch
            batch_grad = np.zeros(n_features)

            # 3. Compute the gradient for the entire dataset
            for i in range(n_samples):
                # Accumulate the gradient contribution from each sample
                batch_grad += self._gradient(X[i], y[i])

            # 4. Average the gradient
            # The in-sample error is the average over N samples, so we divide the sum by N
            batch_grad /= n_samples

            # 5. Update the weights
            # Move in the opposite direction of the gradient (Descent)
            self.W -= self.learning_rate * batch_grad

        ### END YOUR CODE
        return self

Explanation of the Steps

1. Initialization (self.W = np.zeros(n_features))

Concept: Before training begins, the weight vector $w$ must be initialized.
Why: For logistic regression with cross-entropy error, the error surface is convex (it has only one global valley), so initializing weights to zero is a safe and common practice that works well.

2. The Epoch Loop (for _ in range(self.max_iter))

Concept: Gradient descent is an iterative algorithm. Each pass through the loop represents one step of moving the weights down the error surface.
Why: We repeat the update max_iter times (or until the gradient is close to zero) to reach the minimum of the in-sample error $E_{i n}$ .

3. Accumulating the Gradient (batch_grad += ...)

Concept: Batch Gradient Descent (BGD) is defined by using all training examples to compute the gradient direction for a single update step.
Why: The gradient of the total in-sample error $\nabla E_{i n} (w)$ is the average of the gradients of the individual examples $\nabla e_{n} (w)$ . We calculate this by summing self._gradient(X[i], y[i]) for every $i$ from 0 to $N$ .

4. Averaging (batch_grad /= n_samples)

Concept: The formal definition of in-sample error usually includes a factor of $\frac{1}{N}$ .
Math: $E_{i n} (w) = \frac{1}{N} \sum_{n = 1}^{N} \ln (1 + e^{- y_{n} w^{T} x_{n}})$ . Therefore, the full gradient is the sum divided by $N$ .

5. Weight Update (self.W -= self.learning_rate * batch_grad)

Concept: We update the weights by taking a step of size $η$ (learning rate) in the direction of the negative gradient ( $- \nabla E_{i n}$ ).
Math: $w (t + 1) = w (t) - η \nabla E_{i n} (w (t))$ . This moves the weight vector "downhill" towards the minimum error.

Function: `fit_miniBGD`

    def fit_miniBGD(self, X, y, batch_size):
        """Train perceptron model on data (X,y) with mini-Batch Gradient Descent.

        Args:
            X: An array of shape [n_samples, n_features].
            y: An array of shape [n_samples,]. Only contains 1 or -1.
            batch_size: An integer.

        Returns:
            self: Returns an instance of self.
        """
		### YOUR CODE HERE
        

		### END YOUR CODE
        return self

Based on the description provided in the prompt and the concepts of Stochastic Gradient Descent (SGD) discussed in the Learning From Data textbook, Mini-Batch Gradient Descent is a middle ground between Batch Gradient Descent (BGD) and SGD.

In BGD, you update weights once per epoch (using all data). In SGD, you update weights $N$ times per epoch (using one sample at a time). In Mini-Batch, you update weights multiple times per epoch, using a small group of samples (the "batch") for each update.

Implementation

    def fit_miniBGD(self, X, y, batch_size):
        """Train perceptron model on data (X,y) with mini-Batch Gradient Descent.

        Args:
            X: An array of shape [n_samples, n_features].
            y: An array of shape [n_samples,]. Only contains 1 or -1.
            batch_size: An integer.

        Returns:
            self: Returns an instance of self.
        """
        n_samples, n_features = X.shape

        ### YOUR CODE HERE
        # 1. Initialize weights
        self.W = np.zeros(n_features)

        # 2. Loop over epochs
        for _ in range(self.max_iter):
            # 3. Shuffle the data at the start of each epoch
            # This ensures the batches are different every time, preventing cycles.
            indices = np.arange(n_samples)
            np.random.shuffle(indices)
            X_shuffled = X[indices]
            y_shuffled = y[indices]

            # 4. Loop over the data in chunks of 'batch_size'
            for start_idx in range(0, n_samples, batch_size):
                # Handle the last batch, which might be smaller than batch_size
                end_idx = min(start_idx + batch_size, n_samples)

                # Slice the current mini-batch
                X_batch = X_shuffled[start_idx:end_idx]
                y_batch = y_shuffled[start_idx:end_idx]
                current_batch_size = end_idx - start_idx

                # 5. Compute the gradient for this specific batch
                batch_grad = np.zeros(n_features)
                for i in range(current_batch_size):
                    batch_grad += self._gradient(X_batch[i], y_batch[i])

                # Average the gradient over the batch size
                batch_grad /= current_batch_size

                # 6. Update the weights immediately
                self.W -= self.learning_rate * batch_grad

        ### END YOUR CODE
        return self

Step-by-Step Explanation

1. Initialization (self.W = np.zeros)

Just like in Batch Gradient Descent, we start with zero weights.

2. The Epoch Loop (for _ in range(self.max_iter))

An epoch represents one full pass through the entire dataset. In BGD, one epoch equals one weight update. In Mini-Batch, one epoch contains n_samples / batch_size weight updates.

3. Shuffling (np.random.shuffle)

The Logic: The prompt instructs to "reshuffle your samples before each epoch."
Why: If you feed the batches in the exact same order every time, the model might get stuck in loops or learn the order of the data rather than the features. Shuffling adds the necessary randomness (stochasticity) that helps the algorithm escape local minima and converge faster.

4. The Batch Loop (range(0, n_samples, batch_size))

This loop steps through the shuffled data.
Indices: start_idx is the beginning of the batch. end_idx is the end. We use min to ensure we don't go out of bounds on the very last batch (e.g., if you have 105 samples and a batch size of 10, the last batch will only have 5 samples).

5. Accumulating Gradients (batch_grad += ...)

Instead of summing gradients for the entire dataset (like BGD), we only sum the gradients for the samples in the current X_batch.
We use the self._gradient function you wrote earlier to calculate the contribution of each individual sample in the batch.
Averaging: We divide by current_batch_size to get the average gradient direction for this specific group of points.

6. Weight Update (self.W -= ...)

The Critical Difference: Notice that the indentation of this line is inside the batch loop, not the epoch loop.
In Batch GD, you update weights once after seeing all samples.
In Mini-Batch GD, you update weights immediately after seeing one batch. This causes the weights to "wiggle" towards the solution more frequently, often resulting in faster learning than BGD because you don't have to wait for the whole dataset to make a move.

Function: `fit_SGD`

    def fit_SGD(self, X, y):
        """Train perceptron model on data (X,y) with Stochastic Gradient Descent.

        Args:
            X: An array of shape [n_samples, n_features].
            y: An array of shape [n_samples,]. Only contains 1 or -1.

        Returns:
            self: Returns an instance of self.
        """
		### YOUR CODE HERE

		### END YOUR CODE
        return self

Based on the definition of Stochastic Gradient Descent (SGD) provided in the Learning From Data textbook, specifically the section on "Stochastic Gradient Descent," here is the implementation and detailed explanation.

SGD differs from Batch Gradient Descent in that it updates the weights using the gradient of a single training example $(x_{n}, y_{n})$ at a time, rather than the average gradient of the entire dataset.

Implementation

    def fit_SGD(self, X, y):
        """Train perceptron model on data (X,y) with Stochastic Gradient Descent.

        Args:
            X: An array of shape [n_samples, n_features].
            y: An array of shape [n_samples,]. Only contains 1 or -1.

        Returns:
            self: Returns an instance of self.
        """
        n_samples, n_features = X.shape

        ### YOUR CODE HERE
        # 1. Initialize weights
        self.W = np.zeros(n_features)

        # 2. Loop over epochs
        # An epoch is one full pass through the data.
        for _ in range(self.max_iter):

            # 3. Shuffle the data
            # Essential in SGD to ensure the model sees samples in a random order
            indices = np.arange(n_samples)
            np.random.shuffle(indices)

            # 4. Iterate through every single sample
            for i in indices:
                # 5. Compute gradient for ONE sample (X[i], y[i])
                # This uses the gradient formula derived earlier: -y*x / (1 + e^(y*w^T*x))
                grad = self._gradient(X[i], y[i])

                # 6. Update weights immediately
                # w = w - learning_rate * gradient
                self.W -= self.learning_rate * grad

        ### END YOUR CODE
        return self

Detailed Explanation of Each Step

1. Initialization (self.W = np.zeros)

Concept: We start with an initial guess for the weights.
Context: While random initialization is generally safer to avoid symmetric traps (especially in neural networks), initializing to zero is common and effective for logistic regression.

2. The Epoch Loop (for _ in range(self.max_iter))

Concept: In SGD, an "iteration" often refers to a single weight update (seeing one sample), while an epoch refers to a full pass through the entire dataset.
Why: self.max_iter here usually controls the number of epochs (passes over the data) to ensure the model sees the data enough times to converge.

3. Shuffling (np.random.shuffle)

Concept: We randomize the order in which we visit the data points.
Why: SGD relies on the randomness ("stochastic") of the sample selection. If you feed the data in a fixed order (e.g., all class +1, then all class -1), the weights might oscillate or drift in cycles rather than converging. Shuffling simulates picking $n$ uniformly at random.

4. The Inner Loop (for i in indices)

Concept: We iterate through the shuffled indices one by one.
Difference from BGD: In Batch Gradient Descent, there is no inner loop over samples; you calculate the gradient for the whole matrix $X$ at once. In SGD, you must loop through samples individually.

5. Gradient Calculation (grad = self._gradient(X[i], y[i]))

Math: You compute $\nabla e_{n} (w)$ , which is the gradient of the error for just the $n$ -th data point.
Formula: As derived in previous steps, this is $- \frac{y_{n} x_{n}}{1 + e^{y_{n} w^{T} x_{n}}}$ .

6. Weight Update (self.W -= self.learning_rate * grad)

Math: $w (t + 1) = w (t) - η \nabla e_{n} (w (t))$ .
Concept: You update the weights immediately after calculating the error for that single point.
Behavior: This makes the path of the weights "wiggly." Instead of moving directly down the true gradient (like BGD), the weights move based on what is good for just one example. On average, these random fluctuations cancel out, and the weights move toward the global minimum.

Function: `predict_proba`

    def predict_proba(self, X):
        """Predict class probabilities for samples in X.

        Args:
            X: An array of shape [n_samples, n_features].

        Returns:
            preds_proba: An array of shape [n_samples, 2].
                Only contains floats between [0,1].
        """
		### YOUR CODE HERE

		### END YOUR CODE

This function computes the probability $P (y | x)$ using the logistic (sigmoid) function $θ (s) = \frac{1}{1 + e^{- s}}$ . As established in the "Logistic Regression" section, the model estimates $P (y = 1 | x)$ directly,.

    def predict_proba(self, X):
        """Predict class probabilities for samples in X.

        Args:
            X: An array of shape [n_samples, n_features].

        Returns:
            preds_proba: An array of shape [n_samples, 2].
                Only contains floats between.
        """
        ### YOUR CODE HERE
        # 1. Compute the linear signal s = w^T * x
        signal = np.dot(X, self.W)

        # 2. Compute P(y=1|x) using the sigmoid function
        prob_pos = 1 / (1 + np.exp(-signal))

        # 3. Compute P(y=-1|x)
        # Since probabilities sum to 1: P(y=-1) = 1 - P(y=1)
        prob_neg = 1 - prob_pos

        # 4. Stack them into a matrix of shape [n_samples, 2]
        # Column 0 corresponds to the negative class (-1)
        # Column 1 corresponds to the positive class (1)
        preds_proba = np.column_stack((prob_neg, prob_pos))

        return preds_proba
        ### END YOUR CODE

Explanation:

Signal: We calculate the "linear signal" $s = w^{T} x$ .
Sigmoid: We apply the logistic function to transform the signal into a probability between 0 and 1. This represents the likelihood that the label is +1,.
Column Stack: The function returns a 2D array where the first column represents the probability of the negative class and the second column represents the probability of the positive class.

Function: `predict`

    def predict(self, X):
        """Predict class labels for samples in X.

        Args:
            X: An array of shape [n_samples, n_features].

        Returns:
            preds: An array of shape [n_samples,]. Only contains 1 or -1.
        """
		### YOUR CODE HERE

		### END YOUR CODE

This function returns the binary class labels. As discussed in the slides, the decision boundary is linear. We assign the class based on which probability is higher, which is equivalent to checking if the linear signal is non-negative,.

    def predict(self, X):
        """Predict class labels for samples in X.

        Args:
            X: An array of shape [n_samples, n_features].

        Returns:
            preds: An array of shape [n_samples,]. Only contains 1 or -1.
        """
        ### YOUR CODE HERE
        # 1. Compute the linear signal s = w^T * x
        signal = np.dot(X, self.W)

        # 2. Apply the threshold
        # If w^T * x >= 0, then P(y=1|x) >= 0.5, so we predict 1.
        # Otherwise, we predict -1.
        preds = np.where(signal >= 0, 1, -1)

        return preds
        ### END YOUR CODE

Explanation:

Linear Decision: Instead of computing the expensive exponential for the probability and then checking if it is $> 0.5$ , we check the sign of the signal directly. $w^{T} x \geq 0$ implies $P (y = 1 | x) \geq 0.5$ .
np.where: This efficiently vectorizes the conditional logic, returning 1 where the condition is true and -1 otherwise.

Function: `score`

    def score(self, X, y):
        """Returns the mean accuracy on the given test data and labels.

        Args:
            X: An array of shape [n_samples, n_features].
            y: An array of shape [n_samples,]. Only contains 1 or -1.

        Returns:
            score: An float. Mean accuracy of self.predict(X) wrt. y.
        """
		### YOUR CODE HERE

		### END YOUR CODE

This function calculates the mean accuracy of the model, quantifying how well the predictions match the true labels.

    def score(self, X, y):
        """Returns the mean accuracy on the given test data and labels.

        Args:
            X: An array of shape [n_samples, n_features].
            y: An array of shape [n_samples,]. Only contains 1 or -1.

        Returns:
            score: An float. Mean accuracy of self.predict(X) wrt. y.
        """
        ### YOUR CODE HERE
        # 1. Generate predictions using the trained model
        predictions = self.predict(X)

        # 2. Compare predictions to true labels
        # (predictions == y) creates a boolean array (True for correct, False for incorrect)
        # np.mean converts True to 1 and False to 0, then calculates the average
        score = np.mean(predictions == y)

        return score
        ### END YOUR CODE

Explanation:

Accuracy: This implements the standard definition of classification accuracy: the fraction of correctly classified data points.

main.py

Function: `visualize_result`

def visualize_result(X, y, W):
	'''This function is used to plot the sigmoid model after training. 

	Args:
		X: An array of shape [n_samples, 2].
		y: An array of shape [n_samples,]. Only contains 1 or -1.
		W: An array of shape [n_features,].
	
	Returns:
		No return. Save the plot to 'train_result_sigmoid.*' and include it
		in submission.
	'''
	### YOUR CODE HERE

	### END YOUR CODE

This function visualizes the decision boundary learned by the Logistic Regression model. As established in the sources, the decision boundary for a linear model (including Logistic Regression) is defined where the "signal" $s = w^{T} x = 0$ .

In a 2D feature space with an added bias term ( $x_{0} = 1$ ), the equation for the boundary is: $$w_0 + w_1 x_1 + w_2 x_2 = 0$$
To plot this as a line, we solve for the vertical axis feature ( $x_{2}$ , Intensity) in terms of the horizontal axis feature ( $x_{1}$ , Symmetry): $$x_2 = -\frac{w_1 x_1 + w_0}{w_2}$$

Implementation

import numpy as np
import matplotlib.pyplot as plt

def visualize_result(X, y, W):
	'''This function is used to plot the sigmoid model after training.

	Args:
		X: An array of shape [n_samples, 2]. (Symmetry, Intensity)
		y: An array of shape [n_samples,]. Only contains 1 or -1.
		W: An array of shape [n_features,]. (Bias, w_symmetry, w_intensity)

	Returns:
		No return. Save the plot to 'train_result_sigmoid.*' and include it
		in submission.
	'''
	### YOUR CODE HERE
	# 1. Plot the data points (Same as visualize_features)
	plt.figure(figsize=(8, 6))

	# Separate positive (Digit 1) and negative (Digit 2) classes
	X_pos = X[y == 1]
	X_neg = X[y == -1]

	# Plot scatter points
	plt.scatter(X_pos[:, 0], X_pos[:, 1], color='blue', marker='o', label='Digit 1 (+1)')
	plt.scatter(X_neg[:, 0], X_neg[:, 1], color='red', marker='x', label='Digit 2 (-1)')

	# 2. Calculate the Decision Boundary
	# The boundary is where w^T * x = 0 => w0 + w1*x1 + w2*x2 = 0
	# We solve for x2 (Intensity): x2 = -(w1*x1 + w0) / w2

	# Define a range for x1 (Symmetry) based on the data
	x1_min, x1_max = np.min(X[:, 0]), np.max(X[:, 0])
	x1_values = np.linspace(x1_min, x1_max, 100)

	# Calculate corresponding x2 (Intensity) values
	# W is bias, W is Symmetry weight, W is Intensity weight
	x2_values = -(W * x1_values + W) / W

	# 3. Plot the Decision Boundary
	plt.plot(x1_values, x2_values, color='green', linewidth=2, label='Decision Boundary')

	# 4. Add labels, legend, and save
	plt.xlabel('Symmetry')
	plt.ylabel('Intensity')
	plt.title('Logistic Regression: Training Data and Decision Boundary')
	plt.legend()
	plt.ylim(np.min(X[:, 1]) - 0.5, np.max(X[:, 1]) + 0.5) # restrict y-axis to data range

	plt.savefig('train_result_sigmoid.png')
	# plt.show()
	### END YOUR CODE

Step-by-Step Explanation:

Scatter Plot: We plot the 1s and 2s just like in the previous visualization function to show the distribution of the training data.
Deriving the Line:
- W contains 3 weights: W (Bias), W (Symmetry), W (Intensity).
- We create a range of x-coordinates (x1_values) spanning the width of the data.
- We use the derived formula $x_{2} = - (w_{1} x_{1} + w_{0}) / w_{2}$ to generate the y-coordinates (x2_values).
Plotting: We plot this line in green. Any point on one side of this line results in a probability $> 0.5$ (prediction +1), and any point on the other side results in probability $< 0.5$ (prediction -1).

Function: `main()`

Based on the principles of Validation and Testing discussed in Chapter 4 of the Learning From Data textbook (specifically Section 4.3 on Model Selection), here is the implementation for the final section of main().

This code performs hyperparameter tuning using the validation set, visualizes the best result, and finally estimates the out-of-sample error ( $E_{o u t}$ ) using the test set.

Implementation

	# Explore different hyper-parameters.
	### YOUR CODE HERE
    # 1. Define a search space for hyperparameters (e.g., learning rate)
    learning_rates = [0.01, 0.05, 0.1, 0.5, 1.0]
    best_acc = 0
    best_logisticR = None

    print("--- Hyperparameter Tuning ---")
    # 2. Iterate through hyperparameters to find the best model
    for lr in learning_rates:
        # Initialize model
        model = logistic_regression(learning_rate=lr, max_iter=1000)

        # Train on Training Set (using BGD as a baseline)
        model.fit_BGD(train_X, train_y)

        # Evaluate on Validation Set (NOT Test Set - avoid data snooping)
        val_acc = model.score(valid_X, valid_y)
        print(f"Learning Rate: {lr}, Validation Accuracy: {val_acc}")

        # Track the best model
        if val_acc > best_acc:
            best_acc = val_acc
            best_logisticR = model

    print(f"Best Validation Accuracy: {best_acc} with LR: {best_logisticR.learning_rate}")
	### END YOUR CODE

	# Visualize the your 'best' model after training.
	### YOUR CODE HERE
    # Pass the features (columns 1 and 2) and the learned weights (W)
    # get_params() presumably returns self.W
    visualize_result(train_X[:, 1:3], train_y, best_logisticR.get_params())
	### END YOUR CODE

	# Use the 'best' model above to do testing. Note that the test data should be loaded and processed in the same way as the training data.
	### YOUR CODE HERE
    print("--- Testing ---")
    # 1. Load the raw test data
    # We assume 'test_filename' is defined in the imports/config section similar to 'train_filename'
    raw_test_data, test_labels = load_data(os.path.join(data_dir, test_filename))

    # 2. Extract Features (Symmetry and Intensity)
    # Must use the exact same prepare_X function as training
    test_X_all = prepare_X(raw_test_data)

    # 3. Filter for Class 1 and 2
    # prepare_y returns indices for these classes
    test_y_all, test_idx = prepare_y(test_labels)

    # 4. Apply filter to features and labels
    test_X = test_X_all[test_idx]
    test_y = test_y_all[test_idx]

    # 5. Convert labels to binary {-1, 1}
    # Map label '2' to '-1' (Digit 1 is positive class)
    test_y[test_y == 2] = -1

    # 6. Evaluate the best model on the Test Set
    test_acc = best_logisticR.score(test_X, test_y)
    print(f"Final Test Accuracy: {test_acc}")
	### END YOUR CODE

Explanation of Steps

Hyperparameter Tuning (Validation):
- Concept: As described in Chapter 4 (Validation), we cannot use the Test Set to choose parameters (like learning rate) because that constitutes data snooping.
- Action: We loop through different learning rates, train on train_X, and evaluate on valid_X. The model that performs best on the validation set is selected as best_logisticR.
Visualization:
- Concept: We visualize the decision boundary to verify if the linear model effectively separates the two classes in the feature space (Symmetry vs. Intensity).
- Action: We call visualize_result with the best model's weights. We calculate train_X[:, 1:3] because train_X column 0 is the bias ( $x_{0} = 1$ ), but the scatter plot only needs the 2D features ( $x_{1}, x_{2}$ ).
Testing:
- Concept: The Test Set is used only once at the very end to estimate $E_{o u t}$ (out-of-sample error). This provides an unbiased estimate of how the model will perform on unseen data.
- Action:
  - Consistency: It is crucial that test data undergoes the exact same preprocessing as training data. We call prepare_X to extract features and prepare_y to filter for digits 1 and 2.
  - Label Mapping: We must convert the raw labels (1 and 2) to the binary format (1 and -1) used by the logistic regression logic.
  - Scoring: We calculate the final accuracy using score().

4. Softmax Logistic Regression

LRM.py

Function: `_gradient`

    def _gradient(self, _x, _y):
        """Compute the gradient of cross-entropy with respect to self.W
        for one training sample (_x, _y). This function is used in fit_*.

        Args:
            _x: An array of shape [n_features,].
            _y: One_hot vector. 

        Returns:
            _g: An array of shape [n_features,]. The gradient of
                cross-entropy with respect to self.W.
        """
		### YOUR CODE HERE

		### END YOUR CODE

Although the provided docstring states the return shape is [n_features,], Softmax Regression typically requires a weight matrix of shape [n_features, n_classes] (one weight vector per class). Consequently, the gradient should also be a matrix of the same shape. I will implement it returning the matrix, as this is mathematically required for the one-hot vector input.

Implementation

    def _gradient(self, _x, _y):
        """Compute the gradient of cross-entropy with respect to self.W
        for one training sample (_x, _y). This function is used in fit_*.

        Args:
            _x: An array of shape [n_features,].
            _y: One_hot vector. 

        Returns:
            _g: An array of shape [n_features,]. The gradient of
                cross-entropy with respect to self.W.
        """
		### YOUR CODE HERE

        # Assumed self.W to be shape (n_features, n_classes)
        # Compute the Softmax probabilities
        probs = self.softmax(_x)

        # Compute the error term (prediction prob - target)
        error = probs - _y
            # probs is the vector of class probabilities
        
        # Compute the gradient (outer product of the input x and the error vector)
        _g = np.outer(_x, error)    # shape: (n_features, k)

        return _g

Detailed Step-by-Step Explanation

1. Compute the Linear Signal ( $z = W^{T} x$ )

Math: In Softmax regression, we have a separate set of weights for each class $k$ . We compute a "score" (logit) for each class by taking the dot product of the input feature vector _x and the weight vector for that class.
Code: z = np.dot(_x, self.W). If self.W is a matrix where each column represents a class, this operation calculates the scores for all classes simultaneously.

2. Apply the Softmax Function

Math: To convert the scores $z$ into probabilities (values between 0 and 1 that sum to 1), we use the Softmax function: $$ P(y=k|x) = \frac{e^{z_k}}{\sum_{j} e^{z_j}} $$
Code:
- z - np.max(z): This is a standard numerical stability trick. Subtracting the maximum value doesn't change the result of the softmax (it cancels out in the numerator and denominator) but prevents np.exp() from returning inf for large scores.
- probs = exp_z / np.sum(exp_z): This calculates the probability vector $\hat{y}$ .
- Look at 03 - Logistic Regression - From Binary to Multi-Class#Multi-Class Logistic Regression

3. Compute the Error ( $\hat{y} - y$ )

Math: The goal of training is to minimize the Cross-Entropy Loss. $$ E = - \sum_{k} y_k \ln(\hat{y}_k) $$ When you take the derivative of this loss with respect to the scores $z$ , the math simplifies beautifully to just the difference between the prediction and the target: $$ \frac{\partial E}{\partial z} = \hat{y} - y $$
Code: error = probs - _y. Since _y is a one-hot vector, and probs is the predicted distribution (e.g., [0.1, 0.7, 0.2]), this calculates the gradient for every class output.

4. Compute the Gradient w.r.t Weights ( $\nabla W$ )

Math: Using the chain rule, $\frac{\partial E}{\partial W} = \frac{\partial E}{\partial z} \cdot \frac{\partial z}{\partial W}$ . Since $z = W^{T} x$ , the derivative $\frac{\partial z}{\partial W}$ is just $x$ . Therefore, the gradient for the weights is the input $x$ multiplied by the error term: $$ \nabla W = x (\hat{y} - y)^T $$
Code: _g = np.outer(_x, error). This creates a matrix where each column $k$ is the input vector _x scaled by the error for class $k$ . This matches the shape of self.W and indicates the direction to adjust the weights to reduce the error.

Function: `fit_miniBGD`

    def fit_miniBGD(self, X, labels, batch_size):
        """Train perceptron model on data (X,y) with mini-Batch GD.

        Args:
            X: An array of shape [n_samples, n_features].
            labels: An array of shape [n_samples,].  Only contains 0,..,k-1.
            batch_size: An integer.

        Returns:
            self: Returns an instance of self.

        Hint: the labels should be converted to one-hot vectors, for example: 1----> [0,1,0]; 2---->[0,0,1].
        """

		### YOUR CODE HERE

		### END YOUR CODE

Implementation

    def fit_miniBGD(self, X, labels, batch_size):
        """Train perceptron model on data (X,y) with mini-Batch GD.

        Args:
            X: An array of shape [n_samples, n_features].
            labels: An array of shape [n_samples,].  Only contains 0,..,k-1.
            batch_size: An integer.

        Returns:
            self: Returns an instance of self.

        Hint: the labels should be converted to one-hot vectors, for example: 1---->; 2---->.
        """
        ### YOUR CODE HERE
        n_samples, n_features = X.shape

        # 1. Initialize Weights
        # Unlike binary logistic regression (vector), Softmax uses a weight matrix.
        # Shape: (n_features, k) where k is the number of classes.
        self.W = np.zeros((n_features, self.k))

        # 2. One-Hot Encoding of Labels
        # We need to convert integer labels (e.g., 2) into vectors (e.g.,).
        # Create a matrix of zeros of shape (n_samples, k)
        y_one_hot = np.zeros((n_samples, self.k))
        # Set the index corresponding to the label to 1 for each sample
        y_one_hot[np.arange(n_samples), labels] = 1

        # 3. Loop over epochs
        for _ in range(self.max_iter):
            # 4. Shuffle data
            # Essential for Mini-Batch/SGD to ensure batches are random representatives
            perm_indices = np.random.permutation(n_samples)
            X_shuffled = X[perm_indices]
            y_shuffled = y_one_hot[perm_indices]

            # 5. Iterate over mini-batches
            for start_idx in range(0, n_samples, batch_size):
                # Handle the last batch (might be smaller than batch_size)
                end_idx = min(start_idx + batch_size, n_samples)

                # Slice the batch
                X_batch = X_shuffled[start_idx:end_idx]
                y_batch = y_shuffled[start_idx:end_idx]
                current_batch_size = end_idx - start_idx

                # 6. Accumulate Gradients for this batch
                # Initialize a zero gradient matrix of shape (n_features, k)
                batch_grad_sum = np.zeros((n_features, self.k))

                # Sum the gradients for every sample in the batch
                for i in range(current_batch_size):
                    # Use the _gradient helper implemented previously
                    # It returns the gradient contribution of a single sample
                    batch_grad_sum += self._gradient(X_batch[i], y_batch[i])

                # 7. Average the gradient
                # The gradient of the loss is the average over the batch samples
                avg_grad = batch_grad_sum / current_batch_size

                # 8. Update Weights
                # Move W in the opposite direction of the gradient
                self.W -= self.learning_rate * avg_grad

        ### END YOUR CODE
        return self

Detailed Explanation of Each Step

Initialize Weights (self.W = np.zeros(...)):
- Concept: In Softmax Regression, we have a separate weight vector for each class $k$ . Therefore, self.W becomes a matrix of size [n_features, k]. Initializing to zeros is a standard starting point for convex problems like logistic regression.
One-Hot Encoding:
- Math: The mathematical derivation of the cross-entropy gradient relies on the target $y$ being a probability vector. If the true class is $k$ , the target vector should have a $1$ at index $k$ and $0$ elsewhere.
- Code: y_one_hot[np.arange(n_samples), labels] = 1 efficiently creates this matrix. If sample 0 has label 2, this sets y_one_hot = 1.
Shuffle (np.random.permutation):
- Reason: As discussed in Chapter 3 regarding SGD, randomness is crucial. If we process the data in the same order every epoch, the model might learn the sequence of the data rather than the features, or get stuck in cycles. Shuffling ensures that each mini-batch is a random sample of the true population.
Batch Loop:
- We iterate through the dataset in chunks of size batch_size. The logic min(start + batch_size, n_samples) ensures we don't crash when the dataset size isn't perfectly divisible by the batch size (handling the last "remainder" batch).
Accumulate Gradients:
- Math: The gradient of the error for a batch is the sum (or average) of the gradients of the individual samples in that batch.
- Code: We loop through the current batch, call self._gradient (which you implemented in the previous step to calculate $\nabla E_{n} (W)$ ), and add it to batch_grad_sum.
Average Gradient:
- Math: We divide by current_batch_size. This normalizes the update so that the magnitude of the step doesn't depend on how many samples are in the batch. This effectively estimates the "true" gradient of the expected error.
Weight Update:
- Math: $W^{(t + 1)} = W^{(t)} - η \nabla E_{b a t c h} (W)$ .
- We subtract the scaled gradient to move down the error surface toward the minimum. Note that self.W and avg_grad are both matrices of shape [n_features, k], so this updates all class weights simultaneously.

Function: `softmax`

    def softmax(self, x):
        """Compute softmax values for each sets of scores in x."""
        ### You must implement softmax by youself, otherwise you will not get credits for this part.

		### YOUR CODE HERE

        # Compute linear term: s = w^t * x (matrix multiplication)
        s = x @ self.W       
            # (n_samples, n_features) @ (n_features, k) -> (n_samples, k)

        # Stabilize (prevents overflow and made the model predict faster for me)
        s = s - np.max(s)

        # Softmax function: exp(s) / sum(exp(s)), where s = w^t * x
        exp_s = np.exp(s)           # = e^s
        soft_probs = exp_s / np.sum(exp_s)

        return soft_probs

Given a K -dimensional vector $v = {[v_{1}, v_{2}, \dots, v_{K}]}^{T} \in R^{K}$

softmax (v) = \frac{1}{\sum_{k = 1}^{K} e^{v_{k}}} [\begin{matrix} e^{v_{1}} \\ e^{v_{2}} \\ ⋮ \\ e^{v_{K}} \end{matrix}]

The multi-class prediction function can be written as

h_{w} (x) = [\begin{matrix} P (y = 1 ∣ x; w) \\ P (y = 2 ∣ x; w) \\ \dots \\ P (y = K ∣ x; w) \end{matrix}] = softmax [\begin{matrix} w_{1}^{T} x \\ w_{2}^{T} x \\ ⋮ \\ w_{K}^{T} x \end{matrix}] = \frac{1}{\sum_{k = 1}^{K} e^{w_{k}^{T} x}} [\begin{matrix} e^{w_{1}^{T} x} \\ e^{w_{2}^{T} x} \\ \dots \\ e^{w_{k}^{T} x} \end{matrix}]

Function: `predict`

    def predict(self, X):
        """Predict class labels for samples in X.

        Args:
            X: An array of shape [n_samples, n_features].

        Returns:
            preds: An array of shape [n_samples,]. Only contains 0,..,k-1.
        """
		### YOUR CODE HERE

		### END YOUR CODE

Implementation

    def predict(self, X):
        """Predict class labels for samples in X.

        Args:
            X: An array of shape [n_samples, n_features].

        Returns:
            preds: An array of shape [n_samples,]. Only contains 0,..,k-1.
        """
        ### YOUR CODE HERE
        # 1. Compute the linear scores (logits) for all classes
        # X shape: (n_samples, n_features)
        # self.W shape: (n_features, k)
        # Result scores shape: (n_samples, k)
        scores = np.dot(X, self.W)

        # 2. Determine the predicted class
        # The predicted class is the index of the highest score.
        # We assume the index corresponds to the label (0, 1, ..., k-1).
        # axis=1 means we find the max across columns (classes) for each row.
        preds = np.argmax(scores, axis=1)

        return preds
        ### END YOUR CODE

Compute Scores (np.dot(X, self.W)):
- Concept: As discussed in the "Linear Models" slides, the core of the classifier is the linear signal. In the multiclass case, we have $k$ signals, one for each class. We compute the score for every class simultaneously using matrix multiplication.
- Why not Softmax?: While Softmax converts these scores into probabilities (0 to 1), the order of the values does not change because the exponential function is strictly monotonic. Therefore, the class with the highest probability is always the class with the highest linear score.
- Efficiency: We skip the expensive np.exp() calculation required for Softmax because argmax(scores) yields the exact same prediction as argmax(softmax(scores)), but faster.
Argmax (np.argmax(scores, axis=1)):
- Concept: The decision rule for classification is to pick the most probable class.
- Implementation: scores is a matrix where row $i$ contains the scores for sample $i$ across all $k$ classes. np.argmax(..., axis=1) returns the index of the maximum value in each row. Since your labels are 0, ..., k-1, these indices are exactly the predicted class labels.

Function: `score`

    def score(self, X, labels):
        """Returns the mean accuracy on the given test data and labels.

        Args:
            X: An array of shape [n_samples, n_features].
            labels: An array of shape [n_samples,]. Only contains 0,..,k-1.

        Returns:
            score: An float. Mean accuracy of self.predict(X) wrt. labels.
        """
		### YOUR CODE HERE

		### END YOUR CODE

Implementation


    def score(self, X, labels):
        """Returns the mean accuracy on the given test data and labels.

        Args:
            X: An array of shape [n_samples, n_features].
            labels: An array of shape [n_samples,]. Only contains 0,..,k-1.

        Returns:
            score: An float. Mean accuracy of self.predict(X) wrt. labels.
        """
        ### YOUR CODE HERE
        # 1. Generate predictions using the function implemented above
        predictions = self.predict(X)

        # 2. Calculate accuracy
        # Compare predictions to ground truth labels.
        # np.mean converts boolean (True/False) to float (1.0/0.0) and averages them.
        score = np.mean(predictions == labels)

        return score
        ### END YOUR CODE

Generate Predictions:
- We simply call self.predict(X) to get the vector of predicted labels for the test set.
Calculate Accuracy (np.mean(predictions == labels)):
- Concept: This implements the standard accuracy metric defined in the problem description and implied in the "Error Measures" section of the text. It represents the fraction of correctly classified examples.
- Operation: predictions == labels creates a boolean array (e.g., [True, False, True]). np.mean treats True as 1 and False as 0, so the average is the percentage of correct answers (e.g., $2 / 3 \approx 0.66$ ).

main.py

Function: `visualize_result_multi`

def visualize_result_multi(X, y, W):
	'''This function is used to plot the softmax model after training. 

	Args:
		X: An array of shape [n_samples, 2].
		y: An array of shape [n_samples,]. Only contains 0,1,2.
		W: An array of shape [n_features, 3].
	
	Returns:
		No return. Save the plot to 'train_result_softmax.*' and include it
		in submission.
	'''
	### YOUR CODE HERE

	### END YOUR CODE

Based on the principles of Softmax Regression (multiclass linear model) discussed in the course notes, the decision boundary is determined by comparing the "scores" of each class. A point $x$ is assigned to the class $k$ that yields the highest score $s_{k} = w_{k}^{T} x$ .

Because there are 3 classes (0, 1, 2) and multiple linear boundaries interacting (Class 0 vs 1, 1 vs 2, 0 vs 2), the best way to visualize the result is to draw the decision regions. We do this by creating a fine grid of points over the plot area, predicting the class for every point on that grid, and coloring the background based on those predictions.

Here is the implementation and a detailed step-by-step explanation.

Implementation

import matplotlib.pyplot as plt

def visualize_result_multi(X, y, W):
    '''This function is used to plot the softmax model after training.

    Args:
        X: An array of shape [n_samples, 2]. (Symmetry, Intensity)
        y: An array of shape [n_samples,]. Only contains 0, 1, 2.
        W: An array of shape [n_features, 3]. (Bias + 2 features x 3 classes)

    Returns:
        No return. Save the plot to 'train_result_softmax.png' and include it
        in submission.
    '''
    ### YOUR CODE HERE
    plt.figure(figsize=(8, 6))

    # 1. Create a Meshgrid
    # Define the boundaries of the plot based on the data range, adding some padding.
    x_min, x_max = X[:, 0].min() - 0.5, X[:, 0].max() + 0.5
    y_min, y_max = X[:, 1].min() - 0.5, X[:, 1].max() + 0.5

    # Generate a grid of points with 0.02 distance between them
    h = 0.02
    xx, yy = np.meshgrid(np.arange(x_min, x_max, h), np.arange(y_min, y_max, h))

    # 2. Prepare the Grid for Prediction
    # Flatten the grid matrices into a single list of (x, y) coordinates
    grid_points = np.c_[xx.ravel(), yy.ravel()]

    # Add the bias term (column of 1s) to match the shape of W (3 rows)
    # grid_points becomes shape [n_grid_points, 3]
    grid_points_bias = np.c_[np.ones(grid_points.shape), grid_points]

    # 3. Predict the Class for Every Point on the Grid
    # Compute scores: Z = X * W. Shape: [n_grid_points, 3]
    scores = np.dot(grid_points_bias, W)

    # Select class with highest score. Shape: [n_grid_points,]
    Z = np.argmax(scores, axis=1)

    # Reshape the predictions back to the shape of the meshgrid
    Z = Z.reshape(xx.shape)

    # 4. Plot the Decision Regions (Contours)
    # Use contourf to fill the regions with colors corresponding to classes
    plt.contourf(xx, yy, Z, alpha=0.4, cmap=plt.cm.Paired)

    # 5. Plot the Training Data Points
    # Scatter plot the actual data points, colored by their true label 'y'
    # We loop through classes to add a legend
    colors = ['blue', 'red', 'green']
    labels = ['Digit 0', 'Digit 1', 'Digit 2']

    for i in range(3):
        # Select points belonging to class i
        idx = np.where(y == i)
        plt.scatter(X[idx, 0], X[idx, 1], c=colors[i], label=labels[i], edgecolor='k', s=20)

    # 6. Formatting and Saving
    plt.xlabel('Symmetry')
    plt.ylabel('Intensity')
    plt.title('Softmax Regression Decision Boundaries')
    plt.legend()

    plt.savefig('train_result_softmax.png')
    # plt.show()
    ### END YOUR CODE

Explanation of Each Step

Create a Meshgrid (np.meshgrid):
- Context: Unlike the binary case where we could solve a single equation $w^{T} x = 0$ to get a line, here we have three boundaries. The region for Class 0 is where $s c o r e_{0} > s c o r e_{1}$ and $s c o r e_{0} > s c o r e_{2}$ .
- Action: Instead of calculating lines mathematically, we define a "mesh" (a fine net) covering the entire area where the data points sit. We will ask the model to classify every intersection point on this net.
Prepare Grid for Prediction (grid_points_bias):
- Context: The model weights W have shape `` (Bias, Feature1, Feature2 for 3 classes).
- Action: The meshgrid gives us just the 2D coordinates ( $x_{1}, x_{2}$ ). We flatten this into a long list of points and add a column of $1$ s ( $x_{0}$ ) so that we can perform the matrix multiplication with W.
Predict Class (np.argmax):
- Math: For every point on the grid, we calculate 3 scores: $[s_{0}, s_{1}, s_{2}]$ .
- Logic: The predicted class is the index of the maximum score.
- Result: Z contains the predicted class (0, 1, or 2) for every pixel in our plot background.
Plot Contours (plt.contourf):
- Action: This function draws filled regions. It looks at the predictions in Z and colors the area blue where the prediction is 0, brown/red where it is 1, etc. This visualizes the decision surface and the linear boundaries separating the classes.
Plot Training Data:
- Action: Finally, we overlay the actual data points X. If the model is good, the blue points (Digit 0) should mostly fall inside the blue region, the red points (Digit 1) inside the red region, and so on.

Function: `main()`

	# ------------Logistic Regression Multiple-class case, let k= 3------------
	###### Use all data from '0' '1' '2' for training
	train_X = train_X_all
	train_y = train_y_all
	valid_X = valid_X_all
	valid_y = valid_y_all

	#########  miniBGD for multiclass Logistic Regression
	logisticR_classifier_multiclass = logistic_regression_multiclass(learning_rate=0.5, max_iter=100,  k= 3)
	logisticR_classifier_multiclass.fit_miniBGD(train_X, train_y, 10)
	print(logisticR_classifier_multiclass.get_params())
	print(logisticR_classifier_multiclass.score(train_X, train_y))

	# Explore different hyper-parameters.
	### YOUR CODE HERE

	

	### END YOUR CODE

	# Visualize the your 'best' model after training.
	# visualize_result_multi(train_X[:, 1:3], train_y, best_logistic_multi_R.get_params())


	# Use the 'best' model above to do testing.
	### YOUR CODE HERE

	### END YOUR CODE

Implementation

    # ------------Logistic Regression Multiple-class case, let k= 3------------
    ###### Use all data from '0' '1' '2' for training
    # Unlike the binary case, we do not filter for specific digits here.
    # We use the full dataset processed earlier.
    train_X = train_X_all
    train_y = train_y_all
    valid_X = valid_X_all
    valid_y = valid_y_all

    #########  miniBGD for multiclass Logistic Regression
    # Initial test run to verify the code works
    logisticR_classifier_multiclass = logistic_regression_multiclass(learning_rate=0.5, max_iter=100,  k= 3)
    logisticR_classifier_multiclass.fit_miniBGD(train_X, train_y, 10)
    print("Initial weights:", logisticR_classifier_multiclass.get_params())
    print("Initial Train Score:", logisticR_classifier_multiclass.score(train_X, train_y))

    # Explore different hyper-parameters.
    ### YOUR CODE HERE
    print("\n--- Multiclass Hyperparameter Tuning ---")
    learning_rates = [0.01, 0.05, 0.1, 0.5, 1.0]
    best_multi_acc = -1
    best_logistic_multi_R = None

    # 1. Grid Search for the best Hyperparameters
    for lr in learning_rates:
        # Initialize the multiclass model
        # We assume k=3 for digits 0, 1, 2
        model = logistic_regression_multiclass(learning_rate=lr, max_iter=1000, k=3)

        # Train using Mini-Batch Gradient Descent
        # Batch size is set to 10, but could also be tuned
        model.fit_miniBGD(train_X, train_y, batch_size=10)

        # Evaluate on the VALIDATION set
        val_acc = model.score(valid_X, valid_y)
        print(f"LR: {lr}, Validation Accuracy: {val_acc}")

        # Select the best model
        if val_acc > best_multi_acc:
            best_multi_acc = val_acc
            best_logistic_multi_R = model

    print(f"Best Multiclass Model LR: {best_logistic_multi_R.learning_rate} with Val Acc: {best_multi_acc}")
    ### END YOUR CODE

    # Visualize the your 'best' model after training.
    # We pass the features (cols 1 and 2) and the learned weight matrix W.
    visualize_result_multi(train_X[:, 1:3], train_y, best_logistic_multi_R.get_params())

    # Use the 'best' model above to do testing.
    ### YOUR CODE HERE
    print("\n--- Multiclass Testing ---")
    # 1. Load the raw test data
    raw_test_data, test_labels = load_data(os.path.join(data_dir, test_filename))

    # 2. Extract Features
    # Use the same function as training to get Bias, Symmetry, Intensity
    test_X_all = prepare_X(raw_test_data)

    # 3. Prepare Labels
    # prepare_y usually returns the mapped labels and indices.
    # For multiclass (0, 1, 2), we typically use all data returned.
    test_y_all, _ = prepare_y(test_labels)

    # 4. Score the best model on the Test Set
    # This provides the unbiased estimate of E_out (Out-of-sample error)
    test_acc = best_logistic_multi_R.score(test_X_all, test_y_all)
    print(f"Final Multiclass Test Accuracy: {test_acc}")
    ### END YOUR CODE

Detailed Explanation of Each Step

Data Setup (train_X = train_X_all):
- Context: In the binary section, you likely filtered train_X_all to keep only digits 1 and 2.
- Action: For the multiclass section, we revert to using train_X_all, which contains features for all available classes (digits 0, 1, and 2) as prepared by your prepare_X function.
Hyperparameter Tuning (The Loop):
- Concept: As discussed in Chapter 4 regarding Model Selection, we use the Validation Set (valid_X, valid_y) to select the best learning rate.
- Action: We train multiple instances of logistic_regression_multiclass with different learning rates using fit_miniBGD. We select the one that achieves the highest accuracy on the validation set.
Visualization (visualize_result_multi):
- Concept: Visualizing the decision boundaries helps verify that the model has learned meaningful geometric separations between the classes.
- Action: We call the function you implemented earlier.
  - train_X[:, 1:3]: The visualization function expects a 2D feature array (Symmetry vs. Intensity). Since train_X has a bias column of 1s at index 0, we slice it to pass only columns 1 and 2.
  - best_...get_params(): This passes the learned weight matrix $W$ (shape $3 \times 3$ ) to the visualizer so it can compute scores and draw the regions.
Testing:
- Concept: This step estimates $E_{o u t}$ (out-of-sample error). Per the "Three Learning Principles" (Chapter 5) and "Training versus Testing" (Chapter 2), the test set must be used only once at the very end.
- Action:
  - Loading: We load the test data from the file.
  - Preprocessing: We must apply the exact same prepare_X and prepare_y transformations used on the training data. This ensures the test feature matrix has the same dimensions (bias + 2 features) and format as the model expects.
  - Scoring: We calculate the accuracy of our best_logistic_multi_R model on this fresh data. This number is your final performance metric.

Here is the implementation for the final part of your assignment.

This section compares Softmax Regression (configured for $k = 2$ ) against Sigmoid Logistic Regression (binary). As discussed in the "Logistic Regression" sections of the Learning From Data slides and textbook, the two methods should theoretically perform very similarly for binary classification, as Softmax is a generalization of the Sigmoid function.

Implementation

	# ------------Connection between sigmoid and softmax------------
	############ Now set k=2, only use data from '1' and '2'

	#####  set labels to 0,1 for softmax classifer
	train_X = train_X_all[train_idx]
	train_y = train_y_all[train_idx]
	train_X = train_X[0:1350]
	train_y = train_y[0:1350]
	valid_X = valid_X_all[val_idx]
	valid_y = valid_y_all[val_idx]

    # Softmax requires labels 0 and 1.
    # Currently we have 1 and 2. Map 2 -> 0, and 1 -> 1 (implicit).
	train_y[np.where(train_y==2)] = 0
	valid_y[np.where(valid_y==2)] = 0

	###### First, fit softmax classifer until convergence, and evaluate
	##### Hint: we suggest to set the convergence condition as "np.linalg.norm(gradients*1./batch_size) < 0.0005" or max_iter=10000:
	### YOUR CODE HERE
    # Initialize Softmax model with k=2
    # We use a large max_iter=10000 to ensure convergence as suggested in the hint
    softmax_model = logistic_regression_multiclass(learning_rate=0.05, max_iter=10000, k=2)

    # Train using Mini-Batch Gradient Descent
    softmax_model.fit_miniBGD(train_X, train_y, batch_size=10)

    # Evaluate
    train_acc_soft = softmax_model.score(train_X, train_y)
    valid_acc_soft = softmax_model.score(valid_X, valid_y)
    print(f"Softmax (k=2) - Train Acc: {train_acc_soft:.4f}, Valid Acc: {valid_acc_soft:.4f}")
	### END YOUR CODE


	train_X = train_X_all[train_idx]
	train_y = train_y_all[train_idx]
	train_X = train_X[0:1350]
	train_y = train_y[0:1350]
	valid_X = valid_X_all[val_idx]
	valid_y = valid_y_all[val_idx]
	#####       set lables to -1 and 1 for sigmoid classifer
	### YOUR CODE HERE
    # Sigmoid (Binary) requires labels -1 and 1.
    # Currently we have 1 and 2. Map 2 -> -1.
    train_y[train_y == 2] = -1
    valid_y[valid_y == 2] = -1
	### END YOUR CODE

	###### Next, fit sigmoid classifer until convergence, and evaluate
	##### Hint: we suggest to set the convergence condition as "np.linalg.norm(gradients*1./batch_size) < 0.0005" or max_iter=10000:
	### YOUR CODE HERE
    # Initialize Binary Logistic Regression model
    sigmoid_model = logistic_regression(learning_rate=0.05, max_iter=10000)

    # Train using Mini-Batch Gradient Descent (to match the Softmax comparison)
    sigmoid_model.fit_miniBGD(train_X, train_y, batch_size=10)

    # Evaluate
    train_acc_sig = sigmoid_model.score(train_X, train_y)
    valid_acc_sig = sigmoid_model.score(valid_X, valid_y)
    print(f"Sigmoid (Binary) - Train Acc: {train_acc_sig:.4f}, Valid Acc: {valid_acc_sig:.4f}")
	### END YOUR CODE


	################Compare and report the observations/prediction accuracy
    print("\n--- Observations ---")
    print(f"Difference in Validation Accuracy: {abs(valid_acc_soft - valid_acc_sig):.6f}")

    # Optional: Compare weights
    # Softmax has 2 weight vectors (one for class 0, one for class 1).
    # Sigmoid has 1 weight vector (distinguishing class 1 from -1).
    # The decision boundary for Softmax is (w_1 - w_0)^T x = 0.
    # The decision boundary for Sigmoid is w^T x = 0.
    # Therefore, (w_1 - w_0) from Softmax should be proportional to w from Sigmoid.

    softmax_W = softmax_model.get_params() # Shape (d, 2)
    sigmoid_W = sigmoid_model.get_params() # Shape (d,)

    # Calculate effective weight for softmax (w_class1 - w_class0)
    # Note: In our mapping, label 1 is column 1, label 0 (old 2) is column 0.
    effective_softmax_W = softmax_W[:, 1] - softmax_W[:, 0]

    print("Shape of Softmax Weights:", softmax_W.shape)
    print("Shape of Sigmoid Weights:", sigmoid_W.shape)
    # Cosine similarity to check alignment
    cosine_sim = np.dot(effective_softmax_W, sigmoid_W) / (np.linalg.norm(effective_softmax_W) * np.linalg.norm(sigmoid_W))
    print(f"Alignment (Cosine Similarity) between models: {cosine_sim:.4f}")

	# ------------End------------

Explanation of Observations

Prediction Accuracy: You should observe that the validation accuracy for both models is identical (or extremely close, allowing for small numerical differences in optimization).
Theoretical Connection:
- Sigmoid: Predicts $P (y = 1 | x) = \frac{1}{1 + e^{- w^{T} x}}$ . The decision boundary is $w^{T} x = 0$ .
- Softmax ( $k = 2$ ): Predicts $P (y = 1 | x) = \frac{e^{w_{1}^{T} x}}{e^{w_{0}^{T} x} + e^{w_{1}^{T} x}} = \frac{1}{1 + e^{- (w_{1} - w_{0})^{T} x}}$ . The decision boundary is $(w_{1} - w_{0})^{T} x = 0$ .
- Conclusion: Softmax with $k = 2$ is mathematically equivalent to Sigmoid Logistic Regression. The effective weight vector in Softmax ( $w_{1} - w_{0}$ ) plays the same role as the single weight vector $w$ in Sigmoid.

1. Data Preprocessing

DataReader.py

Function: load_data(filename)

What does it do?

Function: train_valid_split(raw_data, labels, split_index)

What does it do?

Theoretical Connections

Why do we need this function?

1. Training Error (Ein) is Biased and Optimistic

2. Model Selection (Tuning Hyperparameters)

3. Protecting the Test Set (Avoiding Data Snooping)

Function: prepare_X(raw_X)

Restore spatial structure of the images

Feature 1: Symmetry

Feature 2: Intensity

(extra) What does axis=(1,2) do?

Feature 3: Bias Term. Always 1.

Why do we need it? (The "Bias Trick")

What happens if you skip this?

(extra) The concept of a threshold/bias weight

The Mathematical Trick: Turning Threshold into Bias:

The "Augmented" Vector (x0=1)

Geometric Meaning

Stack features together

1. What the code does (The Mechanics)

2. The Resulting Data Matrix (X)

3. Why do we stack them this way?

Function: prepare_y(raw_y)

1. The Purpose of the Function

2. Line-by-Line Explanation

3. Summary of Returns

main.py

Function main()

1. Loading and Splitting Data

2. Feature Extraction (prepare_X)

3. Filtering for Binary Classification (prepare_y)

4. Applying the Filter

5. Limiting Sample Size

6. Label Conversion (The Code You Wrote)

7. Visualization

Function: visualize_features(X, y)

2. Cross-entropy loss

(a) Write the loss function E(w) for one training data sample (x,y)

1. The Setup

2. Identify the Probability Function

The Property: 1−θ(s)=θ(−s)

Applying it to the Piecewise Function

3. Apply the Negative Log-Likelihood

The Substitution Step-by-Step

(b) Compute the gradient ∇E(w). Please provide intermediate steps of derivation.

The Loss Function

Step-by-Step Derivation

Step 1: Identify the nested functions

Step 2: Apply the Chain Rule:

Step 3: Compute the individual derivatives

Step 4: Combine and Substitute Multiply the three parts together:

Step 5: Simplify the Expression

Final Answer

(c) Once the optimal w is obtained,

1. Why is checking θ(wTx)≥0.5 not the most efficient way?

2. When will we need to use the sigmoid function in prediction?

(d) Is the decision boundary still linear if the prediction rule is changed to the following?

1. Identify the Decision Boundary Equation

2. Substitute the Logistic Function

3. Solve for the Linear Signal (wTx)

4. Conclusion: It is a Linear Boundary

(e) What is the essential property of logistic regression that results in the linear decision boundary?

3. Sigmoid Logistic Regression

LogisticRegression.py

Function: _gradient

Implementation

Explanation of the Steps

Function: fit_BGD

Implementation

Explanation of the Steps

Function: fit_miniBGD

Implementation

Step-by-Step Explanation

Function: fit_SGD

Implementation

Function: `load_data(filename)`

Function: `train_valid_split(raw_data, labels, split_index)`

1. Training Error ( $E_{i n}$ ) is Biased and Optimistic

Function: `prepare_X(raw_X)`

(extra) What does `axis=(1,2)` do?

The "Augmented" Vector ( $x_{0} = 1$ )

2. The Resulting Data Matrix ( $X$ )

Function: `prepare_y(raw_y)`

Function `main()`

2. Feature Extraction (`prepare_X`)

3. Filtering for Binary Classification (`prepare_y`)

Function: `visualize_features(X, y)`

(a) Write the loss function $E (w)$ for one training data sample ( $x, y$ )

The Property: $1 - θ (s) = θ (- s)$

(b) Compute the gradient $\nabla E (w)$ . Please provide intermediate steps of derivation.

(c) Once the optimal $w$ is obtained,

1. Why is checking $θ (w^{T} x) \geq 0.5$ not the most efficient way?

3. Solve for the Linear Signal ( $w^{T} x$ )

Function: `_gradient`

Function: `fit_BGD`

Function: `fit_miniBGD`

Function: `fit_SGD`

Function: `predict_proba`

Function: `predict`

Function: `score`

Function: `visualize_result`

Function: `main()`

Function: `_gradient`

Function: `fit_miniBGD`

Function: `softmax`

Function: `predict`

Function: `score`

Function: `visualize_result_multi`

Function: `main()`