HW1 - Linear Models for Handwritten Digits Classification

Overview:
In this assignment, you will implement the binary logistic regression model and multi-class logistic regression model on a partial dataset from MNIST. In this classification task, the model will take a 16×16 image of handwritten digits as inputs and classify the image into different classes. For the binary case, the classes are 1 and 2 while for the multi-class case, the classes are 0, 1, and 2. The "data" fold contains the dataset which has already been split into a training set and a testing set. All data examples are saved in dictionary-like objects using "npz" file. For each data sample, the dictionary key 'x' indicates its raw features, which are represented by a 256 -dimensional vector where the values between [1,1] indicate grayscale pixel values for a 16×16 image. In addition, the key 'y' is the label for a data example, which can be 0,1 , or 2 . The "code" fold provides the starting code. You must implement the models using the starting code.

File structure:

code/
	DataReader.py
	LogisticRegression.py
	LRM.py
	main.py
	
data/
	test.npz
	training.npz

1. Data Preprocessing

Function: load_data(filename)

def load_data(filename):
    """Load a given txt file.

    Args:
        filename: A string.

    Returns:
        raw_data: An array of shape [n_samples, 256].
        labels : An array of shape [n_samples,].
        
    """
    data= np.load(filename)
    x= data['x']
    y= data['y']
    return x, y

What does it do?

Function: train_valid_split(raw_data, labels, split_index)

def train_valid_split(raw_data, labels, split_index):
	"""Split the original training data into a new training dataset
	and a validation dataset.
	n_samples = n_train_samples + n_valid_samples

	Args:
		raw_data: An array of shape [n_samples, 256].
        labels : An array of shape [n_samples,].
		split_index: An integer.

	"""
	return raw_data[:split_index], raw_data[split_index:], labels[:split_index], labels[split_index:]

What does it do?

The slicing (this is the key part):

return raw_data[:split_index], raw_data[split_index:], \
       labels[:split_index], labels[split_index:]

Let’s break that apart.

Features:

raw_data[:split_index]      # training data
raw_data[split_index:]      # validation data

Labels:

labels[:split_index]        # training labels
labels[split_index:]        # validation labels

So the return order is:

x_train, x_valid, y_train, y_valid

Theoretical Connections

Why do we need this function?

1. Training Error (Ein) is Biased and Optimistic

When a model minimizes the error on the training set, it is specifically adapting to that specific data. Consequently, the in-sample error (Ein) will always be optimistically biased—it will look lower than the actual error the model would make on new, unseen data.

2. Model Selection (Tuning Hyperparameters)

You often need to make choices about the learning process that the training algorithm cannot determine on its own. For example:

If you choose these parameters based on which one gives the lowest Training Error, you will simply pick the most complex model possible, leading to overfitting.

3. Protecting the Test Set (Avoiding Data Snooping)

You might wonder, "Why not just use the Test set for this?"

Summary: You split the data so you can train on one part (Dtrain) and use the other part (Dval) to simulate how the model behaves in the real world, allowing you to tune the model without cheating by looking at the final test data,.

Function: prepare_X(raw_X)

def prepare_X(raw_X):
    """Extract features from raw_X as required.

    Args:
        raw_X: An array of shape [n_samples, 256].

    Returns:
        X: An array of shape [n_samples, n_features].
    """
    raw_image = raw_X.reshape((-1, 16, 16))

Restore spatial structure of the images

raw_image = raw_X.reshape((-1, 16, 16))

Feature 1: Symmetry

	# Feature 1: Measure of Symmetry
	
	# Flip images horizontally
	flipped_image = np.flip(raw_image, axis=2)
		# Since shape == (N, 16, 16), width is axis 2
	
	# Formula for symmetry: - (Sum of |x - flip(x)|) / 256
	symmetry = -np.sum(np.abs(raw_image - flipped_image), axis=(1, 2)) / 256

Explanation:

	flipped_image = np.flip(raw_image, axis=2)

Explanation:

	symmetry = -np.sum(np.abs(raw_image - flipped_image), axis=(1, 2)) / 256

Feature 2: Intensity

	# Feature 2: Measure of Intensity
	### YOUR CODE HERE
     
    # Formula for intensity: (Sum of x) / 256
    intensity = np.sum(raw_image, axis=(1,2)) / 256

Explanation:

	intensity = np.sum(raw_image, axis=(1,2)) / 256
(extra) What does axis=(1,2) do?

To understand the axis parameter, you have to visualize how the raw_image data is structured after you reshape it.

The variable raw_image is a 3-Dimensional array with the shape (N, 16, 16).

When you specify axis=(1, 2): You are telling NumPy to "collapse" or "squash" dimensions 1 and 2 by summing them up, while keeping Axis 0 separate.

Visualizing the Operation

Visualizing the Result:

Feature 3: Bias Term. Always 1.

	# Feature 3: Bias Term. Always 1.
	# Vector of 1s with lenght == n_samples
	bias = np.ones(raw_image.shape[0])

What the code does

bias = np.ones(raw_image.shape[0])
Why do we need it? (The "Bias Trick")

In linear models, the decision boundary is determined by a weighted sum of the inputs plus a threshold (or offset).

What happens if you skip this?

If you do not include this bias column (the column of 1s), you force w0 to be zero.

(extra) The concept of a threshold/bias weight

Think of a linear model as a scorecard. In the credit card example, the model calculates a weighted score based on your inputs (salary, years in job, etc.).

Mathematically, this looks like: $$ \sum_{i=1}^d w_i x_i > \text{threshold} $$

The Mathematical Trick: Turning Threshold into Bias:

To make the math and the code cleaner, machine learning experts move the threshold to the other side of the inequality. $$ \sum_{i=1}^d w_i x_i - \text{threshold} > 0 $$
The term threshold is renamed the bias weight (w0). $$ \sum_{i=1}^d w_i x_i + w_0 > 0 $$By doing this, the "threshold" isn't a special number we have to keep track of separately; it just becomes another weight that the machine learns automatically,.

The "Augmented" Vector (x0=1)

This explains why our code has bias = np.ones(...).

We want to write the entire formula as a single matrix multiplication (dot product): wTx. However, the bias w0 is floating by itself—it isn't being multiplied by any input feature.

This allows us to merge the bias into the weights and inputs perfectly: $$ \underbrace{w_0 \cdot 1}{\text{bias}} + \sum^d w_i x_i = \sum_{i=0}^d w_i x_i = w^T x $$
This is referred to in the slides as the augmented vector,.

Geometric Meaning

Without a bias term, a linear model (a line or hyperplane) is forced to pass through the origin (0,0).

Stack features together

	# Stack features together in the following order.
	# [Feature 3, Feature 1, Feature 2]     
    # Final feature matrix with shape == (n_samples, 3)
    X = np.column_stack((bias, symmetry, intensity))
    
    return X
1. What the code does (The Mechanics)

The function np.column_stack takes a sequence of 1D arrays (vectors) and stacks them side-by-side as columns to create a single 2D matrix.

2. The Resulting Data Matrix (X)

This line constructs the Data Matrix (often called the Design Matrix) described in the "Linear Regression Algorithm" slide. Each row represents one specific image (data point), and each column represents a specific feature of that image.

Visually, the matrix X looks like this:

X=[1 symmetry 1 intensity 11 symmetry 2 intensity 21 symmetry N intensity N]
3. Why do we stack them this way?

We stack them into this specific shape to utilize matrix multiplication for calculating predictions and errors efficiently.

As explained in the slides, the linear model calculates a "score" or "signal" for every data point using the formula: $$s = \mathbf{X}\mathbf{w}$$
This single line of linear algebra (Xw) performs the following calculation for every image at once: $$ \text{score} = (w_0 \cdot 1) + (w_1 \cdot \text{symmetry}) + (w_2 \cdot \text{intensity}) $$

Function: prepare_y(raw_y)

def prepare_y(raw_y):
    """
    Args:
        raw_y: An array of shape [n_samples,].
        
    Returns:
        y: An array of shape [n_samples,].
        idx:return idx for data label 1 and 2.
    """
    y = raw_y
    idx = np.where((raw_y==1) | (raw_y==2))
    y[np.where(raw_y==0)] = 0
    y[np.where(raw_y==1)] = 1
    y[np.where(raw_y==2)] = 2

    return y, idx

1. The Purpose of the Function

This function is designed to filter the dataset for a specific binary classification task. Even though the original dataset contains many digits (0, 1, 2, etc.), this function creates an index (idx) to isolate only the samples corresponding to the digits "1" and "2".

This aligns with the strategy mentioned in the textbook of decomposing a multiclass problem (recognizing ten digits) into smaller binary problems (e.g., separating digit 1 from digit 2).

2. Line-by-Line Explanation

3. Summary of Returns

  1. y: The array of labels (sanitized).
  2. idx: The critical piece of information. You will likely use this in your main.py or training loop to slice your data matrix X.
    • Instead of using the full X, you will use X[idx] to get only the images of 1s and 2s.
    • Instead of using the full y, you will use y[idx] to get only the labels 1 and 2.

This function identifies samples belonging to classes 1 and 2 for binary classification, and these indices should be used to filter the data and convert the labels to ±1 if required by the learning algorithm.

Function main()

def main():
	# ------------Data Preprocessing------------
	# Read data for training.
    
    raw_data, labels = load_data(os.path.join(data_dir, train_filename))
    raw_train, raw_valid, label_train, label_valid = train_valid_split(raw_data, labels, 2300)

    ##### Preprocess raw data to extract features
    train_X_all = prepare_X(raw_train)
    valid_X_all = prepare_X(raw_valid)
    ##### Preprocess labels for all data to 0,1,2 and return the idx for data from '1' and '2' class.
    train_y_all, train_idx = prepare_y(label_train)
    valid_y_all, val_idx = prepare_y(label_valid)  

    ####### For binary case, only use data from '1' and '2'  
    train_X = train_X_all[train_idx]
    train_y = train_y_all[train_idx]
    ####### Only use the first 1350 data examples for binary training. 
    train_X = train_X[0:1350]
    train_y = train_y[0:1350]
    valid_X = valid_X_all[val_idx]
    valid_y = valid_y_all[val_idx]
    ####### set lables to  1 and -1. Here convert label '2' to '-1' which means we treat data '1' as postitive class. 
	### YOUR CODE HERE	
	
    # Convert label 2 to -1
    train_y[train_y == 2] = -1
    valid_y[valid_y == 2] = -1

    # Explicitely convert label 1 to 1 (not really needed)
    train_y[train_y == 1] = 1
    valid_y[valid_y == 1] = 1

	### END YOUR CODE
    data_shape = train_y.shape[0] 

#    # Visualize training data.
    visualize_features(train_X[:, 1:3], train_y)

1. Loading and Splitting Data

raw_data, labels = load_data(os.path.join(data_dir, train_filename))
raw_train, raw_valid, label_train, label_valid = train_valid_split(raw_data, labels, 2300)

2. Feature Extraction (prepare_X)

train_X_all = prepare_X(raw_train)
valid_X_all = prepare_X(raw_valid)

3. Filtering for Binary Classification (prepare_y)

train_y_all, train_idx = prepare_y(label_train)
valid_y_all, val_idx = prepare_y(label_valid)

4. Applying the Filter

train_X = train_X_all[train_idx]
train_y = train_y_all[train_idx]

5. Limiting Sample Size

train_X = train_X[0:1350]
train_y = train_y[0:1350]

6. Label Conversion (The Code You Wrote)

train_y[train_y == 2] = -1
valid_y[valid_y == 2] = -1
# ... (and implicitly 1 remains 1)

7. Visualization

visualize_features(train_X[:, 1:3], train_y)

Function: visualize_features(X, y)

def visualize_features(X, y):
	'''This function is used to plot a 2-D scatter plot of training features. 

	Args:
		X: An array of shape [n_samples, 2].
		y: An array of shape [n_samples,]. Only contains 1 or -1.

	Returns:
		No return. Save the plot to 'train_features.*' and include it
		in submission.
	'''
	### YOUR CODE HERE
	
	# Separate the data into 1 (digit 1) and -1 (digit 2) classes
	X_pos = X[y == 1]
	X_neg = X[y == -1]

	# Initialize plot
	plt.figure(figsize=(8, 6))
	
	# Plot positives (1) with 'o'
	plt.scatter(X_pos[:,0], X_pos[:, 1], color='blue', marker='o', label='Class +1')
	
	# Plot negatives (-1) with 'x'
	plt.scatter(X_neg[:,0], X_pos[:, 1], color='red', marker='x', label='Class +1')
	
	# Styling
	plt.xlabel('Symmetry')
	plt.ylabel('Inensity')
	plt.title('Feature Vizualization')
	plt.legend()
	plt.grid(True)

	# Save plot
	plt.savefig('features_vizualization.png')
	
	# Display
	plt.show()
	
	### END YOUR CODE

Just a simple training feature scatter plot, not very complex.

2. Cross-entropy loss

(a) Write the loss function E(w) for one training data sample (x,y)

1. The Setup

2. Identify the Probability Function

In Logistic Regression with binary labels y+1,1, the model estimates the probability P(y|x) using the logistic function θ(s). The text defines this probability as: $$P(y|x) = \theta(y w^T x)$$ where θ(s)=es1+es.

The Property: 1θ(s)=θ(s)

The logistic function is defined as θ(s)=es1+es. If you calculate 1θ(s), you get: $$ 1 - \frac{e^s}{1 + e^s} = \frac{1 + e^s - e^s}{1 + e^s} = \frac{1}{1 + e^s} $$
Now, look at θ(s): $$ \theta(-s) = \frac{e^{-s}}{1 + e^{-s}} $$Multiply the top and bottom by es: $$ \frac{e^{-s} \cdot e^s}{(1 + e^{-s}) \cdot e^s} = \frac{1}{e^s + 1} $$
Since both equal 11+es, we have proven that 1θ(s)=θ(s).

Applying it to the Piecewise Function

Now we look at the two cases for P(y|x):

Summary:
By using the symmetry θ(s)=1θ(s), we can cover both cases with a single expression: $$P(y | x) = \theta(y w^T x)$$This simplifies the math significantly because we don't have to write separate equations for positive and negative examples.

3. Apply the Negative Log-Likelihood

To find the optimal weights, the algorithm seeks to maximize the likelihood of the data. Mathematically, maximizing likelihood is equivalent to minimizing the negative natural logarithm (negative log-likelihood) of that probability. Therefore, the loss E(w) for a single sample is: $$E(w) = -\ln(P(y|x))$$$$E(w) = -\ln(\theta(y w^T x))$$

The Substitution Step-by-Step

Step A: Substitute the probability P(y|x) with the logistic function θ(s). $$E(w) = -\ln(\theta(s))$$Step B: Substitute the definition of θ(s). $$E(w) = -\ln\left( \frac{e^s}{1 + e^s} \right)$$Step C: Apply the logarithm rule ln(ab)=ln(ba) to remove the negative sign. $$E(w) = \ln\left( \frac{1 + e^s}{e^s} \right)$$Step D: Simplify the fraction inside the logarithm. $$E(w) = \ln\left( \frac{1}{e^s} + \frac{e^s}{e^s} \right)$$$$E(w) = \ln\left( e^{-s} + 1 \right)$$Step E: Substitute s=ywTx back into the equation. $$E(w) = \ln\left( 1 + e^{-y w^T x} \right)$$#### 3. Simplify the Expression
By substituting the definition of the logistic function into the equation above, the text derives the standard cross-entropy loss formula for a single training sample (x,y):

E(w)=ln(1+eywTx)

(b) Compute the gradient E(w). Please provide intermediate steps of derivation.

Based on the loss function derived in the previous step and the gradient descent discussions in the Learning From Data textbook, here is the step-by-step derivation to compute the gradient E(w).

The Loss Function

Recalling the Cross-Entropy loss for a single sample (x,y) derived from the previous step: $$E(w) = \ln(1 + e^{-y w^T x})$$To find the gradient E(w) with respect to the weight vector w, we apply the chain rule of calculus.

Step-by-Step Derivation

Step 1: Identify the nested functions

Let us define a variable u to represent the exponent: $$u = -y w^T x$$Now, the loss function can be written as: $$E(w) = \ln(1 + e^u)$$

Step 2: Apply the Chain Rule:

The Chain Rule states that if you have composite functions (functions inside of other functions), the derivative of the whole is the product of the derivatives of the parts.

To calculate the gradient, we view the error function E(w)=ln(1+eywTx) as a series of three nested layers, like an onion:

  1. How E changes as z changes (Ez).
  2. How z changes as u changes (zu).
  3. How u changes as w changes (wu).

When we multiply these rates of change together, we get the total gradient: $$ \nabla E(w) = \underbrace{\frac{\partial E}{\partial z}}{\text{Outer}} \cdot \underbrace{\frac{\partial z}{\partial u}}{\text{Middle}} \cdot \underbrace{\nabla_w u}_{\text{Inner}} $$
If we substitute the definitions of z and u back into the Chain Rule equation above, we get exactly:

Step 3: Compute the individual derivatives
  1. Outer Logarithm: The derivative of ln(z) is 1z. $$ \frac{\partial E}{\partial (1+e^u)} = \frac{1}{1 + e^u} $$
  2. Exponential Term: The derivative of 1+eu with respect to u is eu. $$ \frac{\partial (1+e^u)}{\partial u} = e^u $$
  3. Linear Term: The gradient of u=ywTx with respect to the vector w is simply the coefficient vector yx. $$ \nabla_w u = \nabla_w (-y w^T x) = -y x $$
Step 4: Combine and Substitute Multiply the three parts together:
E(w)=(11+eu)(eu)(yx)

Substitute u=ywTx back into the equation: $$\nabla E(w) = \frac{e^{-y w^T x}}{1 + e^{-y w^T x}} (-yx)$$

Step 5: Simplify the Expression

To clean up the fraction, multiply the numerator and the denominator by eywTx: $$ \frac{e^{-y w^T x} \cdot e^{y w^T x}}{(1 + e^{-y w^T x}) \cdot e^{y w^T x}} = \frac{1}{e^{y w^T x} + 1} $$
Now, combine this with the term (yx): $$\nabla E(w) = \frac{1}{1 + e^{y w^T x}} (-yx)$$

Final Answer

The gradient of the loss function for a single sample is:

E(w)=yx1+eywTx

This result matches the summation formula provided in Exercise 3.7 of the textbook, which defines the gradient contribution of a single point as ynxn1+eynwTxn.

Interpretation: The gradient is a scaled version of the input vector yx. The scaling factor 11+eywTx determines how much this specific point affects the weight update. If ywTx is large and positive (the point is correctly classified with high confidence), the denominator is large, and the gradient is close to zero (little update). If ywTx is negative (misclassified), the denominator is small, and the gradient contribution is larger.

(c) Once the optimal w is obtained,

it can be used to make predictions as follows:
$$
\text { Predicted class of } x= \begin{cases}1 & \text { if } \theta\left(w^T x\right) \geq 0.5 \ -1 & \text { if } \theta\left(w^T x\right)<0.5\end{cases}
$$
where the function θ(z)=11+ez looks like

2026-02-05_14-42-07.png|300

However, this is not the most efficient way since the decision boundary is linear. Why? Explain it. When will we need to use the sigmoid function in prediction?

1. Why is checking θ(wTx)0.5 not the most efficient way?

Explanation: Using the sigmoid function to make a binary prediction is inefficient because it involves unnecessary and computationally expensive mathematical operations.

Conclusion: Since the decision boundary is linear (defined by wTx=0), you can simply check the sign of the linear signal. If wTx>0, predict 1; otherwise, predict -1. Calculating the full probability via the sigmoid function is wasted effort if you only need the binary class label.

2. When will we need to use the sigmoid function in prediction?

You need the sigmoid function when the goal is not just a hard classification (Yes/No), but an estimation of uncertainty or risk.

(d) Is the decision boundary still linear if the prediction rule is changed to the following?

 Predicted label of x={1 if θ(wTx)0.91 if θ(wTx)<0.9

Based on the properties of the linear signal and the logistic function described in the "Linear Models" slides and the Learning From Data textbook, the answer is Yes, the decision boundary is still linear.

1. Identify the Decision Boundary Equation

The decision boundary is the "line" (or hyperplane) that separates the region where you predict +1 from the region where you predict 1. In your new rule, this boundary occurs exactly when the probability equals the threshold: $$ \theta(w^T x) = 0.9 $$

2. Substitute the Logistic Function

Recall the definition of the logistic function θ(s) provided in the slides and textbook: $$ \frac{1}{1 + e^{-w^T x}} = 0.9 $$

3. Solve for the Linear Signal (wTx)

To see the shape of the boundary in the input space, we need to isolate the input terms (x).

4. Conclusion: It is a Linear Boundary

The final equation for the decision boundary is: $$ w^T x - \ln(9) = 0 $$
Since ln(9) is just a constant number (approximately 2.2), this equation effectively shifts the threshold of the linear signal.

Geometrically, the equation wTx=constant defines a hyperplane (a line in 2D space). Therefore, changing the probability threshold shifts the position of the boundary, but it remains linear; it does not curve.

(e) What is the essential property of logistic regression that results in the linear decision boundary?

The essential property that results in a linear decision boundary is that Logistic Regression models the probability as a monotonic function of a linear signal.

Here is the breakdown of the components you should mention in your answer:

  1. The Linear Signal (s): The core of the model is the "signal" s, which is a linear combination of the input features: $$s = w^T x = \sum w_i x_i$$ This part is linear in the input x.

  2. The Monotonic Transformation (θ): The probability P(y=1|x) is computed by passing this linear signal through the logistic (sigmoid) function: $$P(y=1|x) = \theta(s) = \frac{1}{1+e^{-s}}$$Crucially, this function is strictly monotonic (it effectively just scales the signal from the range (,) to (0,1) without changing the order of the values).

  3. The Resulting Boundary: Because the function is monotonic, applying a threshold to the probability (e.g., P0.9) is mathematically equivalent to applying a corresponding threshold to the linear signal (e.g., s2.2).

    • Equation: θ(wTx)=thresholdwTx=constant.
    • Since wTx=constant defines a hyperplane (a line in 2D), the decision boundary is linear.

Summary for your answer: You should state that the decision boundary is linear because the logistic regression model relies on a linear signal (wTx) and the decision rule involves thresholding a monotonic function of that signal. Therefore, contours of constant probability correspond directly to contours of constant linear score (hyperplanes).

3. Sigmoid Logistic Regression

Function: _gradient

Based on (b) in the last problem, implement the function _gradient.

    def _gradient(self, _x, _y):
        """Compute the gradient of cross-entropy with respect to self.W
        for one training sample (_x, _y). This function is used in fit_*.

        Args:
            _x: An array of shape [n_features,].
            _y: An integer. 1 or -1.

        Returns:
            _g: An array of shape [n_features,]. The gradient of
                cross-entropy with respect to self.W.
        """
		### YOUR CODE HERE

		### END YOUR CODE

Based on the derivation we performed in the previous step (part b), we found that the gradient of the cross-entropy loss E(w) for a single sample is:

E(w)=yx1+eywTx

Here is the implementation for the _gradient function in LogisticRegression.py, followed by a detailed explanation.

Implementation

    def _gradient(self, _x, _y):
        """Compute the gradient of cross-entropy with respect to self.W
        for one training sample (_x, _y). This function is used in fit_*.

        Args:
            _x: An array of shape [n_features,].
            _y: An integer. 1 or -1.

        Returns:
            _g: An array of shape [n_features,]. The gradient of
                cross-entropy with respect to self.W.
        """
        ### YOUR CODE HERE
        # 1. Compute the exponent term: y * (w^T * x)
        # self.W is the weight vector (w)
        exponent = _y * np.dot(self.W, _x)

        # 2. Compute the scalar coefficient: -y / (1 + e^(exponent))
        # This represents how much "correction" is needed based on the prediction error.
        coefficient = -_y / (1 + np.exp(exponent))

        # 3. Compute the gradient vector: coefficient * x
        _g = coefficient * _x
        ### END YOUR CODE

        return _g

Explanation of the Steps

1. The Dot Product (np.dot(self.W, _x))

2. The Exponent (exponent = _y * ...)

3. The Denominator (1 + np.exp(exponent))

4. The Final Gradient (_g)

Function: fit_BGD

    def fit_BGD(self, X, y):
        """Train perceptron model on data (X,y) with Batch Gradient Descent.

        Args:
            X: An array of shape [n_samples, n_features].
            y: An array of shape [n_samples,]. Only contains 1 or -1.

        Returns:
            self: Returns an instance of self.
        """
        n_samples, n_features = X.shape

		### YOUR CODE HERE
        

		### END YOUR CODE
        return self

Implementation

    def fit_BGD(self, X, y):
        """Train perceptron model on data (X,y) with Batch Gradient Descent.

        Args:
            X: An array of shape [n_samples, n_features].
            y: An array of shape [n_samples,]. Only contains 1 or -1.

        Returns:
            self: Returns an instance of self.
        """
        n_samples, n_features = X.shape

        ### YOUR CODE HERE
        # 1. Initialize weights (usually to zeros or small random values)
        self.W = np.zeros(n_features)

        # 2. Loop over epochs (iterations)
        for _ in range(self.max_iter):
            # Initialize a gradient accumulator for the batch
            batch_grad = np.zeros(n_features)

            # 3. Compute the gradient for the entire dataset
            for i in range(n_samples):
                # Accumulate the gradient contribution from each sample
                batch_grad += self._gradient(X[i], y[i])

            # 4. Average the gradient
            # The in-sample error is the average over N samples, so we divide the sum by N
            batch_grad /= n_samples

            # 5. Update the weights
            # Move in the opposite direction of the gradient (Descent)
            self.W -= self.learning_rate * batch_grad

        ### END YOUR CODE
        return self

Explanation of the Steps

1. Initialization (self.W = np.zeros(n_features))

2. The Epoch Loop (for _ in range(self.max_iter))

3. Accumulating the Gradient (batch_grad += ...)

4. Averaging (batch_grad /= n_samples)

5. Weight Update (self.W -= self.learning_rate * batch_grad)

Function: fit_miniBGD

    def fit_miniBGD(self, X, y, batch_size):
        """Train perceptron model on data (X,y) with mini-Batch Gradient Descent.

        Args:
            X: An array of shape [n_samples, n_features].
            y: An array of shape [n_samples,]. Only contains 1 or -1.
            batch_size: An integer.

        Returns:
            self: Returns an instance of self.
        """
		### YOUR CODE HERE
        

		### END YOUR CODE
        return self

Based on the description provided in the prompt and the concepts of Stochastic Gradient Descent (SGD) discussed in the Learning From Data textbook, Mini-Batch Gradient Descent is a middle ground between Batch Gradient Descent (BGD) and SGD.

In BGD, you update weights once per epoch (using all data). In SGD, you update weights N times per epoch (using one sample at a time). In Mini-Batch, you update weights multiple times per epoch, using a small group of samples (the "batch") for each update.

Implementation

    def fit_miniBGD(self, X, y, batch_size):
        """Train perceptron model on data (X,y) with mini-Batch Gradient Descent.

        Args:
            X: An array of shape [n_samples, n_features].
            y: An array of shape [n_samples,]. Only contains 1 or -1.
            batch_size: An integer.

        Returns:
            self: Returns an instance of self.
        """
        n_samples, n_features = X.shape

        ### YOUR CODE HERE
        # 1. Initialize weights
        self.W = np.zeros(n_features)

        # 2. Loop over epochs
        for _ in range(self.max_iter):
            # 3. Shuffle the data at the start of each epoch
            # This ensures the batches are different every time, preventing cycles.
            indices = np.arange(n_samples)
            np.random.shuffle(indices)
            X_shuffled = X[indices]
            y_shuffled = y[indices]

            # 4. Loop over the data in chunks of 'batch_size'
            for start_idx in range(0, n_samples, batch_size):
                # Handle the last batch, which might be smaller than batch_size
                end_idx = min(start_idx + batch_size, n_samples)

                # Slice the current mini-batch
                X_batch = X_shuffled[start_idx:end_idx]
                y_batch = y_shuffled[start_idx:end_idx]
                current_batch_size = end_idx - start_idx

                # 5. Compute the gradient for this specific batch
                batch_grad = np.zeros(n_features)
                for i in range(current_batch_size):
                    batch_grad += self._gradient(X_batch[i], y_batch[i])

                # Average the gradient over the batch size
                batch_grad /= current_batch_size

                # 6. Update the weights immediately
                self.W -= self.learning_rate * batch_grad

        ### END YOUR CODE
        return self

Step-by-Step Explanation

1. Initialization (self.W = np.zeros)

2. The Epoch Loop (for _ in range(self.max_iter))

3. Shuffling (np.random.shuffle)

4. The Batch Loop (range(0, n_samples, batch_size))

5. Accumulating Gradients (batch_grad += ...)

6. Weight Update (self.W -= ...)

Function: fit_SGD

    def fit_SGD(self, X, y):
        """Train perceptron model on data (X,y) with Stochastic Gradient Descent.

        Args:
            X: An array of shape [n_samples, n_features].
            y: An array of shape [n_samples,]. Only contains 1 or -1.

        Returns:
            self: Returns an instance of self.
        """
		### YOUR CODE HERE

		### END YOUR CODE
        return self

Based on the definition of Stochastic Gradient Descent (SGD) provided in the Learning From Data textbook, specifically the section on "Stochastic Gradient Descent," here is the implementation and detailed explanation.

SGD differs from Batch Gradient Descent in that it updates the weights using the gradient of a single training example (xn,yn) at a time, rather than the average gradient of the entire dataset.

Implementation

    def fit_SGD(self, X, y):
        """Train perceptron model on data (X,y) with Stochastic Gradient Descent.

        Args:
            X: An array of shape [n_samples, n_features].
            y: An array of shape [n_samples,]. Only contains 1 or -1.

        Returns:
            self: Returns an instance of self.
        """
        n_samples, n_features = X.shape

        ### YOUR CODE HERE
        # 1. Initialize weights
        self.W = np.zeros(n_features)

        # 2. Loop over epochs
        # An epoch is one full pass through the data.
        for _ in range(self.max_iter):

            # 3. Shuffle the data
            # Essential in SGD to ensure the model sees samples in a random order
            indices = np.arange(n_samples)
            np.random.shuffle(indices)

            # 4. Iterate through every single sample
            for i in indices:
                # 5. Compute gradient for ONE sample (X[i], y[i])
                # This uses the gradient formula derived earlier: -y*x / (1 + e^(y*w^T*x))
                grad = self._gradient(X[i], y[i])

                # 6. Update weights immediately
                # w = w - learning_rate * gradient
                self.W -= self.learning_rate * grad

        ### END YOUR CODE
        return self

Detailed Explanation of Each Step

1. Initialization (self.W = np.zeros)

2. The Epoch Loop (for _ in range(self.max_iter))

3. Shuffling (np.random.shuffle)

4. The Inner Loop (for i in indices)

5. Gradient Calculation (grad = self._gradient(X[i], y[i]))

6. Weight Update (self.W -= self.learning_rate * grad)

Function: predict_proba

    def predict_proba(self, X):
        """Predict class probabilities for samples in X.

        Args:
            X: An array of shape [n_samples, n_features].

        Returns:
            preds_proba: An array of shape [n_samples, 2].
                Only contains floats between [0,1].
        """
		### YOUR CODE HERE

		### END YOUR CODE

This function computes the probability P(y|x) using the logistic (sigmoid) function θ(s)=11+es. As established in the "Logistic Regression" section, the model estimates P(y=1|x) directly,.

    def predict_proba(self, X):
        """Predict class probabilities for samples in X.

        Args:
            X: An array of shape [n_samples, n_features].

        Returns:
            preds_proba: An array of shape [n_samples, 2].
                Only contains floats between.
        """
        ### YOUR CODE HERE
        # 1. Compute the linear signal s = w^T * x
        signal = np.dot(X, self.W)

        # 2. Compute P(y=1|x) using the sigmoid function
        prob_pos = 1 / (1 + np.exp(-signal))

        # 3. Compute P(y=-1|x)
        # Since probabilities sum to 1: P(y=-1) = 1 - P(y=1)
        prob_neg = 1 - prob_pos

        # 4. Stack them into a matrix of shape [n_samples, 2]
        # Column 0 corresponds to the negative class (-1)
        # Column 1 corresponds to the positive class (1)
        preds_proba = np.column_stack((prob_neg, prob_pos))

        return preds_proba
        ### END YOUR CODE

Explanation:

Function: predict

    def predict(self, X):
        """Predict class labels for samples in X.

        Args:
            X: An array of shape [n_samples, n_features].

        Returns:
            preds: An array of shape [n_samples,]. Only contains 1 or -1.
        """
		### YOUR CODE HERE

		### END YOUR CODE

This function returns the binary class labels. As discussed in the slides, the decision boundary is linear. We assign the class based on which probability is higher, which is equivalent to checking if the linear signal is non-negative,.

    def predict(self, X):
        """Predict class labels for samples in X.

        Args:
            X: An array of shape [n_samples, n_features].

        Returns:
            preds: An array of shape [n_samples,]. Only contains 1 or -1.
        """
        ### YOUR CODE HERE
        # 1. Compute the linear signal s = w^T * x
        signal = np.dot(X, self.W)

        # 2. Apply the threshold
        # If w^T * x >= 0, then P(y=1|x) >= 0.5, so we predict 1.
        # Otherwise, we predict -1.
        preds = np.where(signal >= 0, 1, -1)

        return preds
        ### END YOUR CODE

Explanation:

Function: score

    def score(self, X, y):
        """Returns the mean accuracy on the given test data and labels.

        Args:
            X: An array of shape [n_samples, n_features].
            y: An array of shape [n_samples,]. Only contains 1 or -1.

        Returns:
            score: An float. Mean accuracy of self.predict(X) wrt. y.
        """
		### YOUR CODE HERE

		### END YOUR CODE

This function calculates the mean accuracy of the model, quantifying how well the predictions match the true labels.

    def score(self, X, y):
        """Returns the mean accuracy on the given test data and labels.

        Args:
            X: An array of shape [n_samples, n_features].
            y: An array of shape [n_samples,]. Only contains 1 or -1.

        Returns:
            score: An float. Mean accuracy of self.predict(X) wrt. y.
        """
        ### YOUR CODE HERE
        # 1. Generate predictions using the trained model
        predictions = self.predict(X)

        # 2. Compare predictions to true labels
        # (predictions == y) creates a boolean array (True for correct, False for incorrect)
        # np.mean converts True to 1 and False to 0, then calculates the average
        score = np.mean(predictions == y)

        return score
        ### END YOUR CODE

Explanation:

Function: visualize_result

def visualize_result(X, y, W):
	'''This function is used to plot the sigmoid model after training. 

	Args:
		X: An array of shape [n_samples, 2].
		y: An array of shape [n_samples,]. Only contains 1 or -1.
		W: An array of shape [n_features,].
	
	Returns:
		No return. Save the plot to 'train_result_sigmoid.*' and include it
		in submission.
	'''
	### YOUR CODE HERE

	### END YOUR CODE

This function visualizes the decision boundary learned by the Logistic Regression model. As established in the sources, the decision boundary for a linear model (including Logistic Regression) is defined where the "signal" s=wTx=0.

In a 2D feature space with an added bias term (x0=1), the equation for the boundary is: $$w_0 + w_1 x_1 + w_2 x_2 = 0$$
To plot this as a line, we solve for the vertical axis feature (x2, Intensity) in terms of the horizontal axis feature (x1, Symmetry): $$x_2 = -\frac{w_1 x_1 + w_0}{w_2}$$

Implementation

import numpy as np
import matplotlib.pyplot as plt

def visualize_result(X, y, W):
	'''This function is used to plot the sigmoid model after training.

	Args:
		X: An array of shape [n_samples, 2]. (Symmetry, Intensity)
		y: An array of shape [n_samples,]. Only contains 1 or -1.
		W: An array of shape [n_features,]. (Bias, w_symmetry, w_intensity)

	Returns:
		No return. Save the plot to 'train_result_sigmoid.*' and include it
		in submission.
	'''
	### YOUR CODE HERE
	# 1. Plot the data points (Same as visualize_features)
	plt.figure(figsize=(8, 6))

	# Separate positive (Digit 1) and negative (Digit 2) classes
	X_pos = X[y == 1]
	X_neg = X[y == -1]

	# Plot scatter points
	plt.scatter(X_pos[:, 0], X_pos[:, 1], color='blue', marker='o', label='Digit 1 (+1)')
	plt.scatter(X_neg[:, 0], X_neg[:, 1], color='red', marker='x', label='Digit 2 (-1)')

	# 2. Calculate the Decision Boundary
	# The boundary is where w^T * x = 0 => w0 + w1*x1 + w2*x2 = 0
	# We solve for x2 (Intensity): x2 = -(w1*x1 + w0) / w2

	# Define a range for x1 (Symmetry) based on the data
	x1_min, x1_max = np.min(X[:, 0]), np.max(X[:, 0])
	x1_values = np.linspace(x1_min, x1_max, 100)

	# Calculate corresponding x2 (Intensity) values
	# W is bias, W is Symmetry weight, W is Intensity weight
	x2_values = -(W * x1_values + W) / W

	# 3. Plot the Decision Boundary
	plt.plot(x1_values, x2_values, color='green', linewidth=2, label='Decision Boundary')

	# 4. Add labels, legend, and save
	plt.xlabel('Symmetry')
	plt.ylabel('Intensity')
	plt.title('Logistic Regression: Training Data and Decision Boundary')
	plt.legend()
	plt.ylim(np.min(X[:, 1]) - 0.5, np.max(X[:, 1]) + 0.5) # restrict y-axis to data range

	plt.savefig('train_result_sigmoid.png')
	# plt.show()
	### END YOUR CODE

Step-by-Step Explanation:

  1. Scatter Plot: We plot the 1s and 2s just like in the previous visualization function to show the distribution of the training data.
  2. Deriving the Line:
    • W contains 3 weights: W (Bias), W (Symmetry), W (Intensity).
    • We create a range of x-coordinates (x1_values) spanning the width of the data.
    • We use the derived formula x2=(w1x1+w0)/w2 to generate the y-coordinates (x2_values).
  3. Plotting: We plot this line in green. Any point on one side of this line results in a probability >0.5 (prediction +1), and any point on the other side results in probability <0.5 (prediction -1).

Function: main()

Based on the principles of Validation and Testing discussed in Chapter 4 of the Learning From Data textbook (specifically Section 4.3 on Model Selection), here is the implementation for the final section of main().

This code performs hyperparameter tuning using the validation set, visualizes the best result, and finally estimates the out-of-sample error (Eout) using the test set.

Implementation

	# Explore different hyper-parameters.
	### YOUR CODE HERE
    # 1. Define a search space for hyperparameters (e.g., learning rate)
    learning_rates = [0.01, 0.05, 0.1, 0.5, 1.0]
    best_acc = 0
    best_logisticR = None

    print("--- Hyperparameter Tuning ---")
    # 2. Iterate through hyperparameters to find the best model
    for lr in learning_rates:
        # Initialize model
        model = logistic_regression(learning_rate=lr, max_iter=1000)

        # Train on Training Set (using BGD as a baseline)
        model.fit_BGD(train_X, train_y)

        # Evaluate on Validation Set (NOT Test Set - avoid data snooping)
        val_acc = model.score(valid_X, valid_y)
        print(f"Learning Rate: {lr}, Validation Accuracy: {val_acc}")

        # Track the best model
        if val_acc > best_acc:
            best_acc = val_acc
            best_logisticR = model

    print(f"Best Validation Accuracy: {best_acc} with LR: {best_logisticR.learning_rate}")
	### END YOUR CODE

	# Visualize the your 'best' model after training.
	### YOUR CODE HERE
    # Pass the features (columns 1 and 2) and the learned weights (W)
    # get_params() presumably returns self.W
    visualize_result(train_X[:, 1:3], train_y, best_logisticR.get_params())
	### END YOUR CODE

	# Use the 'best' model above to do testing. Note that the test data should be loaded and processed in the same way as the training data.
	### YOUR CODE HERE
    print("--- Testing ---")
    # 1. Load the raw test data
    # We assume 'test_filename' is defined in the imports/config section similar to 'train_filename'
    raw_test_data, test_labels = load_data(os.path.join(data_dir, test_filename))

    # 2. Extract Features (Symmetry and Intensity)
    # Must use the exact same prepare_X function as training
    test_X_all = prepare_X(raw_test_data)

    # 3. Filter for Class 1 and 2
    # prepare_y returns indices for these classes
    test_y_all, test_idx = prepare_y(test_labels)

    # 4. Apply filter to features and labels
    test_X = test_X_all[test_idx]
    test_y = test_y_all[test_idx]

    # 5. Convert labels to binary {-1, 1}
    # Map label '2' to '-1' (Digit 1 is positive class)
    test_y[test_y == 2] = -1

    # 6. Evaluate the best model on the Test Set
    test_acc = best_logisticR.score(test_X, test_y)
    print(f"Final Test Accuracy: {test_acc}")
	### END YOUR CODE

Explanation of Steps

  1. Hyperparameter Tuning (Validation):
    • Concept: As described in Chapter 4 (Validation), we cannot use the Test Set to choose parameters (like learning rate) because that constitutes data snooping.
    • Action: We loop through different learning rates, train on train_X, and evaluate on valid_X. The model that performs best on the validation set is selected as best_logisticR.
  2. Visualization:
    • Concept: We visualize the decision boundary to verify if the linear model effectively separates the two classes in the feature space (Symmetry vs. Intensity).
    • Action: We call visualize_result with the best model's weights. We calculate train_X[:, 1:3] because train_X column 0 is the bias (x0=1), but the scatter plot only needs the 2D features (x1,x2).
  3. Testing:
    • Concept: The Test Set is used only once at the very end to estimate Eout (out-of-sample error). This provides an unbiased estimate of how the model will perform on unseen data.
    • Action:
      • Consistency: It is crucial that test data undergoes the exact same preprocessing as training data. We call prepare_X to extract features and prepare_y to filter for digits 1 and 2.
      • Label Mapping: We must convert the raw labels (1 and 2) to the binary format (1 and -1) used by the logistic regression logic.
      • Scoring: We calculate the final accuracy using score().

4. Softmax Logistic Regression

Function: _gradient

    def _gradient(self, _x, _y):
        """Compute the gradient of cross-entropy with respect to self.W
        for one training sample (_x, _y). This function is used in fit_*.

        Args:
            _x: An array of shape [n_features,].
            _y: One_hot vector. 

        Returns:
            _g: An array of shape [n_features,]. The gradient of
                cross-entropy with respect to self.W.
        """
		### YOUR CODE HERE

		### END YOUR CODE

Although the provided docstring states the return shape is [n_features,], Softmax Regression typically requires a weight matrix of shape [n_features, n_classes] (one weight vector per class). Consequently, the gradient should also be a matrix of the same shape. I will implement it returning the matrix, as this is mathematically required for the one-hot vector input.

Implementation

    def _gradient(self, _x, _y):
        """Compute the gradient of cross-entropy with respect to self.W
        for one training sample (_x, _y). This function is used in fit_*.

        Args:
            _x: An array of shape [n_features,].
            _y: One_hot vector. 

        Returns:
            _g: An array of shape [n_features,]. The gradient of
                cross-entropy with respect to self.W.
        """
		### YOUR CODE HERE

        # Assumed self.W to be shape (n_features, n_classes)
        # Compute the Softmax probabilities
        probs = self.softmax(_x)

        # Compute the error term (prediction prob - target)
        error = probs - _y
            # probs is the vector of class probabilities
        
        # Compute the gradient (outer product of the input x and the error vector)
        _g = np.outer(_x, error)    # shape: (n_features, k)

        return _g

Detailed Step-by-Step Explanation

1. Compute the Linear Signal (z=WTx)

2. Apply the Softmax Function

3. Compute the Error (y^y)

4. Compute the Gradient w.r.t Weights (W)

Function: fit_miniBGD

    def fit_miniBGD(self, X, labels, batch_size):
        """Train perceptron model on data (X,y) with mini-Batch GD.

        Args:
            X: An array of shape [n_samples, n_features].
            labels: An array of shape [n_samples,].  Only contains 0,..,k-1.
            batch_size: An integer.

        Returns:
            self: Returns an instance of self.

        Hint: the labels should be converted to one-hot vectors, for example: 1----> [0,1,0]; 2---->[0,0,1].
        """

		### YOUR CODE HERE

		### END YOUR CODE

Implementation

    def fit_miniBGD(self, X, labels, batch_size):
        """Train perceptron model on data (X,y) with mini-Batch GD.

        Args:
            X: An array of shape [n_samples, n_features].
            labels: An array of shape [n_samples,].  Only contains 0,..,k-1.
            batch_size: An integer.

        Returns:
            self: Returns an instance of self.

        Hint: the labels should be converted to one-hot vectors, for example: 1---->; 2---->.
        """
        ### YOUR CODE HERE
        n_samples, n_features = X.shape

        # 1. Initialize Weights
        # Unlike binary logistic regression (vector), Softmax uses a weight matrix.
        # Shape: (n_features, k) where k is the number of classes.
        self.W = np.zeros((n_features, self.k))

        # 2. One-Hot Encoding of Labels
        # We need to convert integer labels (e.g., 2) into vectors (e.g.,).
        # Create a matrix of zeros of shape (n_samples, k)
        y_one_hot = np.zeros((n_samples, self.k))
        # Set the index corresponding to the label to 1 for each sample
        y_one_hot[np.arange(n_samples), labels] = 1

        # 3. Loop over epochs
        for _ in range(self.max_iter):
            # 4. Shuffle data
            # Essential for Mini-Batch/SGD to ensure batches are random representatives
            perm_indices = np.random.permutation(n_samples)
            X_shuffled = X[perm_indices]
            y_shuffled = y_one_hot[perm_indices]

            # 5. Iterate over mini-batches
            for start_idx in range(0, n_samples, batch_size):
                # Handle the last batch (might be smaller than batch_size)
                end_idx = min(start_idx + batch_size, n_samples)

                # Slice the batch
                X_batch = X_shuffled[start_idx:end_idx]
                y_batch = y_shuffled[start_idx:end_idx]
                current_batch_size = end_idx - start_idx

                # 6. Accumulate Gradients for this batch
                # Initialize a zero gradient matrix of shape (n_features, k)
                batch_grad_sum = np.zeros((n_features, self.k))

                # Sum the gradients for every sample in the batch
                for i in range(current_batch_size):
                    # Use the _gradient helper implemented previously
                    # It returns the gradient contribution of a single sample
                    batch_grad_sum += self._gradient(X_batch[i], y_batch[i])

                # 7. Average the gradient
                # The gradient of the loss is the average over the batch samples
                avg_grad = batch_grad_sum / current_batch_size

                # 8. Update Weights
                # Move W in the opposite direction of the gradient
                self.W -= self.learning_rate * avg_grad

        ### END YOUR CODE
        return self

Detailed Explanation of Each Step

  1. Initialize Weights (self.W = np.zeros(...)):
    • Concept: In Softmax Regression, we have a separate weight vector for each class k. Therefore, self.W becomes a matrix of size [n_features, k]. Initializing to zeros is a standard starting point for convex problems like logistic regression.
  2. One-Hot Encoding:
    • Math: The mathematical derivation of the cross-entropy gradient relies on the target y being a probability vector. If the true class is k, the target vector should have a 1 at index k and 0 elsewhere.
    • Code: y_one_hot[np.arange(n_samples), labels] = 1 efficiently creates this matrix. If sample 0 has label 2, this sets y_one_hot = 1.
  3. Shuffle (np.random.permutation):
    • Reason: As discussed in Chapter 3 regarding SGD, randomness is crucial. If we process the data in the same order every epoch, the model might learn the sequence of the data rather than the features, or get stuck in cycles. Shuffling ensures that each mini-batch is a random sample of the true population.
  4. Batch Loop:
    • We iterate through the dataset in chunks of size batch_size. The logic min(start + batch_size, n_samples) ensures we don't crash when the dataset size isn't perfectly divisible by the batch size (handling the last "remainder" batch).
  5. Accumulate Gradients:
    • Math: The gradient of the error for a batch is the sum (or average) of the gradients of the individual samples in that batch.
    • Code: We loop through the current batch, call self._gradient (which you implemented in the previous step to calculate En(W)), and add it to batch_grad_sum.
  6. Average Gradient:
    • Math: We divide by current_batch_size. This normalizes the update so that the magnitude of the step doesn't depend on how many samples are in the batch. This effectively estimates the "true" gradient of the expected error.
  7. Weight Update:
    • Math: W(t+1)=W(t)ηEbatch(W).
    • We subtract the scaled gradient to move down the error surface toward the minimum. Note that self.W and avg_grad are both matrices of shape [n_features, k], so this updates all class weights simultaneously.

Function: softmax

    def softmax(self, x):
        """Compute softmax values for each sets of scores in x."""
        ### You must implement softmax by youself, otherwise you will not get credits for this part.

		### YOUR CODE HERE

        # Compute linear term: s = w^t * x (matrix multiplication)
        s = x @ self.W       
            # (n_samples, n_features) @ (n_features, k) -> (n_samples, k)

        # Stabilize (prevents overflow and made the model predict faster for me)
        s = s - np.max(s)

        # Softmax function: exp(s) / sum(exp(s)), where s = w^t * x
        exp_s = np.exp(s)           # = e^s
        soft_probs = exp_s / np.sum(exp_s)

        return soft_probs

Given a K -dimensional vector v=[v1,v2,,vK]TRK

softmax(v)=1k=1Kevk[ev1ev2evK]

The multi-class prediction function can be written as

hw(x)=[P(y=1x;w)P(y=2x;w)P(y=Kx;w)]=softmax[w1Txw2TxwKTx]=1k=1KewkTx[ew1Txew2TxewkTx]

Function: predict

    def predict(self, X):
        """Predict class labels for samples in X.

        Args:
            X: An array of shape [n_samples, n_features].

        Returns:
            preds: An array of shape [n_samples,]. Only contains 0,..,k-1.
        """
		### YOUR CODE HERE

		### END YOUR CODE

Implementation

    def predict(self, X):
        """Predict class labels for samples in X.

        Args:
            X: An array of shape [n_samples, n_features].

        Returns:
            preds: An array of shape [n_samples,]. Only contains 0,..,k-1.
        """
        ### YOUR CODE HERE
        # 1. Compute the linear scores (logits) for all classes
        # X shape: (n_samples, n_features)
        # self.W shape: (n_features, k)
        # Result scores shape: (n_samples, k)
        scores = np.dot(X, self.W)

        # 2. Determine the predicted class
        # The predicted class is the index of the highest score.
        # We assume the index corresponds to the label (0, 1, ..., k-1).
        # axis=1 means we find the max across columns (classes) for each row.
        preds = np.argmax(scores, axis=1)

        return preds
        ### END YOUR CODE

Function: score

    def score(self, X, labels):
        """Returns the mean accuracy on the given test data and labels.

        Args:
            X: An array of shape [n_samples, n_features].
            labels: An array of shape [n_samples,]. Only contains 0,..,k-1.

        Returns:
            score: An float. Mean accuracy of self.predict(X) wrt. labels.
        """
		### YOUR CODE HERE

		### END YOUR CODE

Implementation


    def score(self, X, labels):
        """Returns the mean accuracy on the given test data and labels.

        Args:
            X: An array of shape [n_samples, n_features].
            labels: An array of shape [n_samples,]. Only contains 0,..,k-1.

        Returns:
            score: An float. Mean accuracy of self.predict(X) wrt. labels.
        """
        ### YOUR CODE HERE
        # 1. Generate predictions using the function implemented above
        predictions = self.predict(X)

        # 2. Calculate accuracy
        # Compare predictions to ground truth labels.
        # np.mean converts boolean (True/False) to float (1.0/0.0) and averages them.
        score = np.mean(predictions == labels)

        return score
        ### END YOUR CODE

Function: visualize_result_multi

def visualize_result_multi(X, y, W):
	'''This function is used to plot the softmax model after training. 

	Args:
		X: An array of shape [n_samples, 2].
		y: An array of shape [n_samples,]. Only contains 0,1,2.
		W: An array of shape [n_features, 3].
	
	Returns:
		No return. Save the plot to 'train_result_softmax.*' and include it
		in submission.
	'''
	### YOUR CODE HERE

	### END YOUR CODE

Based on the principles of Softmax Regression (multiclass linear model) discussed in the course notes, the decision boundary is determined by comparing the "scores" of each class. A point x is assigned to the class k that yields the highest score sk=wkTx.

Because there are 3 classes (0, 1, 2) and multiple linear boundaries interacting (Class 0 vs 1, 1 vs 2, 0 vs 2), the best way to visualize the result is to draw the decision regions. We do this by creating a fine grid of points over the plot area, predicting the class for every point on that grid, and coloring the background based on those predictions.

Here is the implementation and a detailed step-by-step explanation.

Implementation

import matplotlib.pyplot as plt

def visualize_result_multi(X, y, W):
    '''This function is used to plot the softmax model after training.

    Args:
        X: An array of shape [n_samples, 2]. (Symmetry, Intensity)
        y: An array of shape [n_samples,]. Only contains 0, 1, 2.
        W: An array of shape [n_features, 3]. (Bias + 2 features x 3 classes)

    Returns:
        No return. Save the plot to 'train_result_softmax.png' and include it
        in submission.
    '''
    ### YOUR CODE HERE
    plt.figure(figsize=(8, 6))

    # 1. Create a Meshgrid
    # Define the boundaries of the plot based on the data range, adding some padding.
    x_min, x_max = X[:, 0].min() - 0.5, X[:, 0].max() + 0.5
    y_min, y_max = X[:, 1].min() - 0.5, X[:, 1].max() + 0.5

    # Generate a grid of points with 0.02 distance between them
    h = 0.02
    xx, yy = np.meshgrid(np.arange(x_min, x_max, h), np.arange(y_min, y_max, h))

    # 2. Prepare the Grid for Prediction
    # Flatten the grid matrices into a single list of (x, y) coordinates
    grid_points = np.c_[xx.ravel(), yy.ravel()]

    # Add the bias term (column of 1s) to match the shape of W (3 rows)
    # grid_points becomes shape [n_grid_points, 3]
    grid_points_bias = np.c_[np.ones(grid_points.shape), grid_points]

    # 3. Predict the Class for Every Point on the Grid
    # Compute scores: Z = X * W. Shape: [n_grid_points, 3]
    scores = np.dot(grid_points_bias, W)

    # Select class with highest score. Shape: [n_grid_points,]
    Z = np.argmax(scores, axis=1)

    # Reshape the predictions back to the shape of the meshgrid
    Z = Z.reshape(xx.shape)

    # 4. Plot the Decision Regions (Contours)
    # Use contourf to fill the regions with colors corresponding to classes
    plt.contourf(xx, yy, Z, alpha=0.4, cmap=plt.cm.Paired)

    # 5. Plot the Training Data Points
    # Scatter plot the actual data points, colored by their true label 'y'
    # We loop through classes to add a legend
    colors = ['blue', 'red', 'green']
    labels = ['Digit 0', 'Digit 1', 'Digit 2']

    for i in range(3):
        # Select points belonging to class i
        idx = np.where(y == i)
        plt.scatter(X[idx, 0], X[idx, 1], c=colors[i], label=labels[i], edgecolor='k', s=20)

    # 6. Formatting and Saving
    plt.xlabel('Symmetry')
    plt.ylabel('Intensity')
    plt.title('Softmax Regression Decision Boundaries')
    plt.legend()

    plt.savefig('train_result_softmax.png')
    # plt.show()
    ### END YOUR CODE

Explanation of Each Step

  1. Create a Meshgrid (np.meshgrid):
    • Context: Unlike the binary case where we could solve a single equation wTx=0 to get a line, here we have three boundaries. The region for Class 0 is where score0>score1 and score0>score2.
    • Action: Instead of calculating lines mathematically, we define a "mesh" (a fine net) covering the entire area where the data points sit. We will ask the model to classify every intersection point on this net.
  2. Prepare Grid for Prediction (grid_points_bias):
    • Context: The model weights W have shape `` (Bias, Feature1, Feature2 for 3 classes).
    • Action: The meshgrid gives us just the 2D coordinates (x1,x2). We flatten this into a long list of points and add a column of 1s (x0) so that we can perform the matrix multiplication with W.
  3. Predict Class (np.argmax):
    • Math: For every point on the grid, we calculate 3 scores: [s0,s1,s2].
    • Logic: The predicted class is the index of the maximum score.
    • Result: Z contains the predicted class (0, 1, or 2) for every pixel in our plot background.
  4. Plot Contours (plt.contourf):
    • Action: This function draws filled regions. It looks at the predictions in Z and colors the area blue where the prediction is 0, brown/red where it is 1, etc. This visualizes the decision surface and the linear boundaries separating the classes.
  5. Plot Training Data:
    • Action: Finally, we overlay the actual data points X. If the model is good, the blue points (Digit 0) should mostly fall inside the blue region, the red points (Digit 1) inside the red region, and so on.

Function: main()

	# ------------Logistic Regression Multiple-class case, let k= 3------------
	###### Use all data from '0' '1' '2' for training
	train_X = train_X_all
	train_y = train_y_all
	valid_X = valid_X_all
	valid_y = valid_y_all

	#########  miniBGD for multiclass Logistic Regression
	logisticR_classifier_multiclass = logistic_regression_multiclass(learning_rate=0.5, max_iter=100,  k= 3)
	logisticR_classifier_multiclass.fit_miniBGD(train_X, train_y, 10)
	print(logisticR_classifier_multiclass.get_params())
	print(logisticR_classifier_multiclass.score(train_X, train_y))

	# Explore different hyper-parameters.
	### YOUR CODE HERE

	

	### END YOUR CODE

	# Visualize the your 'best' model after training.
	# visualize_result_multi(train_X[:, 1:3], train_y, best_logistic_multi_R.get_params())


	# Use the 'best' model above to do testing.
	### YOUR CODE HERE

	### END YOUR CODE

Implementation

    # ------------Logistic Regression Multiple-class case, let k= 3------------
    ###### Use all data from '0' '1' '2' for training
    # Unlike the binary case, we do not filter for specific digits here.
    # We use the full dataset processed earlier.
    train_X = train_X_all
    train_y = train_y_all
    valid_X = valid_X_all
    valid_y = valid_y_all

    #########  miniBGD for multiclass Logistic Regression
    # Initial test run to verify the code works
    logisticR_classifier_multiclass = logistic_regression_multiclass(learning_rate=0.5, max_iter=100,  k= 3)
    logisticR_classifier_multiclass.fit_miniBGD(train_X, train_y, 10)
    print("Initial weights:", logisticR_classifier_multiclass.get_params())
    print("Initial Train Score:", logisticR_classifier_multiclass.score(train_X, train_y))

    # Explore different hyper-parameters.
    ### YOUR CODE HERE
    print("\n--- Multiclass Hyperparameter Tuning ---")
    learning_rates = [0.01, 0.05, 0.1, 0.5, 1.0]
    best_multi_acc = -1
    best_logistic_multi_R = None

    # 1. Grid Search for the best Hyperparameters
    for lr in learning_rates:
        # Initialize the multiclass model
        # We assume k=3 for digits 0, 1, 2
        model = logistic_regression_multiclass(learning_rate=lr, max_iter=1000, k=3)

        # Train using Mini-Batch Gradient Descent
        # Batch size is set to 10, but could also be tuned
        model.fit_miniBGD(train_X, train_y, batch_size=10)

        # Evaluate on the VALIDATION set
        val_acc = model.score(valid_X, valid_y)
        print(f"LR: {lr}, Validation Accuracy: {val_acc}")

        # Select the best model
        if val_acc > best_multi_acc:
            best_multi_acc = val_acc
            best_logistic_multi_R = model

    print(f"Best Multiclass Model LR: {best_logistic_multi_R.learning_rate} with Val Acc: {best_multi_acc}")
    ### END YOUR CODE

    # Visualize the your 'best' model after training.
    # We pass the features (cols 1 and 2) and the learned weight matrix W.
    visualize_result_multi(train_X[:, 1:3], train_y, best_logistic_multi_R.get_params())

    # Use the 'best' model above to do testing.
    ### YOUR CODE HERE
    print("\n--- Multiclass Testing ---")
    # 1. Load the raw test data
    raw_test_data, test_labels = load_data(os.path.join(data_dir, test_filename))

    # 2. Extract Features
    # Use the same function as training to get Bias, Symmetry, Intensity
    test_X_all = prepare_X(raw_test_data)

    # 3. Prepare Labels
    # prepare_y usually returns the mapped labels and indices.
    # For multiclass (0, 1, 2), we typically use all data returned.
    test_y_all, _ = prepare_y(test_labels)

    # 4. Score the best model on the Test Set
    # This provides the unbiased estimate of E_out (Out-of-sample error)
    test_acc = best_logistic_multi_R.score(test_X_all, test_y_all)
    print(f"Final Multiclass Test Accuracy: {test_acc}")
    ### END YOUR CODE

Detailed Explanation of Each Step

  1. Data Setup (train_X = train_X_all):

    • Context: In the binary section, you likely filtered train_X_all to keep only digits 1 and 2.
    • Action: For the multiclass section, we revert to using train_X_all, which contains features for all available classes (digits 0, 1, and 2) as prepared by your prepare_X function.
  2. Hyperparameter Tuning (The Loop):

    • Concept: As discussed in Chapter 4 regarding Model Selection, we use the Validation Set (valid_X, valid_y) to select the best learning rate.
    • Action: We train multiple instances of logistic_regression_multiclass with different learning rates using fit_miniBGD. We select the one that achieves the highest accuracy on the validation set.
  3. Visualization (visualize_result_multi):

    • Concept: Visualizing the decision boundaries helps verify that the model has learned meaningful geometric separations between the classes.
    • Action: We call the function you implemented earlier.
      • train_X[:, 1:3]: The visualization function expects a 2D feature array (Symmetry vs. Intensity). Since train_X has a bias column of 1s at index 0, we slice it to pass only columns 1 and 2.
      • best_...get_params(): This passes the learned weight matrix W (shape 3×3) to the visualizer so it can compute scores and draw the regions.
  4. Testing:

    • Concept: This step estimates Eout (out-of-sample error). Per the "Three Learning Principles" (Chapter 5) and "Training versus Testing" (Chapter 2), the test set must be used only once at the very end.
    • Action:
      • Loading: We load the test data from the file.
      • Preprocessing: We must apply the exact same prepare_X and prepare_y transformations used on the training data. This ensures the test feature matrix has the same dimensions (bias + 2 features) and format as the model expects.
      • Scoring: We calculate the accuracy of our best_logistic_multi_R model on this fresh data. This number is your final performance metric.

Here is the implementation for the final part of your assignment.

This section compares Softmax Regression (configured for k=2) against Sigmoid Logistic Regression (binary). As discussed in the "Logistic Regression" sections of the Learning From Data slides and textbook, the two methods should theoretically perform very similarly for binary classification, as Softmax is a generalization of the Sigmoid function.

Implementation

	# ------------Connection between sigmoid and softmax------------
	############ Now set k=2, only use data from '1' and '2'

	#####  set labels to 0,1 for softmax classifer
	train_X = train_X_all[train_idx]
	train_y = train_y_all[train_idx]
	train_X = train_X[0:1350]
	train_y = train_y[0:1350]
	valid_X = valid_X_all[val_idx]
	valid_y = valid_y_all[val_idx]

    # Softmax requires labels 0 and 1.
    # Currently we have 1 and 2. Map 2 -> 0, and 1 -> 1 (implicit).
	train_y[np.where(train_y==2)] = 0
	valid_y[np.where(valid_y==2)] = 0

	###### First, fit softmax classifer until convergence, and evaluate
	##### Hint: we suggest to set the convergence condition as "np.linalg.norm(gradients*1./batch_size) < 0.0005" or max_iter=10000:
	### YOUR CODE HERE
    # Initialize Softmax model with k=2
    # We use a large max_iter=10000 to ensure convergence as suggested in the hint
    softmax_model = logistic_regression_multiclass(learning_rate=0.05, max_iter=10000, k=2)

    # Train using Mini-Batch Gradient Descent
    softmax_model.fit_miniBGD(train_X, train_y, batch_size=10)

    # Evaluate
    train_acc_soft = softmax_model.score(train_X, train_y)
    valid_acc_soft = softmax_model.score(valid_X, valid_y)
    print(f"Softmax (k=2) - Train Acc: {train_acc_soft:.4f}, Valid Acc: {valid_acc_soft:.4f}")
	### END YOUR CODE


	train_X = train_X_all[train_idx]
	train_y = train_y_all[train_idx]
	train_X = train_X[0:1350]
	train_y = train_y[0:1350]
	valid_X = valid_X_all[val_idx]
	valid_y = valid_y_all[val_idx]
	#####       set lables to -1 and 1 for sigmoid classifer
	### YOUR CODE HERE
    # Sigmoid (Binary) requires labels -1 and 1.
    # Currently we have 1 and 2. Map 2 -> -1.
    train_y[train_y == 2] = -1
    valid_y[valid_y == 2] = -1
	### END YOUR CODE

	###### Next, fit sigmoid classifer until convergence, and evaluate
	##### Hint: we suggest to set the convergence condition as "np.linalg.norm(gradients*1./batch_size) < 0.0005" or max_iter=10000:
	### YOUR CODE HERE
    # Initialize Binary Logistic Regression model
    sigmoid_model = logistic_regression(learning_rate=0.05, max_iter=10000)

    # Train using Mini-Batch Gradient Descent (to match the Softmax comparison)
    sigmoid_model.fit_miniBGD(train_X, train_y, batch_size=10)

    # Evaluate
    train_acc_sig = sigmoid_model.score(train_X, train_y)
    valid_acc_sig = sigmoid_model.score(valid_X, valid_y)
    print(f"Sigmoid (Binary) - Train Acc: {train_acc_sig:.4f}, Valid Acc: {valid_acc_sig:.4f}")
	### END YOUR CODE


	################Compare and report the observations/prediction accuracy
    print("\n--- Observations ---")
    print(f"Difference in Validation Accuracy: {abs(valid_acc_soft - valid_acc_sig):.6f}")

    # Optional: Compare weights
    # Softmax has 2 weight vectors (one for class 0, one for class 1).
    # Sigmoid has 1 weight vector (distinguishing class 1 from -1).
    # The decision boundary for Softmax is (w_1 - w_0)^T x = 0.
    # The decision boundary for Sigmoid is w^T x = 0.
    # Therefore, (w_1 - w_0) from Softmax should be proportional to w from Sigmoid.

    softmax_W = softmax_model.get_params() # Shape (d, 2)
    sigmoid_W = sigmoid_model.get_params() # Shape (d,)

    # Calculate effective weight for softmax (w_class1 - w_class0)
    # Note: In our mapping, label 1 is column 1, label 0 (old 2) is column 0.
    effective_softmax_W = softmax_W[:, 1] - softmax_W[:, 0]

    print("Shape of Softmax Weights:", softmax_W.shape)
    print("Shape of Sigmoid Weights:", sigmoid_W.shape)
    # Cosine similarity to check alignment
    cosine_sim = np.dot(effective_softmax_W, sigmoid_W) / (np.linalg.norm(effective_softmax_W) * np.linalg.norm(sigmoid_W))
    print(f"Alignment (Cosine Similarity) between models: {cosine_sim:.4f}")

	# ------------End------------

Explanation of Observations

  1. Prediction Accuracy: You should observe that the validation accuracy for both models is identical (or extremely close, allowing for small numerical differences in optimization).
  2. Theoretical Connection:
    • Sigmoid: Predicts P(y=1|x)=11+ewTx. The decision boundary is wTx=0.
    • Softmax (k=2): Predicts P(y=1|x)=ew1Txew0Tx+ew1Tx=11+e(w1w0)Tx. The decision boundary is (w1w0)Tx=0.
    • Conclusion: Softmax with k=2 is mathematically equivalent to Sigmoid Logistic Regression. The effective weight vector in Softmax (w1w0) plays the same role as the single weight vector w in Sigmoid.