02 - Logistic Regression For Binary Clasification

Class: CSCE-421


Notes:

Digits Data

Pasted image 20260120093107.png|350

Each digit is a 16 * 16 image.

[-1 -1 -1 -1 -1 -1 -1 -0.63 0.86 -0.17 -1 -1 -1 -1 -1 -1 -1 -1 -1 -1 -1 -1 -0.99 0.3 1 0.31 -1 -1 -1 -1 -1 -1 -1 -1 -1 -1 ... -0.94 -1 -1 -1 -1 -1 -1 -1 -0.97 -0.42 0.30 0.82 1 0.48 -0.47 -0.99 -1 -1 -1 -1]
x=(1,x1,,x256) input w=(w0,w1,,w256) linear model }dim=257

Image Representations

Pasted image 20260120093331.png|600

Transformations on Images

Pasted image 20260120093804.png|600

Learning Invariant Representations

Pasted image 20260120094522.png|400

Intensity and Symmetry Features

Feature: an important property of the input that you think is useful for classification.

Pasted image 20260120094850.png|400

x=(1,x1,x2) input w=(w0,w1,w2) linear model }dim=3

...

Logistic Regression: Predict Probabilities

Will someone have a heart attack over the next year?

age 62 years
blood sugar 120mg/dL40,000
HDL 50
LDL 120
Mass 190lbs
Height 5′10′′
... ...

Logistic Regression: Predict the probability of heart attack: y[0,1]

h(x)=θ(i=0dwixi)=θ(wTx)θ(s)=11+es

Pasted image 20260120095341.png|250

The Data is Still Binary, +-1

D=(x1,y1=±1),,(xN,yN=±1)xn a person’s health information yn=±1 did they have a heart attack or not 

Setting

We are trying to learn the target function

f(x)=P[y=+1x]

The data does not give us the value of f explicitly. Rather, it gives us samples generated by this probability:

P(y=1x)=f(x)P(y=1x)=1f(x)

To learn from such data, we need to define a proper error measure that gauges how close a given hypothesis h is to f in terms of these examples

What Makes an h Good?

The Math

’fiting’ the data means finding a good h

h is good if:

h is good if: {h(xn)1 whenever yn=+1;h(xn)0 whenever yn=1.

A simple error measure that captures this:

Ein (h)=1Nn=1N(h(xn)12(1+yn))2

The Cross Entropy Error Measure

The Math
Ein (w)=1Nn=1Nln(1+eynwTx)

It looks complicated and ugly (ln, e^(·), ...),

But,

Verify:

The Probabilistic Interpretation

The Math

Suppose that h(x)=θ(wTx) closely captures P[+1x]:

P(yx)={θ(wTx) for y=+11θ(wTx) for y=1

Given:

θ(s)=11+es

we have

θ(s)=1θ(s)

So, more compactly,

P(y|x)=θ(y·wTx)

The Likelihood

The Math
P(y|x)=θ(y·wTx)

Likelihood:

- *"The product of each individual probability"* - **Why do we use product instead of summation?** - We want each of these individual probabilities to be large, so we need to maximize using the product - If you use summation it won't work, it turns out you need to use product because multiplication will affect and provide the necessary weight (it will achieve some kind of balance between each data point) - Essentially we want to adjust $w$ so that the product is as larger as possible - Maximizing the likelihood is the same as maximizing the product of individual probabilities - The likelihood measures the probability that the data were generated if $f$ were $h$. ### Maximizing The Likelihood

\begin{aligned}
& \max \prod_{n=1}^N P\left(y_n \mid x_n\right) \
\Leftrightarrow & \max \ln \left(\prod_{n=1}^N P\left(y_n \mid x_n\right)\right) \
\end

Notemax$f(x)$isequivalent(notequal)tomax$lnf(x)$because$log$ismonotonicallyincreasing

\begin{aligned}
\equiv & \max \sum_{n=1}^N \ln P\left(y_n \mid x_n\right) \
\end

Note$log$ofproductequalsthesummationof$log$(logmovedinside)Thisisbecause$lnab=loga+logb$(logproperties)$log$isveryimportantjustbecauseachangeininputdoesnotmeanagreatchangeintheoutput(inlatercasesratherthanearlier)Thisiscalledthesaturatingproperty?oflogsThisisequivalentto(1)becausethefirstistheproduct,andthesecondisthesummation$$min1Nn=1NlnP(ynxn) min1Nn=1Nln1P(ynxn) min1Nn=1Nln1θ(ynwTxn) min1Nn=1Nln(1+eynwTxn)
Ein (w)=1Nn=1Nln(1+eynwTxn)

View

Cross Entropy

CE(g,q)=igilogqi

A Neural Network View

Pasted image 20260122100430.png|600

How To Minimize Ein(w)

Recall:

Ein (w)=1Nn=1Nln(1+eynwTxn)

Regression - pseudoinverse (analytic), from solving wEin(w)=0.

Logistic Regression - analytic won’t work.

Numerically/iteratively set wEin(w)0.

Finding The Best Weights - Hill Descent

Ball on a complicated hilly terrain rolls down to a local valley(a local minimum)

Pasted image 20260122103143.png|200

Questions:

Our Ein Has Only One Valley

Pasted image 20260122103509.png|400

How to "Roll Down"?

The Math

Assume you are at weights w(t) and you take a step of size η in the direction v^.

w(t+1)=w(t)+ηv^
  1. η: A scalar of step size
  2. v^: A unit vector of direction

We get to pick v^ ← what's the best direction to take the step?

Pick v^ to make Ein(w(t+1)) as small as possible.

Pasted image 20260127094259.png|200


The Meaning

The Gradient is the Fastest Way to Roll Down

The Taylor series expansion of a multivariate function f(x) around a point a up to the second order is

f(x)f(a)+f(a)T(xa)+12(xa)THf(a)(xa)

Expanding Ein(w(t+1)) at Ein(w(t)) gives

ΔEin=Ein(w(t+1))Ein(w(t))=Ein(w(t)+ηv^)Ein(w(t))=ηEin(w(t))Tv^+O(η2)( Taylor’s Approximation ) minimized at v^=Ein(w(t))Ein(w(t))

The best (steepest) direction to move is the negative gradient:

v^=Ein (w(t))Ein (w(t))

"Rolling Down ≡ Iterating the Negative Gradient"

Pasted image 20260127094740.png|450

The 'Goldilocks' Step Size

Pasted image 20260122104847.png|450

Fixed Learning Rate Gradient Descent

ηt=ηEin(w(t))Ein(w(t))0 when closer to the minimum. v^=ηtEin (w(t))Ein (w(t))=ηEin (w(t))Ein (w(t))Ein (w(t)) v^=ηEin (w(t))

Pasted image 20260127093606.png|400

Gradient descent can minimize any smooth function, for example

Ein (w)=1Nn=1Nln(1+eynwTx)

(logistic regression)

The Meaning

Stochastic Gradient Descent (SGD)

A variation of GD that considers only the error on one data point.

Ein (w)=1Nn=1Nln(1+eynwTx)=1Nn=1Ne(w,xn,yn) w(t+1)w(t)ηwe(w,x,y)

Logistic Regression:

w(t+1)w(t)+yx(η1+eywTx)

Advantages:

  1. The ’average’ move is the same as GD;
  2. Computation: fraction 1/N cheaper per step;
  3. Stochastic: helps escape local minima;
  4. Simple;

Pasted image 20260127100026.png|400

Comparison

Method Updates Per Iteration Convergence Speed Stability Computation Cost
Batch GD After full dataset Slow Stable High
Stochastic GD After each sample Fast Noisy (unstable) Low
Mini-Batch GD After each mini-batch Moderate Balanced Moderate