ResearchBlog‎ > ‎

My notes on Logistic Regression

posted Jun 9, 2014, 10:32 PM by Teng-Yok Lee   [ updated Jun 10, 2014, 8:59 AM ]
REF: http://www.holehouse.org/mlclass/06_Logistic_Regression.html

Because it took me a while to finally derive it, I decide to put the detail here. Since I cannot type the equation nice, I simplify the notations.

Note 1: Gradients of logistic cost functions

Here the cost function is denoted as F(t) where t terms for theta. m is the number of samples. (x(i), y(i)), i = 1 ... m, is the training set. ht(x) =1/(1+exp(-tTx)) is the logistic function with parameter theta (t).

F(t) = 1/m sumi = 1 ... m y(i) log ht(x(i)) + (1 - y(i))log (1 - ht(x(i)))


The partial gradient w.tr.t tj is denoted as d/dtj = dj. Then the partial gradient at tj, aka, dj F(t) is derived as follows:

dj F(t
= 1/m sumi = 1 ... m y(i) / ht(x(i)) dj ht(x(i)) + (1 - y(i))/(1 - ht(x(i))) dj (1 - ht(x(i)))
= 1/m sumi = 1 ... m y(i) / ht(x(i)) dj ht(x(i)) -  (1 - y(i))/(1 - ht(x(i))) dj ht(x(i)) <-- Remove the constant 1, which is in boldface above.
= 1/m sumi = 1 ... m dj ht(x(i)) {y(i) / ht(x(i))  -  (1 - y(i))/(1 - ht(x(i)))}  <-- Separate dj ht(x(i)) from both terms.
= 1/m sumi = 1 ... m dj ht(x(i)) {y(i) (1 - ht(x(i))) - ht(x(i))(1 - y(i))}/{ht(x(i))(1 - ht(x(i))} 
= 1/m sumi = 1 ... m dj ht(x(i)) (y(i) - ht(x(i)))/{ht(x(i))(1 - ht(x(i))} 


where
dj ht(x(i))
= dj (1 + exp(-tTx(i)))-1
= -(1 + exp(-tTx(i)))-2 exp(-tTx(i)) x(i)j
= -{1/(1 + exp(-tTx(i))} {exp(-tTx(i)) / (1 + exp(-tTx(i))} x(i)j
= -{ht(x(i))(1 - ht(x(i))} x(i)j


Thus

dj F(t)
= 1/m sumi = 1 ... m dj ht(x(i)) (y(i) - ht(x(i)))/{ht(x(i))(1 - ht(x(i))}
= 1/m sumi = 1 ... m (y(i) - ht(x(i))) {-{ht(x(i))(1 - ht(x(i))} x(i)j}/{ht(x(i))(1 - ht(x(i))}
= 1/m sumi = 1 ... m (ht(x(i)) - y(i)) x(i)j

Note 2: How is logistic regression related to MLE?

Actually the logistic regression cost can be treated as the likelihood of a Bernoulli random variable. Here P[x; t] = ht(x) is the probability that x belongs to class 0. Then the pdf of x is
f(x) = P[x; t]y (1 - P[x; t])1 - y

and its likelihood of theta t is:
L(t) = log f(x) = y log P[x; t]+ (1- y) log 1 - P[x; t] = y log ht(x) + (1- y) log 1 - ht(x)
That's why optimizing the cost function F is equivalent to find the Maximum Likelihood Estimator.
Comments