REF: http://www.holehouse.org/mlclass/06_Logistic_Regression.html
Because it took me a while to finally derive it, I decide to put the detail here. Since I cannot type the equation nice, I simplify the notations.
Note 1: Gradients of logistic cost functions
Here the cost function is denoted as F(t) where t terms for theta. m is the number of samples. (x^{(i)}, y^{(i)}), i = 1 ... m, is the training set. h_{t}(x) =1/(1+exp(t^{T}x)) is the logistic function with parameter theta (t).
F ( t ) = 1/ m sum _{i = 1 ... m} y ^{(i)} log h _{t}( x ^{(i)}) + (1  y ^{(i)})log (1  h _{t}( x ^{(i)}))
The partial gradient w.tr.t t_{j} is denoted as d/d t_{j} = d _{j}. Then the partial gradient at t _{j}, aka, d _{j} F( t) is derived as follows: d _{j} F ( t )
= 1/ m sum _{i = 1 ... m} y ^{(i)} / h _{t}( x ^{(i)}) d _{j} h _{t}( x ^{(i)}) + (1  y ^{(i)})/(1  h _{t}( x ^{(i)})) d _{j} (1  h _{t}( x ^{(i)}))
= 1/ m sum _{i = 1 ... m} y ^{(i)} / h _{t}( x ^{(i)}) d _{j} h _{t}( x ^{(i)})  (1  y ^{(i)})/(1  h _{t}( x ^{(i)})) d _{j} h _{t}( x ^{(i)}) < Remove the constant 1 , which is in boldface above.
= 1/ m sum _{i = 1 ... m} d _{j} h _{t}( x ^{(i)}) { y ^{(i)} / h _{t}( x ^{(i)})  (1  y ^{(i)})/(1  h _{t}( x ^{(i)}))} < Separate d _{j} h _{t}( x ^{(i)}) from both terms .
= 1/ m sum _{i = 1 ... m} d _{j} h _{t}(x ^{(i)}) { y ^{(i)} (1  h _{t}( x ^{(i)}))  h _{t}( x ^{(i)})(1  y ^{(i)})}/{ h _{t}( x ^{(i)})(1  h _{t}( x ^{(i)})}
= 1/ m sum _{i = 1 ... m} d _{j} h _{t}(x ^{(i)}) ( y ^{(i)}  h _{t}( x ^{(i)}))/{ h _{t}( x ^{(i)})(1  h _{t}( x ^{(i)})}
where
d _{j} h _{t}( x ^{(i)})
= d _{j} (1 + exp( t ^{T}x ^{(i)})) ^{1}
= (1 + exp( t ^{T}x ^{(i)})) ^{2} exp( t ^{T}x ^{(i)}) x ^{(i)}_{j}
= {1/(1 + exp( t ^{T}x ^{(i)})} {exp( t ^{T}x ^{(i)}) / (1 + exp( t ^{T}x ^{(i)})} x ^{(i)}_{j}= { h _{t}( x ^{(i)})(1  h _{t}( x ^{(i)})} x ^{(i)}_{j}Thus d _{j} F ( t )
= 1/ m sum _{i = 1 ... m} d _{j} h _{t}( x ^{(i)}) ( y ^{(i)}  h _{t}( x ^{(i)}))/{ h _{t}( x ^{(i)})(1  h _{t}( x ^{(i)})}
= 1/ m sum _{i = 1 ... m} ( y ^{(i)}  h _{t}( x ^{(i)})) {{ h _{t}( x ^{(i)})(1  h _{t}( x ^{(i)})} x ^{(i)}_{j}}/{ h _{t}( x ^{(i)})(1  h _{t}( x ^{(i)})}
= 1/ m sum _{i = 1 ... m} ( h _{t}( x ^{(i)})  y ^{(i)}) x ^{(i)}_{j}
Note 2: How is logistic regression related to MLE?Actually the logistic regression cost can be treated as the likelihood of a Bernoulli random variable. Here P[ x; t] = h_{t}( x) is the probability that x belongs to class 0. Then the pdf of x is f ( x ) = P [ x ; t ] ^{y} (1  P [ x ; t ]) ^{1  y}
and its likelihood of theta t is:
L ( t ) = log f ( x ) = y log P [ x ; t ] + (1 y ) log 1  P [ x ; t ] = y log h _{t}( x ) + (1 y ) log 1  h _{t}( x )
That's why optimizing the cost function F is equivalent to find the Maximum Likelihood Estimator.
