REF: http://www.holehouse.org/mlclass/06_Logistic_Regression.html

Because it took me a while to finally derive it, I decide to put the detail here. Since I cannot type the equation nice, I simplify the notations.

## Note 1: Gradients of logistic cost functions

Here the cost function is denoted as F(t) where t terms for theta. m is the number of samples. (x(i), y(i)), i = 1 ... m, is the training set. ht(x) =1/(1+exp(-tTx)) is the logistic function with parameter theta (t).

`F``(``t``) = 1/``m`` sum``i`` = 1 ... ``m` `y``(``i``)`` log ``h``t``(``x``(``i``)``) + (1 - ``y``(``i``)``)log (1 - ``h``t``(``x``(``i``)``)) `

The partial gradient w.tr.t tj is denoted as d/dtj = dj. Then the partial gradient at tj, aka, dj F(t) is derived as follows:

`d``j`` F``(``t``)  `
`= 1/``m`` sum``i`` = 1 ... ``m` `y``(``i``)`` / ``h``t``(``x``(``i``)``) d``j` `h``t``(``x``(``i``)``) + (1 - ``y``(``i``)``)/(1 - ``h``t``(``x``(``i``)``)) d``j`` (1 - ``h``t``(``x``(``i``)``))`
`= 1/``m`` sum``i`` = 1 ... ``m`` y``(i)`` / ``h``t``(``x``(``i``)``) d``j` `h``t``(``x``(``i``)``) -  (1 - ``y``(``i``)``)/(1 - ``h``t``(``x``(``i``)``)) d``j` `h``t``(``x``(``i``)``) <-- Remove the constant 1`, which is in boldface above.
`= 1/``m`` sum``i`` = 1 ... ``m`` d``j` `h``t``(``x``(``i``)``) {``y``(``i``)`` / ``h``t``(``x``(``i``)``)  -  (1 - ``y``(``i``)``)/(1 - ``h``t``(``x``(``i``)``))}  ``<-- Separate ``dj ht(x(i)) from both terms`.
`= 1/``m`` sum``i`` = 1 ... ``m`` d``j` `h``t``(x``(i)``) {``y``(``i``)`` (1 - ``h``t``(``x``(``i``)``)) - ``h``t``(``x``(``i``)``)(1 - ``y``(``i``)``)}/{``h``t``(``x``(``i``)``)(1 - ``h``t``(``x``(``i``)``)}  `
`= 1/``m`` sum``i`` = 1 ... ``m`` d``j` `h``t``(x``(i)``) (``y``(``i``)`` - ``h``t``(``x``(``i``)``))/{``h``t``(``x``(``i``)``)(1 - ``h``t``(``x``(``i``)``)}  `

where
`d``j`` h``t``(``x``(``i``)``) `
`= d``j`` (1 + exp(-``t``T``x``(``i``)``))``-1`
`= -(1 + exp(-``t``T``x``(``i``)``))``-2`` exp(-``t``T``x``(``i``)``) x``(``i``)``j`
`= -{1/(1 + exp(-``t``T``x``(``i``)``)} {exp(-``t``T``x``(``i``)``) / (1 + exp(-``t``T``x``(``i``)``)} ``x``(``i``)``j`
`= -{``h``t``(``x``(``i``)``)(1 - ``h``t``(``x``(``i``)``)} ``x``(``i``)``j`

Thus

`d``j`` F``(``t``)`
`= 1/``m`` sum``i = 1 ... ``m`` d``j` `h``t``(``x``(``i``)``) (``y``(``i``)`` - ``h``t``(``x``(``i``)``))/{``h``t``(``x``(``i``)``)(1 - ``h``t``(``x``(``i``)``)} `
`= 1/``m`` sum``i = 1 ... ``m`` (``y``(``i``)`` - ``h``t``(``x``(``i``)``)) {-{``h``t``(``x``(``i``)``)(1 - ``h``t``(``x``(``i``)``)} ``x``(``i``)``j``}/{``h``t``(``x``(``i``)``)(1 - ``h``t``(``x``(``i``)``)}`
`= 1/``m`` sum``i = 1 ... ``m`` (``h``t``(``x``(``i``)``) - ``y``(``i``)``) ``x``(``i``)``j`

## Note 2: How is logistic regression related to MLE?

Actually the logistic regression cost can be treated as the likelihood of a Bernoulli random variable. Here P[x; t] = ht(x) is the probability that x belongs to class 0. Then the pdf of x is
`f``(``x``)`` = ``P``[``x``; ``t``]``y`` (1 - ``P``[``x``; ``t``])``1 - y`

and its likelihood of theta t is:
`L``(``t``) = log ``f``(``x``) = ``y` `log`` P``[``x``; ``t``]``+ (1- ``y``) ``log 1 - ``P``[``x``; ``t``]`` = ``y` `log` `h``t``(``x``) ``+ (1- ``y``) log 1 - ``h``t``(``x``)`
That's why optimizing the cost function F is equivalent to find the Maximum Likelihood Estimator.