At my heart, I'm always gonna be a functional analysis/measure theory girlie...
And, using measure theory notation is not only more aesthetic, it actually lends itself to a sort of conception unification.
For instance, you might see people defining Cross Entropy, H(P,Q), as
H(P,Q)=βxββp(x)log(q(x))But, this representation hides the purpose of the computation (to me, at least). Using an alternative notation:
H(P,Q)=ββ«xβlog(q(x))p(x)dx=ββ«xβlog(q(x))dp(x)=βEPβ[log(Q)]Cross-entropy is (up to a sign), the expected value of log(Q) under the probability distribution P.
Or, for the Information Theory nerds out there: the average number of bits you'd use to represent a "word" in Q assuming that P is the generating distribution.
Why I'm currently thinking about H(P,Q), however, stems from its pervasiveness across machine learning tutorials. If you look at the standard "deep-learning" tutorials, you'll see that most neural networks will be trained with the Cross Entropy Loss. A natural question would be, "why?". In particular, why Cross Entropy and not something else? Well it turns out
For classification problems, minimizing the Cross Entropy is equivalent to maximizing the (negative) log-likelihood.
We'll start with the likelihood of our estimates and work our way back to H.
- If y^β=(y^β1β,...y^βnβ) be our model's estimates for our batch of data D=(x,y), such that
y^βi,jβ=P^(yiβ=j)Β andΒ yi,jβ={10βifΒ yiβ=jifΒ yiβξ =jβ
then
L(y^β;D)=i=1βnβj=1βkβy^βi,jyi,jββ
- logs convert products to sums, hence:
β(y^β;D)β=log(L(y^β;D))=i=1βnβj=1βkβlog(y^βi,jyi,jββ)=i=1βnβj=1βkβyi,jβlog(y^βi,jβ)β
- But yi,jβ is just an indicator function, and more importantly:
P(yiβ=j)β={10βifΒ yiβ=jifΒ yiβξ =jβ=yi,jββ
- So, if you'd excuse the abuse of notation, we can collapse the inner sum:
β(y^β;D)=i=1βnβlog(y^βi,yiββ)=βH(y,y^β)
- And since y^βi,jββ(0,1), βlog(y^βi,jβ)>0; thus, maximimizing the negative log-likelihood is equivalent to minimizing the Cross Entropy. β