Cross Entropy Loss

Some quick notes about Cross Entropy as a training loss function in classification...

At my heart, I'm always gonna be a functional analysis/measure theory girlie...

And, using measure theory notation is not only more aesthetic, it actually lends itself to a sort of conception unification.

For instance, you might see people defining Cross Entropy, H(P,Q)H(P,Q), as

H(P,Q)=βˆ’βˆ‘xp(x)log⁑(q(x)) H(P,Q) = -\sum_{x} p(x) \log(q(x))

But, this representation hides the purpose of the computation (to me, at least). Using an alternative notation:

H(P,Q)=βˆ’βˆ«xlog⁑(q(x)) p(x) dx=βˆ’βˆ«xlog⁑(q(x)) dp(x)=βˆ’EP[log⁑(Q)] H(P,Q) = - \int_{x} \log(q(x)) \, p(x) \, dx = -\int_{x} \log(q(x)) \, dp(x) = -\mathbb{E}_{P}[\log(Q)]

Cross-entropy is (up to a sign), the expected value of log⁑(Q)\log(Q) under the probability distribution PP.

Or, for the Information Theory nerds out there: the average number of bits you'd use to represent a "word" in QQ assuming that PP is the generating distribution.[1]

Why I'm currently thinking about H(P,Q)H(P,Q), however, stems from its pervasiveness across machine learning tutorials. If you look at the standard "deep-learning" tutorials, you'll see that most neural networks will be trained with the Cross Entropy Loss. A natural question would be, "why?". In particular, why Cross Entropy and not something else? Well it turns out

For classification problems, minimizing the Cross Entropy is equivalent to maximizing the (negative) log-likelihood.

We'll start with the likelihood of our estimates and work our way back to HH.

  1. If y^=(y^1,...y^n)\hat{\mathbf{y}} = (\hat{y}_1, ... \hat{y}_n) be our model's estimates for our batch of data D=(x,y)D = (\mathbf{x}, \mathbf{y}), such that
    y^i,j=P^(yi=j)β€…β€ŠΒ andΒ β€…β€Šyi,j={1ifΒ yi=j0ifΒ yiβ‰ j \hat{y}_{i,j} = \hat{P}(y_i = j) \; \text{ and } \; y_{i,j} = \begin{cases} 1 &\text{if } y_i = j \\ 0 &\text{if } y_i \neq j \end{cases}
    then
L(y^;D)=∏i=1n∏j=1ky^i,jyi,j \mathcal{L}(\hat{\mathbf{y}}; D) = \prod_{i=1}^{n} \prod_{j=1}^{k} \hat{y}_{i,j}^{y_{i,j}}
  1. log⁑\logs convert products to sums, hence:
β„“(y^;D)=log⁑(L(y^;D))=βˆ‘i=1nβˆ‘j=1klog⁑(y^i,jyi,j)=βˆ‘i=1nβˆ‘j=1kyi,jlog⁑(y^i,j) \begin{align*} \ell(\hat{\mathbf{y}}; D) &= \log(\mathcal{L}(\hat{\mathbf{y}}; D)) \\ &= \sum_{i=1}^{n} \sum_{j=1}^{k} \log(\hat{y}_{i,j}^{y_{i,j}}) \\ &= \sum_{i=1}^{n} \sum_{j=1}^{k} y_{i,j} \log(\hat{y}_{i,j}) \\ \end{align*}
  1. But yi,jy_{i,j} is just an indicator function, and more importantly:
P(yi=j)={1if yi=j0if yi≠j=yi,j \begin{align*} P(y_i = j) &= \begin{cases} 1 &\text{if } y_i = j \\ 0 &\text{if } y_i \neq j \end{cases} \\ &= y_{i,j} \end{align*}
  1. So, if you'd excuse the abuse of notation, we can collapse the inner sum:
β„“(y^;D)=βˆ‘i=1nlog⁑(y^i,yi)=βˆ’H(y,y^) \ell(\hat{\mathbf{y}}; D) = \sum_{i=1}^{n} \log(\hat{y}_{i,y_{i}}) = -H(\mathbf{y},\hat{\mathbf{y}})\\
  1. And since y^i,j∈(0,1)\hat{y}_{i,j} \in (0,1), βˆ’log⁑(y^i,j)>0-\log(\hat{y}_{i,j}) > 0; thus, maximimizing the negative log-likelihood is equivalent to minimizing the Cross Entropy. β– \blacksquare

  1. I'm not really well versed in Information Theory, so I could be wrong, here. β†©οΈŽ