Cross Entropy Loss

25 Mar 2026

Some quick notes about Cross Entropy as a training loss function in classification...

At my heart, I'm always gonna be a functional analysis/measure theory girlie...

And, using measure theory notation is not only more aesthetic, it actually lends itself to a sort of conception unification.

For instance, you might see people defining Cross Entropy, $H(P,Q)$ , as

H(P,Q) = -\sum_{x} p(x) \log(q(x))

But, this representation hides the purpose of the computation (to me, at least). Using an alternative notation:

H(P,Q) = - \int_{x} \log(q(x)) \, p(x) \, dx = -\int_{x} \log(q(x)) \, dp(x) = -\mathbb{E}_{P}[\log(Q)]

Cross-entropy is (up to a sign), the expected value of $\log(Q)$ under the probability distribution $P$ .

Or, for the Information Theory nerds out there: the average number of bits you'd use to represent a "word" in $Q$ assuming that $P$ is the generating distribution.^[1]

Why I'm currently thinking about $H(P,Q)$ , however, stems from its pervasiveness across machine learning tutorials. If you look at the standard "deep-learning" tutorials, you'll see that most neural networks will be trained with the Cross Entropy Loss. A natural question would be, "why?". In particular, why Cross Entropy and not something else? Well it turns out

For classification problems, minimizing the Cross Entropy is equivalent to maximizing the (negative) log-likelihood.

We'll start with the likelihood of our estimates and work our way back to $H$ .

If $\hat{\mathbf{y}} = (\hat{y}_1, ... \hat{y}_n)$ be our model's estimates for our batch of data $D = (\mathbf{x}, \mathbf{y})$ , such that
$\hat{y}_{i,j} = \hat{P}(y_i = j) \; \text{ and } \; y_{i,j} = \begin{cases} 1 &\text{if } y_i = j \\ 0 &\text{if } y_i \neq j \end{cases}$
then

\mathcal{L}(\hat{\mathbf{y}}; D) = \prod_{i=1}^{n} \prod_{j=1}^{k} \hat{y}_{i,j}^{y_{i,j}}

$\log$ s convert products to sums, hence:

\begin{align*} \ell(\hat{\mathbf{y}}; D) &= \log(\mathcal{L}(\hat{\mathbf{y}}; D)) \\ &= \sum_{i=1}^{n} \sum_{j=1}^{k} \log(\hat{y}_{i,j}^{y_{i,j}}) \\ &= \sum_{i=1}^{n} \sum_{j=1}^{k} y_{i,j} \log(\hat{y}_{i,j}) \\ \end{align*}

But $y_{i,j}$ is just an indicator function, and more importantly:

\begin{align*} P(y_i = j) &= \begin{cases} 1 &\text{if } y_i = j \\ 0 &\text{if } y_i \neq j \end{cases} \\ &= y_{i,j} \end{align*}

So, if you'd excuse the abuse of notation, we can collapse the inner sum:

\ell(\hat{\mathbf{y}}; D) = \sum_{i=1}^{n} \log(\hat{y}_{i,y_{i}}) = -H(\mathbf{y},\hat{\mathbf{y}})\\

And since $\hat{y}_{i,j} \in (0,1)$ , $-\log(\hat{y}_{i,j}) > 0$ ; thus, maximimizing the negative log-likelihood is equivalent to minimizing the Cross Entropy. $\blacksquare$

I'm not really well versed in Information Theory, so I could be wrong, here. ↩︎