Tensor Calculus is Weird

24 Apr 2026

Even in R^n calculus with tensors gets weird... fast.

As I work my way through the ARENA prereqs, I find myself constantly falling down mathematical rabbit holes. Linear algebra and calculus are consistently snatching my attention, and what could be more powerful than the combination of the two. In particular, derivatives of tensors!

In the beginning, there were scalars, and it was good.

If $x \in \R$ and $f : \R \to \R$ , everything is pretty simple: (assuming it exists)

\frac{df}{dx} : \R \to \R

If we step things up just a notch and let $x \in \R^n$ but still $f$ is scalar-valued: (assuming they exist)

\nabla f : \R^n \to \R^n \; \text{s.t.} \; \nabla f_i = \frac{\partial f}{\partial x_i} : \R^n \to \R

Then there were matrices/tensors in $\R$ , and it was... okay

If $x \in \R^n$ and $f: \R^n \to \R^m$ , we start to get tedious:

D_{x}f = \frac{df}{dx} \in \R^{m \times n} \; \text{s.t.} \; D_{x}f[i,j] = \frac{\partial f_i}{\partial x_j}

or you can think of each row of $D_{x}f$ as $\nabla f_i$ .

But what happens when $x \in \R^{n\times m}$ and $f : \R^{n\times m} \to \R^{o}$ ?

D_{x}f = \frac{df}{dx} \in \R^{o \times m \times n} \; \text{s.t.} \; D_{x}f[i,j,k] = \frac{\partial f_i}{\partial x_{j,k}}

You can probably see the pattern, here, if $f : \R^{n\times m} \to \R^{o \times p}$ :

\R^{o\times p \times n \times m} \ni D_x f [i,j,k,l] = \frac{\partial f_{i,j}}{\partial x_{k,l}}

Note -- and I'm not an expert here -- I'm informed that this simplicity works because $\R^{n}$ is self-dual and the general principal at play here involves the tensor product $\otimes$ . That is, if $x \in A$ and $y \in B$ , then

\frac{dx}{dy} \in A \otimes B^*

where $B^*$ is the dual-space of B (i.e. all $f : B \to \R$ ). But, with real numbers $\R^{n}$ is isomorphic to its dual, and thus when $A = \R^n$ and $B = \R^m$

A \otimes B^* = \R^n \otimes (\R^m)^* \cong \R^n \otimes \R^m \cong \R^{n\times m}

Finally, there was broadcasting and it got weird.

In machine learning, in particular deep learning, we use gradient methods to minimize a loss function, $L$ , subject to an input $x$ . $L$ is almost always scalar-valued (yay!), but $x$ is almost always matrix-valued (d'oh!). In particular, there are very natural situations where the shape of $x$ needs to be manipulated before it can be fed forward to a proceeding layer.

For compatibility, matrix libraries like numpy and pytorch will pad or expand a matrix/tensor to enable an operation to occur. For instance, if you wanted to add a vector of ones with a matrix, you'd need to "broadcast" the ones

x = np.ones(2) # a vector in R^2 [1,1]
y = np.zeros((2,2)) # a matrix in  R^(2x2) [[0,0], [0,0]]
z = x + y # [[1,1], [1,1]]

What if you wanted to take the derivative of z with respect to x?

Your regular calculus intuition would say

\frac{dz}{dx} = \frac{d}{dx}(x + y) = 1

But that would be wrong, and you can sanity check yourself just considering the shapes of these objects:

$x \in \R^2$ and $z \in \R^{2\times 2}$ ,
hence, the above says that $dz/dx \in \R^{2 \times 2 \times 2}$
which means that $dz/dx = 1 \in \R$ doesn't work

So we need to consider an intermediate step: x_b, the (2x2) broadcasted version of x. Really,

z = x_b + y

which means that by the chain-rule:

\frac{dz}{dx} = \frac{dz}{dx_b} \frac{dx_b}{dx}

But then, what are the shapes of these objects? And do these also make sense?

$dz/dx_b \in \R^{2\times 2\times 2\times 2}$ , and $dx_b/dx \in \R^{2 \times 2 \times 2}$

Which means that if you think of this from just a matrix multiplication (tensor contraction) perspective, the product should look like^[1]

\frac{dz}{dx}[i,j,l] = \sum_{p,q} \frac{dz}{dx_b}[i,j,p,q] \frac{dx_b}{dx}[p,q,l]

but $x_b[p,q] = x[q]$ (in this case because we're just copying our $x$ along the first dimension), so

\frac{dx_b}{dx}[p,q,l] = \frac{\partial}{\partial x_l}x_b[p,q] = \frac{\partial}{\partial x_l}x[q] = \begin{cases} 1 \; \text{if} \; q = l \\ 0 \; \text{o.w.} \end{cases}

(You can think of the derivative of the broadcast as a bunch of copies of the identity matrix across the dimension that was copied) Thus,

\frac{dz}{dx}[i,j,l] = \sum_{p} \sum_{q=l} \frac{dz}{dx_b}[i,j,p,q] \frac{dx_b}{dx}[p,q,l] = \sum_{p} \frac{dz}{dx_b}[i,j,p,l]

Which should at least feel "okay": this says that $dz/dx \in \R^{2 \times 2\times 2}$ and (perhaps less intuitively) that $dz/dx$ is a sum of the derivatives across the broadcasted dimension.

Looking at the magic numbers in our example, you'll see we could've replaced $x \in \R^2$ and $y \in \R^{2 \times 2}$ with any random set of broadcastable tensors (e.g. $x \in \R^{20 \times 1}$ and $y \in \R^{10 \times 20 \times 30}$ ) and you'd still find that broadcasted $x_b$ has Identity matrices sprinkled throughout its Jacobian.

While the result is still "sum across the broadcasted dimensions", it still feels weird.

if you're cool with Einstein notation, we could save a touch of ink and write $\frac{dz}{dx}_{i,j,l} = \frac{dz}{dx_b}_{i,j,p,q} \frac{dx_b}{dx}_{p,q,l}$ but the reality is that with matrix multiplication, when the matrices have shared trailing/leading indices, multiplication contracts on those indices. Hence why $A \in \R^{m \times n}$ and $B \in \R^{n \times p}$ yields $AB = C \in \R^{m \times p}$ . ↩︎

In the beginning, there were scalars, and it was good.

Then there were matrices/tensors in R\RR, and it was... okay

Finally, there was broadcasting and it got weird.

Then there were matrices/tensors in $\R$ , and it was... okay