Tensor Calculus is Weird

Even in R^n calculus with tensors gets weird... fast.

As I work my way through the ARENA prereqs, I find myself constantly falling down mathematical rabbit holes. Linear algebra and calculus are consistently snatching my attention, and what could be more powerful than the combination of the two. In particular, derivatives of tensors!

In the beginning, there were scalars, and it was good.

If x∈Rx \in \R and f:R→Rf : \R \to \R, everything is pretty simple: (assuming it exists)

dfdx:R→R\frac{df}{dx} : \R \to \R

If we step things up just a notch and let x∈Rnx \in \R^n but still ff is scalar-valued: (assuming they exist)

∇f:Rn→Rn  s.t.  ∇fi=∂f∂xi:Rn→R \nabla f : \R^n \to \R^n \; \text{s.t.} \; \nabla f_i = \frac{\partial f}{\partial x_i} : \R^n \to \R

Then there were matrices/tensors in R\R, and it was... okay

If x∈Rnx \in \R^n and f:Rn→Rmf: \R^n \to \R^m, we start to get tedious:

Dxf=dfdx∈Rm×n  s.t.  Dxf[i,j]=∂fi∂xj D_{x}f = \frac{df}{dx} \in \R^{m \times n} \; \text{s.t.} \; D_{x}f[i,j] = \frac{\partial f_i}{\partial x_j}

or you can think of each row of DxfD_{x}f as ∇fi\nabla f_i.

But what happens when x∈Rn×mx \in \R^{n\times m} and f:Rn×m→Rof : \R^{n\times m} \to \R^{o}?

Dxf=dfdx∈Ro×m×n  s.t.  Dxf[i,j,k]=∂fi∂xj,k D_{x}f = \frac{df}{dx} \in \R^{o \times m \times n} \; \text{s.t.} \; D_{x}f[i,j,k] = \frac{\partial f_i}{\partial x_{j,k}}

You can probably see the pattern, here, if f:Rn×m→Ro×pf : \R^{n\times m} \to \R^{o \times p}:

Ro×p×n×m∋Dxf[i,j,k,l]=∂fi,j∂xk,l\R^{o\times p \times n \times m} \ni D_x f [i,j,k,l] = \frac{\partial f_{i,j}}{\partial x_{k,l}}

Note -- and I'm not an expert here -- I'm informed that this simplicity works because Rn\R^{n} is self-dual and the general principal at play here involves the tensor product ⊗\otimes. That is, if x∈Ax \in A and y∈By \in B, then

dxdy∈A⊗B∗ \frac{dx}{dy} \in A \otimes B^*

where B∗B^* is the dual-space of B (i.e. all f:B→Rf : B \to \R). But, with real numbers Rn\R^{n} is isomorphic to its dual, and thus when A=RnA = \R^n and B=RmB = \R^m

A⊗B∗=Rn⊗(Rm)∗≅Rn⊗Rm≅Rn×m A \otimes B^* = \R^n \otimes (\R^m)^* \cong \R^n \otimes \R^m \cong \R^{n\times m}

Finally, there was broadcasting and it got weird.

In machine learning, in particular deep learning, we use gradient methods to minimize a loss function, LL, subject to an input xx. LL is almost always scalar-valued (yay!), but xx is almost always matrix-valued (d'oh!). In particular, there are very natural situations where the shape of xx needs to be manipulated before it can be fed forward to a proceeding layer.

For compatibility, matrix libraries like numpy and pytorch will pad or expand a matrix/tensor to enable an operation to occur. For instance, if you wanted to add a vector of ones with a matrix, you'd need to "broadcast" the ones

x = np.ones(2) # a vector in R^2 [1,1]
y = np.zeros((2,2)) # a matrix in  R^(2x2) [[0,0], [0,0]]
z = x + y # [[1,1], [1,1]]

What if you wanted to take the derivative of z with respect to x?

Your regular calculus intuition would say

dzdx=ddx(x+y)=1\frac{dz}{dx} = \frac{d}{dx}(x + y) = 1

But that would be wrong, and you can sanity check yourself just considering the shapes of these objects:

  • x∈R2x \in \R^2 and z∈R2×2z \in \R^{2\times 2},
  • hence, the above says that dz/dx∈R2×2×2dz/dx \in \R^{2 \times 2 \times 2}
  • which means that dz/dx=1∈Rdz/dx = 1 \in \R doesn't work

So we need to consider an intermediate step: x_b, the (2x2) broadcasted version of x. Really,

z = x_b + y

which means that by the chain-rule:

dzdx=dzdxbdxbdx \frac{dz}{dx} = \frac{dz}{dx_b} \frac{dx_b}{dx}

But then, what are the shapes of these objects? And do these also make sense?

  • dz/dxb∈R2×2×2×2dz/dx_b \in \R^{2\times 2\times 2\times 2}, and dxb/dx∈R2×2×2dx_b/dx \in \R^{2 \times 2 \times 2}

Which means that if you think of this from just a matrix multiplication (tensor contraction) perspective, the product should look like[1]

dzdx[i,j,l]=∑p,qdzdxb[i,j,p,q]dxbdx[p,q,l] \frac{dz}{dx}[i,j,l] = \sum_{p,q} \frac{dz}{dx_b}[i,j,p,q] \frac{dx_b}{dx}[p,q,l]

but xb[p,q]=x[q]x_b[p,q] = x[q] (in this case because we're just copying our xx along the first dimension), so

dxbdx[p,q,l]=∂∂xlxb[p,q]=∂∂xlx[q]={1  if  q=l0  o.w. \frac{dx_b}{dx}[p,q,l] = \frac{\partial}{\partial x_l}x_b[p,q] = \frac{\partial}{\partial x_l}x[q] = \begin{cases} 1 \; \text{if} \; q = l \\ 0 \; \text{o.w.} \end{cases}

(You can think of the derivative of the broadcast as a bunch of copies of the identity matrix across the dimension that was copied) Thus,

dzdx[i,j,l]=∑p∑q=ldzdxb[i,j,p,q]dxbdx[p,q,l]=∑pdzdxb[i,j,p,l] \frac{dz}{dx}[i,j,l] = \sum_{p} \sum_{q=l} \frac{dz}{dx_b}[i,j,p,q] \frac{dx_b}{dx}[p,q,l] = \sum_{p} \frac{dz}{dx_b}[i,j,p,l]

Which should at least feel "okay": this says that dz/dx∈R2×2×2dz/dx \in \R^{2 \times 2\times 2} and (perhaps less intuitively) that dz/dxdz/dx is a sum of the derivatives across the broadcasted dimension.

Looking at the magic numbers in our example, you'll see we could've replaced x∈R2x \in \R^2 and y∈R2×2y \in \R^{2 \times 2} with any random set of broadcastable tensors (e.g. x∈R20×1x \in \R^{20 \times 1} and y∈R10×20×30y \in \R^{10 \times 20 \times 30}) and you'd still find that broadcasted xbx_b has Identity matrices sprinkled throughout its Jacobian.

While the result is still "sum across the broadcasted dimensions", it still feels weird.


  1. if you're cool with Einstein notation, we could save a touch of ink and write dzdxi,j,l=dzdxbi,j,p,qdxbdxp,q,l\frac{dz}{dx}_{i,j,l} = \frac{dz}{dx_b}_{i,j,p,q} \frac{dx_b}{dx}_{p,q,l} but the reality is that with matrix multiplication, when the matrices have shared trailing/leading indices, multiplication contracts on those indices. Hence why A∈Rm×nA \in \R^{m \times n} and B∈Rn×pB \in \R^{n \times p} yields AB=C∈Rm×pAB = C \in \R^{m \times p}. â†Šī¸Ž