As I work my way through the ARENA prereqs, I find myself constantly falling down mathematical rabbit holes. Linear algebra and calculus are consistently snatching my attention, and what could be more powerful than the combination of the two. In particular, derivatives of tensors!
In the beginning, there were scalars, and it was good.
If xâR and f:RâR, everything is pretty simple: (assuming it exists) dxdfâ:RâR
If we step things up just a notch and let xâRn but still f is scalar-valued: (assuming they exist)
âf:RnâRns.t.âfiâ=âxiââfâ:RnâRThen there were matrices/tensors in R, and it was... okay
If xâRn and f:RnâRm, we start to get tedious:
Dxâf=dxdfââRmÃns.t.Dxâf[i,j]=âxjââfiââor you can think of each row of Dxâf as âfiâ.
But what happens when xâRnÃm and f:RnÃmâRo?
Dxâf=dxdfââRoÃmÃns.t.Dxâf[i,j,k]=âxj,kââfiââYou can probably see the pattern, here, if f:RnÃmâRoÃp: RoÃpÃnÃmâDxâf[i,j,k,l]=âxk,lââfi,jââ
Note -- and I'm not an expert here -- I'm informed that this simplicity works because Rn is self-dual and the general principal at play here involves the tensor product â. That is, if xâA and yâB, then
dydxââAâBâwhere Bâ is the dual-space of B (i.e. all f:BâR). But, with real numbers Rn is isomorphic to its dual, and thus when A=Rn and B=Rm
AâBâ=Rnâ(Rm)ââ
RnâRmâ
RnÃmFinally, there was broadcasting and it got weird.
In machine learning, in particular deep learning, we use gradient methods to minimize a loss function, L, subject to an input x. L is almost always scalar-valued (yay!), but x is almost always matrix-valued (d'oh!). In particular, there are very natural situations where the shape of x needs to be manipulated before it can be fed forward to a proceeding layer.
For compatibility, matrix libraries like numpy and pytorch will pad or expand a matrix/tensor to enable an operation to occur. For instance, if you wanted to add a vector of ones with a matrix, you'd need to "broadcast" the ones
x = np.ones(2) # a vector in R^2 [1,1]
y = np.zeros((2,2)) # a matrix in R^(2x2) [[0,0], [0,0]]
z = x + y # [[1,1], [1,1]]
What if you wanted to take the derivative of z with respect to x?
Your regular calculus intuition would say dxdzâ=dxdâ(x+y)=1
But that would be wrong, and you can sanity check yourself just considering the shapes of these objects:
- xâR2 and zâR2Ã2,
- hence, the above says that dz/dxâR2Ã2Ã2
- which means that dz/dx=1âR doesn't work
So we need to consider an intermediate step: x_b, the (2x2) broadcasted version of x. Really,
z = x_b + y
which means that by the chain-rule:
dxdzâ=dxbâdzâdxdxbââBut then, what are the shapes of these objects? And do these also make sense?
- dz/dxbââR2Ã2Ã2Ã2, and dxbâ/dxâR2Ã2Ã2
Which means that if you think of this from just a matrix multiplication (tensor contraction) perspective, the product should look like
dxdzâ[i,j,l]=p,qââdxbâdzâ[i,j,p,q]dxdxbââ[p,q,l]but xbâ[p,q]=x[q] (in this case because we're just copying our x along the first dimension), so
dxdxbââ[p,q,l]=âxlâââxbâ[p,q]=âxlâââx[q]={1ifq=l0o.w.â(You can think of the derivative of the broadcast as a bunch of copies of the identity matrix across the dimension that was copied) Thus,
dxdzâ[i,j,l]=pââq=lââdxbâdzâ[i,j,p,q]dxdxbââ[p,q,l]=pââdxbâdzâ[i,j,p,l]Which should at least feel "okay": this says that dz/dxâR2Ã2Ã2 and (perhaps less intuitively) that dz/dx is a sum of the derivatives across the broadcasted dimension.
Looking at the magic numbers in our example, you'll see we could've replaced xâR2 and yâR2Ã2 with any random set of broadcastable tensors (e.g. xâR20Ã1 and yâR10Ã20Ã30) and you'd still find that broadcasted xbâ has Identity matrices sprinkled throughout its Jacobian.
While the result is still "sum across the broadcasted dimensions", it still feels weird.