Another day, another calculus rabbit hole. This time, though, we get to look at my two loves: derivative calculus and metric spaces.
The derivative we learn in highschool (or first-year undergraduate math) is something like:
hβ0limβhf(x+h)βf(x)β=fβ²(x)And this works well because by the time you see this definition, you should feel comfortable with geometry and distance in R.
What the FrΓ©chet derivative does is take the spirit of this, and generalise it to more abstract spaces. Namely, you let V,W be normed vector spaces (β₯β
β₯Vβ and β₯β
β₯Wβ, respectively) with UβV an open subset and f:UβW. Then f is FrΓ©chet differentiable at xβU if there exists A:VβW, a bounded linear operator, such that
β₯hβ₯Vββ0limββ₯hβ₯Vββ₯f(x+h)βf(x)βA(h)β₯Wββ=0This might feel awkward, but if you give it a moment it actually is the most "straight-forward" generalisation of our original definition. In ordinary single-variable calculus, the derivative at x gives us the best linear approximation to the change in f near x:
f(x+h)βf(x)βfβ²(x)hwhere the map hβ¦fβ²(x)h is linear in the perturbation h. So, all we're saying is:
take the original, real-valued concept of the derivative as a best linear approximation, and extend it to abstract spaces that have a concept of distance
So why am I banging on about this? Because I want to explore what the derivative of matrix multiplication is. Simplest case: T(X)=CX where XβM(b,c)=RbΓc and CβM(a,b); hence, T:M(b,c)βM(a,c). What is the derivative of T with respect to X (or in more popular notation DTXβ)?
There are many ways to skin this cat. My favourite is blind symbol-pushing:
DTXβ=dXdβT(X)=dXdβCX=CdXdXβ=CββBut this is pretty unsatisfying because we don't actually know if the derivative operator here behaves like that. We're literally using procedural fluency to manipulate the expression. Let's try first principles, and just line stuff up.
Our function T maps between the finite dimensional V=RbΓc and W=RaΓc, since both of these are 2D, choose your favorite matrix norm and let's just call that β₯β
β₯:
β₯hβ₯β0limββ₯hβ₯β₯T(X+h)βT(X)βA(h)β₯ββ=β₯hβ₯β0limββ₯hβ₯β₯C(X+h)βCXβA(h)β₯β=β₯hβ₯β0limββ₯hβ₯β₯ChβA(h)β₯ββNow C is a matrix from M(a,b), so left multiplication by C is necessarily linear (and bounded). If we take A(h)=Ch, our numerator is always 0, regardless of how close to 0 our β₯hβ₯ is. So we've got a winner:
DTXβ[h]=ChThat is, the derivative is not literally the matrix C as an output value. It is the linear operator that sends a perturbation h to Ch. In the vector case these feel like the same thing, but for matrices that distinction matters.
But what if we spice things up and go T(X)=XC (right multiply by C instead of left multiplying by C)? Almost everything would run the same except for one small hiccup: the order of h and C matters:
T(X+h)βT(X)βA(h)=(X+h)CβXCβA(h)=hCβA(h)hence to get the numerator to be zero, A(h)=hC. Since we switched the composition order, we obviously don't have the same operator as before. If X is still a matrix in M(b,c), then CβM(c,a) and T:M(b,c)βM(b,a). Hence, if we're sloppy and write DTXβ=C, one might be inclined to presume
DTXβ(h)=Chwhich would be wrong since our derivative is supposed to be a map from M(b,c)βM(b,a), and if our perturbations, hβM(b,c), then Ch would be incompatible. The correct statement is:
DTXβ[h]=hCSo what this tells us symbol-pushers, is that with matrix derivatives, we need to be careful:
D(CX)[h]=ChbutD(XC)[h]=hCFor the astute reader, though, you might notice that if T is linear:
T(X+h)βT(X)=T(X+hβX)=T(h)which means that if A(h)=T(h), our limit is zero, regardless of how far β₯hβ₯ is from 0. This is particularly noteworthy, so let's highlight it:
if f:VβW is linear, then its FrΓ©chet derivative at every point is the same linear map: Dfxβ[h]=f(h).
Which brings us to why I'm here in the first place: understanding derivatives in the context of back propagation during network training.
A common operation you might consider when transforming data is a row- or column-sum. This is easily expressed as a matrix operation. That is, if XβM(n,k), and rβRn is the row-sum and cβRk is the column-sum:
r=X1kβΒ andΒ c=1nTβXwhere 1kβ and 1nβ are vectors of ones of the appropriate lengths. Since these are linear maps, their derivatives apply the same sums to the perturbation h:
DrXβ[h]=h1kβΒ andΒ DcXβ[h]=1nTβhSo if our observed data is X, then applying the derivative to that same matrix gives back the corresponding sum:
DrXβ[X]=X1kβ=rThis is the sense in which the derivative's output can still be the row-sum. The derivative at X is the operator hβ¦h1kβ; when the direction you feed it is X itself, its output is the row-sum of X.
And since this easily generalises to rank-n Tensors (e.g. XβM(i1β,β¦,inβ)) -- you can consider the total sum as a composition of individual summations, we conclude that the derivative of an arbitrary axis-sum is the same axis-sum operator applied to the perturbation. Pretty neat!