WTF??

What's the FrΓ©chet derivative??

Another day, another calculus rabbit hole. This time, though, we get to look at my two loves: derivative calculus and metric spaces.

The derivative we learn in highschool (or first-year undergraduate math) is something like:

lim⁑hβ†’0f(x+h)βˆ’f(x)h=fβ€²(x) \lim_{h \to 0} \frac{f(x+h) - f(x)}{h} = f'(x)

And this works well because by the time you see this definition, you should feel comfortable with geometry and distance in R\R.

What the FrΓ©chet derivative does is take the spirit of this, and generalise it to more abstract spaces. Namely, you let V,WV, W be normed vector spaces (βˆ₯β‹…βˆ₯V\Vert\cdot\Vert_{V} and βˆ₯β‹…βˆ₯W\Vert\cdot\Vert_{W}, respectively) with UβŠ†VU \subseteq V an open subset[1] and f:Uβ†’Wf: U \to W. Then ff is FrΓ©chet differentiable at x∈Ux \in U if there exists A:Vβ†’WA : V \to W, a bounded linear operator,[2] such that

lim⁑βˆ₯hβˆ₯Vβ†’0βˆ₯f(x+h)βˆ’f(x)βˆ’A(h)βˆ₯Wβˆ₯hβˆ₯V=0 \lim_{\Vert h \Vert_{V} \to 0} \frac{\Vert f(x+h) - f(x) - A(h)\Vert_{W}}{\Vert h \Vert_{V}} = 0

This might feel awkward, but if you give it a moment it actually is the most "straight-forward" generalisation of our original definition. In ordinary single-variable calculus, the derivative at xx gives us the best linear approximation to the change in ff near xx:

f(x+h)βˆ’f(x)β‰ˆfβ€²(x)h f(x + h) - f(x) \approx f'(x)h

where the map h↦fβ€²(x)hh \mapsto f'(x)h is linear in the perturbation hh. So, all we're saying is:

take the original, real-valued concept of the derivative as a best linear approximation, and extend it to abstract spaces that have a concept of distance

So why am I banging on about this? Because I want to explore what the derivative of matrix multiplication is. Simplest case: T(X)=CXT(X) = CX where X∈M(b,c)=RbΓ—cX \in M(b,c) = \R^{b \times c} and C∈M(a,b)C \in M(a,b); hence, T:M(b,c)β†’M(a,c)T : M(b,c) \to M(a,c). What is the derivative of TT with respect to XX (or in more popular notation DTXDT_X)?

There are many ways to skin this cat. My favourite is blind symbol-pushing:

DTX=ddXT(X)=ddXCX=CdXdX=C \begin{equation} DT_X = \frac{d}{dX}T(X) = \frac{d}{dX} CX = C \frac{dX}{dX} = C \end{equation}

But this is pretty unsatisfying because we don't actually know if the derivative operator here behaves like that. We're literally using procedural fluency to manipulate the expression. Let's try first principles, and just line stuff up.

Our function TT maps between the finite dimensional V=RbΓ—cV = \R^{b \times c} and W=RaΓ—cW = \R^{a \times c}, since both of these are 2D, choose your favorite matrix norm[3] and let's just call that βˆ₯β‹…βˆ₯\Vert \cdot \Vert:

lim⁑βˆ₯hβˆ₯β†’0βˆ₯T(X+h)βˆ’T(X)βˆ’A(h)βˆ₯βˆ₯hβˆ₯=lim⁑βˆ₯hβˆ₯β†’0βˆ₯C(X+h)βˆ’CXβˆ’A(h)βˆ₯βˆ₯hβˆ₯=lim⁑βˆ₯hβˆ₯β†’0βˆ₯Chβˆ’A(h)βˆ₯βˆ₯hβˆ₯ \begin{align*} \lim_{\Vert h \Vert \to 0} \frac{\Vert T(X + h) - T(X) - A(h)\Vert}{\Vert h \Vert} &= \lim_{\Vert h \Vert \to 0} \frac{\Vert C(X + h) - CX - A(h)\Vert}{\Vert h \Vert} \\ &= \lim_{\Vert h \Vert \to 0} \frac{\Vert Ch - A(h)\Vert}{\Vert h \Vert} \end{align*}

Now CC is a matrix from M(a,b)M(a,b), so left multiplication by CC is necessarily linear (and bounded). If we take A(h)=ChA(h) = Ch, our numerator is always 0, regardless of how close to 0 our βˆ₯hβˆ₯\Vert h \Vert is. So we've got a winner:

DTX[h]=Ch DT_X[h] = Ch

That is, the derivative is not literally the matrix CC as an output value. It is the linear operator that sends a perturbation hh to ChCh. In the vector case these feel like the same thing, but for matrices that distinction matters.

But what if we spice things up and go T(X)=XCT(X) = XC (right multiply by CC instead of left multiplying by CC)? Almost everything would run the same except for one small hiccup: the order of hh and CC matters:

T(X+h)βˆ’T(X)βˆ’A(h)=(X+h)Cβˆ’XCβˆ’A(h)=hCβˆ’A(h) T(X + h) - T(X) - A(h) = (X+h)C - XC - A(h) = hC - A(h)

hence to get the numerator to be zero, A(h)=hCA(h) = hC. Since we switched the composition order, we obviously don't have the same operator as before. If XX is still a matrix in M(b,c)M(b,c), then C∈M(c,a)C \in M(c,a) and T:M(b,c)β†’M(b,a)T : M(b,c) \to M(b,a). Hence, if we're sloppy and write DTX=CDT_X = C, one might be inclined to presume

DTX(h)=Ch DT_X(h) = Ch

which would be wrong since our derivative is supposed to be a map from M(b,c)β†’M(b,a)M(b,c) \to M(b,a), and if our perturbations, h∈M(b,c)h \in M(b,c), then ChC h would be incompatible. The correct statement is:

DTX[h]=hC DT_X[h] = h C

So what this tells us symbol-pushers, is that with matrix derivatives, we need to be careful:

D(CX)[h]=ChbutD(XC)[h]=hC D(CX)[h] = Ch \quad \text{but} \quad D(XC)[h] = hC

For the astute reader, though, you might notice that if TT is linear:

T(X+h)βˆ’T(X)=T(X+hβˆ’X)=T(h) T(X+h) - T(X) = T(X + h - X) = T(h)

which means that if A(h)=T(h)A(h) = T(h), our limit is zero, regardless of how far βˆ₯hβˆ₯\Vert h \Vert is from 0. This is particularly noteworthy, so let's highlight it:

if f:V→Wf: V \to W is linear, then its Fréchet derivative at every point is the same linear map: Dfx[h]=f(h)Df_x[h] = f(h).

Which brings us to why I'm here in the first place: understanding derivatives in the context of back propagation during network training.

A common operation you might consider when transforming data is a row- or column-sum. This is easily expressed as a matrix operation. That is, if X∈M(n,k)X \in M(n, k), and r∈Rnr \in \R^n is the row-sum and c∈Rkc \in \R^k is the column-sum:

r=X1kβ€…β€ŠΒ andΒ β€…β€Šc=1nTX r = X \mathbf{1}_k \; \text{ and } \; c = \mathbf{1}_n^T X

where 1k\mathbf{1}_k and 1n\mathbf{1}_n are vectors of ones of the appropriate lengths. Since these are linear maps, their derivatives apply the same sums to the perturbation hh:

DrX[h]=h1kβ€…β€ŠΒ andΒ β€…β€ŠDcX[h]=1nTh Dr_X[h] = h\mathbf{1}_k \; \text{ and } \; Dc_X[h] = \mathbf{1}_n^T h

So if our observed data is XX, then applying the derivative to that same matrix gives back the corresponding sum:

DrX[X]=X1k=r Dr_X[X] = X\mathbf{1}_k = r

This is the sense in which the derivative's output can still be the row-sum. The derivative at XX is the operator h↦h1kh \mapsto h\mathbf{1}_k; when the direction you feed it is XX itself, its output is the row-sum of XX.

And since this easily generalises to rank-nn Tensors (e.g. X∈M(i1,…,in)X \in M(i_1, \ldots, i_n)) -- you can consider the total sum as a composition of individual summations, we conclude that the derivative of an arbitrary axis-sum is the same axis-sum operator applied to the perturbation. Pretty neat!


  1. this is one of those conditions we don't usually think about because R=(βˆ’βˆž,+∞)\R = (-\infty, +\infty) is an open set, and we're taught about derivatives for functions defined on R\R or open-intervals (e.g. (0,1)(0,1)) on R\R. β†©οΈŽ

  2. basically, does there exist M>0M > 0 such that for all v∈Vv \in V, βˆ₯A(v)βˆ₯W≀Mβˆ₯vβˆ₯V\Vert A(v) \Vert_{W} \leq M \Vert v \Vert_{V}. This is making sure the image of the derivative doesn't "explode". We also need A(x+y)=A(x)+A(y)A(x + y) = A(x) + A(y) and A(cx)=cA(x)A(cx) = cA(x), which is linearity. β†©οΈŽ

  3. they're all equivalent! β†©οΈŽ