WTF??

28 Apr 2026

What's the Fréchet derivative??

Another day, another calculus rabbit hole. This time, though, we get to look at my two loves: derivative calculus and metric spaces.

The derivative we learn in highschool (or first-year undergraduate math) is something like:

\lim_{h \to 0} \frac{f(x+h) - f(x)}{h} = f'(x)

And this works well because by the time you see this definition, you should feel comfortable with geometry and distance in $\R$ .

What the Fréchet derivative does is take the spirit of this, and generalise it to more abstract spaces. Namely, you let $V, W$ be normed vector spaces ( $\Vert\cdot\Vert_{V}$ and $\Vert\cdot\Vert_{W}$ , respectively) with $U \subseteq V$ an open subset^[1] and $f: U \to W$ . Then $f$ is Fréchet differentiable at $x \in U$ if there exists $A : V \to W$ , a bounded linear operator,^[2] such that

\lim_{\Vert h \Vert_{V} \to 0} \frac{\Vert f(x+h) - f(x) - A(h)\Vert_{W}}{\Vert h \Vert_{V}} = 0

This might feel awkward, but if you give it a moment it actually is the most "straight-forward" generalisation of our original definition. In ordinary single-variable calculus, the derivative at $x$ gives us the best linear approximation to the change in $f$ near $x$ :

f(x + h) - f(x) \approx f'(x)h

where the map $h \mapsto f'(x)h$ is linear in the perturbation $h$ . So, all we're saying is:

take the original, real-valued concept of the derivative as a best linear approximation, and extend it to abstract spaces that have a concept of distance

So why am I banging on about this? Because I want to explore what the derivative of matrix multiplication is. Simplest case: $T(X) = CX$ where $X \in M(b,c) = \R^{b \times c}$ and $C \in M(a,b)$ ; hence, $T : M(b,c) \to M(a,c)$ . What is the derivative of $T$ with respect to $X$ (or in more popular notation $DT_X$ )?

There are many ways to skin this cat. My favourite is blind symbol-pushing:

\begin{equation} DT_X = \frac{d}{dX}T(X) = \frac{d}{dX} CX = C \frac{dX}{dX} = C \end{equation}

But this is pretty unsatisfying because we don't actually know if the derivative operator here behaves like that. We're literally using procedural fluency to manipulate the expression. Let's try first principles, and just line stuff up.

Our function $T$ maps between the finite dimensional $V = \R^{b \times c}$ and $W = \R^{a \times c}$ , since both of these are 2D, choose your favorite matrix norm^[3] and let's just call that $\Vert \cdot \Vert$ :

\begin{align*} \lim_{\Vert h \Vert \to 0} \frac{\Vert T(X + h) - T(X) - A(h)\Vert}{\Vert h \Vert} &= \lim_{\Vert h \Vert \to 0} \frac{\Vert C(X + h) - CX - A(h)\Vert}{\Vert h \Vert} \\ &= \lim_{\Vert h \Vert \to 0} \frac{\Vert Ch - A(h)\Vert}{\Vert h \Vert} \end{align*}

Now $C$ is a matrix from $M(a,b)$ , so left multiplication by $C$ is necessarily linear (and bounded). If we take $A(h) = Ch$ , our numerator is always 0, regardless of how close to 0 our $\Vert h \Vert$ is. So we've got a winner:

DT_X[h] = Ch

That is, the derivative is not literally the matrix $C$ as an output value. It is the linear operator that sends a perturbation $h$ to $Ch$ . In the vector case these feel like the same thing, but for matrices that distinction matters.

But what if we spice things up and go $T(X) = XC$ (right multiply by $C$ instead of left multiplying by $C$ )? Almost everything would run the same except for one small hiccup: the order of $h$ and $C$ matters:

T(X + h) - T(X) - A(h) = (X+h)C - XC - A(h) = hC - A(h)

hence to get the numerator to be zero, $A(h) = hC$ . Since we switched the composition order, we obviously don't have the same operator as before. If $X$ is still a matrix in $M(b,c)$ , then $C \in M(c,a)$ and $T : M(b,c) \to M(b,a)$ . Hence, if we're sloppy and write $DT_X = C$ , one might be inclined to presume

DT_X(h) = Ch

which would be wrong since our derivative is supposed to be a map from $M(b,c) \to M(b,a)$ , and if our perturbations, $h \in M(b,c)$ , then $C h$ would be incompatible. The correct statement is:

DT_X[h] = h C

So what this tells us symbol-pushers, is that with matrix derivatives, we need to be careful:

D(CX)[h] = Ch \quad \text{but} \quad D(XC)[h] = hC

For the astute reader, though, you might notice that if $T$ is linear:

T(X+h) - T(X) = T(X + h - X) = T(h)

which means that if $A(h) = T(h)$ , our limit is zero, regardless of how far $\Vert h \Vert$ is from 0. This is particularly noteworthy, so let's highlight it:

if $f: V \to W$ is linear, then its Fréchet derivative at every point is the same linear map: $Df_x[h] = f(h)$ .

Which brings us to why I'm here in the first place: understanding derivatives in the context of back propagation during network training.

A common operation you might consider when transforming data is a row- or column-sum. This is easily expressed as a matrix operation. That is, if $X \in M(n, k)$ , and $r \in \R^n$ is the row-sum and $c \in \R^k$ is the column-sum:

r = X \mathbf{1}_k \; \text{ and } \; c = \mathbf{1}_n^T X

where $\mathbf{1}_k$ and $\mathbf{1}_n$ are vectors of ones of the appropriate lengths. Since these are linear maps, their derivatives apply the same sums to the perturbation $h$ :

Dr_X[h] = h\mathbf{1}_k \; \text{ and } \; Dc_X[h] = \mathbf{1}_n^T h

So if our observed data is $X$ , then applying the derivative to that same matrix gives back the corresponding sum:

Dr_X[X] = X\mathbf{1}_k = r

This is the sense in which the derivative's output can still be the row-sum. The derivative at $X$ is the operator $h \mapsto h\mathbf{1}_k$ ; when the direction you feed it is $X$ itself, its output is the row-sum of $X$ .

And since this easily generalises to rank- $n$ Tensors (e.g. $X \in M(i_1, \ldots, i_n)$ ) -- you can consider the total sum as a composition of individual summations, we conclude that the derivative of an arbitrary axis-sum is the same axis-sum operator applied to the perturbation. Pretty neat!

this is one of those conditions we don't usually think about because $\R = (-\infty, +\infty)$ is an open set, and we're taught about derivatives for functions defined on $\R$ or open-intervals (e.g. $(0,1)$ ) on $\R$ . ↩︎
basically, does there exist $M > 0$ such that for all $v \in V$ , $\Vert A(v) \Vert_{W} \leq M \Vert v \Vert_{V}$ . This is making sure the image of the derivative doesn't "explode". We also need $A(x + y) = A(x) + A(y)$ and $A(cx) = cA(x)$ , which is linearity. ↩︎
they're all equivalent! ↩︎