Michael Nielsen's book “Neural Networks and Deep Learning” Cauchy-Schwarz Inequality Proof

Question

In the online free book the following is stated:

If $C$ is a cost function which depends on $v1,v2,...,vn$ he states that we make a move in the $Δv$ direction to decrease $C$ as much as possible, and that's equivalent to minimizing $ΔC≈∇C⋅Δv$. So if $∣∣Δv∣∣=ϵ$ for a small $ϵ$, it can be proved that the choice of $Δv$ that minimizes $ΔC≈∇C⋅Δv$ is $Δv=−η∇C$ where $η=ϵ/∣∣∇C∣∣$. It is suggested to use the Cauchy-Schwarz inequality.

I don't have a background in mathematics, I have done a lot of reading but I am struggling to know where to start. Even after a lot of reading, I have no conceptual understanding of why the Cauchy-Schwarz inequality is relevant here. Perhaps somebody can help me?

Formatting tips here: http://meta.math.stackexchange.com/q/5020/321264 — StubbornAtom, Aug 28 '16 at 18:21
I'm not sure you will really get this topic without background in mathematics — Yuriy S, Aug 28 '16 at 18:25
Note that if you moved in the direction $v$, the reduction in cost will be $v \cdot \nabla C$. Cauchy-Schwarz says that $|v \cdot \nabla C| \le ||v|| , ||\nabla C||$, with the maximum when $v = a \nabla C$ for some scalar $a$. This means that to get the maximum reduction, I should be directed either along or directly opposite to $\nabla C$. It's relatively easy to see that I should indeed be opposed to $\nabla C$ in order to get a reduction at all, and that finishes the story. — stochasticboy321, Aug 28 '16 at 18:29
That said, I'd second Yuriy in that understanding ML would require a fairly firm grounding in vector calc and linear algebra, and you'd likely be better off investing time in these first instead of jumping right in, especially if a proof like this is currently out of reach. — stochasticboy321, Aug 28 '16 at 18:32
@stochasticboy321 I understand the notation and of course I understand linear algebra. I guess the concept here is that the two sides of the C-S are be equal, and if that is the base it is a simple rearrangement to show that $η=ϵ/∣∣∇C∣∣η=ϵ/∣∣∇C∣∣$. But I don't understand why the C-S must be equal? — par, Aug 28 '16 at 18:33
C-S gives you an upper bound on the absolute value of the reduction possible. If you reach this maximum (which you'd like to), you must then (tautologically) satisfy C-S with equality. — stochasticboy321, Aug 28 '16 at 18:36
Perfect, it just clicked thanks. I guess that should have been obvious :) — par, Aug 28 '16 at 18:38

Michael Nielsen's book “Neural Networks and Deep Learning” Cauchy-Schwarz Inequality Proof

0 Answers0