I am reading this paper on implicit regularisation in gradient descent and I am having difficulty with the provided definition of effective rank. In the paper it is given as
$r(W) = \frac{||W||_*}{||W||}$
Where $||W||_*$ is the nuclear norm, $||W||$ is the operator norm and $W \in \mathbb{R}^{n \times n}$ is symmetric. Assuming $W$ has eigenvalues $\lambda_i$, $i \in \{1,...,n\}$, is the below formula for the effective rank correct?
$\sum_{\ell=1}^{n}\frac{\left|\lambda_{\ell}\right|}{\left|\lambda_{1}\right|}$
Additionally, is anyone able to provide an intuition as to why one may use the effective rank rather than the true rank? I see there is already a question on here regarding effective rank, but the definition given appears slightly different as to the one I am working with. (Unless this indeed the entropy of the notional distribution obtained by normalising the singular values as answered on that question).
However, the definition given on page 21 of the linked paper which is then used at the bottom of page 23 in the inequality (where $r(\widehat{W}_L)$ is computed) seems to use the definition I provided in my question.
– JamesLevine Jan 24 '24 at 09:07