0

I am reading this paper on implicit regularisation in gradient descent and I am having difficulty with the provided definition of effective rank. In the paper it is given as

$r(W) = \frac{||W||_*}{||W||}$

Where $||W||_*$ is the nuclear norm, $||W||$ is the operator norm and $W \in \mathbb{R}^{n \times n}$ is symmetric. Assuming $W$ has eigenvalues $\lambda_i$, $i \in \{1,...,n\}$, is the below formula for the effective rank correct?

$\sum_{\ell=1}^{n}\frac{\left|\lambda_{\ell}\right|}{\left|\lambda_{1}\right|}$

Additionally, is anyone able to provide an intuition as to why one may use the effective rank rather than the true rank? I see there is already a question on here regarding effective rank, but the definition given appears slightly different as to the one I am working with. (Unless this indeed the entropy of the notional distribution obtained by normalising the singular values as answered on that question).

  • 2
    Nuclear / Operator norm are given in terms of singular values, not eigenvalues. – Hyperplane Jan 23 '24 at 15:44
  • I thought so too, but then is it not the case that the square of singular values are eigenvalues? But then given this, the above definition I've provided wouldn't be correct.

    However, the definition given on page 21 of the linked paper which is then used at the bottom of page 23 in the inequality (where $r(\widehat{W}_L)$ is computed) seems to use the definition I provided in my question.

    – JamesLevine Jan 24 '24 at 09:07
  • Singular values are roots of the eigenvalues of $WᵀW$. If $W$ is non-square, then it doesn't even have eigenvalues. – Hyperplane Jan 24 '24 at 09:42
  • Ah, my apologies, in the paper W is assumed to be square (and symmetric). I will edit the question. – JamesLevine Jan 24 '24 at 10:38
  • Actual rank can 'jump' upwards, i.e. it is lower semi-continuous. "Effective rank" should be continuous as long as we exclude the zero matrix. It is more appropriate to write this using singular values-- for symmetric matrices the eigenvalues almost always have a strictly enforced ordering convention where $\lambda_1$ would always be the largest eigenvalue (or for a different author: always the smallest eigenvalue) but your $\lambda_1$ is neither-- it is the largest modulus eigenvalue. All of these problems ago away when one uses singular values. – user8675309 Jan 24 '24 at 18:45

1 Answers1

0

If $W$ is square and symmetric, then $W^TW = W^2$. The eigenvalues of $W^2$ are precisely the squares of the eigenvalues of $W$, since $Wx=λx⟹W^2x = λ^2 x$. Thus, the singular are $σ_k = \sqrt{λ_k(W^TW)}= \sqrt{λ_k^2} = |\lambda_k|$, assuming $λ_k$ are ordered by magnitude.

So in this special case, $r(W) = \frac{||W||_*}{||W||} = ∑_{k=1}^n \frac{σ_k}{σ_1} = ∑_{k=1}^n \frac{|λ_k|}{|λ_1|}$.

Remark: In the literature, effective rank has been defined differently than in this publication. The effective rank: A measure of effective dimensionality (IEEE 2007) define it as $e^{H(p(A))}$, where $H$ is the Shannon entropy and $p(A)$ is a categorical distribution with probabilities according to the singular values.

Hyperplane
  • 11,659