Finding matrix inverses in parallell by C-G column-by-column?

Question

I did a small experiment the other week and managed to invert some small matrices by the following procedure:

Left hand side = matrix we want to invert (say $A$). If non-symmetric, instead $A^TA$
First standard basis vector $v_1 = [1,0,\cdots,0]^T$ becomes RHS if non-symmetric LHS, if symmetric then $A^Tv_1$ instead.
Solve with conjugate-gradient.
Rotate the $1$ in rhs one spot $v_2=[0,1,0,\cdots,0]^T$ becomes RHS if non-symmetric LHS, if symmetric then $A^Tv_2$
Repeat for all columns of standard basis vectors: Solve with Conjugate-Gradient.
Collect the solution vectors as columns in the new matrix.
Done!

Everything will be extremely parallellizable. Each column could be assigned a set of processing units because each column is independent of the solvin of the other columns. Everything to write and accumulate will be local in memory for the (groups of) cores. The matrix to invert can be stored in common "global memory".

(Many modern hardwares, for example GPU architectures has this setup)

Do there already exist software packages which do this? If not, why not? Is there some obvious drawback that I am not seeing?

EDIT:

Example where it is interesting to solve for a subset of rows or columns of the inverse: We discretize a differential equation. The solution to the linear equation system for this discretized differential equation is the same as to multiply the inverse matrix with a vector (right-hand-side). Then one row in the inverse is the linear combination of "initial/boundary" conditions for the solution at that particular point in space and/or time. For different rows we can see how this varies throughout the definition space in a very illustrative way.

Here is a motivational gif for why it is useful. The equation system it illustrates the solution for is this Nyquist sampling question. We can see the kernel shifting to become the linear interpolation weight as the output position varies over the signal.

Well.. Each CG costs $O(n^3)$, so your overall cost is $O(n^3)$ parallel, but the common algorithms for the inverse is $O(n^3)$ sequential, so you're taking a lot of reasources for doing something relatively simple... — Exodd, Mar 11 '18 at 01:54
Yep assuming $A$ is dense then you are right complexity wise. But we still have things like pivoting which could cause branching in sequential algorithms seriously reducing the peak performance potential of a pipelined CPU. — mathreadler, Mar 11 '18 at 10:04
A quick research indicates they got down to $O(\log^2(n))$ in parallel time https://epubs.siam.org/doi/abs/10.1137/0205040 — Exodd, Mar 11 '18 at 11:28
also... wut?
"Submitted: 10 June 1975 Published online: 13 July 2006" — mathreadler, Mar 11 '18 at 11:39
On the topic, do you have any particular application of interest in mind? If $A$ is sparse and large (when CG is normally used), computing its inverse would usually be a waste of time (it is usually not needed to have an explicit inverse), resources (inverse of a sparse matrix is usually full), and accuracy (inverse times rhs is often less accurate than the linsolve). Not even mentioning that you'd probably need a good preconditioner to make CG to converge fast. — Algebraic Pavel, Mar 12 '18 at 10:46
@AlgebraicPavel : Could be any application. Solving for a specific column or row in the inverse can be valuable in trying to figure out if we are solving the right problem in the first place. When designing algorithms one of the largest difficulties is finding the right cost functions. Rows in the inverted matrix can help us see conceptually what terms may be missing in our left hand side. — mathreadler, Mar 12 '18 at 12:41
I was just surprised that a paper containing recent results of the latest years algorithms for parallell hardware would have been submitted in the 70s. — mathreadler, Mar 12 '18 at 12:42

Finding matrix inverses in parallell by C-G column-by-column?

0 Answers0