PCA for sparse matrices, tolerance of eigenvalues

classic Classic list List threaded Threaded
4 messages Options
Reply | Threaded
Open this post in threaded view
|

PCA for sparse matrices, tolerance of eigenvalues

Jaidev Deshpande
Dear all,

I tried using the 'scipy.sparse.eigs' tool for performing principal component analysis on a matrix which is roughly 80% sparse.

First of all, is that a good way to go about it?

Second, the operation failed when the function failed to converge on accurate eigenvalues. I noticed the 'tol' attribute in the function, but how does one define a reasonable tolerance and calculate it?

Thanks

_______________________________________________
SciPy-User mailing list
[hidden email]
http://mail.scipy.org/mailman/listinfo/scipy-user
Reply | Threaded
Open this post in threaded view
|

Re: PCA for sparse matrices, tolerance of eigenvalues

Pauli Virtanen-3
Thu, 24 Feb 2011 09:58:00 +0530, Jaidev Deshpande wrote:
> I tried using the 'scipy.sparse.eigs' tool for performing principal
> component analysis on a matrix which is roughly 80% sparse.
>
> First of all, is that a good way to go about it?

If it doesn't work as a dense matrix, then you don't have much choice
than to rely on an iterative method. 'eigs' uses ARPACK.

> Second, the operation failed when the function failed to converge on
> accurate eigenvalues. I noticed the 'tol' attribute in the function, but
> how does one define a reasonable tolerance and calculate it?

As the docstring states, `tol` is the desired relative tolerance for the
eigenvalue. One should expect that the error is approximately

        abs(exact_value - estimate) < tol*abs(exact_value)

You can also try increasing `maxiter` or `ncv` to make it try harder to
reach convergence. What are reasonable values for these paremeters
depends on the problem in question.

You can consult the ARPACK user guide for more:

        http://www.sc.fsu.edu/~burkardt/pdf/arpack.pdf

Check e.g. the section on "Stopping criterion".

_______________________________________________
SciPy-User mailing list
[hidden email]
http://mail.scipy.org/mailman/listinfo/scipy-user
Reply | Threaded
Open this post in threaded view
|

Re: PCA for sparse matrices, tolerance of eigenvalues

Gael Varoquaux
On Thu, Feb 24, 2011 at 11:37:18AM +0000, Pauli Virtanen wrote:
> > First of all, is that a good way to go about it?

> If it doesn't work as a dense matrix, then you don't have much choice
> than to rely on an iterative method. 'eigs' uses ARPACK.

For the application of truncated PCA, for which error on small
eigenvalues are not important, and very large data (much much larger than
the cache size) randomized projection methods can work increadibly well
(partly because they render the problem local in memory, and with large
data memory bandwidth is a killer). We have a fast truncated SVD in the
scikit learn that is fairly standalone, and can be extracted:
https://github.com/scikit-learn/scikit-learn/blob/master/scikits/learn/utils/extmath.py

Don't use this for other applications than PCA!

G
_______________________________________________
SciPy-User mailing list
[hidden email]
http://mail.scipy.org/mailman/listinfo/scipy-user
Reply | Threaded
Open this post in threaded view
|

Re: PCA for sparse matrices, tolerance of eigenvalues

David Cournapeau
In reply to this post by Jaidev Deshpande
On Thu, Feb 24, 2011 at 1:28 PM, Jaidev Deshpande
<[hidden email]> wrote:
> Dear all,
> I tried using the 'scipy.sparse.eigs' tool for performing principal
> component analysis on a matrix which is roughly 80% sparse.
> First of all, is that a good way to go about it?

It generally is, but keep in mind that 80 % sparse is not that sparse.
Indeed, for efficient sparse matrices representations, each item needs
besides its value two integer values for indexing. If you use single
precision and 64 bits indexing, you are effectively using as much
memory as a dense representation.

Also, dense methods are often much faster than sparse ones for a same
matrix size, especially since in scipy sparse matrices are not as well
optimized as numpy arrays. You may want to compare both methods if you
can.

cheers,

David
_______________________________________________
SciPy-User mailing list
[hidden email]
http://mail.scipy.org/mailman/listinfo/scipy-user