How to efficiently do dot(dot( A.T, diag(d) ), A ) ?

classic Classic list List threaded Threaded
19 messages Options
Reply | Threaded
Open this post in threaded view
|

How to efficiently do dot(dot( A.T, diag(d) ), A ) ?

Hugh Perkins
How to do efficiently do dot(dot( A.T, diag(d) ), A ) ?

... where d is of length n, A is n * k, and n >> k

(obviously the i,j element of this is just sum_r A_{r,i) * A_{r,j} *
d_r , which is in theory relatively fast, at least, a lot faster than
two brute-force matrix muliplies! )
_______________________________________________
SciPy-User mailing list
[hidden email]
http://mail.scipy.org/mailman/listinfo/scipy-user
Reply | Threaded
Open this post in threaded view
|

Re: How to efficiently do dot(dot( A.T, diag(d) ), A ) ?

josef.pktd
On Sun, Sep 9, 2012 at 9:19 PM, Hugh Perkins <[hidden email]> wrote:
> How to do efficiently do dot(dot( A.T, diag(d) ), A ) ?

dot( A.T * d , A )

IIRC

Josef

>
> ... where d is of length n, A is n * k, and n >> k
>
> (obviously the i,j element of this is just sum_r A_{r,i) * A_{r,j} *
> d_r , which is in theory relatively fast, at least, a lot faster than
> two brute-force matrix muliplies! )
> _______________________________________________
> SciPy-User mailing list
> [hidden email]
> http://mail.scipy.org/mailman/listinfo/scipy-user
_______________________________________________
SciPy-User mailing list
[hidden email]
http://mail.scipy.org/mailman/listinfo/scipy-user
Reply | Threaded
Open this post in threaded view
|

Re: How to efficiently do dot(dot( A.T, diag(d) ), A ) ?

Hugh Perkins
In reply to this post by Hugh Perkins
> > How to do efficiently do dot(dot( A.T, diag(d) ), A ) ?
>
> dot( A.T * d , A )

This is very good!

Still, the second multiplication looks like it is doing a full
brute-force matrix multiplication:

>>> tic(); d = c.T * a; toc()
Elapsed time: 0.00560903549194
>>> tic(); e = dot( d, c ); toc()
Elapsed time: 0.110434055328
_______________________________________________
SciPy-User mailing list
[hidden email]
http://mail.scipy.org/mailman/listinfo/scipy-user
Reply | Threaded
Open this post in threaded view
|

Re: How to efficiently do dot(dot( A.T, diag(d) ), A ) ?

josef.pktd
On Mon, Sep 10, 2012 at 12:34 PM, Hugh Perkins <[hidden email]> wrote:
>> > How to do efficiently do dot(dot( A.T, diag(d) ), A ) ?
>>
>> dot( A.T * d , A )
>
> This is very good!
>
> Still, the second multiplication looks like it is doing a full
> brute-force matrix multiplication:

except that the result is symmetric, I don't see any way around that.
elementwise multiplication and sum will usually be slower.

Josef


>
>>>> tic(); d = c.T * a; toc()
> Elapsed time: 0.00560903549194
>>>> tic(); e = dot( d, c ); toc()
> Elapsed time: 0.110434055328
> _______________________________________________
> SciPy-User mailing list
> [hidden email]
> http://mail.scipy.org/mailman/listinfo/scipy-user
_______________________________________________
SciPy-User mailing list
[hidden email]
http://mail.scipy.org/mailman/listinfo/scipy-user
Reply | Threaded
Open this post in threaded view
|

Re: How to efficiently do dot(dot( A.T, diag(d) ), A ) ?

eat-3
In reply to this post by Hugh Perkins
Hi,

On Mon, Sep 10, 2012 at 7:34 PM, Hugh Perkins <[hidden email]> wrote:
> > How to do efficiently do dot(dot( A.T, diag(d) ), A ) ?
>
> dot( A.T * d , A )

This is very good!

Still, the second multiplication looks like it is doing a full
brute-force matrix multiplication:

>>> tic(); d = c.T * a; toc()
Elapsed time: 0.00560903549194
>>> tic(); e = dot( d, c ); toc()
Elapsed time: 0.110434055328
It's not so clear what kind of improvements you are looking for. Do you perhaps expect that there exist some magic to ignore half of the computations with dot product, when the result is symmetric? Anyway, here is a simple demonstration to show that dot product is fairly efficient already:
In []: m, n= 500000, 5
In []: A, d= randn(m, n), randn(m)
In []: %timeit A.T* d
10 loops, best of 3: 34.4 ms per loop
In []: mps= m* n/ .0344 # multiplications per second
In []: c= A.T* d
In []: %timeit dot(c, A)
10 loops, best of 3: 68 ms per loop
In []: masps= m* n** 2/ .068 # multiplications and summations per second
In []: masps/ mps
Out[]: 2.529411764705882

In []: allclose(dot(c, A), dot(A.T* d, A))
Out[]: True
In []: sys.version
Out[]: '2.7.2 (default, Jun 12 2011, 15:08:59) [MSC v.1500 32 bit (Intel)]'
In []: np.version.version
Out[]: '1.6.0'

Thus multiplication and summation with dot product is some 2.5 times faster than simple multiplication.


My 2 cents,
-eat
_______________________________________________
SciPy-User mailing list
[hidden email]
http://mail.scipy.org/mailman/listinfo/scipy-user


_______________________________________________
SciPy-User mailing list
[hidden email]
http://mail.scipy.org/mailman/listinfo/scipy-user
Reply | Threaded
Open this post in threaded view
|

Re: How to efficiently do dot(dot( A.T, diag(d) ), A ) ?

Hugh Perkins
In reply to this post by Hugh Perkins
> It's not so clear what kind of improvements you are looking for. Do you
> perhaps expect that there exist some magic to ignore half of the
> computations with dot product, when the result is symmetric?

Here's the same test in matlab:

>> n = 10000;
>> k = 100;
>> a = spdiags(rand(n,1),0,n,n);
>> c = rand(k,n);
>> tic, d = c*a; toc
Elapsed time is 0.007769 seconds.
>> tic, d = d*c'; toc
Elapsed time is 0.007782 seconds.

(vs, in scipy:

>>> tic(); d = c.T * a; toc()
Elapsed time: 0.00560903549194
>>> tic(); e = dot( d, c ); toc()
Elapsed time: 0.110434055328
)
_______________________________________________
SciPy-User mailing list
[hidden email]
http://mail.scipy.org/mailman/listinfo/scipy-user
Reply | Threaded
Open this post in threaded view
|

Re: How to efficiently do dot(dot( A.T, diag(d) ), A ) ?

Tony Stillfjord
I'm not sure what's going on here, but when I run your MATLAB-code on my system
(copy-pasted) I get results more in the line of

Elapsed time is 0.007203 seconds.
Elapsed time is 0.031520 seconds.

i.e. much worse than you report but still better than the Python ones. With scipy I get
4.1 ms vs. 60 ms.

'Sparsifying' the code improves things on my end:

In [73]: n = 10000
In [74]: k = 100
In [75]: r = rand(n)
In [76]: a = scipy.sparse.spdiags(r, 0, n, n)
In [77]: c = scipy.sparse.rand(k,n)
In [78]: d = c.dot(a)
In [79]: e = d.dot(c.T)
In [80]: %timeit d = c.dot(a)
1000 loops, best of 3: 1.52 ms per loop
In [81]: %timeit e = d.dot(c.T)
1000 loops, best of 3: 1.12 ms per loop


Tony



On Tue, Sep 11, 2012 at 4:43 AM, Hugh Perkins <[hidden email]> wrote:
> It's not so clear what kind of improvements you are looking for. Do you
> perhaps expect that there exist some magic to ignore half of the
> computations with dot product, when the result is symmetric?

Here's the same test in matlab:

>> n = 10000;
>> k = 100;
>> a = spdiags(rand(n,1),0,n,n);
>> c = rand(k,n);
>> tic, d = c*a; toc
Elapsed time is 0.007769 seconds.
>> tic, d = d*c'; toc
Elapsed time is 0.007782 seconds.

(vs, in scipy:

>>> tic(); d = c.T * a; toc()
Elapsed time: 0.00560903549194
>>> tic(); e = dot( d, c ); toc()
Elapsed time: 0.110434055328
)
_______________________________________________
SciPy-User mailing list
[hidden email]
http://mail.scipy.org/mailman/listinfo/scipy-user


_______________________________________________
SciPy-User mailing list
[hidden email]
http://mail.scipy.org/mailman/listinfo/scipy-user
Reply | Threaded
Open this post in threaded view
|

Re: How to efficiently do dot(dot( A.T, diag(d) ), A ) ?

Hugh Perkins
In reply to this post by Hugh Perkins
> 'Sparsifying' the code improves things on my end:

It did for me too :-)

But then I noticed that sparse.rand produces a matrix with no elements:

>>> from scipy import sparse
>>> sparse.rand(2,2)
<2x2 sparse matrix of type '<type 'numpy.float64'>'
        with 0 stored elements in COOrdinate format>
>>> sparse.rand(2,2).todense()
matrix([[ 0.,  0.],
        [ 0.,  0.]])

You can make a sparse rand matrix from a dense one:

c = sparse.coo_matrix(scipy.rand(n,k))

This gives worse timings than a non-sparse c for me:

>>> c = rand(n,k);
>>> tic(); d = c.T * a; toc()
Elapsed time: 0.0610790252686
>>> tic(); e = dot( d, c ); toc()
Elapsed time: 0.239707946777

>>> import scipy.sparse as sparse
>>> c = sparse.coo_matrix(scipy.rand(n,k))
>>> tic(); d = c.T * a; toc()
Elapsed time: 0.137360095978
>>> tic(); e = dot( d, c ); toc()
Elapsed time: 1.74925804138
_______________________________________________
SciPy-User mailing list
[hidden email]
http://mail.scipy.org/mailman/listinfo/scipy-user
Reply | Threaded
Open this post in threaded view
|

Re: How to efficiently do dot(dot( A.T, diag(d) ), A ) ?

Tony Stillfjord
Oh. Of course. Sorry about that, I forgot about the density argument to sparse.rand
and didn't check my results.

That certainly explains why it was so fast...and since d can't be expected to be sparse
the sparse approach is kind of dumb.

Tony

On Tue, Sep 11, 2012 at 11:10 AM, Hugh Perkins <[hidden email]> wrote:
> 'Sparsifying' the code improves things on my end:

It did for me too :-)

But then I noticed that sparse.rand produces a matrix with no elements:

>>> from scipy import sparse
>>> sparse.rand(2,2)
<2x2 sparse matrix of type '<type 'numpy.float64'>'
        with 0 stored elements in COOrdinate format>
>>> sparse.rand(2,2).todense()
matrix([[ 0.,  0.],
        [ 0.,  0.]])

You can make a sparse rand matrix from a dense one:

c = sparse.coo_matrix(scipy.rand(n,k))

This gives worse timings than a non-sparse c for me:

>>> c = rand(n,k);
>>> tic(); d = c.T * a; toc()
Elapsed time: 0.0610790252686
>>> tic(); e = dot( d, c ); toc()
Elapsed time: 0.239707946777

>>> import scipy.sparse as sparse
>>> c = sparse.coo_matrix(scipy.rand(n,k))
>>> tic(); d = c.T * a; toc()
Elapsed time: 0.137360095978
>>> tic(); e = dot( d, c ); toc()
Elapsed time: 1.74925804138
_______________________________________________
SciPy-User mailing list
[hidden email]
http://mail.scipy.org/mailman/listinfo/scipy-user


_______________________________________________
SciPy-User mailing list
[hidden email]
http://mail.scipy.org/mailman/listinfo/scipy-user
Reply | Threaded
Open this post in threaded view
|

Re: How to efficiently do dot(dot( A.T, diag(d) ), A ) ?

Pauli Virtanen-3
In reply to this post by Hugh Perkins
Hi,

First a quick note: Please, when posting any benchmark result, always
include the full test code for each case.

11.09.2012 05:43, Hugh Perkins kirjoitti:
>> It's not so clear what kind of improvements you are looking for. Do you
>> perhaps expect that there exist some magic to ignore half of the
>> computations with dot product, when the result is symmetric?
>
> Here's the same test in matlab:
[clip]

You are here bencmarking the underlying BLAS libraries. Matlab comes
with Intel MKL, whereas your Numpy/Scipy is likely linked with ATLAS.

MKL can be faster than ATLAS, but on the other hand you can also link
the Numpy/Scipy combination against MKL (provided you buy a license from
Intel).

In your Matlab example, the first matrix product is a sparse-dense one,
whereas the second is dense-dense (MKL-fast). That the speeds of the two
operations happen to coincide is likely a coincidence.

In the Numpy case without sparse matrices, the first product is
broadcast-multiplication (faster than a sparse-dense matrix product),
whereas the second product is a dense-dense matrix multiplication.


Here are results on one (slowish) machine:
----
import numpy as np
n = 10000
k = 100
a = np.random.rand(n)
c = np.random.rand(k,n)
d = c*a
e = np.dot(d, c.T)
%timeit d = c*a
# -> 100 loops, best of 3: 11 ms per loop
%timeit e = np.dot(d, c.T)
# -> 10 loops, best of 3: 538 ms per loop
----
n = 10000;
k = 100;
a = spdiags(rand(n,1),0,n,n);
c = rand(k,n);
tic, d = c*a; toc
% -> Elapsed time is 0.018380 seconds.
tic, e = d*c'; toc
% -> Elapsed time is 0.138673 seconds.

--
Pauli Virtanen

_______________________________________________
SciPy-User mailing list
[hidden email]
http://mail.scipy.org/mailman/listinfo/scipy-user
Reply | Threaded
Open this post in threaded view
|

Re: How to efficiently do dot(dot( A.T, diag(d) ), A ) ?

Hugh Perkins
On Wed, Sep 12, 2012 at 1:15 AM, Pauli Virtanen <[hidden email]> wrote:
> You are here bencmarking the underlying BLAS libraries. Matlab comes
> with Intel MKL, whereas your Numpy/Scipy is likely linked with ATLAS.

Hi Pauli,

Ok, that's good information.  I think that makes a lot of sense, and
sounds plausible to me.

> MKL can be faster than ATLAS, but on the other hand you can also link
> the Numpy/Scipy combination against MKL (provided you buy a license from
> Intel).

Ok, that sounds reasonable to me.

It makes me wonder though.  There is an opensource project called
'Eigen', for C++.
It seems to provide good performance for matrix-matrix multiplication,
comparable to Intel MKL, and significantly better than ublas
http://eigen.tuxfamily.org/index.php?title=Benchmark  I'm not sure
what the relationship is between ublas and BLAS?

I wonder if it might be worth scipy providing an option to link with
eigen3?  I confess I don't personally have time to do that though.
_______________________________________________
SciPy-User mailing list
[hidden email]
http://mail.scipy.org/mailman/listinfo/scipy-user
Reply | Threaded
Open this post in threaded view
|

Re: How to efficiently do dot(dot( A.T, diag(d) ), A ) ?

Hugh Perkins
(Apparently Eigen can use multithreading too.  This is one thing which
matlab does automatically by the way: automatically parallelize out
certain matrix operations amongst multiple cores.

http://eigen.tuxfamily.org/dox/TopicMultiThreading.html

)
_______________________________________________
SciPy-User mailing list
[hidden email]
http://mail.scipy.org/mailman/listinfo/scipy-user
Reply | Threaded
Open this post in threaded view
|

Re: How to efficiently do dot(dot( A.T, diag(d) ), A ) ?

Pauli Virtanen-3
In reply to this post by Hugh Perkins
11.09.2012 20:28, Hugh Perkins kirjoitti:
[clip]
> It makes me wonder though.  There is an opensource project called
> 'Eigen', for C++.
> It seems to provide good performance for matrix-matrix multiplication,
> comparable to Intel MKL, and significantly better than ublas
> http://eigen.tuxfamily.org/index.php?title=Benchmark  I'm not sure
> what the relationship is between ublas and BLAS?

Eigen doesn't provide a BLAS interface, so it would be quite a lot of
work to use it.

Moreover, it probably derives some of its speed for small matrices from
compile-time specialization, which is not available via a BLAS interface.

However, OpenBLAS/GotoBLAS could be better than ATLAS, it seems to be
also doing well in the benchmarks you linked to:

    https://github.com/xianyi/OpenBLAS

If you are on Linux, you can easily swap the BLAS libraries used, like so:

*** OpenBLAS:

LD_PRELOAD=/usr/lib/openblas-base/libopenblas.so.0 ipython
...
In [11]: %timeit e = np.dot(d, c.T)
100 loops, best of 3: 14.8 ms per loop

*** ATLAS:

LD_PRELOAD=/usr/lib/atlas-base/atlas/libblas.so.3gf ipython
In [12]: %timeit e = np.dot(d, c.T)
10 loops, best of 3: 20.8 ms per loop

*** Reference BLAS:

LD_PRELOAD=/usr/lib/libblas/libblas.so.3gf:/usr/lib/libatlas.so ipython
...
In [11]: %timeit e = np.dot(d, c.T)
10 loops, best of 3: 89.3 ms per loop


Yet another thing to watch out is possible use of multiple processors at
once (although I'm not sure how much that will matter in this particular
case).

--
Pauli Virtanen

_______________________________________________
SciPy-User mailing list
[hidden email]
http://mail.scipy.org/mailman/listinfo/scipy-user
Reply | Threaded
Open this post in threaded view
|

Re: How to efficiently do dot(dot( A.T, diag(d) ), A ) ?

Pauli Virtanen-3
In reply to this post by Pauli Virtanen-3
11.09.2012 20:15, Pauli Virtanen kirjoitti:
> Here are results on one (slowish) machine:

Ok, just for completeness: the bad performance for `np.dot` on this
machine comes from the fact that Numpy is not linked with any BLAS
library, so it falls back to a slow and naive algorithm. (You can check
this as follows: if `numpy.dot.__module__` is 'numpy.core._dotblas', you
have some BLAS.)

> ----
> import numpy as np
> n = 10000
> k = 100
> a = np.random.rand(n)
> c = np.random.rand(k,n)
> d = c*a
> e = np.dot(d, c.T)
> %timeit d = c*a
> # -> 100 loops, best of 3: 11 ms per loop
> %timeit e = np.dot(d, c.T)
> # -> 10 loops, best of 3: 538 ms per loop
> ----
> n = 10000;
> k = 100;
> a = spdiags(rand(n,1),0,n,n);
> c = rand(k,n);
> tic, d = c*a; toc
> % -> Elapsed time is 0.018380 seconds.
> tic, e = d*c'; toc
> % -> Elapsed time is 0.138673 seconds.
>


_______________________________________________
SciPy-User mailing list
[hidden email]
http://mail.scipy.org/mailman/listinfo/scipy-user
Reply | Threaded
Open this post in threaded view
|

Re: How to efficiently do dot(dot( A.T, diag(d) ), A ) ?

Hugh Perkins
In reply to this post by Pauli Virtanen-3
On Wed, Sep 12, 2012 at 2:21 AM, Pauli Virtanen <[hidden email]> wrote:
> However, OpenBLAS/GotoBLAS could be better than ATLAS, it seems to be
> also doing well in the benchmarks you linked to:
>
> If you are on Linux, you can easily swap the BLAS libraries used, like so:

Ah great!  This is 4 times faster for me:

Using atlas:
$ python testmult.py
Elapsed time: 0.0221469402313
Elapsed time: 0.21438908577

Using goto/openblas:
$ python testmult.py
Elapsed time: 0.0214130878448
Elapsed time: 0.051687002182

Note that I had to do 'sudo update-alternatives --config
liblapack.so.3gf', and pick liblapack.

On Ubuntu, I didn't need to modify LD_LIBRARY_PATH: installing
openblas-base is enough to half-select openblas, and actually, to
break scipy, until the update-alternatives step above is done.
http://stackoverflow.com/questions/12249089/how-to-use-numpy-with-openblas-instead-of-atlas-in-ubuntu

Apparently openblas seems to support multicore too, though I haven't
confirmed that that works in conjunction with scipy yet.
_______________________________________________
SciPy-User mailing list
[hidden email]
http://mail.scipy.org/mailman/listinfo/scipy-user
Reply | Threaded
Open this post in threaded view
|

Re: How to efficiently do dot(dot( A.T, diag(d) ), A ) ?

Hugh Perkins
Openblas does work tons better than atlas and reference.

I'm fairly sure that multicore is not really supported though? or
supported not very well.  On a 12-core machine, using openblas:

$ python testmult.py
Elapsed time: 0.00923895835876
Elapsed time: 0.0242748260498
$ export OMP_NUM_THREADS=12
$ python testmult.py
Elapsed time: 0.00923895835876
Elapsed time: 0.0255119800568
$ export OMP_NUM_THREADS=1
$ python testmult.py
Elapsed time: 0.00781798362732
Elapsed time: 0.0324649810791

On the same machine, using matlab:
>> n = 10000;
>> k = 100;
>> a = spdiags(rand(n,1),0,n,n);
>> c = rand(k,n);
>> tic, d = c*a; toc
Elapsed time is 0.005955 seconds.
>> tic, d = d*c'; toc
Elapsed time is 0.006201 seconds.

(code for testmult.py:
from __future__ import division
from scipy import *
import scipy
import scipy.sparse as sparse
from tictoc import tic,toc

n = 10000;
k = 100;
a = sparse.spdiags( rand(n), 0, n, n )
c = rand(n,k);

tic(); d = c.T * a; toc()
tic(); e = dot( d, c ); toc()

tictoc.py:
import time

start = 0

def tic():
   global start
   start = time.time()

def toc():
   global start
   print "Elapsed time: " + str( time.time() - start )
   start = time.time()
)
_______________________________________________
SciPy-User mailing list
[hidden email]
http://mail.scipy.org/mailman/listinfo/scipy-user
Reply | Threaded
Open this post in threaded view
|

Re: How to efficiently do dot(dot( A.T, diag(d) ), A ) ?

Hugh Perkins
The good news is: it seems the scipy with openblas is *very* good on a
dual-core machine:

matlab:

>> n = 200000;
k = 100;
a = spdiags(rand(n,1),0,n,n);
c = rand(k,n);
tic, d = c*a*c'; toc

Elapsed time is 1.322931 seconds.
>> tic, d = c*a*c'; toc
Elapsed time is 1.308237 seconds.

python:
_______________________________________________
SciPy-User mailing list
[hidden email]
http://mail.scipy.org/mailman/listinfo/scipy-user
Reply | Threaded
Open this post in threaded view
|

Re: How to efficiently do dot(dot( A.T, diag(d) ), A ) ?

Hugh Perkins
(pressed send by accident too soon, sorry :-(

The good news is: if I use larger matrices, which is actually closer
to my target scenario, then it seems the scipy with openblas is *very*
good on a
dual-core machine:

matlab:

>> n = 200000;
k = 100;
a = spdiags(rand(n,1),0,n,n);
c = rand(k,n);
tic, d = c*a*c'; toc
tic, d = c*a*c'; toc

Elapsed time is 1.322931 seconds.
Elapsed time is 1.308237 seconds.

python:

python testmult.py
Elapsed time: 0.408543109894
Elapsed time: 1.15053105354

However, on a 12-core machine, it seems that scipy/openblas is rather
less competitive:

matlab:

>> n = 200000;
>> k = 100;
>> a = spdiags(rand(n,1),0,n,n);
>> c = rand(k,n);
>> tic, d = c*a; toc
Elapsed time is 0.106401 seconds.
>> tic, d = d*c'; toc
Elapsed time is 0.098882 seconds.

python:

$ python testmult2.py
Elapsed time: 0.168606996536
Elapsed time: 0.584333896637

python code:

from __future__ import division
from scipy import *
import scipy.sparse as sparse
import time

start = 0
def tic():
   global start
   start = time.time()

def toc():
   global start
   print "Elapsed time: " + str( time.time() - start )
   start = time.time()

n = 200000;
k = 100;
a = sparse.spdiags( rand(n), 0, n, n )
c = rand(n,k);

tic(); d = c.T * a; toc()
tic(); e = dot( d, c ); toc()

Note that openblas seems to use multithreading by default:

$ export OMP_NUM_THREADS=1
$ python testmult2.py
Elapsed time: 0.145273923874
Elapsed time: 0.750689983368

$ unset OMP_NUM_THREADS
$ python testmult2.py
Elapsed time: 0.168558120728
Elapsed time: 0.578655004501

$ export OMP_NUM_THREADS=12
$ python testmult2.py
Elapsed time: 0.168849945068
Elapsed time: 0.591191053391

... but this is apparently not enough to get quite as good as matlab :-(

Looking at the ratio of the times for 12 threads and 1 thread, it
looks like the openblas performance doesn't change much with the
number of threads.

... but perhaps this is now out of the scope of scipy, and is
something I should check with openblas?
_______________________________________________
SciPy-User mailing list
[hidden email]
http://mail.scipy.org/mailman/listinfo/scipy-user
Reply | Threaded
Open this post in threaded view
|

Re: How to efficiently do dot(dot( A.T, diag(d) ), A ) ?

Tiziano Zito-2
In reply to this post by Hugh Perkins
> > However, OpenBLAS/GotoBLAS could be better than ATLAS, it seems to be
> > also doing well in the benchmarks you linked to:
> >
> > If you are on Linux, you can easily swap the BLAS libraries used, like so:
>
> Ah great!  This is 4 times faster for me:
>
> Using atlas:
> $ python testmult.py
> Elapsed time: 0.0221469402313
> Elapsed time: 0.21438908577
>
> Using goto/openblas:
> $ python testmult.py
> Elapsed time: 0.0214130878448
> Elapsed time: 0.051687002182
>

What ATLAS package are you using? If you are on Debian/Ubuntu, the
default libatlas3-base is *not* optimized for your CPU. With ATLAS
most optimizations happen at build time, so to take full advantage
of ATLAS you *need* to compile it on the machine you are going to
use it. The binary package you get from Ubuntu/Debian has been
compiled on the Debian developer machine and is not going to be good
for yours.
You need to follow the (very simple) instructions found in
/usr/share/doc/libatlas3-base/README.Debian to compile ATLAS on your
CPU, so that ATLAS has actually a chance to optimize. In my
experience this can make ATLAS a lot (up to 10x) faster.

For the lazy, here are the instructions:

# cd /tmp
# apt-get source atlas
# apt-get build-dep atlas
# cd atlas-3.8.4
# fakeroot debian/rules custom
# cd ..

this will produce a series of deb packages that you can install with

# dpkg -i *.deb


HTH,
Tiziano

_______________________________________________
SciPy-User mailing list
[hidden email]
http://mail.scipy.org/mailman/listinfo/scipy-user