# How to efficiently do dot(dot( A.T, diag(d) ), A ) ? Classic List Threaded 19 messages Open this post in threaded view
|

## How to efficiently do dot(dot( A.T, diag(d) ), A ) ?

 How to do efficiently do dot(dot( A.T, diag(d) ), A ) ? ... where d is of length n, A is n * k, and n >> k (obviously the i,j element of this is just sum_r A_{r,i) * A_{r,j} * d_r , which is in theory relatively fast, at least, a lot faster than two brute-force matrix muliplies! ) _______________________________________________ SciPy-User mailing list [hidden email] http://mail.scipy.org/mailman/listinfo/scipy-user
Open this post in threaded view
|

## Re: How to efficiently do dot(dot( A.T, diag(d) ), A ) ?

 On Sun, Sep 9, 2012 at 9:19 PM, Hugh Perkins <[hidden email]> wrote: > How to do efficiently do dot(dot( A.T, diag(d) ), A ) ? dot( A.T * d , A ) IIRC Josef > > ... where d is of length n, A is n * k, and n >> k > > (obviously the i,j element of this is just sum_r A_{r,i) * A_{r,j} * > d_r , which is in theory relatively fast, at least, a lot faster than > two brute-force matrix muliplies! ) > _______________________________________________ > SciPy-User mailing list > [hidden email] > http://mail.scipy.org/mailman/listinfo/scipy-user_______________________________________________ SciPy-User mailing list [hidden email] http://mail.scipy.org/mailman/listinfo/scipy-user
Open this post in threaded view
|

## Re: How to efficiently do dot(dot( A.T, diag(d) ), A ) ?

 In reply to this post by Hugh Perkins > > How to do efficiently do dot(dot( A.T, diag(d) ), A ) ? > > dot( A.T * d , A ) This is very good! Still, the second multiplication looks like it is doing a full brute-force matrix multiplication: >>> tic(); d = c.T * a; toc() Elapsed time: 0.00560903549194 >>> tic(); e = dot( d, c ); toc() Elapsed time: 0.110434055328 _______________________________________________ SciPy-User mailing list [hidden email] http://mail.scipy.org/mailman/listinfo/scipy-user
Open this post in threaded view
|

## Re: How to efficiently do dot(dot( A.T, diag(d) ), A ) ?

 On Mon, Sep 10, 2012 at 12:34 PM, Hugh Perkins <[hidden email]> wrote: >> > How to do efficiently do dot(dot( A.T, diag(d) ), A ) ? >> >> dot( A.T * d , A ) > > This is very good! > > Still, the second multiplication looks like it is doing a full > brute-force matrix multiplication: except that the result is symmetric, I don't see any way around that. elementwise multiplication and sum will usually be slower. Josef > >>>> tic(); d = c.T * a; toc() > Elapsed time: 0.00560903549194 >>>> tic(); e = dot( d, c ); toc() > Elapsed time: 0.110434055328 > _______________________________________________ > SciPy-User mailing list > [hidden email] > http://mail.scipy.org/mailman/listinfo/scipy-user_______________________________________________ SciPy-User mailing list [hidden email] http://mail.scipy.org/mailman/listinfo/scipy-user
Open this post in threaded view
|

## Re: How to efficiently do dot(dot( A.T, diag(d) ), A ) ?

 In reply to this post by Hugh Perkins Hi,On Mon, Sep 10, 2012 at 7:34 PM, Hugh Perkins wrote: > > How to do efficiently do dot(dot( A.T, diag(d) ), A ) ? > > dot( A.T * d , A ) This is very good! Still, the second multiplication looks like it is doing a full brute-force matrix multiplication: >>> tic(); d = c.T * a; toc() Elapsed time: 0.00560903549194 >>> tic(); e = dot( d, c ); toc() Elapsed time: 0.110434055328It's not so clear what kind of improvements you are looking for. Do you perhaps expect that there exist some magic to ignore half of the computations with dot product, when the result is symmetric? Anyway, here is a simple demonstration to show that dot product is fairly efficient already: In []: m, n= 500000, 5In []: A, d= randn(m, n), randn(m)In []: %timeit A.T* d 10 loops, best of 3: 34.4 ms per loopIn []: mps= m* n/ .0344 # multiplications per secondIn []: c= A.T* d In []: %timeit dot(c, A)10 loops, best of 3: 68 ms per loopIn []: masps= m* n** 2/ .068 # multiplications and summations per second In []: masps/ mpsOut[]: 2.529411764705882In []: allclose(dot(c, A), dot(A.T* d, A)) Out[]: TrueIn []: sys.versionOut[]: '2.7.2 (default, Jun 12 2011, 15:08:59) [MSC v.1500 32 bit (Intel)]' In []: np.version.versionOut[]: '1.6.0'Thus multiplication and summation with dot product is some 2.5 times faster than simple multiplication. My 2 cents,-eat _______________________________________________ SciPy-User mailing list [hidden email] http://mail.scipy.org/mailman/listinfo/scipy-user _______________________________________________ SciPy-User mailing list [hidden email] http://mail.scipy.org/mailman/listinfo/scipy-user
Open this post in threaded view
|

## Re: How to efficiently do dot(dot( A.T, diag(d) ), A ) ?

 In reply to this post by Hugh Perkins > It's not so clear what kind of improvements you are looking for. Do you > perhaps expect that there exist some magic to ignore half of the > computations with dot product, when the result is symmetric? Here's the same test in matlab: >> n = 10000; >> k = 100; >> a = spdiags(rand(n,1),0,n,n); >> c = rand(k,n); >> tic, d = c*a; toc Elapsed time is 0.007769 seconds. >> tic, d = d*c'; toc Elapsed time is 0.007782 seconds. (vs, in scipy: >>> tic(); d = c.T * a; toc() Elapsed time: 0.00560903549194 >>> tic(); e = dot( d, c ); toc() Elapsed time: 0.110434055328 ) _______________________________________________ SciPy-User mailing list [hidden email] http://mail.scipy.org/mailman/listinfo/scipy-user
Open this post in threaded view
|

## Re: How to efficiently do dot(dot( A.T, diag(d) ), A ) ?

 I'm not sure what's going on here, but when I run your MATLAB-code on my system(copy-pasted) I get results more in the line of Elapsed time is 0.007203 seconds.Elapsed time is 0.031520 seconds. i.e. much worse than you report but still better than the Python ones. With scipy I get4.1 ms vs. 60 ms.'Sparsifying' the code improves things on my end:In : n = 10000In : k = 100 In : r = rand(n)In : a = scipy.sparse.spdiags(r, 0, n, n)In : c = scipy.sparse.rand(k,n)In : d = c.dot(a)In : e = d.dot(c.T)In : %timeit d = c.dot(a)1000 loops, best of 3: 1.52 ms per loop In : %timeit e = d.dot(c.T)1000 loops, best of 3: 1.12 ms per loopTonyOn Tue, Sep 11, 2012 at 4:43 AM, Hugh Perkins wrote: > It's not so clear what kind of improvements you are looking for. Do you > perhaps expect that there exist some magic to ignore half of the > computations with dot product, when the result is symmetric? Here's the same test in matlab: >> n = 10000; >> k = 100; >> a = spdiags(rand(n,1),0,n,n); >> c = rand(k,n); >> tic, d = c*a; toc Elapsed time is 0.007769 seconds. >> tic, d = d*c'; toc Elapsed time is 0.007782 seconds. (vs, in scipy: >>> tic(); d = c.T * a; toc() Elapsed time: 0.00560903549194 >>> tic(); e = dot( d, c ); toc() Elapsed time: 0.110434055328 ) _______________________________________________ SciPy-User mailing list [hidden email] http://mail.scipy.org/mailman/listinfo/scipy-user _______________________________________________ SciPy-User mailing list [hidden email] http://mail.scipy.org/mailman/listinfo/scipy-user
Open this post in threaded view
|

## Re: How to efficiently do dot(dot( A.T, diag(d) ), A ) ?

 In reply to this post by Hugh Perkins > 'Sparsifying' the code improves things on my end: It did for me too :-) But then I noticed that sparse.rand produces a matrix with no elements: >>> from scipy import sparse >>> sparse.rand(2,2) <2x2 sparse matrix of type ''         with 0 stored elements in COOrdinate format> >>> sparse.rand(2,2).todense() matrix([[ 0.,  0.],         [ 0.,  0.]]) You can make a sparse rand matrix from a dense one: c = sparse.coo_matrix(scipy.rand(n,k)) This gives worse timings than a non-sparse c for me: >>> c = rand(n,k); >>> tic(); d = c.T * a; toc() Elapsed time: 0.0610790252686 >>> tic(); e = dot( d, c ); toc() Elapsed time: 0.239707946777 >>> import scipy.sparse as sparse >>> c = sparse.coo_matrix(scipy.rand(n,k)) >>> tic(); d = c.T * a; toc() Elapsed time: 0.137360095978 >>> tic(); e = dot( d, c ); toc() Elapsed time: 1.74925804138 _______________________________________________ SciPy-User mailing list [hidden email] http://mail.scipy.org/mailman/listinfo/scipy-user
Open this post in threaded view
|

## Re: How to efficiently do dot(dot( A.T, diag(d) ), A ) ?

 Oh. Of course. Sorry about that, I forgot about the density argument to sparse.randand didn't check my results.That certainly explains why it was so fast...and since d can't be expected to be sparse the sparse approach is kind of dumb.TonyOn Tue, Sep 11, 2012 at 11:10 AM, Hugh Perkins wrote: > 'Sparsifying' the code improves things on my end: It did for me too :-) But then I noticed that sparse.rand produces a matrix with no elements: >>> from scipy import sparse >>> sparse.rand(2,2) <2x2 sparse matrix of type ''         with 0 stored elements in COOrdinate format> >>> sparse.rand(2,2).todense() matrix([[ 0.,  0.],         [ 0.,  0.]]) You can make a sparse rand matrix from a dense one: c = sparse.coo_matrix(scipy.rand(n,k)) This gives worse timings than a non-sparse c for me: >>> c = rand(n,k); >>> tic(); d = c.T * a; toc() Elapsed time: 0.0610790252686 >>> tic(); e = dot( d, c ); toc() Elapsed time: 0.239707946777 >>> import scipy.sparse as sparse >>> c = sparse.coo_matrix(scipy.rand(n,k)) >>> tic(); d = c.T * a; toc() Elapsed time: 0.137360095978 >>> tic(); e = dot( d, c ); toc() Elapsed time: 1.74925804138 _______________________________________________ SciPy-User mailing list [hidden email] http://mail.scipy.org/mailman/listinfo/scipy-user _______________________________________________ SciPy-User mailing list [hidden email] http://mail.scipy.org/mailman/listinfo/scipy-user
Open this post in threaded view
|

## Re: How to efficiently do dot(dot( A.T, diag(d) ), A ) ?

 In reply to this post by Hugh Perkins Hi, First a quick note: Please, when posting any benchmark result, always include the full test code for each case. 11.09.2012 05:43, Hugh Perkins kirjoitti: >> It's not so clear what kind of improvements you are looking for. Do you >> perhaps expect that there exist some magic to ignore half of the >> computations with dot product, when the result is symmetric? > > Here's the same test in matlab: [clip] You are here bencmarking the underlying BLAS libraries. Matlab comes with Intel MKL, whereas your Numpy/Scipy is likely linked with ATLAS. MKL can be faster than ATLAS, but on the other hand you can also link the Numpy/Scipy combination against MKL (provided you buy a license from Intel). In your Matlab example, the first matrix product is a sparse-dense one, whereas the second is dense-dense (MKL-fast). That the speeds of the two operations happen to coincide is likely a coincidence. In the Numpy case without sparse matrices, the first product is broadcast-multiplication (faster than a sparse-dense matrix product), whereas the second product is a dense-dense matrix multiplication. Here are results on one (slowish) machine: ---- import numpy as np n = 10000 k = 100 a = np.random.rand(n) c = np.random.rand(k,n) d = c*a e = np.dot(d, c.T) %timeit d = c*a # -> 100 loops, best of 3: 11 ms per loop %timeit e = np.dot(d, c.T) # -> 10 loops, best of 3: 538 ms per loop ---- n = 10000; k = 100; a = spdiags(rand(n,1),0,n,n); c = rand(k,n); tic, d = c*a; toc % -> Elapsed time is 0.018380 seconds. tic, e = d*c'; toc % -> Elapsed time is 0.138673 seconds. -- Pauli Virtanen _______________________________________________ SciPy-User mailing list [hidden email] http://mail.scipy.org/mailman/listinfo/scipy-user
Open this post in threaded view
|

## Re: How to efficiently do dot(dot( A.T, diag(d) ), A ) ?

 On Wed, Sep 12, 2012 at 1:15 AM, Pauli Virtanen <[hidden email]> wrote: > You are here bencmarking the underlying BLAS libraries. Matlab comes > with Intel MKL, whereas your Numpy/Scipy is likely linked with ATLAS. Hi Pauli, Ok, that's good information.  I think that makes a lot of sense, and sounds plausible to me. > MKL can be faster than ATLAS, but on the other hand you can also link > the Numpy/Scipy combination against MKL (provided you buy a license from > Intel). Ok, that sounds reasonable to me. It makes me wonder though.  There is an opensource project called 'Eigen', for C++. It seems to provide good performance for matrix-matrix multiplication, comparable to Intel MKL, and significantly better than ublas http://eigen.tuxfamily.org/index.php?title=Benchmark  I'm not sure what the relationship is between ublas and BLAS? I wonder if it might be worth scipy providing an option to link with eigen3?  I confess I don't personally have time to do that though. _______________________________________________ SciPy-User mailing list [hidden email] http://mail.scipy.org/mailman/listinfo/scipy-user
Open this post in threaded view
|

## Re: How to efficiently do dot(dot( A.T, diag(d) ), A ) ?

 (Apparently Eigen can use multithreading too.  This is one thing which matlab does automatically by the way: automatically parallelize out certain matrix operations amongst multiple cores. http://eigen.tuxfamily.org/dox/TopicMultiThreading.html) _______________________________________________ SciPy-User mailing list [hidden email] http://mail.scipy.org/mailman/listinfo/scipy-user
Open this post in threaded view
|

## Re: How to efficiently do dot(dot( A.T, diag(d) ), A ) ?

 In reply to this post by Hugh Perkins 11.09.2012 20:28, Hugh Perkins kirjoitti: [clip] > It makes me wonder though.  There is an opensource project called > 'Eigen', for C++. > It seems to provide good performance for matrix-matrix multiplication, > comparable to Intel MKL, and significantly better than ublas > http://eigen.tuxfamily.org/index.php?title=Benchmark  I'm not sure > what the relationship is between ublas and BLAS? Eigen doesn't provide a BLAS interface, so it would be quite a lot of work to use it. Moreover, it probably derives some of its speed for small matrices from compile-time specialization, which is not available via a BLAS interface. However, OpenBLAS/GotoBLAS could be better than ATLAS, it seems to be also doing well in the benchmarks you linked to:     https://github.com/xianyi/OpenBLASIf you are on Linux, you can easily swap the BLAS libraries used, like so: *** OpenBLAS: LD_PRELOAD=/usr/lib/openblas-base/libopenblas.so.0 ipython ... In : %timeit e = np.dot(d, c.T) 100 loops, best of 3: 14.8 ms per loop *** ATLAS: LD_PRELOAD=/usr/lib/atlas-base/atlas/libblas.so.3gf ipython In : %timeit e = np.dot(d, c.T) 10 loops, best of 3: 20.8 ms per loop *** Reference BLAS: LD_PRELOAD=/usr/lib/libblas/libblas.so.3gf:/usr/lib/libatlas.so ipython ... In : %timeit e = np.dot(d, c.T) 10 loops, best of 3: 89.3 ms per loop Yet another thing to watch out is possible use of multiple processors at once (although I'm not sure how much that will matter in this particular case). -- Pauli Virtanen _______________________________________________ SciPy-User mailing list [hidden email] http://mail.scipy.org/mailman/listinfo/scipy-user
Open this post in threaded view
|

## Re: How to efficiently do dot(dot( A.T, diag(d) ), A ) ?

 In reply to this post by Pauli Virtanen-3 11.09.2012 20:15, Pauli Virtanen kirjoitti: > Here are results on one (slowish) machine: Ok, just for completeness: the bad performance for `np.dot` on this machine comes from the fact that Numpy is not linked with any BLAS library, so it falls back to a slow and naive algorithm. (You can check this as follows: if `numpy.dot.__module__` is 'numpy.core._dotblas', you have some BLAS.) > ---- > import numpy as np > n = 10000 > k = 100 > a = np.random.rand(n) > c = np.random.rand(k,n) > d = c*a > e = np.dot(d, c.T) > %timeit d = c*a > # -> 100 loops, best of 3: 11 ms per loop > %timeit e = np.dot(d, c.T) > # -> 10 loops, best of 3: 538 ms per loop > ---- > n = 10000; > k = 100; > a = spdiags(rand(n,1),0,n,n); > c = rand(k,n); > tic, d = c*a; toc > % -> Elapsed time is 0.018380 seconds. > tic, e = d*c'; toc > % -> Elapsed time is 0.138673 seconds. > _______________________________________________ SciPy-User mailing list [hidden email] http://mail.scipy.org/mailman/listinfo/scipy-user
Open this post in threaded view
|

## Re: How to efficiently do dot(dot( A.T, diag(d) ), A ) ?

 In reply to this post by Pauli Virtanen-3 On Wed, Sep 12, 2012 at 2:21 AM, Pauli Virtanen <[hidden email]> wrote: > However, OpenBLAS/GotoBLAS could be better than ATLAS, it seems to be > also doing well in the benchmarks you linked to: > > If you are on Linux, you can easily swap the BLAS libraries used, like so: Ah great!  This is 4 times faster for me: Using atlas: \$ python testmult.py Elapsed time: 0.0221469402313 Elapsed time: 0.21438908577 Using goto/openblas: \$ python testmult.py Elapsed time: 0.0214130878448 Elapsed time: 0.051687002182 Note that I had to do 'sudo update-alternatives --config liblapack.so.3gf', and pick liblapack. On Ubuntu, I didn't need to modify LD_LIBRARY_PATH: installing openblas-base is enough to half-select openblas, and actually, to break scipy, until the update-alternatives step above is done. http://stackoverflow.com/questions/12249089/how-to-use-numpy-with-openblas-instead-of-atlas-in-ubuntuApparently openblas seems to support multicore too, though I haven't confirmed that that works in conjunction with scipy yet. _______________________________________________ SciPy-User mailing list [hidden email] http://mail.scipy.org/mailman/listinfo/scipy-user
Open this post in threaded view
|

## Re: How to efficiently do dot(dot( A.T, diag(d) ), A ) ?

 Openblas does work tons better than atlas and reference. I'm fairly sure that multicore is not really supported though? or supported not very well.  On a 12-core machine, using openblas: \$ python testmult.py Elapsed time: 0.00923895835876 Elapsed time: 0.0242748260498 \$ export OMP_NUM_THREADS=12 \$ python testmult.py Elapsed time: 0.00923895835876 Elapsed time: 0.0255119800568 \$ export OMP_NUM_THREADS=1 \$ python testmult.py Elapsed time: 0.00781798362732 Elapsed time: 0.0324649810791 On the same machine, using matlab: >> n = 10000; >> k = 100; >> a = spdiags(rand(n,1),0,n,n); >> c = rand(k,n); >> tic, d = c*a; toc Elapsed time is 0.005955 seconds. >> tic, d = d*c'; toc Elapsed time is 0.006201 seconds. (code for testmult.py: from __future__ import division from scipy import * import scipy import scipy.sparse as sparse from tictoc import tic,toc n = 10000; k = 100; a = sparse.spdiags( rand(n), 0, n, n ) c = rand(n,k); tic(); d = c.T * a; toc() tic(); e = dot( d, c ); toc() tictoc.py: import time start = 0 def tic():    global start    start = time.time() def toc():    global start    print "Elapsed time: " + str( time.time() - start )    start = time.time() ) _______________________________________________ SciPy-User mailing list [hidden email] http://mail.scipy.org/mailman/listinfo/scipy-user
Open this post in threaded view
|

## Re: How to efficiently do dot(dot( A.T, diag(d) ), A ) ?

 The good news is: it seems the scipy with openblas is *very* good on a dual-core machine: matlab: >> n = 200000; k = 100; a = spdiags(rand(n,1),0,n,n); c = rand(k,n); tic, d = c*a*c'; toc Elapsed time is 1.322931 seconds. >> tic, d = c*a*c'; toc Elapsed time is 1.308237 seconds. python: _______________________________________________ SciPy-User mailing list [hidden email] http://mail.scipy.org/mailman/listinfo/scipy-user