How to do efficiently do dot(dot( A.T, diag(d) ), A ) ?
... where d is of length n, A is n * k, and n >> k (obviously the i,j element of this is just sum_r A_{r,i) * A_{r,j} * d_r , which is in theory relatively fast, at least, a lot faster than two brute-force matrix muliplies! ) _______________________________________________ SciPy-User mailing list [hidden email] http://mail.scipy.org/mailman/listinfo/scipy-user |
On Sun, Sep 9, 2012 at 9:19 PM, Hugh Perkins <[hidden email]> wrote:
> How to do efficiently do dot(dot( A.T, diag(d) ), A ) ? dot( A.T * d , A ) IIRC Josef > > ... where d is of length n, A is n * k, and n >> k > > (obviously the i,j element of this is just sum_r A_{r,i) * A_{r,j} * > d_r , which is in theory relatively fast, at least, a lot faster than > two brute-force matrix muliplies! ) > _______________________________________________ > SciPy-User mailing list > [hidden email] > http://mail.scipy.org/mailman/listinfo/scipy-user SciPy-User mailing list [hidden email] http://mail.scipy.org/mailman/listinfo/scipy-user |
In reply to this post by Hugh Perkins
> > How to do efficiently do dot(dot( A.T, diag(d) ), A ) ?
> > dot( A.T * d , A ) This is very good! Still, the second multiplication looks like it is doing a full brute-force matrix multiplication: >>> tic(); d = c.T * a; toc() Elapsed time: 0.00560903549194 >>> tic(); e = dot( d, c ); toc() Elapsed time: 0.110434055328 _______________________________________________ SciPy-User mailing list [hidden email] http://mail.scipy.org/mailman/listinfo/scipy-user |
On Mon, Sep 10, 2012 at 12:34 PM, Hugh Perkins <[hidden email]> wrote:
>> > How to do efficiently do dot(dot( A.T, diag(d) ), A ) ? >> >> dot( A.T * d , A ) > > This is very good! > > Still, the second multiplication looks like it is doing a full > brute-force matrix multiplication: except that the result is symmetric, I don't see any way around that. elementwise multiplication and sum will usually be slower. Josef > >>>> tic(); d = c.T * a; toc() > Elapsed time: 0.00560903549194 >>>> tic(); e = dot( d, c ); toc() > Elapsed time: 0.110434055328 > _______________________________________________ > SciPy-User mailing list > [hidden email] > http://mail.scipy.org/mailman/listinfo/scipy-user _______________________________________________ SciPy-User mailing list [hidden email] http://mail.scipy.org/mailman/listinfo/scipy-user |
In reply to this post by Hugh Perkins
Hi,
On Mon, Sep 10, 2012 at 7:34 PM, Hugh Perkins <[hidden email]> wrote:
It's not so clear what kind ofÂ improvementsÂ you are looking for. Do you perhaps expect that there exist some magic to ignore half of the computations with dot product, when the result is symmetric? Anyway, here is a simple demonstration to show that dot product is fairly efficient already:
In []: m, n= 500000, 5 In []: A, d= randn(m, n), randn(m) In []: %timeit A.T* d
10 loops, best of 3: 34.4 ms per loop In []: mps= m* n/ .0344 # multiplications per second In []: c= A.T* d
In []: %timeit dot(c, A) 10 loops, best of 3: 68 ms per loop In []: masps= m* n** 2/ .068 # multiplications and summations per second
In []: masps/ mps Out[]: 2.529411764705882 In []: allclose(dot(c, A), dot(A.T* d, A))
Out[]: True In []: sys.version Out[]: '2.7.2 (default, Jun 12 2011, 15:08:59) [MSC v.1500 32 bit (Intel)]'
In []: np.version.version Out[]: '1.6.0' Thus multiplication and summation with dot product is some 2.5 times faster than simple multiplication.
My 2 cents, -eat
_______________________________________________ SciPy-User mailing list [hidden email] http://mail.scipy.org/mailman/listinfo/scipy-user |
In reply to this post by Hugh Perkins
> It's not so clear what kind of improvements you are looking for. Do you
> perhaps expect that there exist some magic to ignore half of the > computations with dot product, when the result is symmetric? Here's the same test in matlab: >> n = 10000; >> k = 100; >> a = spdiags(rand(n,1),0,n,n); >> c = rand(k,n); >> tic, d = c*a; toc Elapsed time is 0.007769 seconds. >> tic, d = d*c'; toc Elapsed time is 0.007782 seconds. (vs, in scipy: >>> tic(); d = c.T * a; toc() Elapsed time: 0.00560903549194 >>> tic(); e = dot( d, c ); toc() Elapsed time: 0.110434055328 ) _______________________________________________ SciPy-User mailing list [hidden email] http://mail.scipy.org/mailman/listinfo/scipy-user |
I'm not sure what's going on here, but when I run your MATLAB-code on my system
(copy-pasted) I get results more in the line of Elapsed time is 0.007203 seconds. Elapsed time is 0.031520 seconds. i.e. much worse than you report but still better than the Python ones. With scipy I get 4.1 ms vs. 60 ms. 'Sparsifying' the code improves things on my end: In [73]: n = 10000 In [74]: k = 100 In [75]: r = rand(n) In [76]: a = scipy.sparse.spdiags(r, 0, n, n) In [77]: c = scipy.sparse.rand(k,n) In [78]: d = c.dot(a) In [79]: e = d.dot(c.T) In [80]: %timeit d = c.dot(a) 1000 loops, best of 3: 1.52 ms per loop In [81]: %timeit e = d.dot(c.T) 1000 loops, best of 3: 1.12 ms per loop Tony On Tue, Sep 11, 2012 at 4:43 AM, Hugh Perkins <[hidden email]> wrote:
_______________________________________________ SciPy-User mailing list [hidden email] http://mail.scipy.org/mailman/listinfo/scipy-user |
In reply to this post by Hugh Perkins
> 'Sparsifying' the code improves things on my end:
It did for me too :-) But then I noticed that sparse.rand produces a matrix with no elements: >>> from scipy import sparse >>> sparse.rand(2,2) <2x2 sparse matrix of type '<type 'numpy.float64'>' with 0 stored elements in COOrdinate format> >>> sparse.rand(2,2).todense() matrix([[ 0., 0.], [ 0., 0.]]) You can make a sparse rand matrix from a dense one: c = sparse.coo_matrix(scipy.rand(n,k)) This gives worse timings than a non-sparse c for me: >>> c = rand(n,k); >>> tic(); d = c.T * a; toc() Elapsed time: 0.0610790252686 >>> tic(); e = dot( d, c ); toc() Elapsed time: 0.239707946777 >>> import scipy.sparse as sparse >>> c = sparse.coo_matrix(scipy.rand(n,k)) >>> tic(); d = c.T * a; toc() Elapsed time: 0.137360095978 >>> tic(); e = dot( d, c ); toc() Elapsed time: 1.74925804138 _______________________________________________ SciPy-User mailing list [hidden email] http://mail.scipy.org/mailman/listinfo/scipy-user |
Oh. Of course. Sorry about that, I forgot about the density argument to sparse.rand
and didn't check my results. That certainly explains why it was so fast...and since d can't be expected to be sparse the sparse approach is kind of dumb. Tony On Tue, Sep 11, 2012 at 11:10 AM, Hugh Perkins <[hidden email]> wrote:
_______________________________________________ SciPy-User mailing list [hidden email] http://mail.scipy.org/mailman/listinfo/scipy-user |
In reply to this post by Hugh Perkins
Hi,
First a quick note: Please, when posting any benchmark result, always include the full test code for each case. 11.09.2012 05:43, Hugh Perkins kirjoitti: >> It's not so clear what kind of improvements you are looking for. Do you >> perhaps expect that there exist some magic to ignore half of the >> computations with dot product, when the result is symmetric? > > Here's the same test in matlab: [clip] You are here bencmarking the underlying BLAS libraries. Matlab comes with Intel MKL, whereas your Numpy/Scipy is likely linked with ATLAS. MKL can be faster than ATLAS, but on the other hand you can also link the Numpy/Scipy combination against MKL (provided you buy a license from Intel). In your Matlab example, the first matrix product is a sparse-dense one, whereas the second is dense-dense (MKL-fast). That the speeds of the two operations happen to coincide is likely a coincidence. In the Numpy case without sparse matrices, the first product is broadcast-multiplication (faster than a sparse-dense matrix product), whereas the second product is a dense-dense matrix multiplication. Here are results on one (slowish) machine: ---- import numpy as np n = 10000 k = 100 a = np.random.rand(n) c = np.random.rand(k,n) d = c*a e = np.dot(d, c.T) %timeit d = c*a # -> 100 loops, best of 3: 11 ms per loop %timeit e = np.dot(d, c.T) # -> 10 loops, best of 3: 538 ms per loop ---- n = 10000; k = 100; a = spdiags(rand(n,1),0,n,n); c = rand(k,n); tic, d = c*a; toc % -> Elapsed time is 0.018380 seconds. tic, e = d*c'; toc % -> Elapsed time is 0.138673 seconds. -- Pauli Virtanen _______________________________________________ SciPy-User mailing list [hidden email] http://mail.scipy.org/mailman/listinfo/scipy-user |
On Wed, Sep 12, 2012 at 1:15 AM, Pauli Virtanen <[hidden email]> wrote:
> You are here bencmarking the underlying BLAS libraries. Matlab comes > with Intel MKL, whereas your Numpy/Scipy is likely linked with ATLAS. Hi Pauli, Ok, that's good information. I think that makes a lot of sense, and sounds plausible to me. > MKL can be faster than ATLAS, but on the other hand you can also link > the Numpy/Scipy combination against MKL (provided you buy a license from > Intel). Ok, that sounds reasonable to me. It makes me wonder though. There is an opensource project called 'Eigen', for C++. It seems to provide good performance for matrix-matrix multiplication, comparable to Intel MKL, and significantly better than ublas http://eigen.tuxfamily.org/index.php?title=Benchmark I'm not sure what the relationship is between ublas and BLAS? I wonder if it might be worth scipy providing an option to link with eigen3? I confess I don't personally have time to do that though. _______________________________________________ SciPy-User mailing list [hidden email] http://mail.scipy.org/mailman/listinfo/scipy-user |
(Apparently Eigen can use multithreading too. This is one thing which
matlab does automatically by the way: automatically parallelize out certain matrix operations amongst multiple cores. http://eigen.tuxfamily.org/dox/TopicMultiThreading.html ) _______________________________________________ SciPy-User mailing list [hidden email] http://mail.scipy.org/mailman/listinfo/scipy-user |
In reply to this post by Hugh Perkins
11.09.2012 20:28, Hugh Perkins kirjoitti:
[clip] > It makes me wonder though. There is an opensource project called > 'Eigen', for C++. > It seems to provide good performance for matrix-matrix multiplication, > comparable to Intel MKL, and significantly better than ublas > http://eigen.tuxfamily.org/index.php?title=Benchmark I'm not sure > what the relationship is between ublas and BLAS? Eigen doesn't provide a BLAS interface, so it would be quite a lot of work to use it. Moreover, it probably derives some of its speed for small matrices from compile-time specialization, which is not available via a BLAS interface. However, OpenBLAS/GotoBLAS could be better than ATLAS, it seems to be also doing well in the benchmarks you linked to: https://github.com/xianyi/OpenBLAS If you are on Linux, you can easily swap the BLAS libraries used, like so: *** OpenBLAS: LD_PRELOAD=/usr/lib/openblas-base/libopenblas.so.0 ipython ... In [11]: %timeit e = np.dot(d, c.T) 100 loops, best of 3: 14.8 ms per loop *** ATLAS: LD_PRELOAD=/usr/lib/atlas-base/atlas/libblas.so.3gf ipython In [12]: %timeit e = np.dot(d, c.T) 10 loops, best of 3: 20.8 ms per loop *** Reference BLAS: LD_PRELOAD=/usr/lib/libblas/libblas.so.3gf:/usr/lib/libatlas.so ipython ... In [11]: %timeit e = np.dot(d, c.T) 10 loops, best of 3: 89.3 ms per loop Yet another thing to watch out is possible use of multiple processors at once (although I'm not sure how much that will matter in this particular case). -- Pauli Virtanen _______________________________________________ SciPy-User mailing list [hidden email] http://mail.scipy.org/mailman/listinfo/scipy-user |
In reply to this post by Pauli Virtanen-3
11.09.2012 20:15, Pauli Virtanen kirjoitti:
> Here are results on one (slowish) machine: Ok, just for completeness: the bad performance for `np.dot` on this machine comes from the fact that Numpy is not linked with any BLAS library, so it falls back to a slow and naive algorithm. (You can check this as follows: if `numpy.dot.__module__` is 'numpy.core._dotblas', you have some BLAS.) > ---- > import numpy as np > n = 10000 > k = 100 > a = np.random.rand(n) > c = np.random.rand(k,n) > d = c*a > e = np.dot(d, c.T) > %timeit d = c*a > # -> 100 loops, best of 3: 11 ms per loop > %timeit e = np.dot(d, c.T) > # -> 10 loops, best of 3: 538 ms per loop > ---- > n = 10000; > k = 100; > a = spdiags(rand(n,1),0,n,n); > c = rand(k,n); > tic, d = c*a; toc > % -> Elapsed time is 0.018380 seconds. > tic, e = d*c'; toc > % -> Elapsed time is 0.138673 seconds. > _______________________________________________ SciPy-User mailing list [hidden email] http://mail.scipy.org/mailman/listinfo/scipy-user |
In reply to this post by Pauli Virtanen-3
On Wed, Sep 12, 2012 at 2:21 AM, Pauli Virtanen <[hidden email]> wrote:
> However, OpenBLAS/GotoBLAS could be better than ATLAS, it seems to be > also doing well in the benchmarks you linked to: > > If you are on Linux, you can easily swap the BLAS libraries used, like so: Ah great! This is 4 times faster for me: Using atlas: $ python testmult.py Elapsed time: 0.0221469402313 Elapsed time: 0.21438908577 Using goto/openblas: $ python testmult.py Elapsed time: 0.0214130878448 Elapsed time: 0.051687002182 Note that I had to do 'sudo update-alternatives --config liblapack.so.3gf', and pick liblapack. On Ubuntu, I didn't need to modify LD_LIBRARY_PATH: installing openblas-base is enough to half-select openblas, and actually, to break scipy, until the update-alternatives step above is done. http://stackoverflow.com/questions/12249089/how-to-use-numpy-with-openblas-instead-of-atlas-in-ubuntu Apparently openblas seems to support multicore too, though I haven't confirmed that that works in conjunction with scipy yet. _______________________________________________ SciPy-User mailing list [hidden email] http://mail.scipy.org/mailman/listinfo/scipy-user |
Openblas does work tons better than atlas and reference.
I'm fairly sure that multicore is not really supported though? or supported not very well. On a 12-core machine, using openblas: $ python testmult.py Elapsed time: 0.00923895835876 Elapsed time: 0.0242748260498 $ export OMP_NUM_THREADS=12 $ python testmult.py Elapsed time: 0.00923895835876 Elapsed time: 0.0255119800568 $ export OMP_NUM_THREADS=1 $ python testmult.py Elapsed time: 0.00781798362732 Elapsed time: 0.0324649810791 On the same machine, using matlab: >> n = 10000; >> k = 100; >> a = spdiags(rand(n,1),0,n,n); >> c = rand(k,n); >> tic, d = c*a; toc Elapsed time is 0.005955 seconds. >> tic, d = d*c'; toc Elapsed time is 0.006201 seconds. (code for testmult.py: from __future__ import division from scipy import * import scipy import scipy.sparse as sparse from tictoc import tic,toc n = 10000; k = 100; a = sparse.spdiags( rand(n), 0, n, n ) c = rand(n,k); tic(); d = c.T * a; toc() tic(); e = dot( d, c ); toc() tictoc.py: import time start = 0 def tic(): global start start = time.time() def toc(): global start print "Elapsed time: " + str( time.time() - start ) start = time.time() ) _______________________________________________ SciPy-User mailing list [hidden email] http://mail.scipy.org/mailman/listinfo/scipy-user |
The good news is: it seems the scipy with openblas is *very* good on a
dual-core machine: matlab: >> n = 200000; k = 100; a = spdiags(rand(n,1),0,n,n); c = rand(k,n); tic, d = c*a*c'; toc Elapsed time is 1.322931 seconds. >> tic, d = c*a*c'; toc Elapsed time is 1.308237 seconds. python: _______________________________________________ SciPy-User mailing list [hidden email] http://mail.scipy.org/mailman/listinfo/scipy-user |
(pressed send by accident too soon, sorry :-(
The good news is: if I use larger matrices, which is actually closer to my target scenario, then it seems the scipy with openblas is *very* good on a dual-core machine: matlab: >> n = 200000; k = 100; a = spdiags(rand(n,1),0,n,n); c = rand(k,n); tic, d = c*a*c'; toc tic, d = c*a*c'; toc Elapsed time is 1.322931 seconds. Elapsed time is 1.308237 seconds. python: python testmult.py Elapsed time: 0.408543109894 Elapsed time: 1.15053105354 However, on a 12-core machine, it seems that scipy/openblas is rather less competitive: matlab: >> n = 200000; >> k = 100; >> a = spdiags(rand(n,1),0,n,n); >> c = rand(k,n); >> tic, d = c*a; toc Elapsed time is 0.106401 seconds. >> tic, d = d*c'; toc Elapsed time is 0.098882 seconds. python: $ python testmult2.py Elapsed time: 0.168606996536 Elapsed time: 0.584333896637 python code: from __future__ import division from scipy import * import scipy.sparse as sparse import time start = 0 def tic(): global start start = time.time() def toc(): global start print "Elapsed time: " + str( time.time() - start ) start = time.time() n = 200000; k = 100; a = sparse.spdiags( rand(n), 0, n, n ) c = rand(n,k); tic(); d = c.T * a; toc() tic(); e = dot( d, c ); toc() Note that openblas seems to use multithreading by default: $ export OMP_NUM_THREADS=1 $ python testmult2.py Elapsed time: 0.145273923874 Elapsed time: 0.750689983368 $ unset OMP_NUM_THREADS $ python testmult2.py Elapsed time: 0.168558120728 Elapsed time: 0.578655004501 $ export OMP_NUM_THREADS=12 $ python testmult2.py Elapsed time: 0.168849945068 Elapsed time: 0.591191053391 ... but this is apparently not enough to get quite as good as matlab :-( Looking at the ratio of the times for 12 threads and 1 thread, it looks like the openblas performance doesn't change much with the number of threads. ... but perhaps this is now out of the scope of scipy, and is something I should check with openblas? _______________________________________________ SciPy-User mailing list [hidden email] http://mail.scipy.org/mailman/listinfo/scipy-user |
In reply to this post by Hugh Perkins
> > However, OpenBLAS/GotoBLAS could be better than ATLAS, it seems to be
> > also doing well in the benchmarks you linked to: > > > > If you are on Linux, you can easily swap the BLAS libraries used, like so: > > Ah great! This is 4 times faster for me: > > Using atlas: > $ python testmult.py > Elapsed time: 0.0221469402313 > Elapsed time: 0.21438908577 > > Using goto/openblas: > $ python testmult.py > Elapsed time: 0.0214130878448 > Elapsed time: 0.051687002182 > What ATLAS package are you using? If you are on Debian/Ubuntu, the default libatlas3-base is *not* optimized for your CPU. With ATLAS most optimizations happen at build time, so to take full advantage of ATLAS you *need* to compile it on the machine you are going to use it. The binary package you get from Ubuntu/Debian has been compiled on the Debian developer machine and is not going to be good for yours. You need to follow the (very simple) instructions found in /usr/share/doc/libatlas3-base/README.Debian to compile ATLAS on your CPU, so that ATLAS has actually a chance to optimize. In my experience this can make ATLAS a lot (up to 10x) faster. For the lazy, here are the instructions: # cd /tmp # apt-get source atlas # apt-get build-dep atlas # cd atlas-3.8.4 # fakeroot debian/rules custom # cd .. this will produce a series of deb packages that you can install with # dpkg -i *.deb HTH, Tiziano _______________________________________________ SciPy-User mailing list [hidden email] http://mail.scipy.org/mailman/listinfo/scipy-user |
Free forum by Nabble | Edit this page |