[SciPy-User] Why is hamming distance so slow?

classic Classic list List threaded Threaded
1 message Options
Reply | Threaded
Open this post in threaded view
Report Content as Inappropriate

[SciPy-User] Why is hamming distance so slow?

Joshua Auerbach

I am trying to calculate the pairwise hamming distances among a large set of boolean arrays.  I would think that hamming distance would be one of the fastest distance metrics, but it seems to be much slower than alternatives.

For instance I generate a 1000x1000 test boolean matrix

test = rng.rand(1000,1000) > 0.5

Then I try to calculate the pairwise distances, e.g.

pdist(test, 'hamming')

I can run this in an average of 3.08s on my laptop, Whereas if I change hamming to sqeuclidean this goes down to 0.56s.  These will effectively give me the same result (with the normalization factor), but I am quite surprised to see hamming be so slow. In fact, computing sqeuclidean distance on a real valued matrix of the same shape also runs in less than 0.6s

Why should hamming, which is just doing bitwise xor take so much longer than computing sqeuclidean distances on real valued arrays?  And, is there a recommended way to optimize this?

SciPy-User mailing list
[hidden email]