Hi,

I am trying to calculate the pairwise hamming distances among a
large set of boolean arrays. I would think that hamming distance
would be one of the fastest distance metrics, but it seems to be
much slower than alternatives.

For instance I generate a 1000x1000 test boolean matrix

test = rng.rand(1000,1000) > 0.5

Then I try to calculate the pairwise distances, e.g.

pdist(test, 'hamming')

I can run this in an average of 3.08s on my laptop, Whereas if I
change

*hamming* to

*sqeuclidean* this goes down to
0.56s. These will effectively give me the same result (with the
normalization factor), but I am quite surprised to see

*hamming*
be so slow. In fact, computing

*sqeuclidean* distance on a
real valued matrix of the same shape also runs in less than 0.6s

Why should

*hamming*, which is just doing bitwise

*xor*
take so much longer than computing

*sqeuclidean* distances on
real valued arrays? And, is there a recommended way to optimize
this?

Cheers,

Josh

_______________________________________________

SciPy-User mailing list

[hidden email]
https://mail.scipy.org/mailman/listinfo/scipy-user