Hi,
I am trying to calculate the pairwise hamming distances among a
large set of boolean arrays. I would think that hamming distance
would be one of the fastest distance metrics, but it seems to be
much slower than alternatives.
For instance I generate a 1000x1000 test boolean matrix
test = rng.rand(1000,1000) > 0.5
Then I try to calculate the pairwise distances, e.g.
pdist(test, 'hamming')
I can run this in an average of 3.08s on my laptop, Whereas if I
change
hamming to
sqeuclidean this goes down to
0.56s. These will effectively give me the same result (with the
normalization factor), but I am quite surprised to see
hamming
be so slow. In fact, computing
sqeuclidean distance on a
real valued matrix of the same shape also runs in less than 0.6s
Why should
hamming, which is just doing bitwise
xor
take so much longer than computing
sqeuclidean distances on
real valued arrays? And, is there a recommended way to optimize
this?
Cheers,
Josh
_______________________________________________
SciPy-User mailing list
[hidden email]
https://mail.scipy.org/mailman/listinfo/scipy-user