I am trying to calculate the pairwise hamming distances among a
large set of boolean arrays. I would think that hamming distance
would be one of the fastest distance metrics, but it seems to be
much slower than alternatives.
For instance I generate a 1000x1000 test boolean matrix
test = rng.rand(1000,1000) > 0.5
Then I try to calculate the pairwise distances, e.g.
I can run this in an average of 3.08s on my laptop, Whereas if I
change hamming to sqeuclidean this goes down to
0.56s. These will effectively give me the same result (with the
normalization factor), but I am quite surprised to see hamming
be so slow. In fact, computing sqeuclidean distance on a
real valued matrix of the same shape also runs in less than 0.6s
Why should hamming, which is just doing bitwise xor
take so much longer than computing sqeuclidean distances on
real valued arrays? And, is there a recommended way to optimize