Tuesday, December 05, 2006

Distance metrics

Will Dwinnell (fellow poster here, but who also has a blog specific to Matlab at http://matlabdatamining.blogspot.com) recently posted on Mahalanobis distance as an alternative to Euclidean distance. We are kindred spirits on this one as I have long advocated the Mahalanobis distance, particularly for data that is close to being normally distributed (there are fixes to make numeric data more normally distributed, of course, but that's for another post, perhaps).

The reasons he gives are right on point, but I'd like to expand the application side. I was first introduced to Mahalanobis distance in the context of Nearest Mean classifiers. In case anyone is not familiar with the M. distance, it weights the Euclidean distance by the covariance matrix (think of Euclidean distance as weighing the distance by the Identity matrix). But it is very useful in many other contexts in addition to these or Radial Basis Function networks, and in fact, any time you compute a distance in an algorithm the M. distance is my preferred distance metric, including k-nearest neighbor, and clustering (like Isodata or K-Means).

The problem is that very few data mining software packages have it. I was introduced to it in an obsolete tool called OLPARS (I also wrote code for it, including Perceptrons, RBFs, and some neural networks). But Matlab does it quite easily.

No comments: