Biometrika Advance Access originally published online on April 1, 2009
Biometrika 2009 96(2):469-478; doi:10.1093/biomet/asp007
Article |
Scale adjustments for classifiers in high-dimensional, low sample size settings
Department of Mathematics and Statistics, The University of Melbourne, Parkville, Victoria 3010, Australia y.chan{at}ms.unimelb.edu.au P.Hall{at}ms.unimelb.edu.au
Received for publication 1 October 2007. Revision received 1 September 2008.
Distance-based classifiers are generally considered to be effective at discriminating between populations that differ in location. Indeed, nearest-neighbour methods and the support vector machine are frequently used in very high-dimensional problems involving gene expression data, where it is believed that elevated levels of expression convey much of the information for classification. However, one problem inherent to distance-based classifiers is that scale differences can mask location differences. In consequence, such classifiers can have poor performance if the information for classification accumulates through a large number of relatively small location differences in data components, rather than via large differences. In this paper, we show that a simple adjustment for scale, applicable to a variety of distance-based classifiers, can remedy the problem. For some classifiers, such as those based on the support vector machine or the centroid method, scale corrections are important primarily in the case of small training-sample sizes. However, for other classifiers, including those based on nearest-neighbour and average-distance methods, scale adjustments are helpful more generally.
Key Words: Average-distance classifier Centroid method Distance-based classifier Location difference Nearest-neighbour method Support vector machine
References
-
Burges C. J. C. A tutorial on support vector machines for pattern recognition. Data Mining Know. Disc. (1998) 2:121–67.[CrossRef]
Christianini N., Shawe-Taylor J. An Introduction to Support Vector Machines (2000) Cambridge, UK: Cambridge University Press.
Cootes T. F., Hill A., Taylor C. J., Haslam J. The use of active shape models for locating structures in medical images. Information Processing in Medical Imaging—Barret H. H., Gmitro A. F., eds. (1993) 687. Berlin: Springer. 33–47. Lecture Notes in Computer Science.[CrossRef]
Dasarathy B. V. Nearest Neighbor (NN) Norms: NN Pattern Classification Techniques (1990) Los Alamitos, CA: IEEE Computer Society Press.
Dudoit S., Fridlyand J., Speed T. Comparison of discrimination methods for the classification of tumors using gene expression data. J. Am. Statist. Assoc. (2002) 97:77–87.[CrossRef][Web of Science]
Fan J., Yao Q. Nonlinear Time Series: Nonparametric and Parametric Methods (2003) New York: Springer.
Furey T. S., Christianini N., Duffy N., Bednarski D. W., Schummer M., Haussler D. Support vector machine classification and validation of cancer tissue samples using microarray expression data. Bioinformatics (2000) 16:906–14.
Hall P., Marron J. S., Neeman A. Geometric representation of high dimension, low sample size data. J. R. Statist Soc.B (2005) 67:427–44.[CrossRef]
Hastie T. J., Tibshirani R. J., Friedman J. The Elements of Statistical Learning: Data Mining, Inference and Prediction (2001) New York: Springer.
Marron J. S., Todd M., Ahn J. Distance weighted discrimination. J. Am. Statist. Assoc. (2007) 102:1267–71.[CrossRef][Web of Science]
Ross D. T., Scherf U., Eisen M. B., Perou C. M., Rees C., Spellman P., Iyer V., Jeffrey S. S., Van de Rijn M., Waltham M., Pergamenschikov A., Lee J. C. E., Lashkari D., Shalon D., Myers T. G., Weinstein J. N., Botstein D., Brown P. O. Systematic variation in gene expression patterns in human cancer cell lines. Nature Genet. (2000) 24:227–35.[CrossRef][Web of Science][Medline]
Scherf U., Ross D. T., Waltham M., Smith L. H., Lee J. K., Tanabe L., Kohn K. W., Reinhold W. C., Myers T. G., Andrews D. T., Scudiero D. A., Eisen M. B., Sausville E. A., Pommier Y., Botstein D., Brown P. O., Weinstein J. N. A gene expression database for the molecular pharmacology of cancer. Nature Genet. (2000) 24:236–44.[CrossRef][Web of Science][Medline]
Schölkopf B., Smola A. Learning with Kernels (2001) Cambridge, MA: MIT Press.
Schoonover J. R., Marx R., Zhang S. L. Multivariate curve resolution in the analysis of vibrational spectroscopy data files. Appl. Spectrosc. (2003) 57:483–90.[CrossRef][Web of Science][Medline]
Shakhnarovich G., Darrell T., Indyk P. Nearest-Neighbor Methods in Learning and Vision: Theory and Practice (2005) Cambridge, MA: MIT Press.
Simard P., Lecun Y., Denker J. S. Efficient pattern recognition using a new transformation distance. In: Advances in Neural Information Processing Systems—Hanson S., Cowan J., Giles L., eds. (1993) San Francisco, CA: Morgan Kaufmann. 50–58.
Theilhaber J., Connolly T., Roman-Roman S., Bushnell S., Jackson A., Call K., Garcia T., Baron R. Finding genes in the C2C12 osteogenic pathway by k-nearest-neighbor classification of expression data. Genome Res. (2001) 12:165–76.[CrossRef][Web of Science]
Vapnik V. N. Estimation of Dependences Based on Empirical Data (1982) Berlin: Springer.
Vapnik V. N. The Nature of Statistical Learning Theory (1995) New York: Springer.
Wakahara T., Kimura Y., Tomono A. Affine-invariant recognition of gray-scale characters using global affine transformation correlation. IEEE Trans. Pat. Anal. Mach. Intel. (2001) 23:384–95.[CrossRef]
| ||||||||||||||||||||||||||||||||||||||||||||||||