Biometrika Advance Access originally published online on August 5, 2007
Biometrika 2007 94(3):760-766; doi:10.1093/biomet/asm050
Copyright © 2007 Biometrika Trust
Miscellanea |
The high-dimension, low-sample-size geometric representation holds under mild conditions
Department of Statistics, University of Georgia, Athens, Georgia 30602, U.S.A.
Department of Statistics and Operations Research, University of North Carolina, Chapel Hill, North Carolina 27599, U.S.A.
Department of Epidemiology and Health Policy Research, University of Florida, Gainesville, Florida 32610, U.S.A.
Department of Biostatistics, University of Washington, Seattle, Washington 98195, U.S.A.
jyahn{at}stat.uga.edu
marron{at}email.unc.edu
Keith.Muller{at}biostat.ufl.edu
yychi{at}u.washington.edu
Received for publication 1 July 2005. Revision received 1 February 2007.
High-dimension, low-small-sample size datasets have different geometrical properties from those of traditional low-dimensional data. In their asymptotic study regarding increasing dimensionality with a fixed sample size, Hall et al. (2005) showed that each data vector is approximately located on the vertices of a regular simplex in a high-dimensional space. A perhaps unappealing aspect of their result is the underlying assumption which requires the variables, viewed as a time series, to be almost independent. We establish an equivalent geometric representation under much milder conditions using asymptotic properties of sample covariance matrices. We discuss implications of the results, such as the use of principal component analysis in a high-dimensional space, extension to the case of nonindependent samples and also the binary classification problem.
Key Words: High-dimension, low-sample-size Large p small n Linear discrimination Sample covariance matrix
References
-
Bai Z. D., Silverstein J. W. No eigenvalues outside the support of the limiting spectral distribution of large-dimensional sample covariance matrices. Ann. Prob. (1998) 26:316–45.[CrossRef]
Baik J, Ben Arous G., Péché S. Phase transition of the largest eigenvalue for non-null complex covariance matrices. Ann. Prob. (2005) 33:1643–97.[CrossRef]
Baik J., Silverstein J. W. Eigenvalues of large sample covariance matrices of spiked population models. J. Mult. Anal. (2006) 97:1382–408.[CrossRef]
Benito M., Parker J., Du Q., Wu J., Xiang D., Perou C. M., Marron J. S. Adjustment of systematic microarray data biases. Bioinformatics (2004) 20:105–44.
Bickel P., Levina E. Some theory for Fisher's linear discriminant function, "naive Bayes", and some alternatives when there are many more variables than observations. Bernoulli (2004) 10:989–1010.[ISI]
Cristianini N., Shawe-Taylor J. An Introduction to Support Vector Machines and other Kernel-Based Learning Methods (2000) Cambridge: Cambridge University Press.
Donoho D. L., Tanner J. Neighborliness of randomly-projected simplices in high dimensions. Proc. Nat. Acad. Sci. (2005) 102:9452–7.
Furey T. S., Christianini N., Duffy N., Bednarski D. W., Schummer M., Haussler D. Support vector machine classification and validation of cancer tissue samples using microarray expression data. Bioinformatics (2000) 16:906–14.
Golub T. R., Slonim D. K., Tamayo P., Huard C., Gaasenbeek M., Mesirov J. P., Coller H., Loh M., Downing J. R., Caligiuri M. A., Bloomfield C. D., Lander E. S. Molecular classification of cancer: Class discovery and class prediction by gene expression monitoring. Science (1999) 286:531–7.
Hall P., Marron J. S., Neeman A. Geometric representation of high dimension, low sample size data. J. R. Statist. Soc. (2005) B 67:427–44.[CrossRef]
John S. The distribution of a statistic used for testing sphericity of normal distributions. Biometrika (1972) 59:169–73.
Johnstone I. M. On the distribution of the largest eigenvalue in principal components analysis. Ann. Statist. (2001) 29:295–327.
| ||||||||||||||||||||||||||||||||||||||||||||||||||