Skip Navigation



Biometrika Advance Access published online on August 5, 2007

Biometrika, doi:10.1093/biomet/asm050
This Article
Right arrow Abstract Freely available
Right arrow Full Text (PDF)
Right arrow Alert me when this article is cited
Right arrow Alert me if a correction is posted
Services
Right arrow Email this article to a friend
Right arrow Similar articles in this journal
Right arrow Alert me to new issues of the journal
Right arrow Add to My Personal Archive
Right arrow Download to citation manager
Right arrowRequest Permissions
Google Scholar
Right arrow Articles by Ahn, J.
Right arrow Articles by Chi, Y.-Y.
Right arrow Search for Related Content
Social Bookmarking
 Add to CiteULike   Add to Connotea   Add to Del.icio.us  
What's this?

Copyright © 2007 Biometrika Trust

Article

The high-dimension, low-sample-size geometric representation holds under mild conditions

Jeongyoun Ahn

Department of Statistics, University of Georgia, Athens, Georgia 30602, U.S.A.

J. S. Marron

Department of Statistics and Operations Research, University of North Carolina, Chapel Hill, North Carolina 27599, U.S.A.

Keith M. Muller

Department of Epidemiology and Health Policy Research, University of Florida, Gainesville, Florida 32610, U.S.A.

Yueh-Yun Chi

Department of Biostatistics, University of Washington, Seattle, Washington 98195, U.S.A.

jyahn{at}stat.uga.edu

marron{at}email.unc.edu

Keith.Muller{at}biostat.ufl.edu

yychi{at}u.washington.edu

Received for publication 1 July 2005. Revision received 1 February 2007.

High-dimension, low-small-sample size datasets have different geometrical properties from those of traditional low-dimensional data. In their asymptotic study regarding increasing dimensionality with a fixed sample size, Hall et al. (2005) showed that each data vector is approximately located on the vertices of a regular simplex in a high-dimensional space. A perhaps unappealing aspect of their result is the underlying assumption which requires the variables, viewed as a time series, to be almost independent. We establish an equivalent geometric representation under much milder conditions using asymptotic properties of sample covariance matrices. We discuss implications of the results, such as the use of principal component analysis in a high-dimensional space, extension to the case of nonindependent samples and also the binary classification problem.

Key Words: High-dimension, low-sample-size • Large p small n • Linear discrimination • Sample covariance matrix



References

    Bai Z. D., Silverstein J. W. No eigenvalues outside the support of the limiting spectral distribution of large-dimensional sample covariance matrices. Ann. Prob. (1998) 26:316–45.[CrossRef]

    Baik J, Ben Arous G., Péché S. Phase transition of the largest eigenvalue for non-null complex covariance matrices. Ann. Prob. (2005) 33:1643–97.[CrossRef]

    Baik J., Silverstein J. W. Eigenvalues of large sample covariance matrices of spiked population models. J. Mult. Anal. (2006) 97:1382–408.[CrossRef]

    Benito M., Parker J., Du Q., Wu J., Xiang D., Perou C. M., Marron J. S. Adjustment of systematic microarray data biases. Bioinformatics (2004) 20:105–44.[Abstract/Free Full Text]

    Bickel P., Levina E. Some theory for Fisher's linear discriminant function, "naive Bayes", and some alternatives when there are many more variables than observations. Bernoulli (2004) 10:989–1010.[Web of Science]

    Cristianini N., Shawe-Taylor J. An Introduction to Support Vector Machines and other Kernel-Based Learning Methods (2000) Cambridge: Cambridge University Press.

    Donoho D. L., Tanner J. Neighborliness of randomly-projected simplices in high dimensions. Proc. Nat. Acad. Sci. (2005) 102:9452–7.[Abstract/Free Full Text]

    Furey T. S., Christianini N., Duffy N., Bednarski D. W., Schummer M., Haussler D. Support vector machine classification and validation of cancer tissue samples using microarray expression data. Bioinformatics (2000) 16:906–14.[Abstract/Free Full Text]

    Golub T. R., Slonim D. K., Tamayo P., Huard C., Gaasenbeek M., Mesirov J. P., Coller H., Loh M., Downing J. R., Caligiuri M. A., Bloomfield C. D., Lander E. S. Molecular classification of cancer: Class discovery and class prediction by gene expression monitoring. Science (1999) 286:531–7.[Abstract/Free Full Text]

    Hall P., Marron J. S., Neeman A. Geometric representation of high dimension, low sample size data. J. R. Statist. Soc. (2005) B 67:427–44.[CrossRef]

    John S. The distribution of a statistic used for testing sphericity of normal distributions. Biometrika (1972) 59:169–73.[Abstract/Free Full Text]

    Johnstone I. M. On the distribution of the largest eigenvalue in principal components analysis. Ann. Statist. (2001) 29:295–327.


Add to CiteULike CiteULike   Add to Connotea Connotea   Add to Del.icio.us Del.icio.us    What's this?


This article has been cited by other articles:


Home page
BioinformaticsHome page
M. C. Wu, L. Zhang, Z. Wang, D. C. Christiani, and X. Lin
Sparse linear discriminant analysis for simultaneous testing for the significance of a gene set/pathway and gene selection
Bioinformatics, May 1, 2009; 25(9): 1145 - 1151.
[Abstract] [Full Text] [PDF]


This Article
Right arrow Abstract Freely available
Right arrow Full Text (PDF)
Right arrow Alert me when this article is cited
Right arrow Alert me if a correction is posted
Services
Right arrow Email this article to a friend
Right arrow Similar articles in this journal
Right arrow Alert me to new issues of the journal
Right arrow Add to My Personal Archive
Right arrow Download to citation manager
Right arrowRequest Permissions
Google Scholar
Right arrow Articles by Ahn, J.
Right arrow Articles by Chi, Y.-Y.
Right arrow Search for Related Content
Social Bookmarking
 Add to CiteULike   Add to Connotea   Add to Del.icio.us  
What's this?