A High-performance Computing Toolset for Relatedness and Principal Component Analysis of SNP Data.

Publication Type:

Journal Article


Bioinformatics (Oxford, England), Volume 28, Issue 24, p.3326-3328 (2012)


2012, Center-Authored Paper, October 2012, Public Health Sciences Division


SUMMARY: Genome-wide association studies (GWAS) are widely used to investigate the genetic basis of diseases and traits, but they pose many computational challenges. We developed gdsfmt and SNPRelate (R packages for multi-core symmetric multiprocessing computer architectures) to accelerate two key computations on SNP data: principal component analysis (PCA) and relatedness analysis using identity-by-descent (IBD) measures. The kernels of our algorithms are written in C/C++ and highly optimized. Benchmarks show the uniprocessor implementations of PCA and IBD are ~8 to 50 times faster than the implementations provided in the popular EIGENSTRAT (v3.0) and PLINK (v1.07) programs respectively, and can be sped up to 30~300 fold by utilizing eight cores. SNPRelate can analyze tens of thousands of samples with millions of SNPs. For example, our package was used to perform PCA on 55,324 subjects from the "Gene-Environment Association Studies" (GENEVA) consortium studies. AVAILABILITY: gdsfmt and SNPRelate are available from R CRAN (http://cran.r-project.org), including avignette.A tutorial can be found athttps://www.genevastudy.org/Accomplishments/software CONTACT: Xiuwen Zheng (zhengx@u.washington.edu).