Skip to main page content
U.S. flag

An official website of the United States government

Dot gov

The .gov means it’s official.
Federal government websites often end in .gov or .mil. Before sharing sensitive information, make sure you’re on a federal government site.

Https

The site is secure.
The https:// ensures that you are connecting to the official website and that any information you provide is encrypted and transmitted securely.

Access keys NCBI Homepage MyNCBI Homepage Main Content Main Navigation
. 2020 May 5;31(5):107576.
doi: 10.1016/j.celrep.2020.107576.

A Quantitative Framework for Evaluating Single-Cell Data Structure Preservation by Dimensionality Reduction Techniques

Affiliations

A Quantitative Framework for Evaluating Single-Cell Data Structure Preservation by Dimensionality Reduction Techniques

Cody N Heiser et al. Cell Rep. .

Abstract

High-dimensional data, such as those generated by single-cell RNA sequencing (scRNA-seq), present challenges in interpretation and visualization. Numerical and computational methods for dimensionality reduction allow for low-dimensional representation of genome-scale expression data for downstream clustering, trajectory reconstruction, and biological interpretation. However, a comprehensive and quantitative evaluation of the performance of these techniques has not been established. We present an unbiased framework that defines metrics of global and local structure preservation in dimensionality reduction transformations. Using discrete and continuous real-world and synthetic scRNA-seq datasets, we show how input cell distribution and method parameters are largely determinant of global, local, and organizational data structure preservation by 11 common dimensionality reduction methods.

Keywords: data analysis; dimensionality reduction; single-cell analysis; single-cell transcriptomics; unsupervised learning; visualization.

PubMed Disclaimer

Conflict of interest statement

Declaration of Interests The authors declare no competing interests.

Figures

Figure 1.
Figure 1.. Cell Distance Distributions Describe Global Structure of High-Dimensional Data
(A) Representation of scRNA-seq counts matrix. (B) Cell-cell distances in native gene space are calculated to generate an m × m matrix, where m is the total number of cells. The K nearest-neighbor (Knn) graph is constructed from these distances as a binary m × m matrix. (C) Upon transformation to low-dimensional space, a distance matrix and Knn graph can be calculated as in (B). (D) Distance matrices from native (B) and latent (C) spaces are used to build cumulative probability density distributions, which can be compared to one another by Earth-Mover’s distance (EMD; left). Unique cell-cell distances are correlated (right), and Knn preservation represents element-wise comparison of nearest-neighbor graph matrices in each space. See also Figure S1.
Figure 2.
Figure 2.. Discrete and Continuous Cell Distributions Exemplify Common Biological Patterns
(A) Relative expression of top genes in each cluster for mouse retina dataset. (B) t-SNE embedding primed with 100 principal components of retina dataset with overlay of consensus clusters. (C) t-SNE projection from (B) with overlay of marker genes used to identify cell types in (A). (D) Relative expression of top genes in each cluster for mouse colonic epithelium dataset. (E) t-SNE embeding primed with 100 principal components of colon dataset with overlay of consensus clusters. (F) t-SNE projection from (E) with overlay of marker genes used to identify cell types in (D). See also Figure S2.
Figure 3.
Figure 3.. Global and Local Structure Preservation Analysis of Dimension Reduction Methods on Discrete and Continuous scRNA-seq Datasets
(A) Example 2D projection of mouse retina data using SIMLR with cluster overlay (top). Cumulative distance distributions for native and latent spaces (bottom left) and 2D histogram representing correlation between unique distances (bottom right). (B) Cumulative distance distributions of evaluated projections of retina data. (C) Summary of structure preservation metrics for retina data. (D) 2D histograms of cell distance correlations for retina data. (E) Example 2D projection of retina data using zero-inflated negative binomial-based wanted variation extraction (ZINB-WaVE) and overlay of cone cells (left) and 2D histogram representing correlation between the two sets of unique distances (right). (F) Same as in (E) for distances between bipolar, amacrine, and rod cell clusters, using scvis projection. (G) Example graph representation of cluster topology for retina dataset, using t-SNE projection primed with single-cell variational inference (scVI) latent space. Red edges represent those not present in minimum spanning tree of native graph. (H) Same as in (A), with UMAP projection of mouse colon data. (J) Cumulative distance distributions of evaluated projections for colon data. (K) Summary of structure preservation metrics for colon data. (L) 2D histograms of cell distance correlations for colon data. (M) Same as in (E), with deep count autoencoder (DCA) projection of mature colonocytes. (N) Same as in F for distances between immature, developing, and mature goblet cell clusters, using zero inflated factor analysis (ZIFA) projection. (P) Same as in (G), but for the colon dataset, using generalized principal component analysis (GLM-PCA) projection. See also Figure S3 and Table S1.
Figure 4.
Figure 4.. Simulated Datasets with Defined Topology Validate Observations
(A) Diagram of discrete synthetic data with ground-truth topology defined by three equally spaced developmental paths along directional pseudotime (PT) from a common source state (removed to discretize paths). (B) 2D embeddings by 11 dimensionality reduction tools showing unique paths defined in discrete simulation. (C) Same as in (B), with overlay of PT values for each cell as defined in simulation. (D) 2D histograms showing correlation of pairwise distances between cells in each of the three developmental paths with the sum of PT values between each pair of cells as ground-truth topology. (E) Summary of correlation and EMD values between cells in each path for all dimensionality reduction methods. (F) Diagram of continuous synthetic data with ground-truth topology defined by three developmental paths along directional PT from a common source state. (G) 2D embeddings by 11 dimensionality reduction tools showing unique paths defined in continuous simulation. (H) Same as in (G), with overlay of PT values for each cell. (J) 2D histograms showing correlation of pairwise distances between cells in each of the three developmental paths with the sum of PT values between each pair of cells as ground-truth topology. (K) Summary of correlation and EMD values between cells in each path of continuous simulation for all dimensionality reduction methods. See also Figure S4 and Table S2.

References

    1. Becht E, McInnes L, Healy J, Dutertre C-A, Kwok IWH, Ng LG, Ginhoux F, and Newell EW (2018). Dimensionality reduction for visualizing single-cell data using UMAP. Nat. Biotechnol 37, 38–44. - PubMed
    1. Belkina AC, Ciccolella CO, Anno R, Halpert R, Spidlen J, and Snyder-Cappione JE (2019). Automated optimal parameters for T-distributed stochastic neighbor-embedding improve visualization and allow analysis of large datasets. Nat. Commun 10, 5415. - PMC - PubMed
    1. Butler A, Hoffman P, Smibert P, Papalexi E, and Satija R (2018). Integrating single-cell transcriptomic data across different conditions, technologies, and species. Nat. Biotechnol 36, 411–420. - PMC - PubMed
    1. Cramér H (1928). On the composition of elementary errors. Scand. Actuar. J 1928, 13–74.
    1. Ding J, Condon A, and Shah SP (2018). Interpretable dimensionality reduction of single cell transcriptome data with deep generative models. Nat. Commun 9, 2002. - PMC - PubMed

Publication types

Substances

LinkOut - more resources