Supplementary Materials

1 downloads 9 Views 3MB Size Report
web service to help them run TSCAN directly online without the need to install ... Supplementary Table 3 (xlsx format): Cell orderings produced by different.

Supplementary Materials

Installation of TSCAN software TSCAN software package is distributed through GitHub [1] and Bioconductor [2]. The best way to use GSCA is to install and run it on users’ own computers. To do so, users have to first download and install R from [3]. Next, users can install the latest TSCAN from GitHub by typing the following commands in R: > if (!require("devtools")) + install.packages("devtools") > devtools::install_github("TSCAN","zji90") Alternatively, one can also install TSCAN from Bioconductor following the instructions in [4]. However, since the Bioconductor only updates its packages twice per year, TSCAN installed from Bioconductor may not be the most upto-date version. After installation, one can start the GUI by typing the following commands in R: > library(TSCAN) > TSCANui() Users are referred to a demonstration video at [5] to learn how to use TSCAN. Users who are familiar with R programming can also run TSCAN using command-line mode. The instructions for using TSCAN as R commands are included in the TSCAN package documentations. For users who only want to do a one-time analysis, we also created an online web service to help them run TSCAN directly online without the need to install R or TSCAN on their own computers. The link to the web service is provided on the TSCAN homepage at GitHub [1]. The online version, however, can be slow depending on the job load of the web server. The datasets used in this study can be downloaded from Github [6].

K-means TSCAN K-means TSCAN is a modified version of TSCAN where mclust is replaced by k-means clustering for clustering cells. All the other procedures in TSCAN

1

remain the same in k-means TSCAN. In order to determine the cluster number of k-means clustering, we used an approach similar to Figure 1E, with its x-axis changed to the cluster number and its y-axis changed to the proportion of total data variance unexplained by the clusters. More precisely, let ˜ = (E ˜ 1, . . . , E ˜ N ) be a matrix containing data for N cells. Each column in E the matrix corresponds to a cell. Suppose the N columns are clustered into ¯ (k) denote the mean of the k th cluster, and let E ¯ be the C clusters. Let E mean of all columns. Let C(i) denote the cluster membership of the ith cell.

2 PN

˜ ¯ The total data variance is defined as SST =

Ei − E

where k.k repi=1

resents l2 norm. The variance

unexplained

2 by the cluster structure is defined PC P

˜ (k) ¯ as SSW = k=1 i:C(i)=k Ei − E

. The proportion of total data variance unexplained by the cluster structure is SSW/SST . The codes for k-means TSCAN are provided at [7].

Waterfall, SCUBA and Wanderlust for HSMM, LPS and qNSC data The codes of Waterfall, SCUBA and Wanderlust for analyzing HSMM, LPS and qNSC data are provided at [7].

References [1] TSCAN home page on Github. https://github.com/zji90/TSCAN [2] Bioconductor. http://www.bioconductor.org/install/ [3] R-project. http://www.r-project.org/ [4] TSCAN Bioconductor page. http://www.bioconductor.org/packages/release/bioc/html/TSCAN.html [5] TSCAN Demo Video. https://www.youtube.com/watch?v=zdcBAVe1GBE [6] TSCAN datasets on Github. https://github.com/zji90/TSCANdata [7] K-means TSCAN and Waterfall codes for HSMM, LPS and qNSC data. https://github.com/zji90/TSCANdata/archive/master.zip

2

List of Supplementary Tables • Supplementary Table 1 (xlsx format): The correspondence between sample identifiers and data collection time in the HSMM and LPS data. The table has multiple sheets, one for each dataset. • Supplementary Table 2 (xlsx format): Comparison of functionalities of different single-cell analysis methods. • Supplementary Table 3 (xlsx format): Cell orderings produced by different methods for each dataset. The table has multiple sheets, one for each dataset and analysis. • Supplementary Table 4 (xlsx format): Gold standard differential genes for each dataset. The table has multiple sheets, one for each dataset.

List of Supplementary Figures Supplementary Figures 1-5 are provided below.

3

MEF2C

4 3 2

Expression

1 0 0

24

48

72

MYH2

1.0

0.5

0.0 0

24

48

72

Time of Collection (Hour)

Supplementary Figure 1: Averaged bulk gene expression level for MEF2C and MYH2 in HSMM data.

4

Gene

9 6 3 0



MEF2C

TSCAN

● ●● ● ● ●● ● ● ● ● ●●● ● ● ●●● ● ● ●● ● ● ● ● ● ●● ● ●● ●●● ● ● ● ●● ●● ● ●● ●● ● ● ● ●● ●●●● ● ●● ● ● ●● ●● ● ● ● ● ● ●● ●● ● ● ● ● ● ● ● ● ● ● ● ●● ● ●● ●● ● ●●● ●● ● ● ●● ● ●● ● ● ● ● ● ● ● ● ● ● ● ● ●● ● ● ● ● ● ● ● ● ●● ● ● ● ● ● ●● ● ● ● ● ●● ● ● ● ● ●● ● ● ● ● ● ● ● ●● ● ● ● ● ● ● ● ● ● ●● ● ● ●● ● ● ● ● ● ● ● ● ●● ●●● ● ● ● ● ● ●● ● ● ● ●● ● ●● ● ●● ●●●● ● ● ● ● ● ● ●● ● ● ●●● ●● ● ●● ● ●

● ●● ● ●● ● ● ● ● ●●●● ● ●●● ● ●●● ● ● ● ● ● ● ●● ● ● ● ● ● ● ● ● ● ● ● ● ●● ● ● ● ●● ● ● ● ● ● ● ● ● ● ● ●● ●● ● ● ● ● ●● ●● ● ● ● ● ● ● ● ● ● ●● ● ●● ● ●● ● ●●●● ●● ● ● ● ●● ● ● ● ● ● ●● ●● ● ● ● ● ● ● ●● ● ● ●● ●● ●● ● ●● ● ● ● ● ● ● ● ● ● ● ● ● ●● ●● ●●●●● ● ● ● ● ●● ● ● ●●● ● ● ● ●● ● ●● ●

50

100

150

Expression

3 0



9 6 3 0

200

0

3 0

● ● ● ●● ● ●● ● ● ● ● ●●●● ● ● ● ● ●● ● ●●●● ● ● ● ● ● ●● ● ●● ● ● ● ● ● ● ● ●●●● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ●● ● ● ● ● ● ●● ● ● ● ● ● ● ● ● ●● ● ●● ● ● ● ● ●● ● ● ● ● ●● ● ● ● ●● ● ●● ● ●● ● ● ● ● ● ● ● ● ● ● ●●●● ● ● ● ● ●● ● ● ● ● ● ● ●● ● ● ● ● ●● ● ●● ● ● ● ● ●● ●● ●●●● ●● ● ● ●●● ●●

30

60

90



9 6 3 0

120

0

0

100

150

● ●● ● ● ● ● ● ●● ● ● ●● ● ●● ● ● ●● ●● ● ● ● ● ● ● ● ●● ●● ● ● ● ●● ● ●●●●●● ● ● ●● ● ● ● ●● ● ● ● ● ● ● ●●● ●●● ● ● ●● ● ●● ●●● ● ●● ●● ● ● ●● ● ● ● ● ● ● ●●●●● ● ● ● ● ● ● ●● ● ● ● ● ● ● ● ● ●● ● ● ● ●● ● ● ●● ● ● ● ● ●● ● ● ● ● ●● ● ● ● ●● ●●●●● ● ● ● ● ●● ● ●● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ●● ● ● ● ●● ● ●● ● ●● ● ●● ● ●●● ● ● ● ● ●● ●● ●● ● ● ● ● ● ● ● ● ●● ● ●● ● ● ●● ● ●● ● ● ●● ● ● ● ● ● ● ● ● ● ●●● ● ● ●● ● ● ●● ● ● ●● ●● ● ● ●●● ● ● ●●● ●● ● ● ●● ● ●

● ●● ● ● ●●●● ●● ● ●●● ●● ● ●● ● ●● ● ● ● ● ● ●● ●● ●●●●● ● ●● ● ●● ●● ● ● ●● ● ●● ● ● ● ● ● ●● ● ● ● ● ●● ● ● ●● ●● ●● ● ● ● ● ● ● ● ● ● ● ● ● ●● ● ● ● ● ●● ● ● ●● ● ● ●● ●● ● ●●● ● ● ● ● ● ● ● ● ● ● ● ● ● ●● ● ● ● ● ● ● ● ●● ● ● ● ● ● ●● ● ●● ● ● ● ● ● ● ● ● ● ● ● ● ● ●● ● ● ● ● ●● ●● ●● ● ● ● ● ● ● ●●● ● ● ● ● ● ● ● ●● ●● ● ● ● ● ● ●● ● ●● ● ● ● ● ●●● ●● ●●● ● ● ● ● ● ● ●● ●● ● ● ● ●● ● ● ● ● ●● ● ● ●

100



9 6 3 0

200

0

● ● ●● ●● ●● ● ●● ●● ●● ● ●●●● ●● ● ● ● ● ● ● ●● ●●● ●● ● ●● ● ●● ● ● ●● ● ● ● ● ●● ● ●● ● ● ● ●● ●● ● ● ● ● ● ● ●● ● ● ● ● ● ● ● ● ● ●● ● ●● ● ● ● ● ●●● ●● ●●● ● ●● ●●● ● ●● ● ●● ● ● ● ● ● ●● ● ● ● ● ● ●● ●● ●● ● ● ● ● ●● ● ● ● ● ● ● ● ● ● ●●●● ● ● ●● ●● ● ● ● ● ● ●● ●● ● ● ● ● ● ● ● ● ● ●● ● ●●●● ●● ● ● ● ●● ● ● ● ●● ● ● ● ● ● ●● ● ● ●● ●● ● ● ●● ● ● ● ● ● ●● ● ● ●● ● ●● ●●● ● ● ● ● ● ● ● ●● ●● ● ● ● ● ● ●● ●● ● ● ● ● ● ● ● ● ● ●●● ●● ● ●● ● ● ● ● ●●● ●● ●●●

0

100

50

100

150

200

Wanderlust



3

50



Waterfall

Scuba

6





Kmeans

0

9

150

● ● ● ●●●●●● ●● ● ● ● ● ● ● ●● ● ● ● ● ● ● ● ● ●● ●●● ● ● ● ● ● ● ●● ● ● ●● ● ● ● ● ● ● ●● ● ●● ● ● ● ●● ● ● ● ● ● ● ● ●● ● ● ●● ● ● ● ● ● ● ●● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ●● ● ●● ● ●● ● ● ●● ● ●



6

100 Marker

0

9

50

nocluTSCAN ●

6

ENO3

Monocle

0

9

MYH2

200

● ● ● ●● ●● ● ● ● ● ●● ●●●● ● ● ● ● ● ● ● ● ● ●● ● ● ●● ● ● ●●● ●●●● ● ● ● ● ●●● ● ● ● ● ●● ● ●● ● ● ●● ● ● ●●● ● ● ●● ● ● ●● ● ●● ● ● ●● ● ●● ● ● ● ● ● ● ● ●● ● ● ●●● ●● ● ● ● ●●● ● ●● ● ● ● ● ● ● ● ● ● ● ●● ● ●● ● ●● ●● ● ● ● ● ● ●● ●● ● ● ● ●● ● ● ● ● ● ●● ● ● ●● ● ● ● ● ● ● ● ● ●● ● ● ● ● ●● ● ● ● ● ●● ●● ● ●● ●● ● ● ●● ●● ●● ● ●● ● ●●●● ● ●●● ●●●●● ● ●●● ● ● ●● ● ● ●● ● ● ● ●● ●● ● ● ●● ● ●●● ●● ●● ●●● ●● ●● ● ● ●●● ● ● ●● ● ●●● ● ● ● ● ●● ● ● ● ● ●



9 6 3 0

0

100

200

Pseudotime

Supplementary Figure 2: MEF2C and MYH2 expression patterns in HSMM dataset where pseudo-time was constructed based on 518 a priori chosen genes. MEF2C and MYH2 expression in each cell is plotted as a function of cell order on the analyzed pseudo-time axis. The curves are the fitted GAM function. The dashed curve is the GAM fit for ENO3, the marker used to determine the path direction.

5

Supplementary Figure 3: Comparing the cell ordering constructed using 518 prior genes and the cell ordering obtained without using these genes in the HSMM dataset. (A) Similarity score between the two orderings for each method. (B) The number of common genes among the top R differentially expressed genes detected by the two cell orderings is plotted as a function of R. (C) The number of common genes with consistent change directions among the top R differentially expressed genes detected by the two cell orderings is plotted as a function of R. In order to determine if a gene has consistent change direction in the two cell orderings, the fitted GAM functions of the gene from the two cell orderings are compared as follows. First, the pseudo-time axes for both cell orderings are linearly scaled to interval [0,1], and the GAM functions are scaled accordingly. Next, values of the GAM functions are extracted at 100 evenly spaced pseudotime points (i.e., 0.01, 0.02, ..., 1), and then the Pearson’s correlation between the two extracted vectors (representing the two GAM functions) is computed. Genes with negative correlation are viewed as inconsistent between the two cell orderings.

6

Monocle

12

● ● ●● ● ● ● ● ● ● ●●● ● ● ● ● ● ●●● ●● ●●● ● ● ●● ● ● ● ●● ● ● ●● ● ●● ●● ● ● ● ●● ●● ●●● ● ● ●● ●● ● ● ● ● ●● ●● ● ● ● ● ●● ● ●● ● ● ● ● ● ● ●●● ● ●● ● ● ● ● ●● ● ● ● ● ●● ● ● ●●● ●● ●● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ●



9



6 3

TSCAN

12



● ●



●● ● ● ● ● ● ● ● ●●● ● ●● ● ● ●● ● ●



3

● ● ● ● ●● ● ● ●●● ●●

100





0

50

100

nocluTSCAN

12 9 6

Expression

12

● ●●● ● ●● ●● ●●● ●



● ●

0

9 ●

● ● ●● ● ● ● ●● ● ●● ●● ● ●● ● ●●

● ●

50

6

● ● ● ● ● ●● ● ●● ● ●● ●●●● ● ● ●● ● ● ● ● ● ● ● ● ● ● ● ● ● ●● ●● ● ● ● ●● ●● ● ●●● ● ● ●●● ● ● ●● ● ● ● ● ● ●●● ●●● ● ● ● ●●● ● ● ●● ●●● ●● ● ●● ● ● ●● ● ● ● ● ● ● ●● ● ● ● ● ● ● ● ● ● ● ● ●● ●● ● ●● ● ● ● ● ● ● ● ● ● ●● ● ● ● ● ● ●

3

● ●

●● ● ●● ● ● ●

● ●

● ● ●



9 6

100

150

3

●● ●





12

50

6 3

● ●●

100

9 6

100

150

●●

0



● ●



● ●●

● ● ● ● ●●

12

50

100

● ● ● ● ● ● ●● ●● ●●● ● ● ● ● ● ●● ● ● ● ● ●●●●● ● ● ●● ●●●● ●● ● ● ●●● ● ● ●● ●● ●●● ●●● ● ●● ● ●● ● ●● ● ● ● ●● ● ● ● ● ● ● ● ● ● ● ●● ● ● ● ● ● ● ●●● ● ● ● ●● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ●● ● ● ● ● ● ● ● ●● ●● ● ● ● ● ● ● ●

9 6

●● ● ●● ● ●●●● ●● ●● ● ● ● ●● ●

3

● ●● ● ●● ● ● ● ● ● ●





●●

● ● ● ●

●● ●● ●

● ● ● ●

● ●

● ●



● ● ●



0



0

● ●

Wanderlust







● ●





0

● ● ● ● ●● ● ●●● ●● ● ●● ●●●● ● ● ●● ● ● ● ●● ●●●● ● ● ●● ●● ● ●● ● ● ● ●●● ● ●●● ● ● ● ● ● ● ● ● ● ●● ●● ● ●● ● ● ●● ●● ● ● ● ●●● ● ●●● ●●●●● ●● ●● ●● ● ● ● ● ● ● ● ● ● ● ● ●● ● ● ●● ● ● ● ●● ● ● ● ● ● ● ● ● ● ● ● ● ●● ●● ●





● ● ● ● ● ● ● ●

0



3

150



●● ● ● ●●

Scuba

12



● ● ● ●● ● ●●● ● ● ●● ●●● ● ● ● ● ●● ● ● ● ● ●● ● ● ● ● ● ●● ● ●● ●● ● ●● ●● ●● ● ● ● ● ● ● ● ●● ● ● ●● ●●●● ● ● ●● ● ● ● ● ●● ● ● ● ● ● ● ● ●● ●● ● ● ● ● ●●● ● ● ● ● ● ● ● ● ● ● ●● ●

9



● ●● ●● ● ● ● ● ● ●●● ● ●● ●● ● ●● ●● ● ●● ●

50





Waterfall

0 0





0

● ● ● ● ● ● ● ● ● ● ● ● ● ●●●●● ● ● ● ●●●● ●● ● ● ● ● ● ● ● ●● ● ● ● ● ●●● ● ●● ●●● ● ● ●● ● ● ● ●●●●● ● ●●●● ● ●● ● ●● ●● ●● ● ● ● ● ● ● ●● ● ●●● ● ● ●● ● ● ● ● ● ● ● ● ● ● ● ● ●● ● ● ● ● ● ● ● ● ●● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ●



● ● ●● ●● ● ● ●● ● ●●

Kmeans

12



● ●

0



0

150

Marker

● ●● ● ●● ●● ● ●● ● ● ● ●●● ● ●● ● ●● ● ●●●● ●● ● ● ●● ● ●● ●● ●● ●● ● ●● ●● ●● ●● ● ● ● ● ● ● ● ●●● ●●● ● ●● ●●● ● ●●● ● ● ●● ● ● ● ●● ●● ● ● ●● ● ●● ● ● ●●● ● ● ● ● ● ●● ● ●● ● ● ●● ● ●● ●● ● ●● ● ● ● ● ● ● ● ● ● ● ● ●

3





0

150

● ● ● ●● ● ● ● ● ● ● ● ● ● ● ●● ● ●● ●● ● ● ●● ●

●● ● ● ● ● ● ●● ● ●● ● ● ●● ●● ● ● ●● ● ●●● ●● ● ●● ● ● ●





● ●

50

6



0 0

9

●●

● ● ●● ● ● ● ● ●● ● ●● ● ● ●● ●● ● ● ●●●● ● ●● ●● ● ● ●● ● ●●● ● ●●●● ● ●●●●●● ●● ● ●● ● ●● ●● ● ●● ●● ● ●●● ● ●● ●● ● ● ● ●● ●● ●● ● ●● ● ● ● ● ● ●● ● ● ● ● ● ● ● ● ● ●

50

100

150



0

50

100

150

Pseudotime

Supplementary Figure 4: SOX9 expression patterns in qNSC dataset. SOX9 expression in each cell is plotted as a function of cell order on the pseudo-time axis. The orange curve is the fitted GAM function.

7

00

00

SG

EN

45

01

5

6.

38

00

SG

EN

57

01

00

3

6.

45

T1_40 T1_2 T1_26 T1_20 T1_13 T1_14 T1_50 T1_55 T1_5 T1_3 T1_47 T1_23 T1_39 T1_41 T1_15 T1_54 T1_10 T1_33 T1_59 T1_58 T1_34 T4_44 T1_30 T1_45 T1_24 T1_60 T1_1 T1_48 T1_38 T1_28 T1_43 T1_46 T3_7 T1_4 T1_31 T1_12 T1_7 T4_45 T1_57 T1_29 T1_65 T4_12 T1_35 T1_67 T1_64 T1_53 T3_53 T1_66 T1_27 T1_69 T1_19 T1_49 T1_51 T1_25 T1_44 T1_56 T1_36 T1_32 T1_52 T1_62 T1_68 T1_9 T4_39 T1_18 T1_8 T2_26 T2_20 T2_34 T2_50 T3_73 T3_23 T4_22 T2_32 T3_76 T4_13 T3_15 T3_25 T3_6 T2_14 T2_52 T1_63 T2_4 T3_56 T1_37 T2_9 T2_17 T2_64 T2_53 T4_46 T2_44 T2_66 T1_11 T1_16 T2_68 T2_1 T3_69 T3_40 T2_7 T3_60 T2_45 T2_31 T4_28 T3_33 T4_41 T3_38 T2_72 T2_42 T3_78 T4_8 T2_67 T4_47 T2_73 T2_33 T2_49 T3_51 T2_60 T2_71 T3_12 T3_46 T3_34 T3_62 T2_10 T3_48 T2_74 T3_28 T2_8 T3_58 T3_44 T3_55 T1_61 T2_18 T4_14 T3_9 T4_17 T2_56 T2_69 T3_21 T4_37 T2_40 T2_6 T1_6 T4_48 T3_68 T4_36 T4_19 T3_42 T2_16 T3_3 T2_51 T2_59 T2_25 T4_25 T3_18 T4_3 T4_33 T3_49 T4_9 T3_30 T4_27 T2_63 T4_24 T2_39 T4_7 T3_52 T3_16 T3_61 T4_11

0

C 100

Count

A B

PDGFRA

SPHK1

Color Key and Histogram

Row Z−Score −1 1 2 3

CCNA2

8 CCNB2

Supplementary Figure 5: Further demonstration of TSCAN GUI. (A) Users can set up trimming criteria by choosing gene names and specifying expression cutoffs. (B) TSCAN excludes cells that meet all trimming criteria. (C) Users can also visualize the expression of specified genes along pseudo-time as heatmaps.