Widespread intronic polyadenylation diversifies

2 downloads 0 Views 6MB Size Report
t(11;14), del(1q), inv(9)(p12q13). Y Y. MM12 3. II IgM KLC t(11;14), del(1q). Y Y. MM13 2. II IgG Lambda. Normal. N Y. MM14 1. I IgG kappa t(14;16) --> IGH-MAF.
Supplementary File

Widespread intronic polyadenylation diversifies immune cell transcriptomes Singh et al

Supplementary Figure 1 a

IpA events

Random events 3 log2 (Upstream/Downstream)

log2 (Upstream/Downstream)

6 4 2 0 −2 −4

2 1 0 −1 −2

Total peaks(4802)

−6

Validated by RNA−seq, GLM (612)

0

5

−3 −2

10

0

4

6

8

10

Mean of normalized counts

Mean of normalized counts

b

2

c Annotated

RNA−seq, GLM 81

214

186 682

1264

1.00

Annotated RNA−seq, GLM Other protocols RNA−seq, polyAreads 186 Highly expressed Not validated

0.75

0.50 IpA atlas

721 0.25 1332

Other protocols

0.00

Supplementary Figure 1: Validation of IpA events. a) Differential read coverage of RNA-seq data up- and downstream of IpA events was tested using a GLM. If significantly higher coverage was found upstream versus downstream of the IpA event, the IpA isoform was considered validated and highlighted in red (FDR-adjusted P < 0.1). The same test was performed on random events obtained up- and downstream of IpA sites without read evidence and did not yield significant events (right panel). b) Venn diagram showing the number of events that were validated by RNA-seq, other 3' end sequencing protocols, and database annotation. c) Atlas of high confidence IpA isoforms from 3'-seq. To validate IpA isoforms additional evidence from different sources was used sequentially. In the final IpA isoform atlas, 39.51% (n = 2,241) coincide with annotated isoforms in RefSeq, UCSC genes or Ensembl databases; 16.0% (n = 907) lack annotation but are validated by RNA-seq by the GLM-based test; 23.50% (n = 1,332) lack RNA-seq or annotation support but are reported in data sets generated with other 3' end sequencing protocols; 2.19% (n = 124) lack the previous sources of evidence but are corroborated by untemplated polyA reads using RNA-seq data; and the remaining 5.7% (n = 323) are highly expressed in at least one sample (> 10 TPM). 13.1% (n = 743) of IpA events could not be validated by any of these methods and were removed from further analyses.

Supplementary Figure 2

3'UTR start

5 10 3ss score

1.5

1.0

Low AT no IpA genes High AT no IpA genes Low AT IpA genes High AT IpA genes

80 60 40 20 0

TSS

o

Ip A

n

n

le

Skipped terminal exon upstream 3ss

15

100

0.0

xo

xo le

in a

te rm

Skipped terminal exon downstream 3ss

f

0.5

N

Trancription unit width

in a te rm

0.25

3'UTR start

ne s

TSS

pp

No IpA introns (skipped terminal exon genes)

ge

0.5

ed

te si om po C

0.50

A

1.0

3

No IpA introns (composite terminal exon genes)

Ip

1.5

10

0.75

o

Low AT no IpA genes High AT no IpA genes Low AT IpA genes High AT IpA genes

Frequency of U1 snrnp signal per kb per gene

e

4

Composite terminal exon

0

10

10

Sk i

n

le

1.00

AT−content percentage

5

5

1.00

0.00 5ss score

10

Ip A

0

le

0.50 0.75 Retained CDR

No IpA introns (skipped terminal exon genes) 0

Frequency of polyA signal per kb per gene

10

in a

0.25

0.00

0.0

1

xo

0.00

Skipped terminal exon

0.25

10

te rm te si

0

1.00

om po

50

No IpA introns (composite terminal exon genes)

0.50

2.0

Intron length

100

Composite terminal exon

0.75

2

ge ne s

0.50 0.75 Retained CDR

1.00

10

N

0.25

3

Ip A

0.00

10

C

20

150

Probability

Probability

Skipped terminal exon

Number of IpA isoforms

Number of IpA isoforms

40

0

d

Composite terminal exon

60

4

Skipped terminal exon

Composite terminal exon

b

10

in a

AAA

CDS

AAA

n

AAA

o

CDS

3'-most exon

5

xo

3'UTR

CDS CDS

10

N

Alternative exon

te rm

Intron

c

pA

Downstream 3′SS

ed

Exon

pA

Upstream 3′SS

pp

pA

5′SS

Sk i

a

Supplementary Figure 2: Composite vs skipped terminal IpA. a) Composite terminal exon events result from loss of recognition of 5'ss while skipped terminal exon results in inclusion of a new exon. b) Composite terminal exon recognizes IpA sites uniformly across the transcription unit while skipped terminal exon are predominant near the TSS. c) Skipped terminal exon occurs in longer introns compared to composite terminal exon or introns of genes with no IpA (one-sided Wilcoxon rank-sum test, P < 10-20). They also occur in longer transcription units (one-sided Wilcoxon rank-sum test, P < 10-20). d) 5ss of composite terminal exon is weaker than the 5ss of the other introns of the same genes (one-sided KS test, P < 10-09). 5ss of skipped terminal exon is stronger than the 5ss of the other introns of the same genes (one-sided KS test, P < 10-06). Upstream 3ss of skipped terminal exon is weaker than the other introns of the same genes (one-sided KS test, P < 10-15). Downstream 3ss of skipped terminal exon is weaker than the other introns of the same genes (one-sided KS test, P < 10-09). e) There is significant enrichment of pA signals in IpA genes vs. non-IpA genes in both high and low AT regions (one-sided Wilcoxon signed-rank test, IpA high-AT vs nonIpA_high-AT, P < 10−17; IpA low-AT vs non-IpA_low-AT, P < 10−16) as well as depletion of U1 signals (one-sided Wilcoxon signed-rank test, IpA high-AT vs non-IpA_high-AT, P < 10−15; IpA low-AT vs non-IpA_low-AT, P < 10−10) f) IpA genes have higher AT-content compared to non-IpA genes (Wilcoxon rank-sum test, P < 10-20).

Supplementary Figure 3 Presence of domains in IpA vs no IpA genes

A

IpA genes No IpA genes

DNA binding present

284

810

DNA binding absent

3147

11400

RNA binding present

117

239

RNA binding absent

3314

11971

PPI present

794

2054

PPI absent

2637

10156

TMD present

673

2904

TMD absent

2758

9306

Active site present

435

1325

Active site absent

2996

10885

Repeats present

994

2826

Repeats absent

2437

9384

B

Fisher-exact p 0.001

1.42 x 10-6

1.2E-16 2.23 x 10-7

0.003 4.78 x 10-12

Preferential loss of domains within IpA genes(n = 1,410) Fisher-exact p DNA binding Other domains

Lost

303

4859

Retained

238

5194

0.0005

RNA binding Other domains Lost

46

5116

Retained

39

5393

PPI

Other domains

Lost

931

4231

Retained

783

4649

TMD

Other domains

Lost

751

4411

Retained

852

4580

Active site

Other domains

Lost

133

5029

Retained

175

5257

**

*

*

NS

4.5 x 10-7

0.10

0.04

NS

0.75 retained

0.50

lost

0.25

D TM

ing ind

site

Ab RN

tive

ing

Ac

I

Ab ind

PP

rou nd

0.00 Ba ckg

Fraction of domains

1.00

DN

C

0.32

Supplementary Figure 3: Statistics for domain enrichment in IpA genes. a) Domain information was obtained from UniProt and categorized (see Methods). IpA genes encode proteins enriched in DNA-binding, RNA-binding, PPI domains and active sites but are depleted for TMDs relative to genes that do not express IpA isoforms. Fisher’s exact test was performed for enrichment p values. b) Expression of IpA isoforms leads to preferential loss of DNA-binding and PPI domains but retains active sites of enzymes compared to the other domains present in IpA genes. Fisher’s exact test was performed on IpA isoforms that retain at least one domain. c) Fraction of domains lost and retained determined from (b)

Supplementary Figure 4

Fraction of transcripts

Retained CDR < 25% 25% < Retained CDR < 50% 50% < Retained CDR < 75% 75% < Retained CDR < 100%

0.4

0.2

0.0

0

Coding probability

1

Supplementary Figure 4: Predicted coding probability for IpA isoforms. Coding probability was assessed according to CPAT and shown for IpA isoforms with different fractions of retained CDR. Most of the IpA isoforms that retain less than 25% of the CDR are predicted to be non-coding.

Supplementary Figure 5 b

a

MM group 1 IpA site usage 1 0.8 0.6 0.4 0.2 0 MM group 1 MM group 2 MM group 3

MM6

MM11

MM8

MM14

MM4

MM3

MM9

MM5

MM7

MM10

PC1

MM13

PC2

MM15

MM2

MM12

MM1

PC

5 336

139 MM group 2

Supplementary Figure 5: Hierarchical clustering defining MM patient groups. a) Hierarchical clustering of PCs and MM patient samples using 50% of the most variable IpA isoforms by median absolute deviation (n = 554) identifies three groups of MM samples. b) Overlap of significantly lower used IpA sites in MM group 1 and group 2. MM group 3 is not shown as IpA site usage was very similar to PCs and only one IpA isoform was differentially used.

Supplementary Figure 6 a

b

Number of IpA isoforms

8

6

4

2

0 1.0

5 0.7

0 0.5

5 0.2

0 0.0

−0

.25

0

Correlation values between 3'-seq and RNA−seq IpA usage

Group A

Group B

Group C

Group D

0

0.2

0.4

0.6

0.8

1

RNA-seq IpA site usage 1.0

−2.5

Group A-filtered Group B-filtered 2.5

0.6 0.4

P < 0.05 −5.0

0.0 −5

0 PC1

CNOT1

NT5C2

STAG1

5

PMS1

0

500

ETNK1

OXR1

GALNT2

CCT4

NMRK1

CUL3

1000 Days

1500

B4GALT4 SLC30A5

−5

2000

TBP

0 PC1

5

ANKRD17

ELMO1

SFT2D1

WIPF1

DNAJB6

TIPRL

ACSL3

DPY19L1

CUL1

BLOC1S6

RNA−seq IpA site usage

RAB10

0.0

−2.5 0.2

−5.0

1

e

0.8

0.0

f

Group A Group B

PC2

PC2

2.5

d

Group A Group B Group C Group D

Progression free survival

c

0

RNA−seq IpA site usage

1

0

MVD

TNRC6B

ANKLE2

PRRC2C

Supplementary Figure 6: MM patients with high IpA usage have significantly improved progression free survival a) Distribution of correlation values of IpA usage determined from 3'-seq and RNA-seq for genes that do not recognize IpA sites in MM when compared to PCs (n = 114, FDRadjusted P < 0.05, usage difference > 0.25) b) Hierarchical clustering of RNA-seq IpA site usage of genes with high correlation between 3'-seq and RNA-seq usage (Pearson r > 0.75, n = 28) determined for an independent cohort of patients (n = 319) with RNA-seq expression. This clustering show separation of patients largely into two groups, group A – low IpA usage patients (n = 139) and group B - high IpA usage patients (n = 176). c) Visualization of separation of patient groups by principal component analysis. d) Group B patients with progression-free survival information (n = 160) have significantly improved (P < 0.05) progression free survival than group A patients (n = 126). e) As (c) after removing heterogeneous samples from both groups, group A-filtered (n = 73) and group B-filtered (n = 115). f) RNA-seq IpA site usage of the gene signature (n = 28) for samples of group A-filtered and group B-filtered from (e). Genes are ordered from high correlation between RNAseq and 3'-seq IpA site usage.

Supplementary Figure 7 a

ZnF 1, 2, 3, 4, 5, 6

IKZF1 IKZF1 IpA

IKZF1 position: 139

b

DDB1

519 aa 77 aa

ROC1

168

HTGERPFQCN QCGASFTQKG NLLRHIKLHS Core lenalidomide degron

NTD

CUL4A

DDB1

CUL4A IpA NTD

CTD

Supplementary Figure 7: Potential role of IKZF1-IpA and CUL4A-IpA expression in resistance to lenalidomide b) Loss of core lenalidomide degron sequence from IKZF1 IpA. a) DDB1-CUL4A-ROC1 complex (PDB id: 2HYE) and DDB1-CUL4A IpA complex. Fulllength CUL4A has 41-759 aa (top) while CUL4A IpA has 41-171 aa (bottom).

IR

e

0.1 0.2

Sk m us cl

Breast

IR

r: 0.50

No_IR

Skmuscle Brain Ovary

No_IR

T

Testis

ES

MM

IR

d

oo

Bl

0.15

No_IR

y

NISCH

IR

va r

O

0.0

IR

ES

No_IR

0.00

n

NISCH

Br ai

10 kb 0.05

IR

NISCH

No_IR

0

Fraction of genes with IpA isoforms

150

is

ea st

0

Te st

Br

N B

3'-seq 100

No_IR

d

oo

Bl

RNA-seq

a

IR

No_IR

IR

No_IR

B

B

5+

D

N

B

em

M C

B

C

G

NISCH

IR

No_IR

IR

No_IR

IR

No_IR

IR

No_IR

d PC

250

IR

Sk Bra mu in scl e M Ov M ary E Te S stis B PC GC rea B ( st ex t) NB GCB (ex t) MeNB m CD B Blo 5+B o NB d T (ex t) NB

Number of introns with IR and IpA

c

No_IR

IpA site usage

Supplementary Figure 8 b Blood NB (external)

Blood T Blood NB

0.10 PC GCB (external) GCB CD5+B

NB

MemB NB (external)

0.3 0.4

Fraction of genes with IR

Expected Observed

200

150

100

50

0

1.00

0.75

0.50

0.25

0.00

Supplementary Figure 8: Co-occurrence of IpA events with intron retention. a) Co-occurrence of IpA with intron retention (IR). Shown as in Fig. 1a. b) Correlation between IR and IpA isoform expression. Tissues with a higher number of genes with IR also have more genes that express IpA isoforms (Pearson correlation coefficient, r = 0.5). b) Higher incidence of IpA events in introns with IR. The number of introns where IpA and IR are observed simultaneously is much higher than expected by chance. c) IpA site usage differs between introns with or without IR. The distribution of IpA site usage was calculated for the two groups and plotted for each cell type. The median usage of the IpA isoforms that co-occurred with IR is lower compared to those occurring in introns that are not retained.

Supplementary Table 1. Characterization of normal human immune cells investigated by 3'-seq Sample

Derived from CD5+B Tonsil NB Tonsil NB Blood MemB Tonsil GCB Tonsil PC BM T Blood BM, bone marrow

Sample name CD5+B3-CD5+B6 NB3-NB4 NB1-NB2 M1-M2 GC1-GC2 PC1-PC2 T2-T3

Markers for sorting CD5+, CD19+ CD19+, CD27CD19+, CD27CD19+, CD27+ CD19+, CD38+ CD138+ CD3+

No of samples 4 2 2 2 2 2 2

Accession number GSE111310 GSE111310 SRP029953 GSE111310 GSE111310 GSE111310 GSE111310

MM3

1

I

MM4 MM5 MM6 MM7 MM8 MM9 MM10 MM11 MM12 MM13 MM14 MM15

1 2 1 2 1 1 2 1 3 2 1 3

III II II I I I II II II II I III

N, No; Y, Yes

IgA IgA IgG LLC IgG KLC IgA KLC IgA KLC Kappa LC Kappa LC IgG IgA KLC IgM KLC IgM KLC IgG Lambda IgG kappa Non-secretory PC leukemia

3'-seq

I I

RNA-seq

3 3

Cytogenetics

ISS stage at sample collection

MM1 MM2

Type

MM group

Supplementary Table 2. MM patient characteristics

t(11;14) --> IGH-CCND1

Y Y

Y Y

Y

Y

Y Y Y Y Y Y Y Y Y N N N

Y Y Y Y Y Y Y Y Y Y Y Y

t(11;14) --> IGH-CCND1 t(4;14), loss of chr 13 and 17 and other complex cytogenetic changes Hyperdiploid t(11;14), del(13q), deletion P53 t(11;14) --> IGH-CCND1 t(11;14) --> IGH-CCND1 t(11;14) --> IGH-CCND1 Hyperdiploid t(11;14), del(13q), deletion P53 t(11;14), del(1q), inv(9)(p12q13) t(11;14), del(1q) Normal t(14;16) --> IGH-MAF Complex

Supplementary Table 3. Characterization of normal human immune cells investigated by RNA-seq Sample CD5+B CD5+B NB NB MemB MemB GCB PC NB GCB NB

Derived from Tonsil Blood Tonsil Blood Tonsil Blood Tonsil BM Tonsil Tonsil Blood

T

Blood

Sample name

Markers for sorting

CD5+B3-CD5+B4 CD5+B2 NB3-NB5 NB1-NB2, NB6 M2, M6 M3-M5 GC1-4 PC4-PC21

CD5+, CD19+ CD5+, CD19+ CD19+, CD27CD19+, CD27CD19+, CD27+ CD19+, CD27+ CD19+, CD38+ CD138+ IgD+ CD77+ CD19+ CD5- CD27-

No of samples 2 1 3 3 2 3 4 18 4 4 2

CD3+

Other 3'-seq profiles were from Lianoglou et al. 2013 (SRP029953).

Accession number GSE111310 GSE111310 GSE111310 GSE111310 GSE111310 GSE111310 GSE111310 GSE111310 GSE45982 GSE45982 ERX397853 ERX397892 GSM1576415