t(11;14), del(1q), inv(9)(p12q13). Y Y. MM12 3. II IgM KLC t(11;14), del(1q). Y Y. MM13 2. II IgG Lambda. Normal. N Y. MM14 1. I IgG kappa t(14;16) --> IGH-MAF.
Supplementary File
Widespread intronic polyadenylation diversifies immune cell transcriptomes Singh et al
Supplementary Figure 1 a
IpA events
Random events 3 log2 (Upstream/Downstream)
log2 (Upstream/Downstream)
6 4 2 0 −2 −4
2 1 0 −1 −2
Total peaks(4802)
−6
Validated by RNA−seq, GLM (612)
0
5
−3 −2
10
0
4
6
8
10
Mean of normalized counts
Mean of normalized counts
b
2
c Annotated
RNA−seq, GLM 81
214
186 682
1264
1.00
Annotated RNA−seq, GLM Other protocols RNA−seq, polyAreads 186 Highly expressed Not validated
0.75
0.50 IpA atlas
721 0.25 1332
Other protocols
0.00
Supplementary Figure 1: Validation of IpA events. a) Differential read coverage of RNA-seq data up- and downstream of IpA events was tested using a GLM. If significantly higher coverage was found upstream versus downstream of the IpA event, the IpA isoform was considered validated and highlighted in red (FDR-adjusted P < 0.1). The same test was performed on random events obtained up- and downstream of IpA sites without read evidence and did not yield significant events (right panel). b) Venn diagram showing the number of events that were validated by RNA-seq, other 3' end sequencing protocols, and database annotation. c) Atlas of high confidence IpA isoforms from 3'-seq. To validate IpA isoforms additional evidence from different sources was used sequentially. In the final IpA isoform atlas, 39.51% (n = 2,241) coincide with annotated isoforms in RefSeq, UCSC genes or Ensembl databases; 16.0% (n = 907) lack annotation but are validated by RNA-seq by the GLM-based test; 23.50% (n = 1,332) lack RNA-seq or annotation support but are reported in data sets generated with other 3' end sequencing protocols; 2.19% (n = 124) lack the previous sources of evidence but are corroborated by untemplated polyA reads using RNA-seq data; and the remaining 5.7% (n = 323) are highly expressed in at least one sample (> 10 TPM). 13.1% (n = 743) of IpA events could not be validated by any of these methods and were removed from further analyses.
Supplementary Figure 2
3'UTR start
5 10 3ss score
1.5
1.0
Low AT no IpA genes High AT no IpA genes Low AT IpA genes High AT IpA genes
80 60 40 20 0
TSS
o
Ip A
n
n
le
Skipped terminal exon upstream 3ss
15
100
0.0
xo
xo le
in a
te rm
Skipped terminal exon downstream 3ss
f
0.5
N
Trancription unit width
in a te rm
0.25
3'UTR start
ne s
TSS
pp
No IpA introns (skipped terminal exon genes)
ge
0.5
ed
te si om po C
0.50
A
1.0
3
No IpA introns (composite terminal exon genes)
Ip
1.5
10
0.75
o
Low AT no IpA genes High AT no IpA genes Low AT IpA genes High AT IpA genes
Frequency of U1 snrnp signal per kb per gene
e
4
Composite terminal exon
0
10
10
Sk i
n
le
1.00
AT−content percentage
5
5
1.00
0.00 5ss score
10
Ip A
0
le
0.50 0.75 Retained CDR
No IpA introns (skipped terminal exon genes) 0
Frequency of polyA signal per kb per gene
10
in a
0.25
0.00
0.0
1
xo
0.00
Skipped terminal exon
0.25
10
te rm te si
0
1.00
om po
50
No IpA introns (composite terminal exon genes)
0.50
2.0
Intron length
100
Composite terminal exon
0.75
2
ge ne s
0.50 0.75 Retained CDR
1.00
10
N
0.25
3
Ip A
0.00
10
C
20
150
Probability
Probability
Skipped terminal exon
Number of IpA isoforms
Number of IpA isoforms
40
0
d
Composite terminal exon
60
4
Skipped terminal exon
Composite terminal exon
b
10
in a
AAA
CDS
AAA
n
AAA
o
CDS
3'-most exon
5
xo
3'UTR
CDS CDS
10
N
Alternative exon
te rm
Intron
c
pA
Downstream 3′SS
ed
Exon
pA
Upstream 3′SS
pp
pA
5′SS
Sk i
a
Supplementary Figure 2: Composite vs skipped terminal IpA. a) Composite terminal exon events result from loss of recognition of 5'ss while skipped terminal exon results in inclusion of a new exon. b) Composite terminal exon recognizes IpA sites uniformly across the transcription unit while skipped terminal exon are predominant near the TSS. c) Skipped terminal exon occurs in longer introns compared to composite terminal exon or introns of genes with no IpA (one-sided Wilcoxon rank-sum test, P < 10-20). They also occur in longer transcription units (one-sided Wilcoxon rank-sum test, P < 10-20). d) 5ss of composite terminal exon is weaker than the 5ss of the other introns of the same genes (one-sided KS test, P < 10-09). 5ss of skipped terminal exon is stronger than the 5ss of the other introns of the same genes (one-sided KS test, P < 10-06). Upstream 3ss of skipped terminal exon is weaker than the other introns of the same genes (one-sided KS test, P < 10-15). Downstream 3ss of skipped terminal exon is weaker than the other introns of the same genes (one-sided KS test, P < 10-09). e) There is significant enrichment of pA signals in IpA genes vs. non-IpA genes in both high and low AT regions (one-sided Wilcoxon signed-rank test, IpA high-AT vs nonIpA_high-AT, P < 10−17; IpA low-AT vs non-IpA_low-AT, P < 10−16) as well as depletion of U1 signals (one-sided Wilcoxon signed-rank test, IpA high-AT vs non-IpA_high-AT, P < 10−15; IpA low-AT vs non-IpA_low-AT, P < 10−10) f) IpA genes have higher AT-content compared to non-IpA genes (Wilcoxon rank-sum test, P < 10-20).
Supplementary Figure 3 Presence of domains in IpA vs no IpA genes
A
IpA genes No IpA genes
DNA binding present
284
810
DNA binding absent
3147
11400
RNA binding present
117
239
RNA binding absent
3314
11971
PPI present
794
2054
PPI absent
2637
10156
TMD present
673
2904
TMD absent
2758
9306
Active site present
435
1325
Active site absent
2996
10885
Repeats present
994
2826
Repeats absent
2437
9384
B
Fisher-exact p 0.001
1.42 x 10-6
1.2E-16 2.23 x 10-7
0.003 4.78 x 10-12
Preferential loss of domains within IpA genes(n = 1,410) Fisher-exact p DNA binding Other domains
Lost
303
4859
Retained
238
5194
0.0005
RNA binding Other domains Lost
46
5116
Retained
39
5393
PPI
Other domains
Lost
931
4231
Retained
783
4649
TMD
Other domains
Lost
751
4411
Retained
852
4580
Active site
Other domains
Lost
133
5029
Retained
175
5257
**
*
*
NS
4.5 x 10-7
0.10
0.04
NS
0.75 retained
0.50
lost
0.25
D TM
ing ind
site
Ab RN
tive
ing
Ac
I
Ab ind
PP
rou nd
0.00 Ba ckg
Fraction of domains
1.00
DN
C
0.32
Supplementary Figure 3: Statistics for domain enrichment in IpA genes. a) Domain information was obtained from UniProt and categorized (see Methods). IpA genes encode proteins enriched in DNA-binding, RNA-binding, PPI domains and active sites but are depleted for TMDs relative to genes that do not express IpA isoforms. Fisher’s exact test was performed for enrichment p values. b) Expression of IpA isoforms leads to preferential loss of DNA-binding and PPI domains but retains active sites of enzymes compared to the other domains present in IpA genes. Fisher’s exact test was performed on IpA isoforms that retain at least one domain. c) Fraction of domains lost and retained determined from (b)
Supplementary Figure 4
Fraction of transcripts
Retained CDR < 25% 25% < Retained CDR < 50% 50% < Retained CDR < 75% 75% < Retained CDR < 100%
0.4
0.2
0.0
0
Coding probability
1
Supplementary Figure 4: Predicted coding probability for IpA isoforms. Coding probability was assessed according to CPAT and shown for IpA isoforms with different fractions of retained CDR. Most of the IpA isoforms that retain less than 25% of the CDR are predicted to be non-coding.
Supplementary Figure 5 b
a
MM group 1 IpA site usage 1 0.8 0.6 0.4 0.2 0 MM group 1 MM group 2 MM group 3
MM6
MM11
MM8
MM14
MM4
MM3
MM9
MM5
MM7
MM10
PC1
MM13
PC2
MM15
MM2
MM12
MM1
PC
5 336
139 MM group 2
Supplementary Figure 5: Hierarchical clustering defining MM patient groups. a) Hierarchical clustering of PCs and MM patient samples using 50% of the most variable IpA isoforms by median absolute deviation (n = 554) identifies three groups of MM samples. b) Overlap of significantly lower used IpA sites in MM group 1 and group 2. MM group 3 is not shown as IpA site usage was very similar to PCs and only one IpA isoform was differentially used.
Supplementary Figure 6 a
b
Number of IpA isoforms
8
6
4
2
0 1.0
5 0.7
0 0.5
5 0.2
0 0.0
−0
.25
0
Correlation values between 3'-seq and RNA−seq IpA usage
Group A
Group B
Group C
Group D
0
0.2
0.4
0.6
0.8
1
RNA-seq IpA site usage 1.0
−2.5
Group A-filtered Group B-filtered 2.5
0.6 0.4
P < 0.05 −5.0
0.0 −5
0 PC1
CNOT1
NT5C2
STAG1
5
PMS1
0
500
ETNK1
OXR1
GALNT2
CCT4
NMRK1
CUL3
1000 Days
1500
B4GALT4 SLC30A5
−5
2000
TBP
0 PC1
5
ANKRD17
ELMO1
SFT2D1
WIPF1
DNAJB6
TIPRL
ACSL3
DPY19L1
CUL1
BLOC1S6
RNA−seq IpA site usage
RAB10
0.0
−2.5 0.2
−5.0
1
e
0.8
0.0
f
Group A Group B
PC2
PC2
2.5
d
Group A Group B Group C Group D
Progression free survival
c
0
RNA−seq IpA site usage
1
0
MVD
TNRC6B
ANKLE2
PRRC2C
Supplementary Figure 6: MM patients with high IpA usage have significantly improved progression free survival a) Distribution of correlation values of IpA usage determined from 3'-seq and RNA-seq for genes that do not recognize IpA sites in MM when compared to PCs (n = 114, FDRadjusted P < 0.05, usage difference > 0.25) b) Hierarchical clustering of RNA-seq IpA site usage of genes with high correlation between 3'-seq and RNA-seq usage (Pearson r > 0.75, n = 28) determined for an independent cohort of patients (n = 319) with RNA-seq expression. This clustering show separation of patients largely into two groups, group A – low IpA usage patients (n = 139) and group B - high IpA usage patients (n = 176). c) Visualization of separation of patient groups by principal component analysis. d) Group B patients with progression-free survival information (n = 160) have significantly improved (P < 0.05) progression free survival than group A patients (n = 126). e) As (c) after removing heterogeneous samples from both groups, group A-filtered (n = 73) and group B-filtered (n = 115). f) RNA-seq IpA site usage of the gene signature (n = 28) for samples of group A-filtered and group B-filtered from (e). Genes are ordered from high correlation between RNAseq and 3'-seq IpA site usage.
Supplementary Figure 7 a
ZnF 1, 2, 3, 4, 5, 6
IKZF1 IKZF1 IpA
IKZF1 position: 139
b
DDB1
519 aa 77 aa
ROC1
168
HTGERPFQCN QCGASFTQKG NLLRHIKLHS Core lenalidomide degron
NTD
CUL4A
DDB1
CUL4A IpA NTD
CTD
Supplementary Figure 7: Potential role of IKZF1-IpA and CUL4A-IpA expression in resistance to lenalidomide b) Loss of core lenalidomide degron sequence from IKZF1 IpA. a) DDB1-CUL4A-ROC1 complex (PDB id: 2HYE) and DDB1-CUL4A IpA complex. Fulllength CUL4A has 41-759 aa (top) while CUL4A IpA has 41-171 aa (bottom).
IR
e
0.1 0.2
Sk m us cl
Breast
IR
r: 0.50
No_IR
Skmuscle Brain Ovary
No_IR
T
Testis
ES
MM
IR
d
oo
Bl
0.15
No_IR
y
NISCH
IR
va r
O
0.0
IR
ES
No_IR
0.00
n
NISCH
Br ai
10 kb 0.05
IR
NISCH
No_IR
0
Fraction of genes with IpA isoforms
150
is
ea st
0
Te st
Br
N B
3'-seq 100
No_IR
d
oo
Bl
RNA-seq
a
IR
No_IR
IR
No_IR
B
B
5+
D
N
B
em
M C
B
C
G
NISCH
IR
No_IR
IR
No_IR
IR
No_IR
IR
No_IR
d PC
250
IR
Sk Bra mu in scl e M Ov M ary E Te S stis B PC GC rea B ( st ex t) NB GCB (ex t) MeNB m CD B Blo 5+B o NB d T (ex t) NB
Number of introns with IR and IpA
c
No_IR
IpA site usage
Supplementary Figure 8 b Blood NB (external)
Blood T Blood NB
0.10 PC GCB (external) GCB CD5+B
NB
MemB NB (external)
0.3 0.4
Fraction of genes with IR
Expected Observed
200
150
100
50
0
1.00
0.75
0.50
0.25
0.00
Supplementary Figure 8: Co-occurrence of IpA events with intron retention. a) Co-occurrence of IpA with intron retention (IR). Shown as in Fig. 1a. b) Correlation between IR and IpA isoform expression. Tissues with a higher number of genes with IR also have more genes that express IpA isoforms (Pearson correlation coefficient, r = 0.5). b) Higher incidence of IpA events in introns with IR. The number of introns where IpA and IR are observed simultaneously is much higher than expected by chance. c) IpA site usage differs between introns with or without IR. The distribution of IpA site usage was calculated for the two groups and plotted for each cell type. The median usage of the IpA isoforms that co-occurred with IR is lower compared to those occurring in introns that are not retained.
Supplementary Table 1. Characterization of normal human immune cells investigated by 3'-seq Sample
Derived from CD5+B Tonsil NB Tonsil NB Blood MemB Tonsil GCB Tonsil PC BM T Blood BM, bone marrow
Sample name CD5+B3-CD5+B6 NB3-NB4 NB1-NB2 M1-M2 GC1-GC2 PC1-PC2 T2-T3
Markers for sorting CD5+, CD19+ CD19+, CD27CD19+, CD27CD19+, CD27+ CD19+, CD38+ CD138+ CD3+
No of samples 4 2 2 2 2 2 2
Accession number GSE111310 GSE111310 SRP029953 GSE111310 GSE111310 GSE111310 GSE111310
MM3
1
I
MM4 MM5 MM6 MM7 MM8 MM9 MM10 MM11 MM12 MM13 MM14 MM15
1 2 1 2 1 1 2 1 3 2 1 3
III II II I I I II II II II I III
N, No; Y, Yes
IgA IgA IgG LLC IgG KLC IgA KLC IgA KLC Kappa LC Kappa LC IgG IgA KLC IgM KLC IgM KLC IgG Lambda IgG kappa Non-secretory PC leukemia
3'-seq
I I
RNA-seq
3 3
Cytogenetics
ISS stage at sample collection
MM1 MM2
Type
MM group
Supplementary Table 2. MM patient characteristics
t(11;14) --> IGH-CCND1
Y Y
Y Y
Y
Y
Y Y Y Y Y Y Y Y Y N N N
Y Y Y Y Y Y Y Y Y Y Y Y
t(11;14) --> IGH-CCND1 t(4;14), loss of chr 13 and 17 and other complex cytogenetic changes Hyperdiploid t(11;14), del(13q), deletion P53 t(11;14) --> IGH-CCND1 t(11;14) --> IGH-CCND1 t(11;14) --> IGH-CCND1 Hyperdiploid t(11;14), del(13q), deletion P53 t(11;14), del(1q), inv(9)(p12q13) t(11;14), del(1q) Normal t(14;16) --> IGH-MAF Complex
Supplementary Table 3. Characterization of normal human immune cells investigated by RNA-seq Sample CD5+B CD5+B NB NB MemB MemB GCB PC NB GCB NB
Derived from Tonsil Blood Tonsil Blood Tonsil Blood Tonsil BM Tonsil Tonsil Blood
T
Blood
Sample name
Markers for sorting
CD5+B3-CD5+B4 CD5+B2 NB3-NB5 NB1-NB2, NB6 M2, M6 M3-M5 GC1-4 PC4-PC21
CD5+, CD19+ CD5+, CD19+ CD19+, CD27CD19+, CD27CD19+, CD27+ CD19+, CD27+ CD19+, CD38+ CD138+ IgD+ CD77+ CD19+ CD5- CD27-
No of samples 2 1 3 3 2 3 4 18 4 4 2
CD3+
Other 3'-seq profiles were from Lianoglou et al. 2013 (SRP029953).
Accession number GSE111310 GSE111310 GSE111310 GSE111310 GSE111310 GSE111310 GSE111310 GSE111310 GSE45982 GSE45982 ERX397853 ERX397892 GSM1576415