New Statistical Models for Copolymerization

4 downloads 3273 Views 996KB Size Report
Jun 22, 2016 - This set of ODEs can be solved fast but does not convey any information on the ... system and solved in Python. For the Monte ...... A web-server predicting complex biopolymer targets in plasmodium with entropy measures of.
polymers Article

New Statistical Models for Copolymerization Martin S. Engler 1 , Kerstin Scheubert 1 , Ulrich S. Schubert 2,3 and Sebastian Böcker 1,3, * 1 2 3

*

Chair of Bioinformatics, Friedrich Schiller University Jena, Ernst-Abbe-Platz 2, 07743 Jena, Germany; [email protected] (M.S.E.); [email protected] (K.S.) Laboratory of Organic and Macromolecular Chemistry (IOMC), Friedrich Schiller University Jena, Humboldtstr. 10, 07743 Jena, Germany; [email protected] Jena Center for Soft Matter (JCMS), Friedrich Schiller University Jena, Philosophenweg 7, 07743 Jena, Germany Correspondence: [email protected]; Tel.: +49-3641-9-46451

Academic Editor: Carlo Cavallotti Received: 9 May 2016; Accepted: 15 June 2016; Published: 22 June 2016

Abstract: For many years, copolymerization has been studied using mathematical and statistical models. Here, we present new Markov chain models for copolymerization kinetics: the Bernoulli and Geometric models. They model copolymer synthesis as a random process and are based on a basic reaction scheme. In contrast to previous Markov chain approaches to copolymerization, both models take variable chain lengths and time-dependent monomer probabilities into account and allow for computing sequence likelihoods and copolymer fingerprints. Fingerprints can be computed from copolymer mass spectra, potentially allowing us to estimate the model parameters from measured fingerprints. We compare both models against Monte Carlo simulations. We find that computing the models is fast and memory efficient. Keywords: copolymer kinetics; copolymer fingerprint; Markov model

1. Introduction Copolymerization is a random process, where two or more monomer species are mixed to form polymer chains. We investigate binary copolymers with two monomer types A and B. In the past, several approaches to model copolymerization were proposed. The well-known terminal model by Mayo and Lewis describes four propagation reactions and is determined by the reactivity ratios of the monomers [1]. There are three different computational approaches to such a basic reaction scheme and each approach has certain disadvantages. The reaction scheme can be modeled as a set of ordinary differential equations (ODE), a discrete Markov chain or simulated with Monte Carlo methods. Mayo and Lewis described the scheme as a set of ODEs and deduced the copolymer equation, which provides the copolymer composition. This set of ODEs can be solved fast but does not convey any information on the chain sequences. Kryven and Iedema advanced the ODE approach by applying population balance equations [2]. They showed the importance of recovering “distributions in a full form rather than averages, since average values may often be far from the most frequently occurring ones” ([2], p. 305). They were able to extract simple sequence patterns, but not the full distribution of sequences. Markov and Hidden Markov models are frequently used in the analysis of polymers and biopolymers, for example de novo peptide sequencing [3], detection of gene promoter regions [4], or prediction of quantitative structure–property relationships [5–8] for cellular recognition [9], or drug–DNA [10] or protein–protein interactions [11]. The transformation of the traditional Mayo–Lewis model to a Markov chain is straightforward and the resulting Markov chain can be used to compute the probability of a single copolymer chain [12], but not the distribution of all chains.

Polymers 2016, 8, 240; doi:10.3390/polym8060240

www.mdpi.com/journal/polymers

Polymers 2016, 8, 240

2 of 13

Gillespie introduced Monte Carlo methods to cheminformatics to simulate chemical reactions [13]. Gillespie’s algorithm has been frequently used to simulate copolymerizations by randomly growing copolymer chains [14–16]. Several times, Monte Carlo simulations have been evaluated against experimental data [17–19] and it has been shown that Gillespie’s algorithm can be used to compute copolymer fingerprints [20–22]. However, Monte Carlo simulations are time- and memory-intensive, in particular if an accurate representation of the distribution of copolymer chains is desired. The distribution of copolymer chains can be represented using fingerprints. A copolymer fingerprint is the 2D compositional distribution, i.e., the abundance of each possible combination of monomer counts. The fingerprints can be estimated experimentally using mass spectrometry (MS), though the computational transformation of the experimental data to a fingerprint is a non-trivial problem [23]. Matrix-assisted laser desorption/ionization time-of-flight (MALDI-TOF) MS is frequently applied to characterize (co-)polymers [24,25] and can be used to estimate fingerprints [23,26–28]. Most recently, we proposed a method to correct fingerprints for mass discrimination effects [29] a long-known issue of MALDI ionization [30–33]. In this publication, we propose two new Markov chain models for copolymerization kinetics: the Bernoulli and the Geometric model based on a simple reaction scheme. Different to Mayo and Lewis [1], our model allows for variable chain lengths and time-dependent monomer probabilities. The accuracy of Monte Carlo simulations depends on the number of simulated chains, the simulated distribution converges to the true distribution with an increasing number. This makes accurate computations time- and memory-intensive. In contrast to Monte Carlo simulations, our models are exact and fast. We implement a simple copolymerization scheme using ODEs and Monte Carlo simulations. We verify the Monte Carlo simulations with the ODE system. We evaluate our models against the fingerprints and copolymer chains computed by Monte Carlo. 2. Materials and Methods For evaluation against Monte Carlo methods, we consider the following simple copolymerization scheme. We denote an active center as X• , and a polymer chain ending with X as ∼X, where X can be one of the monomers A or B, or initiator I. We model two types of reactions, initiation and propagation reactions:  kIA I + A −− → ∼ A•  k

IB I + B −− → ∼ B•

Initiation



 k AA ∼A• + A −− → ∼ A•       k  AB • • ∼A + B −−→ ∼B  Propagation. k BA   ∼B• + A −− → ∼ A•      k  BB ∼B• + B −−→ ∼B• −1 We use the reaction rate coefficient kAA = 1.0 L · mol−1 · s−1 and the reactivity ratios rA = rB with the values 0.01, 0.05, and 0.1, plus values in the range from 0.25 to 2.0 with step size 0.25. All other reaction rate coefficients are 1.0 L · mol−1 · s−1 . The initial amounts of A, B, and I are nA = 1 mol, nB = 2 mol, and nI = 0.0094 mol, respectively. The reaction scheme was implemented as an ODE system and solved in Python. For the Monte Carlo simulations, we use 102 to 106 polymer chains and the same parameters as for the ODE system. The simulations are stopped at full conversion of A and B or if the simulated reaction time reaches 103 s. We repeat the simulations ten times for each reactivity ratio and chain number. We implemented the Monte Carlo simulation software in Java using the conventional Gillespie’s algorithm [13].

Polymers 2016, 8, 240

3 of 13

We compute the Monte Carlo fingerprints by calculating a histogram from the simulated polymer chains. To compare two fingerprints, we use the normalized root mean square error (NRMSE). The NRMSE( M1 , M2 ) between two matrices M1 and M2 of size n × m is defined as: q NRMSE( M1 , M2 ) = 100 ×

1 n×m || M1

− M2 ||22

max ( M2 )

.

(1)

Monte Carlo simulations produce a large random sample of polymer chains S. If a model M can compute the likelihood of a single chain P(S| M ), we can compare different models by computing and comparing the log likelihoods of the whole dataset D: P( D | M) =



log P(S| M).

(2)

S∈ D

We use this log likelihood to evaluate our models. 3. Results and Discussion 3.1. Bernoulli Model 3.1.1. Chain Lengths Consider the synthesis of a single polymer chain. We divide the continuous reaction time into T discrete time steps, which we call synthesis steps. At each step, there are two mutually exclusive events: adding a monomer or not. This random process is equivalent to conducting a series of T Bernoulli trials for every polymer chain and recording the chain lengths, i.e., how many monomers were added. Thus, the chain lengths are binomially distributed with parameters T, the number of trials, and p M , the probability of adding a monomer. 3.1.2. Fingerprint Model We extend the model to describe copolymer fingerprints. At each of the T discrete synthesis steps, three mutually exclusive events are possible: adding monomer A, monomer B, or nothing. However, in general, the proportion of A to B changes during the synthesis; therefore, the probabilities of adding A or B change. We define the monomer probability parameters pA (t) and pB (t), with pA (t) + pB (t) = 1 for all 1 ≤ t ≤ T. pA and pB are vectors of length T, describing the probability of encountering a monomer A or B at each synthesis step. We model copolymerization as an inhomogeneous Markov chain and call this basic model the Bernoulli model. We describe a copolymer fingerprint as a matrix M of size n × m, in which entry Ma,b gives the relative abundance of a copolymer with a monomers of type A and b monomers of type B. The states of the Markov chain correspond to the fingerprint entries. The transition probabilities correspond to the three possible events: append A, B, or nothing. The transition probability from state Ma,b to Ma+1,b is the probability of adding a monomer p M times the probability of encountering an A at synthesis step t: P( Ma,b → Ma+1,b ) = p M × pA (t). (3) Analogously, the transition probability from Ma,b to Ma,b+1 is the probability of adding a monomer times the probability of encountering a B: P( Ma,b → Ma,b+1 ) = p M × pB (t).

(4)

The transition probability for staying in state Ma,b is the probability of adding nothing: P( Ma,b → Ma,b ) = 1 − p M .

(5)

Polymers 2016, 8, 240

4 of 13

All other transition probabilities are zero. The starting distribution M(0) is a matrix of zeros, except for M0,0 (0) = 1. This means that, before starting the synthesis, all chains have zero monomer repeating units A and B. To conform to standard Markov chain notation, let M be a row vector. Let P be the matrix of transition probabilities. Starting with M(0), the copolymer fingerprint at synthesis step t is: M ( t ) = M ( t − 1) × P ( t ).

(6)

We are interested in the fingerprint after the completed synthesis, which is the fingerprint at the last synthesis step M ( T ). The transition matrix P is sparse; thus, Equation (6) can be simplified for a > 0 and b > 0 to: Ma,b (t) = p M × pA (t) × Ma−1,b (t − 1)

+

p M × pB (t) × Ma,b−1 (t − 1)

+

(1 − p M ) × Ma,b (t − 1).

(7)

If a = 0 or b = 0, one needs to delete from Equation (7) the first or second term, respectively. In each synthesis step 1 ≤ t ≤ T, we compute n × m fingerprint entries in constant time for each entry. Because n ≤ T and m ≤ T, the worst case running time is O( T 3 ). It is not necessary to save the fingerprints for each synthesis step as M (t) only depends on M (t − 1); therefore, the memory requirement is O( T 2 ). 3.1.3. Reactivity Ratios Thus far, our model has not taken reactivity ratios into account. The probability of a reaction equals the probability of adding a certain monomer times the probability of encountering that monomer. However, the reactivity ratios are known to influence the copolymerization process. For example, if monomer A has a strong affinity for monomer B, a weak affinity for A, and monomer B has the reverse affinity, then the result will be an alternating copolymer. To this end, we define a new model: the Bernoulli model with reactivity parameters. We define the reactivity parameters pAA , pAB , pBA , and pBB , which describe the probabilities of the reactions between the four possible pairings of chain ends and monomers. To be able to distinguish between chains ends, we use two fingerprints: MA , the distribution of chains ending with A, and MB , the distribution of chains ending with B. We are interested in the fingerprint after the final synthesis step T. The final fingerprint can be calculated by adding the final distributions of chains ending with A and B: M ( T ) = M A ( T ) + M B ( T ). (8) We define the transition probabilities for the four possible reactions of chain ends and monomers. For X ∈ {A, B}, the transition probabilities for adding A are: A P ( MX a,b → Ma+1,b ) = p M × cX × pXA × pA ( t ).

(9)

Analogously, the transition probabilities for adding B are: B P ( MX a,b → Ma,b+1 ) = p M × cX × pXB × pB ( t ).

(10)

An important property of Markov chains is that the rows of the transition matrix sum to one. Introducing the reactivity parameters violated this property; therefore, we use the normalization coefficients cA and cB in the Equations (9) and (10). The normalization coefficients are defined as: cX =

1 . pXA × pA (t) + pXB × pB (t)

(11)

Polymers 2016, 8, 240

5 of 13

Because empty chains end neither with A nor with B, we define the initiator state I. The transition probabilities to start a chain are: A P(I → M1,0 )=

P (I →

B M0,1 )

=

p M × pA ( t ), p M × pB ( t ).

(12)

The transition probabilities of the non-state-changing transitions are not affected by the reactivity parameters and are analogous to Equation (5). All other transition probabilities are zero. By applying the transition probabilities (Equations (9) and (10)) and the normalization coefficients (Equation (11)), the fingerprint MA can be calculated for a > 0 and b > 0 by: MA a,b ( t ) =

p M c A pAA pA (t) × MA a−1,b ( t − 1)

+

p M c B pBA pA (t) × MB a−1,b ( t − 1)

+

MA a,b ( t − 1).

(1 − p M ) ×

(13)

Analogously, fingerprint MB is: MB a,b ( t ) =

p M cA pAB pB (t) × MA a,b−1 ( t − 1)

+

p M cB pBB pB (t) × MB a,b−1 ( t − 1)

+

MB a,b ( t − 1).

(1 − p M ) ×

(14)

If a = 0 or b = 0, the appropriate terms can be deleted from Equations (13) and (14). For a = 1, b = 0 or b = 1, a = 0 Equations (13) and (14) change according to Equation (12). The running time and memory requirements change by a constant factor; therefore, the worst case running time is still O( T 3 ) and memory is O( T 2 ). 3.2. Geometric Model 3.2.1. Chain Length The Bernoulli model that we introduced above used T discrete synthesis steps to add monomers: A, B or nothing. Adding a monomer or not is a Bernoulli trial and the resulting chain lengths are binomially distributed. However, in practice, polymer lengths often show a long-tailed distribution, which is usually modeled by a gamma distribution [12,34,35]. Here, we modify our discrete model for a long-tailed chain length distribution. The discrete equivalent to the continuous gamma distribution is the negative binomial distribution. A random variable following a negative binomial distribution with parameters T and p equals the sum of T independent geometrically distributed random variables with parameter 1 − p. To this end, we model the discrete steps using the geometric distribution. Consider the synthesis of a single polymer chain. In each synthesis step, the number of monomers, which are added to the chain, is random. The probability of adding k monomers follows a geometric distribution with parameter pe , the “stop” probability: PG (k) = (1 − pe )k pe .

(15)

We call this the Geometric model. 3.2.2. Fingerprint Model Due the geometrically distributed number of monomers to add in each synthesis step, the number of possible transitions increases compared to the Bernoulli model. Given i ≥ 0 and j ≥ 0, the transition probability from Ma,b to any state with equal or higher numbers of A and B is the number of

Polymers 2016, 8, 240

6 of 13

combinations with i monomers of type A and j monomers of type B times the probability of adding i + j monomers times the probabilities of encountering i monomers of type A and j monomers of type B:  P( Ma,b → Ma+i,b+ j ) =

 i+j × PG (i + j) × pA (t)i × pB (t) j . j

(16)

To save computation time, the number of combinations (i+j j) can be calculated using Pascal’s triangle. As with the Bernoulli model, the memory requirements are O( T 2 ). However, the running time increases to O( T 5 ) because we need to iterate over all possible i and j. 3.2.3. Reactivity Ratios Analogous to the Bernoulli model, we define a Geometric model with reacitivity parameters. We use the reactivity parameters pAA , pAB , pBA , and pBB to model the reactivity ratios, the initiator state I, and two fingerprints MA and MB to describe the distributions of chains ending with A or B, respectively. In contrast to the Bernoulli model, the Geometric model is able to add more than one monomer per synthesis step. We need to determine the reactivity parameters for all possible combinations of added A and B. Consider one synthesis step of the Markov chain: we say that we start in a state X ∈ {I, A, B}, if the last added monomer of all previous steps was nothing, A, or B, respectively. We stop in state Y ∈ {A, B}, if the last added monomer of this or any previous step is an A or B, respectively. To this end, we introduce the matrix RXY . RXY a,b is the probability of starting in state X, adding a monomers A, b monomers B, and ending in state Y. We define RXY as: RXA a,b =

XB RXA a−1,b × pAA + R a−1,b × pBA ,

RXB a,b =

XB RXA a,b−1 × pAB + R a,b−1 × pBB .

(17)

To compute RXY for each possible combination of X and Y, we need to know the initial values. If no monomer is added, we start and end in the same state: RXX 0,0 =

1,

RXY 0,0

0 for X 6= Y.

=

(18)

If we start in the initiator state I and add one monomer, it is independent of the reactivity parameters: RIA 1,0 =

1,

RIB 0,1 =

1.

(19)

Analogous to the Bernoulli model with reactivity parameters, the rows of the transition matrix need to sum to one. We therefore define normalization coefficients and normalize the transition probabilities for all transitions that add the same number of monomers: cX ( k ) =

1 . XA + RXB ) × p (t ) a × p (t )b ( R ∑ A B a,b a,b

(20)

a+b=k

We now combine Equation (17) to Equation (20) to specify the transition probabilities for the Geometric model. For X ∈ {A, B}: Y XY i j P ( MX a,b → Ma+i,b+ j ) = cX (i + j ) × Ri,j × PG (i + j ) × pA ( t ) × pB ( t ) .

(21)

The transition probabilities from the initiator state I to any other state are given by: Y i j P(I → Mi,j ) = cI (i + j) × RIY i,j × PG (i + j ) × pA ( t ) × pB ( t ) .

(22)

Polymers 2016, 8, 240

7 of 13

The transition probability to not start a chain and stay in state I is: P(I → I) = PG (0).

(23)

We are interested in the fingerprint after the final synthesis step T. Analogous to the Bernoulli model, the final fingerprint can be calculated by adding the final distributions MA ( T ) and MB (t). Compared to the Geometric model without reactivity parameters, the running time and memory requirements change by a constant factor; therefore, the worst case running time is still O( T 5 ) and memory is O( T 2 ). 3.3. Single Chain Models The Bernoulli and Geometric models described above compute the copolymer fingerprints, and the distribution of all chains over the numbers of monomer repeating units. However, an additional interesting question is: what is the likelihood of a single copolymer chain under a given model? To compute the likelihood of a single chain, we only consider transitions which may lead to the chain in question and transitions which do not add a monomer, i.e., non-state-changing transitions. All other transition probabilities are zero. After progressing T synthesis steps, the likelihood of the chain is the probability of the last reachable state. For example, let us compute the likelihood of the chain “ABB”. In addition to the non-state-changing transitions, the Bernoulli model would allow M0,0 → M1,0 , M1,0 → M1,1 , and M1,1 → M1,2 . The likelihood of “ABB” is the probability of the state M1,2 . The likelihood under the Bernoulli model with reactivity parameters and the Geometric models can be computed analogously. 3.4. Parameter Estimation

0.18 0.16 0.14 0.12 0.10 0.08 0.06 0.04 0.02 0.000

6

102 chains 106 chains

5 fingerprint NRMSE [%]

relative abundance

The Bernoulli and Geometric models fully characterize the distribution of copolymer chains. Unfortunately, the true underlying distribution of copolymer chains is unknown and the Monte Carlo simulated chains are random samples. However, the larger the sample size is, the closer the empirical distribution is to the true distribution and the better we can use the sample to the evaluate our models. The accuracy of the Monte Carlo simulation strongly depends on the number of simulated chains 1 2 6 (Figure 1, left). For r A = r − B = 1.0, we compute 10 fingerprints M with 10 to 10 chains and compare 6 i them to the fingerprint Mtotal , which we compute using all 10 × ∑i=2 10 = 11, 111, 000 chains (Figure 1, right). For comparison, we use the normalized root mean square error NRMSE( M, Mtotal ).

4 3 2 1 5

10 15 20 degree of polymerization

25 102

103

104 105 number of chains

106

0

Figure 1. Left: Comparison of the distribution of chain lengths computed by the Monte Carlo simulations with 102 vs. 106 chains at reactivity ratio r A = 1.0; Right: Normalized root mean square error (NRSME) of the fingerprints computed by Monte Carlo simulations with different numbers of chains compared to the fingerprint computed from all chains produced by all Monte Carlo simulations at reactivity ratio r A = 1.0.

Polymers 2016, 8, 240

8 of 13

The error decays with the number of chains. The lowest mean errors are ∼2% and ∼0.5% using 105 and 106 chains, respectively. We observe that the error for 105 is still significantly above zero. Thus, if not stated otherwise, we use 106 chains for the Monte Carlo simulations in the following. For completeness, we evaluate the Monte Carlo simulations by comparing the simulated concentrations to the concentrations computed by solving the ordinary differential equation model of the living copolymerization (Supplementary Figure S1). The concentration curves are identical to the eye, strongly supporting the validity of the Monte Carlo simulations. We now compare the Bernoulli and Geometric models to the Monte Carlo simulations. The reactivity parameters can be calculated from the reactivity ratios. For X, Y ∈ {A, B}, the reactivity parameters are: rXY . (24) pXY = rXA + rXB Unfortunately, the other model parameters cannot be calculated intuitively from the Monte Carlo simulation parameters. In principle, it is possible to estimate the parameters by fitting the model fingerprint to the Monte Carlo fingerprint. However, to minimize the influence of the fitting algorithms, we apply a two-step estimation process. First, we estimate the number of synthesis steps and the probability of adding monomers. According to the Bernoulli and Geometric model, the chain lengths follow a binomial or negative binomial distribution, respectively. We fit a binomial and a negative binomial probability mass function (pmf) to each copolymer length distribution (Figure 2 and Supplementary Figure S2). The length distributions become broader with increasing reactivity ratios rA = rB −1. The broader the distribution, the better it is approximated by a negative binomial pmf, and the worse by a binomial pmf. However, for narrow distributions (rA = rB −1 ≤ 0.1, which correspond to a standard deviation σ ≤ 2.8), we do not observe such a clear distinction: The mode of the distribution is better approximated by the binomial pmf. In contrast, the negative binomial pmf is better able to fit the long tail of the distribution.

rA = rB−1 =0.01

relative abundance

0.20 0.15

rA = rB−1 =2.0

Monte-Carlo Bernoulli Geometric

Monte-Carlo Bernoulli Geometric

0.10 0.05 0.000

5

10 15 degree of polymerization

20 0

5

10 15 degree of polymerization

20

Figure 2. Comparison of the distribution of chain lengths computed by the Monte Carlo simulations with r A = 0.01 (left) and r A = 2.0 (right) vs. the length distributions computed by the Bernoulli and Geometric models.

Second, we estimate the monomer probabilities pA and pB . Because we defined pA + pB = 1, estimating pA is sufficient. We divide the reaction time of the Monte Carlo simulation into intervals. The number of intervals equals the number of synthesis steps T. We choose the left and right interval limits, such that the change in concentration is the same for each interval (Figure 3 and

Polymers 2016, 8, 240

9 of 13

Supplementary Figure S3). We calculate the mean concentrations [f A](t) and [f B](t) for each interval 1 ≤ t ≤ T. Then, the monomer probabilities pA (t) can be calculated as: pA ( t ) =

0.07

A](t) [f . f B](t) [A](t) + [f

(25)

0.8

[A] [B] [A]+[B]

0.05

0.7 0.6

[A] g [B] g

0.04 0.03

monomer probability

concentration [mol ·l−1 ]

0.06

pA pB

0.5 0.4

0.02

0.3

0.01 0.000

100

200 300 time [s]

400

500 0

2

4

6 8 10 synthesis steps

12

140.2

Figure 3. Left: Concentration of monomers [A] and [B] during the Monte Carlo simulation with r A = 2.0. We divided the time into discrete synthesis steps and determined the average concentrations A] and [f B]; Right: Monomer probabilities pA and pB for each synthesis step calculated from the [f average concentrations.

Please note that the parameter estimation using the concentrations from Monte Carlo simulations is for evaluation purposes only. When applying the models to experimental data, the parameters can be estimated by fitting the computed fingerprint to the observed fingerprint. 3.5. Model Evaluation Determining the model parameters allows us to compare the fingerprints computed by our models to the Monte Carlo fingerprints (Figure 4 and Supplementary Figure S4). Additionally, we can compute the NRMSE of the Monte Carlo fingerprints vs. the model fingerprints (Figure 5, left). Evidently, the reactivity parameters are crucial to model copolymerization. They determine the location and size of the distribution of abundances in the fingerprint. Both the Bernoulli and Geometric model fingerprints without the reactivity parameters have a significantly larger deviation than the models with reactivity parameters to the Monte Carlo fingerprints, except for the instances with −1 rA = rB = 1.0. This is to be expected because, in our setup, this corresponds to reactivity parameters of pXY = 1.0 for all A, B ∈ {AB}. Overall, the Geometric model provides a better fit than the Bernoulli model for all fingerprints −1 computed with rA = rB ≥ 0.5: the shapes of the distributions match closely and the deviations −1 to the Monte Carlo fingerprints are the lowest. For fingerprints computed with rA = rB < 0.5, we observe the reverse: the Bernoulli model provides a better fit than the Geometric model for narrow distributions. The Bernoulli and Geometric models are not only able to compute fingerprints, but also the likelihood of a single copolymer chain. Monte Carlo simulations produce a large random sample of copolymer chains. This allows us to compute and compare the log likelihoods of the sampled data under the different models (Figure 5, right) to further evaluate the models. A model that has a higher likelihood is “closer” to the sample.

Polymers 2016, 8, 240

10 of 13

10

Bernoulli Bernoulli (RP)

relative abundance [%]

8

4.8 4.2 3.6 3.0 2.4 1.8 1.2 0.6 0.0

Geometric Geometric (RP)

A units

6 4 2 00

2

4

B units

6

8

10 0

2

4

B units

6

8

10

Figure 4. Copolymer fingerprint computed by the Monte Carlo simulation with r A = 2.0 (filled contours) compared to the fingerprints computed by the statistical models (solid and dashed contours). Left: Bernoulli model with and without reactivity parameters (RP); Right: Geometric model with and without reactivity parameters (RP).

1e6 4.5

Bernoulli Bernoulli (RP) Geometric Geometric (RP)

fingerprint NRMSE [%]

8 7 6

Bernoulli Bernoulli (RP) Geometric Geometric(RP)

5.0 log likelihood

9

5.5

5 4

6.0

3 2

6.5

1 0 0.0

0.5

1.0 1.5 reactivity ratio rA =rB−1

2.0

0.0

0.5

1.0 1.5 reactivity ratio rA =rB−1

2.0

7.0

Figure 5. Left: Normalized root mean square error (NRMSE) of the copolymer fingerprints computed by Monte Carlo simulations compared to the fingerprints computed by the statistical models; Right: Log likelihoods of the polymer chains produced by the Monte Carlo simulations under the Bernoulli and Geometric models with and without RP. Note that the minimal and maximal log likelihoods are so close to the means that the error bars are indiscernible. −1 Except for rA = rB = 1.0, the log likelihood under the models without reactivity parameters are in all cases lower than their counterparts using reactivity parameters. This is consistent with the fingerprint comparisons. However, contrary to the fingerprint comparisons, the Geometric model has the best log likelihood for all instances. The running time and memory requirements of a Monte Carlo simulation increase with the number of simulated chains and for good accuracy the number should be high. The running time and memory of the Bernoulli and Geometric models are determined by the number of synthesis steps. Compared to the theoretical time complexity of O( T 3 ) for the Bernoulli model, the Geometric model has a higher theoretical time complexity of O( T 5 ). We measured running time (excluding Input/Output operations) and memory of the Monte Carlo simulations with 102 to 106 chains and of our models (Figure 6).

Polymers 2016, 8, 240

11 of 13

10

4

10

3

10

2

10

1

10

0

10

4

10

3

10

2

10

1

10

0

10

-1

-2

10

-1

10 10

-3

10

-2

10

-4

10

-3

102

103

104

105

Monte-Carlo

106

B B(RP) G G(RP)

102

103

104

105

106

B B(RP) G G(RP)

memory [MB]

running time [s]

Computing the Bernoulli model is the fastest. As expected, the measured running time of the Geometric model is higher. However, computing the fingerprints with the Geometric model is still 11.8 and 788.5 times faster than the Monte Carlo simulations with 105 and 106 chains, respectively. The reactivity parameters have no substantial impact on the running time. Both models require significantly less memory than the Monte Carlo simulations. The additional matrices required for the reactivity parameters increase the memory consumption only slightly.

Monte-Carlo

Figure 6. Comparison of the running time (left) and memory (right) measurements of the Monte Carlo simulations using Gillespie’s algorithm with 102 to 106 chains and the Bernoulli (B) and Geometric (G) models with and without RP.

4. Conclusions We introduced two new Markov chain models: the Bernoulli and Geometric models. The major differences to classical copolymer Markov chains based on the terminal model by Mayo and Lewis [1] are the variable number of added monomers per time step and the time-dependent monomer probabilities. The number of added monomers follows a Bernoulli or geometric distribution, respectively. The reactivity ratio has a major influence on synthesized copolymers and likewise the reactivity parameters of the models play a decisive role. In our setup, the Geometric model is able to provide a good fit to the fingerprints of the broad polymer distributions, while the fit of the Bernoulli model is particularly good to the mode but not as good to the long tail of the narrow polymer distributions. However, we observe that the likelihood of the copolymer chain sequences is always higher under the Geometric model. This shows that long chains play a major role in characterizing the distribution of copolymer chains. Our models require less memory than Monte Carlo simulations. The Bernoulli model is faster than Monte Carlo, and the Geometric model is significantly faster (1–3 folds) for a high number of simulated chains, which is necessary for accurate Monte Carlo simulations. In addition, computing our models can be parallelized for multiple cores in a straightforward way, computing different lines of the matrices in parallel. However, the main advantage of our models over Monte Carlo simulations is that they do not produce just a random sample, but characterize the complete distribution of copolymer chains. Our computations are exact and deterministic. In particular, we can calculate the exact likelihood of any polymer chain. The model parameters can be estimated from measured copolymer fingerprints and compute previously inaccessible statistical properties of the synthesized copolymers. This will be described in a forthcoming publication. Although the Geometric model was more accurate in our setup, the Bernoulli model is a good characterization for copolymer distributions without a long tail and in general can be used as a rapid first estimate. Supplementary Materials: Supplementary Materials can be found at www.mdpi.com/2073-4360/8/6/240/s1.

Polymers 2016, 8, 240

12 of 13

Acknowledgments: We thank Sarah Crotty for fruitful discussions. Funding was provided by the Thüringer Ministerium für Bildung, Wissenschaft und Kultur (grant no. 12038-514). Author Contributions: Martin S. Engler, Kerstin Scheubert, Ulrich S. Schubert, and Sebastian Böcker jointly contributed to model development; Kerstin Scheubert and Sebastian Böcker developed the Bernoulli model; Martin S. Engler developed the other models, implemented the algorithms, performed the experiments, and drafted the paper; and Martin S. Engler, Kerstin Scheubert, Ulrich S. Schubert, and Sebastian Böcker finalized the paper. Conflicts of Interest: The authors declare no conflict of interest.

Abbreviations The following abbreviations are used in this manuscript: ODE MS MALDI-TOF MS NRMSE

Ordinary differential equation Mass spectrometry Matrix-assisted laser desorption/ionization time-of-flight mass spectrometry Normalized root mean square error

References 1.

2. 3. 4. 5. 6.

7.

8.

9. 10.

11.

12. 13. 14.

Mayo, F.R.; Lewis, F.M. Copolymerization. I. A basis for comparing the behavior of monomers in copolymerization; the copolymerization of styrene and methyl methacrylate. J. Am. Chem. Soc. 1944, 66, 1594–1601. Kryven, I.; Iedema, P.D. Deterministic modeling of copolymer microstructure: Composition drift and sequence patterns. Macromol. React. Eng. 2015, 9, 285–306. Fischer, B.; Roth, V.; Roos, F.; Grossmann, J.; Baginsky, S.; Widmayer, P.; Gruissem, W.; Buhmann, J.M. NovoHMM: A hidden Markov model for de novo peptide sequencing. Anal. Chem. 2005, 77, 7265–7273. Wu, H.; Caffo, B.; Jaffee, H.A.; Irizarry, R.A.; Feinberg, A.P. Redefining CpG islands using hidden Markov models. Biostatistics 2010, 11, 499–514. González Díaz, H.; Molina, R.; Uriarte, E. Stochastic molecular descriptors for polymers. 1. Modelling the properties of icosahedral viruses with 3D-Markovian negentropies. Polymer 2004, 45, 3845–3853. González-Díaz, H.; Pérez-Bello, A.; Uriarte, E. Stochastic molecular descriptors for polymers. 3. Markov electrostatic moments as polymer 2D-folding descriptors: RNA-QSAR for mycobacterial promoters. Polymer 2005, 46, 6461–6473. González-Díaz, H.; Saíz-Urra, L.; Molina, R.; Uriarte, E. Stochastic molecular descriptors for polymers. 2. Spherical truncation of electrostatic interactions on entropy based polymers 3D-QSAR. Polymer 2005, 46, 2791–2798. Cruz-Monteagudo, M.; Munteanu, C.R.; Borges, F.; Cordeiro, M.N.D.S.; Uriarte, E.; Chou, K.C.; González-Díaz, H. Stochastic molecular descriptors for polymers. 4. Study of complex mixtures with topological indices of mass spectra spiral and star networks: The blood proteome case. Polymer 2008, 49, 5575–5587. González-Díaz, H.; Uriarte, E. Biopolymer stochastic moments. I. Modeling human rhinovirus cellular recognition with protein surface electrostatic moments. Biopolymers 2005, 77, 296–303. Pérez-Montoto, L.G.; Dea-Ayuela, M.A.; Prado-Prado, F.J.; Bolas-Fernández, F.; Ubeira, F.M.; González-Díaz, H. Study of peptide fingerprints of parasite proteins and drug-DNA interactions with Markov-Mean-Energy invariants of biopolymer molecular-dynamic lattice networks. Polymer 2009, 50, 3857–3870. Rodriguez-Soca, Y.; Munteanu, C.R.; Dorado, J.; Rabuñal, J.; Pazos, A.; González-Díaz, H. Plasmod-PPI: A web-server predicting complex biopolymer targets in plasmodium with entropy measures of protein-protein interactions. Polymer 2010, 51, 264–273. Brandrup, J.; Immergut, E.H. Polymer Handbook, 4th ed.; Grulke, E.A., Ed.; Wiley: Hoboken, NJ, USA, 1999. Gillespie, D.T. Exact stochastic simulation of coupled chemical reactions. J. Phys. Chem. 1977, 81, 2340–2361. Meimaroglou, D.; Kiparissides, C. Review of Monte Carlo Methods for the Prediction of Distributed Molecular and Morphological Polymer Properties. Ind. Eng. Chem. Res. 2014, 53, 8963–8979.

Polymers 2016, 8, 240

15.

16. 17. 18.

19. 20. 21.

22. 23. 24. 25. 26.

27. 28. 29. 30. 31.

32. 33. 34.

35.

13 of 13

D’hooge, D.R.; Van Steenberge, P.H.; Derboven, P.; Reyniers, M.F.; Marin, G.B. Model-based design of the polymer microstructure: Bridging the gap between polymer chemistry and engineering. Polym. Chem. 2015, 6, 7081–7096. Brandão, A.L.T.; Soares, J.B.P.; Pinto, J.C.; Alberton, A.L. When polymer reaction engineers play dice: Applications of Monte Carlo Models in PRE. Macromol. React. Eng. 2015, 9, 141–185. Willemse, R.X.E. New insights into free-radical (co)polymerization kinetics. Ph.D. Thesis, University of Technology Eindhoven, Eindhoven, The Netherlands, 2005. Drache, M.; Schmidt-Naake, G.; Buback, M.; Vana, P. Modeling RAFT polymerization kinetics via Monte Carlo methods: Cumyl dithiobenzoate mediated methyl acrylate polymerization. Polymer 2005, 46, 8483–8493. Drache, M. Modeling the product composition during controlled radical polymerizations with mono- and bifunctional alkoxyamines. Macromol. Symp. 2009, 275–276, 52–58. Szymanski, R. On the determination of the ratios of the propagation rate constants on the basis of the MWD of copolymer chains: A new Monte Carlo algorithm. e-Polymers 2009, 9, 538–552. Van Steenberge, P.H.M.; D’hooge, D.R.; Wang, Y.; Zhong, M.; Reyniers, M.F.; Konkolewicz, D.; Matyjaszewski, K.; Marin, G.B. Linear gradient quality of ATRP copolymers. Macromolecules 2012, 45, 8519–8531. Drache, M.; Drache, G. Simulating controlled radical polymerizations with mcPolymer—A Monte Carlo approach. Polymers 2012, 4, 1416–1442. Engler, M.S.; Crotty, S.; Barthel, M.J.; Pietsch, C.; Knop, K.; Schubert, U.S.; Böcker, S. COCONUT—An efficient tool for estimating copolymer compositions from mass spectra. Anal. Chem. 2015, 87, 5223–5231. Montaudo, M.S. Mass spectra of copolymers. Mass Spectrom. Rev. 2002, 21, 108–144. Pasch, H. MALDI-TOF Mass Spectrometry of Synthetic Polymers; Schrepp, W., Ed.; Springer: Berlin, Germany, 2003. Vivó-Truyols, G.; Staal, B.; Schoenmakers, P.J. Strip-based regression: A method to obtain comprehensive co-polymer architectures from matrix-assisted laser desorption ionisation-mass spectrometry data. J. Chromatogr. A 2010, 1217, 4150–4159. Weidner, S.M.; Falkenhagen, J.; Bressler, I. Copolymer composition determined by LC-MALDI-TOF MS coupling and MassChrom2D data analysis. Macromol. Chem. Phys. 2012, 213, 2404–2411. Horský, J.; Walterová, Z. Fingerprint multiplicity in MALDI-TOF mass spectrometry of copolymers. Macromol. Symp. 2014, 339, 9–16. Engler, M.S.; Crotty, S.; Barthel, M.J.; Pietsch, C.; Schubert, U.S.; Böcker, S. Abundance correction for mass discrimination effects in polymer mass spectra. Rapid Commun. Mass Spectrom. 2016, 30, 1233–1241. Raeder, H.; Schrepp, W. MALDI-TOF mass spectrometry in the analysis of synthetic polymers. Acta Polym. 1998, 49, 272–293. Schriemer, D.C.; Li, L. Mass discrimination in the analysis of polydisperse polymers by MALDI time-of-flight mass spectrometry. 1. Sample preparation and desorption/ionization issues. Anal. Chem. 1997, 69, 4169–4175. Schriemer, D.C.; Li, L. Mass discrimination in the analysis of polydisperse polymers by MALDI time-of-flight mass spectrometry. 2. Instrumental issues. Anal. Chem. 1997, 69, 4176–4183. Hoteling, A.J.; Erb, W.J.; Tyson, R.J.; Owens, K.G. Exploring the importance of the relative solubility of matrix and analyte in MALDI sample preparation using HPLC. Anal. Chem. 2004, 76, 5157–5164. Wilczek-Vera, G.; Danis, P.O.; Eisenberg, A. Individual block length distributions of block copolymers of polystyrene-block-poly(R-methylstyrene) by MALDI/TOF mass spectrometry. Macromolecules 1996, 29, 4036–4044. Teraoka, I. Polymer Solutions; Wiley: New York, NY, USA, 2002. c 2016 by the authors; licensee MDPI, Basel, Switzerland. This article is an open access

article distributed under the terms and conditions of the Creative Commons Attribution (CC-BY) license (http://creativecommons.org/licenses/by/4.0/).