Research Paper II Cumulativity in constraint ...

0 downloads 0 Views 2MB Size Report
Lucien Carroll. Main Reader: Eric Bakovic ..... the placement of certain clitics and Germanic finite verbs (Anderson, 2000)) are not uncom- mon, but there are very ...
Research Paper II Cumulativity in constraint grammars: An iterated learning view of probabilistic typology Lucien Carroll Main Reader: Eric Bakovic Ancillary Readers: Andy Kehler and Sharon Rose

Abstract A primary goal of the generative linguistics tradition has been to formally describe the categorical limit of cross-linguistic variation. However, much of the systematicity in cross-linguistic variation is probabilistic rather than categorical, consisting of trends with exceptions rather than absolute universals. This presents a higher standard for a typologically-oriented grammatical framework: that it should produce probabilistic predictions. This study shows that Harmonic Grammar, a framework based on weighted constraint interaction, is well-suited to meet that criterion. Harmonic Grammar is a predecessor of Optimality Theory, and strict constraint domination, the key innovation of Optimality Theory, was motivated by problematic typological predictions in Harmonic Grammar. The weighted constraint interaction of Harmonic Grammar permits cumulativity effects, where a single violation of a high-weight constraint can be more optimal than multiple violations of low weight constraints, and these effects often lead to language patterns that are not attested in the linguistically described languages of the world. A partial resolution of the typological non-cumulativity problem comes from the same insight that allows Harmonic Grammar to produce typological probabilities: Harmonic Grammar is interpretable as a maximum entropy model, where candidate harmonies are log-probabilities and the probability distribution of constraint weights is a learning bias. The evolution of language establishes a bidirectional relationship between individual learning bias and the typological distribution of languages. And probability distributions that lead, as learning biases, to generalization and regularization, are associated with reduced cumulativity in the typological probabilities. The predicted probabilities of three simple typologies (neutralization driven by contextual markedness, default-to-same-side single stress, and the basic CV typology of syllable structure) still diverge from observed probabilities, but they demonstrate Harmonic Gammar’s potential as a typological theory and point the way forward through working out the relationships between language change and language typology.

Acknowledgements This study has grown in fits and starts, and seen many incarnations and detours. In the process, it has benefited from the comments, advice and questions of many people. The project especially owes its existence to Rob Malouf, who unintentionally set me on the path of trying to understand the relationship between Optimality Theory and maximum entropy models, and to Eric Bakovic, who consistently posed key questions about the typological properties of Harmonic Grammar. Also important have been conversations with Sharon Rose and Andy Kehler, as well as with Roger Levy, Farrell Ackerman, Rebecca Colavin and Alex del Giudice, and discussions with SaD-PhIG, the EvoLang reading group and the Computational Psycholinguistics Lab. Each of these people have helped to make this work stronger, and none of them bare any blame for this study’s faults,

Contents 1 Introduction 2 Non-cumulativity in typology 2.1 Cumulativity is typologically unusual . . . . 2.1.1 Constraint cumulativity . . . . . . . 2.1.2 Violation cumulativity . . . . . . . . 2.2 Cumulativity effects do exist . . . . . . . . . 2.2.1 Local constraint conjunction . . . . . 2.2.2 Cumulativity in variable phenomena 2.3 Universals and typology, without universals

1

. . . . . . .

4 7 8 12 17 18 19 21

. . . . . . . .

21 22 24 25 27 31 32 38 43

4 Evaluating typological predictions 4.1 General-case neutralization . . . . . . . . . . . . . . . . . . . . . . . . . . . . 4.2 Stress windows in default-to-same-side stress . . . . . . . . . . . . . . . . . . 4.3 Jakobson’s syllable structure typology . . . . . . . . . . . . . . . . . . . . . .

45 45 48 53

5 Conclusions

59

References

61

A Appendix: Tableaux in the neutralization typology

69

B Appendix: Syllable template typological data

71

. . . . . . .

. . . . . . .

. . . . . . .

. . . . . . .

. . . . . . .

3 From real-time bias to typological priors 3.1 Harmonic Grammar as Maximum Entropy Grammar 3.2 Modeling the individual versus modeling typology . . 3.2.1 Constraint sets and weight priors . . . . . . . 3.2.2 Why are there typological tendencies at all? . 3.3 Hypotheses about the weight space . . . . . . . . . . 3.3.1 Probability distributions of constraint weights 3.3.2 Learning bias and regularization . . . . . . . . 3.3.3 Violation cost functions . . . . . . . . . . . .

. . . . . . .

. . . . . . . .

. . . . . . .

. . . . . . . .

. . . . . . .

. . . . . . . .

. . . . . . .

. . . . . . . .

. . . . . . .

. . . . . . . .

. . . . . . .

. . . . . . . .

. . . . . . .

. . . . . . . .

. . . . . . .

. . . . . . . .

. . . . . . .

. . . . . . . .

. . . . . . .

. . . . . . . .

. . . . . . .

. . . . . . . .

. . . . . . .

. . . . . . . .

1

Introduction

Real-time human language processing is deeply probabilistic, showing variability that is influenced by many different factors, including lexical frequency and neighborhood density (Pierrehumbert, 2002), contextual predictability (Tily et al., 2009), communication difficulty (Zhao & Jurafsky, 2009), cross-modal environmental factors (Boroditsky & Ramscar, 2002), interlocutor identity (Creel, Aslin, & Tanenhaus, 2008), social identity (Eckert & Rickford, 2001), genre (Biber, 1995), and discourse structure (Biber, Connor, & Upton, 2007), and these effects have been addressed in a wide variety of frameworks. There is also much cross-linguistic variation in attested grammatical patterns, with surprising diversity that has disconfirmed many hypothesized universals (Haspelmath, 2007; Evans & Levinson, 2009). Non-trivial typological generalizations often have exceptions, but there is nevertheless much probabilistic typological systematicity. A primary goal of the generative linguistics tradition has been to formally describe the categorical limit of cross-linguistic variation, but because so much of the systematicity in cross-linguistic variation is probabilistic rather than categorical, a complete model of typological variation should produce probabilistic predictions. For example, in a genetically balanced sample of 262 languages with phonologically predictable quantity-insensitive stress (that is, where syllable weight does not matter), Gordon (2002) found that 71% of these (187 languages) have a single stressed syllable in each word, 23% (61 languages) have various iterative stress systems, which in addition to primary stress, have secondary stress distributed at regular intervals, and just 5% (14 languages) have dual stress systems, with one primary stress and one secondary stress in each word. Stress is also systematically associated with the ends of the words. Of the languages with single stress, 90% have stress falling on the initial syllable, the final syllable, or the penult (1). The remaining languages stress the antepenultimate or peninitial syllables, with no languages in that sample stressing syllables further into the word. (1)

Typology of quantity-insensitive single stress systems

1

• final stress (σσσσσσ´ σ ): 32.0% • initial stress (´ σ σσσσσσ): 30.2% • penultimate stress (σσσσσ´ σ σ): 28.8% • antepenultimate stress (σσσσ´ σ σσ): 5.3% • peninitial stress (σ´ σ σσσσσ): 3.7% Speaking phenomenologically, languages that count one syllable from the beginning of the word, one syllable from the end, and two syllables from the end are all about equally common. Languages that count two syllables from the beginning of the word or three syllables from the end are attested but uncommon, and languages that count further in are unattested, at least in that sample. How can we distinguish not just the impossible from the possible, but the common from the uncommon? One of the primary motivations in the development of Optimality Theory (Prince & Smolensky, 1993/2004) was the potential to distinguish possible language patterns from impossible language patterns in terms of exhaustive reranking of a set of functionally-motivated constraints. The analysis set forth by Gordon (2002) attempts to do exactly that, for the attested quantity-insensitive stress systems. But the constraints that are used in Gordon’s analysis to generate the 26 stress patterns in the whole sample also generate an additional 128 unattested stress patterns. Considering that many of the attested stress patterns are quite rare, exemplified by just a few languages in Gordon’s sample, some of the unattested generated stress patterns might simply correspond to actual languages that have not been documented phonologically. But the number of unattested patterns is quite large (5 times the number of attested patterns), and as originally intended, the model doesn’t make any distinctions along the scale from the most frequently attested patterns through the uncommon patterns to the unattested patterns. In a factorial typology, some rerankings crucially affect the pattern of mappings from underlying forms to surface forms, while other rerankings do not change the patterns. Bane and Riggle (2008) pursue the idea that r-volume, the 2

number of total constraint rankings that generate a given language pattern, is a predictor of the probability of that pattern. Looking at Gordon’s stress typology, they show that rvolumes are indeed correlated with typological frequencies, so that the r-volumes associated with each stress pattern do at least generally distinguish the common stress patterns from the unattested stress patterns. Harmonic Grammar (Legendre, Sorace, & Smolensky, 2006) is another framework of constraint interaction, which uses a quantitative optimization procedure based on weighted constraints, in contrast to the stepwise decision procedure in Optimality Theory. One result of the optimization procedure is that Harmonic Grammar allows cumulativity effects, where combined violations of less important constraints can overtake violations of more important constraints—which is impossible in Optimality Theory—and the typological scarcity of such effects was a major factor in the original development of Optimality Theory. However, there actually are some attested language patterns that are most elegantly analyzed as cases of cumulativity, and Harmonic Grammar has recently received new attention in work on variation and learning (e.g. Pater, 2008; Boersma & Pater, 2008; Jesney & Tessier, 2010; Coetzee & Kawahara, 2010), where cumulativity patterns are particularly evident. The software that Bane and Riggle have developed for calculating factorial typologies can also calculate Harmonic Grammar typologies (Bane & Riggle, 2009), and they show that for both Gordon’s stress typology and the classic consonant-vowel factorial typology of syllable structure (Prince & Smolensky, 1993/2004), Harmonic Grammar predicts a much greater number of unattested patterns than Optimality Theory does. Just as multiple total constraint rankings in Optimality Theory may be associated with a single mapping pattern, a mapping pattern in Harmonic Grammar is generated by a range of ratios of constraint weights. Based on the analogy with r-volume and the success of r-volume as a predictor of typological frequency, Bane and Riggle have also implemented the ability to calculate the weight space volume associated with a pattern in Harmonic Grammar, as a potential means of predicting attested versus unattested languages.

3

In this paper, I show that proportions of the Harmonic Grammar weight space are quite suitable estimators of typological probabilities, but that typological arguments about Harmonic Grammar have made questionable assumptions about the shape of the weight space. I then show that alternative assumptions, used in related models of language variation and learning, lead to typological predictions which are closer to attested frequencies for three simple factorial typologies: phoneme neutralization and contrast, quantity-sensitive single stress, and consonant-vowel syllable structure. I first discuss why Harmonic Grammar has been seen as problematic as a typological model but is still worthy of consideration, and then I explain how Harmonic Grammar can be understood as a special case of a maximum entropy model, a probabilistic formalism for constraint interaction used in many other fields. As such, previous hypotheses about the Harmonic Grammar weight space are formalizable in terms of probability distributions, and comparison with related work in language variation and learning suggests alternative hypotheses about the weight space. Furthermore, weight space hypotheses that produce generalization and regularization in learning (as human learners do) produce less cumulativity in typology, and because cognitive biases are linked through sociobiological and sociocultural evolution to the weight space of language typology, we should prefer weight space hypotheses in typological models which are consistent with what is found in models of learning and variation. The resulting Harmonic Grammar typologies still noticeably diverge from observed typologies, but they help bridge the gap between Optimality Theory and conventional Harmonic Grammar. They also point the way towards the development of a more complete probabilistic typological model, through fleshing out the connections between language typology, cognitive biases and sociocultural selection.

2

Non-cumulativity in typology

Generative typologically-oriented linguistic frameworks attempt to encode typological universals in the structure of the theory, so that the linguistic theory neither overgenerates nor

4

undergenerates the grammatical patterns of known languages. For example, Jakobson (1962/1971) wrote that even though some languages have no syllables with codas (e.g. forbidding CVC syllables) and some languages have no syllables without onsets (e.g. forbidding V syllables), one typological universal is that all languages permit syllables with onsets and syllables without codas (i.e. CV). Clements and Keyser (1983) formalize this statement as a theory of syllable construction: all languages start with CV syllables, and languages can use neither, one, or both of the following rules to create other syllables: • Delete syllable-initial C • Insert syllable-final C This creates the following possible inventories of syllables, which I will call by the descriptions on the right, rather than Clements and Keyser’s types or the syllable lists. (2)

Clements and Keyser syllable inventories Type

Syllables

Description

CV

Minimal

CV, V

Codaless

III.

CV, CVC

Onsetful

IV.

CV, CVC, V, VC

All good

I. II.

Clements and Keyser specifically identify several other conceivable inventories as not possible, including the inventory that forbids only V (CV, CVC, VC) and the one that forbids only VC (CV, CVC, V). Prince and Smolensky (1993/2004) adapted this analysis to Optimality Theory, framing the syllable inventories as the resolution of opposing constraints. In the simplest analysis, there are just three constraints: • Onset: Syllables should start with a consonant

5

• NoCoda: Syllables should not end with a consonant • Faith: The surface form should match the underlying form In this analysis, regardless of what sequence of consonants and vowels are in the underlying form, the inventory will emerge from the constraint interaction. If Onset is ranked above Faith, then consonants will be inserted (or vowels deleted) to prevent syllable-initial vowels, violating Faith to satisfy Onset, and if Faith is ranked above Onset, then syllable-initial vowels will be permitted. The analysis for codas is similar, so that the four syllable inventory types emerge from the combination of the two independent rankings. (3)

Syllable inventory typology in Optimality Theory Onset » Faith

Faith » Onset

NoCoda » Faith

Minimal

Codaless

Faith » NoCoda

Onsetful

All good

Unfortunately, because of the simplicity of these analyses, these typologies undergenerate the possible inventories of languages on two counts: • Inventories that forbid only VC are observed in a few languages (to be discussed in §4.3) as well as during the course of acquisition of ‘All good’ inventories (Levelt, Schiller, & Levelt, 1999) • Arrernte, an aboriginal Australian language, is described as having VC syllables only (Breen & Pensalfini, 1999) The analyses of Clements and Keyser and of Prince and Smolensky each predict that neither of these inventories can exist, while Jakobson’s original generalization does not exclude the ‘No VC’ languages but does exclude Arrernte. On the other hand, even though neither version of the generalization is universally true, both versions are true of a very large number of languages.

6

Optimality Theory diverged from Harmonic Grammar and other related weighted models of phonology (e.g. Goldsmith, 1993) by placing constraints in a partially ordered strictdomination hierarchy as opposed to a continuum of numeric weights. This move was motivated by the observation that weighted constraint interaction led to cumulativity effects, producing patterns that run contrary to typological generalizations (Legendre et al., 2006). Strict domination built non-cumulativity into the framework of Optimality Theory as a typological universal. However, like Jakobson’s syllable structure generalization and many other putative typological ‘universals’ (Evans & Levinson, 2009), non-cumulativity is not without exceptions. Non-cumulativity is probabilistically true but not categorically true.

2.1

Cumulativity is typologically unusual

The numeric evaluation procedure of Harmonic Grammar produces two kinds of cumulativity that are typologically problematic, running contrary to two distinct typological generalizations. One kind is what I refer to as constraint cumulativity, which is when two or more low-weight constraints "gang up" on a higher weight constraint. I refer to the other kind as violation cumulativity, which is when multiple violations of a low-weight constraint overcome the violations of higher weight constraint. This distinction corresponds to what Sorace and Keller (2005) call ganging up and cumulativity, what Jäger and Rosenbach (2006) call ganging-up cumulativity and counting cumulativity and what Bane and Riggle (2009) call gang effects and cartel effects. The stepwise optimization of Optimality Theory was developed because many conceivable cumulativity effects are rare or unattested in descriptions of the world’s languages. The optimization procedure of Optimality Theory strikes both of these cumulativity effects from typological predictions. In Optimality Theory, if constraint X dominates constraints Y and Z, a candidate A that violates Y and Z is always preferred over a candidate B that only violates X (4a). Even if constraint Y and/or Z is violated many times (4b), A will be preferred over B.

7

(4)

Optimality Theory Tableaux a.)

R

/input/

X

A B

Y

Z

*

*

b.)

R

/input/

*!

X

A B

Y

Z

***

***

*!

In contrast, Harmonic Grammar constraints are sortable by their numeric weights, but the domination is violable. This means that if the weight of X is less than the sum of the weights of Y and Z, then candidate A, which violates only X, with a total penalty (or ‘harmony’) of H = −40, would be preferred over candidate B, which violates Y and Z, obtaining H = −50 (5a). This is an example of constraint cumulativity. (5)

Harmonic Grammar Tableaux a.)

weights

40

30

20

/input/

X

Y

Z

H

*

*

−50

R

A B

*

b.)

weights

40

30

20

/input/

X

Y

Z

R

−40

A B

** *

H −60 −40

Similarly, if candidate A did not violate Z but violated constraint Y twice (5b), the harmony score would be H = −2 × 30 = −60, and candidate B would again be preferred, a case of violation cumulativity.

2.1.1

Constraint cumulativity

Constraint cumulativity often produces unattested grammatical patterns. A relatively simple example of this is the factorial typology associated with allophony driven by contextual markedness (McCarthy, 2002, 82–91). In this kind of allophony, the contextual markedness or special-case markedness constraint (S) is in a ‘Paninian relation’ (Prince & Smolensky, 1993/2004) with a general-case markedness constraint (G) that penalizes the output favored by S, and a faithfulness constraint (F) penalizes mismatches between input and output. To be more concrete we will use the constraints relevant in coronal palatalization, but many

8

allophonic alternations can be (and have been) analyzed with a similar set of contraints.1 The constraints associated with coronal palatalization are:2 • Id(ant) (F): Segments in the output should have the same anteriority as the corresponding segments in the input. (Violated by unfaithful mappings like /s/→/S/.) • No[−ant] (G): Output segments should not have [−ant] features. (Violated by palatals like /S, tS/). • Pal (S): Output segments should have [−ant] features when followed by high/front [−cons] segments. (Violated by sequences of an alveolar like /s, t/ and a high front vocoid like /i, j/.) The special-case markedness constraint S is ‘special’ because the contexts in which it applies are a subset of the contexts in which the general markedness constraint applies. In this case, G applies to all segments, and S applies to segments in the context preceding high or front vocoids. In Optimality Theory, there are four possible outcomes: full contrast, complementary distribution, contextual neutralization in the special-case context, and context-free total neutralization. And these correspond well with what is observed for palatalization: (6)

a) Full contrast (no alternation): Spanish contrasts /t/ and /tS/ and has no synchronic palatalization processes, even though the affricate is the result of diachronic palatalization processes (Baker, 2004). b) Complementary distribution (‘allophonic’ alternation): Minais Gerais Portuguese (as well as several other dialects) has affricates /tS, dZ/ before /i, j, ˜ı/ in complementary distribution with alveolar stops /t, d/ elsewhere

1

For example, vowel nasalization (McCarthy & Prince, 1995), post-vocalic spirantization (Benua, 2000) and intervocalic voicing (Kager, 2006). 2 This analysis assumes that only coronals are specified for anteriority. I have adapted these constraints from the analysis given by Kim (2002), which glosses over cross-linguistic variation in the triggers, targets and degree of palatalization (Bateman, 2007). The characteristics of interest here carry over to a more complete analysis.

9

(Cristófaro-Silva, 2003). Korean has /S, S/ before /i, y/, in complementary "" distribution with /s, s/ elsewhere (Kim, 2002). "" c) Special-case neutralization (‘phonemic’ alternation): Polish has a dental series /s, z, t, d/ which contrasts with an alveolo-palatal series /C, ý, tC, dý/ in most environments but neutralizes with the alveolo-palatal series before /i, j, E/ (Rubach, 2003). Korean has /t, th / which contrasts with /tS, tSh / in general but neutralizes with them before /i, y/ (Kim, 2002). d) Total neutralization (no alternation): Inhambane Gitonga has a large inventory of alveolar segments, but no palatal obstruents (Lanham, 1955). In contrast to other Brazilian dialects of Portuguese, Sao Paolo Portuguese has /t, d/ even when they are preceded and followed by /i/, with no affricates in the language (Cristófaro-Silva, 2003). Each of these patterns can be represented in Harmonic Grammar just as well as in Optimality Theory. (Tableaux are provided in Appendix A) (7)

a) Full contrast: Occurs in OT when F » {S, G}. Occurs in HG when WF > WG > WS − WF . b) Complementary Distribution: Occurs in OT when S » G » F. Occurs in HG when WS − WF > WG > WF . c) Special-case Neutralization: Occurs in OT when S » F » G. Occurs in HG when WS − WG > WF > WG . d) Total Neutralization: Occurs in OT when G » {F, S}. Occurs in HG when WG > WF + WS . 10

However, the fact that each interaction type has a more complicated ordering relation in Harmonic Grammar allows for a fifth outcome: neutralization in the general case, or ‘reverse positional neutralization’, a reportedly unattested pattern (Smith, 2000; Kager, 2006). (8)

General-case Neutralization: Occurs in HG when WG > WF > |WS − WG |.

For example, if WS and WF are 20 and WG is 30, then the output neutralizes to the unmarked option in the general case, but remains faithful in the special case. So a phonological contrast would be maintained in the ‘weaker’ position targeted by the positional markedness constraint, while the contrast is lost in general.3 In a general-case neutralization pattern, the language could contrast /s,S/ (or /t, tS/) before /i/, but neutralize by depalatalization to /s/ (or /t/) elsewhere. 3

Note that this general-case neutralization produces the unmarked allophone in ‘strong’ contexts (contexts that support contrastive cues), which should not be confused with neutralization involving positional faithfulness constraints (Beckman, 1998), where neutralization produces the unmarked allophone in general-case ‘weak’ contexts.

11

(9)

General-case neutralization weights

R

/sa/

/Sa/

/si/

G: No[−ant]

S: Pal

F: Id(ant)

H 0

* G: No[−ant]

* S: Pal

/Si/

S: Pal

*

−20 *

S: Pal

F: Id(ant)

*

*

si Si

F: Id(ant)

*

G: No[−ant]

−20 −30

* G: No[−ant]

−50

F: Id(ant) *

si Si

R

20

sa Sa

R

20

sa Sa

R

30

−50

−40 −30

*

Ukrainian (Rubach, 2006) and Japanese (Ito & Mester, 2003) exhibit depalatalizing neutralization before /E/ only, but there do not appear to be attested cases of languages with depalatalizing neutralization before low or back vowels.

2.1.2

Violation cumulativity

Violation cumulativity also produces typologically unusual patterns, which run contrary to the generalization that ‘grammars don’t count’, or at most they count ‘up to two’ (McCarthy & Prince, 1986/1996). That is, there are many ‘edgemost’ phenomena (e.g. stress hierarchies (Hayes, 1995) and edge-oriented infixes (A. Yu, 2003)) and ‘second position’ phenomena (e.g. the placement of certain clitics and Germanic finite verbs (Anderson, 2000)) are not uncommon, but there are very few claims of phenomena associated with higher numbers. This generalization is encoded in two ways in Optimality Theory. Initially, this was implemented

12

in the basic optimization procedure, which prevents violations of different constraints from being quantitatively compared (Prince & Smolensky, 1993/2004), and McCarthy (2003) takes it a step further to argue that each constraint violation refers to structural categories, excluding constraint definitions that increase violations according to gradient distances. I will review the basic argument against violation cumulativity and then return to the question of gradient distance constraints. Legendre et al. (2006) illustrate the violation cumulativity problem with a quantitysensitive ‘default to same side’ stress system (Walker, 1997), like Golin or Aguacatec Mayan, characterized as: stress falls on the rightmost heavy syllable if there is a heavy syllable, or if there are no heavy syllables, stress falls on the last syllable. In Optimality Theory, this can be described by the following two constraints:4 • StressHeavy (StrH): Heavy syllables should bear stress. (Each unstressed heavy syllable produces a violation.) • StressRight (StrR): The stressed syllable should be near the end of the word. (Each syllable intervening between the stress and the word end produces a violation.) In this analysis, stress falling on the rightmost (heavy) syllable is obtained by comparing the number of violations of StressRight for each of the candidate forms. (10) Comparison of violations in Optimality Theory

R

/H H L L/

StrH

StrR

´ LL HH

*

**

´ HLL H

*

***!

HHLĹ

**!

The first two candidates violate StressHeavy once, while the third candidate violates it twice, so the third candidate is eliminated. StressHeavy must be ranked over Stress4

These are conventionally known as the Weight-to-Stress Principle (WSP) and MainStressRight, but they are renamed here for simplicity. This analysis assumes that there is a single main stress per word, and that feet are not involved.

13

Right so that violations of StressHeavy are evaluated first, as evaluating violations of StressRight first would lead to the third candidate being optimal. Once the third candidate is eliminated, StressRight determines that the first candidate is optimal because it violates it twice, while the second candidate violates it three times. This tableau also works in Harmonic Grammar for any weighting where the weight of StressHeavy is more than twice that of StressRight. Regardless of weighting, two violations of StressHeavy are worse than one, and three violations of StressRight are worse than two, so as long as the cost of the third candidate’s second violation of StressHeavy is greater than that of the first candidate’s two violations of StressRight, the first candidate will be optimal. However, the suitability of the chosen weights depends on how long words can be. Under some weightings Harmonic Grammar changes behavior for long words, and if words can be arbitrarily long, no weighting succeeds. Consider the tableaux in (11), where the input has just one heavy syllable. (11) a) /L H L L/ → Stress on heavy syllable weights

R

/L H L L/

50

20

StrH

StrR

H

**

−40

´ LL LH LHLĹ

−50

*

b) /H L L L/ → Stress on last syllable weights /H L L L/

R

50

20

StrH

StrR

H

***

−60

´ LLL H HLLĹ

*

−50

The weights are assigned in such a way that the correct candidate is selected as optimal in (11a), since two violations of StressRight have a lower total cost than a single violation of StressHeavy. However, when the heavy syllable is one step further from the right edge (11b), the total cost is greater than the violation of StressHeavy, and the stress shifts 14

to the final syllable. This particular tableau could be fixed by increasing the weight of StressHeavy, but for any weighting of the constraints in (11), a sufficiently long word will result in a tableau in which the stress shifts to the final syllable. Since Harmonic Grammar is a typological theory, these tableaux show that this kind of stress-shifting pattern is predicted to be possible for windows of any length. As it turns out, this kind of stress-shifting pattern actually is known for two-syllable windows in ‘right-headed right-edge’ languages like Yapese (Jensen, 1977), and it is actually reported for a three-syllable window in Pirahã (Everett, 1988), but not for longer windows. Optimality Theory analyses of these language obtain the window effect through additional constraints, while the Harmonic Grammar analysis uses fewer constraints but predicts windows that are longer than what is attested. The counting effect occurs because the cost of the second candidate’s single violation of StressHeavy is quantitatively compared against the cost of the first candidate’s multiple violations of StressRight. The evaluation procedure of Harmonic Grammar requires comparisons of this sort, while the procedure of Optimality Theory specifically prevents it, requiring that constraints are unranked only when no comparison is relevant, and otherwise they are strictly ranked and thus quantitatively incomparable. McCarthy (2003) goes a step further, arguing that constraints like StressRight, which assign violations according to a distance, create further typological irregularities in Optimality Theory, and that these ‘generalized alignment’ constraints can be and should be replaced with categorical constraints acting on more structured forms. It is still possible to have multiple violations, but only when there are multiple structures that violate the constraint; for example, the third candidate in (10) above violates StressHeavy twice because two distinct syllables are heavy and not stressed. Restricting the constraint set in this way eliminates counting effects in this stress system,5 but it is not a general solution to the problem of violation cumulativity in Harmonic 5

The multiple violations of StressHeavy are not banished in McCarthy’s analysis, but they would not create violation cumulativity here. If we replace StressRight with a categorical edgemost constraint StressRightMost (StrRM), we might expect to see pathological stress alternations in words with several heavy syllables, but only one violation of StressHeavy can be traded away for a violation of StressRight-

15

Grammar. First of all, there are other cases where multiple violations of one constraint can be traded off against a single violation of another constraint. In one of the less simplistic versions of Prince and Smolensky’s (1993/2004) syllable structure typology, the Faith constraint is split into three distinct constraints. • DepV: Any vowel in the output should correspond to a vowel in the input • DepC: Any consonant in the output should correspond to a consonant in the input • Max: Any segment in an input form should correspond to a segment in the output With the modified constraint set, Harmonic Grammar can produce violation cumulativity patterns like the following (Bane & Riggle, 2009), where the input consisting of a single vowel is deleted to satisfy Onset, but a /VC/ syllable is maintained because deleting it would violate Max twice. Most (12), so the problem reduces to a comparison of one violation of StressHeavy to one violation of StressRightMost. (12)

WStrH < WStrRM → Stress on final syllable weights 20 30 /H H L/ StrH StrRM H ´ HHL * * −50 ´ L HH * * −50 HHĹ ** −40

R

16

(13) Violation cumulativity in syllable typology weights /V/

40

30

20

DepC

Onset

Max

V

R R

CV

−40

*

DepC

VC

CVC

−30

*



/VC/

H

Onset

*

−20

Max

H −30

*

−40

*



**

−40

The alternative repair strategy, epenthesizing an onset, is unavailable in this case because it is more costly than violating Onset, but if the weight of DepC were reduced to 25, the optimality of the epenthetic candidate would again be due to violation cumulativity. Secondly, restricting gradient alignment constraints in Harmonic Grammar does not help resolve the violation cumulativity question, because there are some phenomena that are sensitive to lengths greater than two, and though eliminating alignment constraints may help describe the typology of non-variable phenomena, it does not help explain why counting is typologically unusual or what the distinction is between domains where it less common and domains where it is more common.

2.2

Cumulativity effects do exist

In spite of the scarcity of cumulativity effects, both violation cumulativity and constraint cumulativity are attested. Evidence for cumulativity effects comes from three kinds of sources: Optimality Theory analyses that use local constraint conjunction, variable phenomena, and typological exceptions.

17

2.2.1

Local constraint conjunction

One set of evidence for cumulativity in phonology comes from a set of phenomena which have been addressed in Optimality Theory via the mechanism of local constraint conjunction, a hypothesis that violations of certain combinations of constraints within a certain domain constitute a new constraint that is ranked above either of the constraints that compose it. Local constraint conjunction suffers from a lack of consensus about the restrictions on constraint combinations and domains. Strict domination was introduced into Optimality Theory to preclude cumulativity effects, but one theoretical amendment permits ganging-up effects in certain conditions. Primarily in order to account for harmonically complete inventories, Smolensky (1995) introduced the concept of local constraint conjunction. Local conjunction creates a constraint C by conjoining two other constraints A and B within a prosodic domain D: C = A &D B. This was motivated by the reliability of markedness hierarchies. If a language’s segmental inventory contains some segment in a markedness hierarchy, segments of lower markedness are generally also in the inventory, and if an inventory does not contain a segment, segments that are more marked are also not in the inventory. For a single dimension this is unproblematic. For example, along the markedness hierarchy *[dorsal] » *[labial] » *[coronal], interposing a constraint Faith(place) between *[dorsal] and *[labial] causes dorsals to neutralize to coronals (or glottals), leaving labials and coronals in the inventory. Interposing Faith(place) between *[labial] and *[coronal] causes labials to neutralize as well, producing the smaller inventory. Along two dimensions, however, simple constraint ranking can not represent inventories that ‘ban only the worst of the worst’ (BOWOW). For example, English has dorsal, labial and coronal obstruents, and both fricatives (more marked) and stops (less marked), but there are no dorsal fricatives.

18

(14) English BOWOW inventory [dorsal]

[labial]

[coronal]

[+cont]

*x

f

s

[−cont]

k

p

t

No simple constraint ranking can eliminate /x/ without also eliminating either /k/ or the other fricatives. This problem is solved with local conjunction since the conjunction of *[dorsal] and *[+cont] can outrank Faith(place) or Faith(cont), to eliminate /x/ from the inventory. Local conjunction has also been recruited to explain many other phenomena in Optimality Theory. Smolensky (2006) provides a survey of a dozen kinds of phenomena explainable as blocked or triggered by locally conjoined constraints, including source-conditioned and target-conditioned harmony, chain shifts, vowel nasalization and OCP effects. The greater power of local conjunction means that particularly odd unattested languages can be produced, so restrictions on the power of local conjunction have been pursued, by limiting what categories of constraints can be conjoined (Ito & Mester, 1998; Bakovic, 2000), and what local domains they can be conjoined in (Moreton & Smolensky, 2002; Lubowicz, 2005), but no consensus was reached on how to appropriately limit the typological consequences of local conjunction.

2.2.2

Cumulativity in variable phenomena

Jäger and Rosenbach (2006) show that both violation cumulativity and constraint cumulativity are found in the English alternation between of genitives (e.g. ‘the nose of the other person’) and s genitives (e.g. ‘the other person’s nose’). Two of the key factors in the alternation are the animacy and length of the possessor, which we express as Harmonic Grammar constraints in order to show the interaction tableaux. • AnimatePossLeft (AL): Possessors that are animate should precede the genitive morpheme

19

• HeavyPossRight (HR): Possessors that are long should come after the genitive morpheme Jäger and Rosenbach measure length of the possessor as the number of words in addition to the head word. This choice affects the numerical weights but not the qualitative results. The corpus data they examined showed an overall preference for animate possessors to use the s genitive, but when the possessor noun phrase had two premodifiers, just over half of the instances were s genitives, and when there were three or more premodifiers, the of genitive was strongly preferred. In the Harmonic Grammar analysis, this corresponds to making WAL about twice that of WHR . With 0 or 1 premodifiers, the cost of the violation of AnimatePossLeft was more than the cost of the violations of HeavyPossRight (15a), but for three premodifiers, the accumulated cost of the violations of HeavyPossRight exceeds the cost of the violation of AnimatePossLeft (15b). (15) a) Animate possessor with one premodifier weights

20

10

daughter OF the doctor

AL

HR

R

the daughter of the doctor the doctor’s daughter

*

H 20

*

10

b) Animate possessor with three premodifiers

R

weights

20

10

policy OF the right honorable gentleman

AL

HR

the policy of the right honorable gentleman the right honorable gentleman’s policy

*

H 20

***

30

This phenomenon cannot be represented in Optimality Theory, even in versions of Optimality Theory that deal with variation. Two other factors that are known to influence the genitive alternation are topicality of the possessor and prototypicality of the possessing relation. Possessors that are suitable topics 20

in the discourse context, and certain possessive relationships like kinship and body parts, are more likely to occur with the s genitive. In both experimental and corpus data, animacy is shown to be the strongest factor of the three, and yet when the possessor is animate but the other two variables favor the of genitive, the of genitive is (slightly) preferred. That is, the constraints of prototypicality and topicality gang up over the animacy constraint.

2.3

Universals and typology, without universals

Cumulativity has been shown to be typologically problematic. Harmonic Grammar treats it as possible, and as a result, it predicts a variety of language patterns that are unattested cross-linguistically, even when only considering a few constraints. Optimality Theory treats cumulativity as impossible, but it then requires more constraints to describe attested patterns that are described in Harmonic Grammar with cumulative constraint interaction. The additional constraints that Optimality Theory requires are themselves typologically problematic because adding them to the analysis often creates predictions of many unattested language patterns. So the attested language patterns that can be described in Harmonic Grammar through cumulative interactions are exceptions to a typological generalization: cumulativity effects are possible but generally uncommon. How can this kind of gradient generalization be described or explained?

3

From real-time bias to typological priors

Harmonic grammars are actually well-suited for the task of describing probabilistic generalizations, because they are closely related to maximum entropy models, a class of probabilistic formalisms for constraint interaction. In addition, probabilistic models of language evolution have demonstrated a bidirectional link between learning biases and evolutionarily stable distributions for language typology. In maximum entropy models of synchronic linguistic variation, learning biases can favor or disfavor regularizing variation, and psycholinguistic studies

21

indicate that people generally use regularizing learning biases. Regularizing learning biases also disfavor cumulativity effects, which leads to the prediction that compared to the space of possible languages, cumulativity effects are relatively improbable in linguistic typology.

3.1

Harmonic Grammar as Maximum Entropy Grammar

Maximum entropy models (Kindermann & Snell, 1980; Berger, Della Pietra, & Della Pietra, 1996) are a class of mathematical formalisms for modeling event probabilities as determined by a system of interacting constraints. They are widely used in a variety of fields and are known by a number of names, including multiple logistic regression, log-linear models, and conditional random fields. The VARBRUL methodology from sociolinguistics (Cedergren & Sankoff, 1974) is also closely related. A maximum entropy model is a domain-general learner, as applicable to gambling odds or thermodynamics as it is to cognitive and linguistic tasks, making near-optimal use of the information represented in the constraint violations, and the mathematical properties of the formalism are rather well understood. These facts make it a useful baseline or null-hypothesis “ideal learner” in modeling language learning and its result. In addition, because of its foundation in Bayesian statistics, learning bias is an explicit object in the framework. A variety of research has discussed and taken advantage of similarities between the framework of Optimality Theory and maximum entropy models (Eisner, 2000; Goldwater & Johnson, 2003; Wilson, 2006; Jäger, 2007; Hayes & Wilson, 2008), and the similarity between Harmonic Grammar and maximum entropy models is even closer, to the extent that Harmonic Grammar can be considered a subclass of maximum entropy models (Goldwater & Johnson, 2003). A Harmonic Grammar interpreted as a maximum entropy model has become known as a Maximum Entropy Grammar (Hayes & Wilson, 2008; Zhang, Lai, & Sailor, 2009). The constraint weights and candidate harmonies of a Harmonic Grammar closely correspond with the constraint weights and event log-probabilities in a maximum entropy model. 22

A Harmonic Grammar tableau can be extended into a Maximum Entropy Grammar tableau (16) by exponentiating the harmonies to produce unnormalized probabilities ψ, which are then converted to proper probabilities p by dividing by their sum Z. (16) Rightward default-to-same-side stress in a Maximum Entropy Grammar 50

20

StrH

StrR

H

ψ

p

´ LL HH

*

**

−90

2−90 ≈ 10−3×9

0.999

´ HLL H

*

***

−110

2−110 ≈ 10−3×11

0.000

HHLĹ

**

−100

2−100 ≈ 10−3×10

0.001

weights

R

/H H L L/

Z ≈ 1.001 × 2−90 So in this example, the harmony of the first candidate (−90), produced by one violation of StressHeavy plus two violations of StressRight (−1 × 50 + −2 × 20), is converted to an unnormalized probability (2−90 ) by exponentiating with base 2.6 Since 2−10 is just a little less than 10−3 = 0.001, 2−90 is approximately (10−3 )9 = 10−3×9 . Because the other two ψ terms are 1000 and 1000000 smaller, they barely influence the sum Z, and the calculated probabilities for those two candidates are thus 1 /1000 and 1 /1000000 , leaving the rest of the probability on the first candidate. This interpretation of candidate harmonies as logprobabilities ties the harmonies to observed data in a way that goes beyond the criteria of conventional Harmonic Grammar, but any Harmonic Grammar tableau is interpretable as a Maximum Entropy Grammar without modification of the weights. When differences among the weights are particularly large (where ‘large’ is determined by the base of exponentiation) the probability mass assigned to non-optimal candidates is negligible. Since there is always some uncertainty about the probability of events that have not been observed, Maximum Entropy Grammars are completely capable of modeling non-variable phenomena, while 6

Maximum entropy models conventionally use e as the basis of exponentiation, but I use base 2 here because it makes the scale of the weights more intuitive. A weight difference of 1 is an odds ratio of 2:1, a weight difference of 3 is an odds ratio of 8:1, and a weight difference of 10 is an odds ratio of about 1000:1, effectively eliminating variation.

23

on the other hand, the probability mapping also provides a framework for straightforward treatment of variable phenomena. Learning bias comes into the framework in the form of a prior probability distribution over possible constraint weights, a mathematical representation of the learner’s expectation (or the modeler’s expectation) about weight values, before exposure to the observable data of the particular language. The learning biases commonly used in maximum entropy models express vague expectations that small weights are more probable than large weights, but in the context of modeling typology, the typologist’s expectation about probable constraint weights could be much more specific, reflecting implicational hierarchies or other typological trends. The trend of non-cumulativity, however, does not need an appeal to specific knowledge about constraint weights, since it is explainable as resulting from the general shape of the typologist’s weight prior, not distinguishing among constraints. Before returning to the relationship between non-cumulativity and the shape of the prior probability of the weights, I need to clarify the relationship between learner models and typological models.

3.2

Modeling the individual versus modeling typology

Optimality Theory and Harmonic Grammar were originally developed with the aim of describing non-varying patterns and typological possibilities, without a mapping to the time course of processing or learning. But over the years, interest has blossomed in modeling the learning process within variationist forms of Optimality Theory and Harmonic Grammar (e.g. Boersma & Hayes, 2001; Jäger, 2003; Jarosz, 2006; Boersma & Pater, 2008; Jesney & Tessier, 2010). Although Maximum Entropy Grammar is quite suitable for modeling both the individual learner’s variable behavior and typological patterning, the distinction between a model of an individual learner and a model of typological diversity is large and important. In addition to language-general constraints, which derive from social, cognitive and physiological pressures, the learner’s constraint set is populated by language-specific constructed constraints, indexed to the phonological and morphosyntactic inventories of their language, 24

and the learner’s prior probability of constraint weights functions as a learning bias. The typologist’s constraint set, in contrast, is populated by the language-general constraints, and the typologist’s prior probability of constraint weights translates into a probability distribution over possible language patterns. Research in language evolution has shown that the two kinds of models are closely related, but the details of that relationship have yet to be elaborated.

3.2.1

Constraint sets and weight priors

Recent work testing usage-based and exemplar-based models of language processing (e.g., Pierrehumbert, 2002; Gahl & Yu, 2006; Johnson, 2006; Hay & Bresnan, 2006; Goldinger, 2007; Davis & Gaskell, 2009) is showing that the linguistic behavior of individuals is dominated by their experience of individual utterances in context. Among other things, this research shows that lexical representations are phonetically detailed, construction-specific, socially indexed, and influenced by previous exposure to particular speakers’ specific utterances. The emerging picture is that utterance events leave traces in episodic memory, and generalizations, associations or abstractions are generated from them. To a limited degree, the behavior of exemplar models can be reproduced within a Maximum Entropy Grammar. Each construction, lexical item, and sublexical unit generates a constraint, and other constraints are formed by partial abstraction from these constraints. For example, in a simple model of a word/pseudoword acceptability judgements, every word in the subject’s lexicon could be associated with one constraint that is violated by an occurrence of that word in an unfamiliar context and another constraint that is violated by every form that is not that word. One step of partial abstraction would be to form constraints violated by any word that has a different onset or a different rime, and another step would form constraints violated by any different segment sequence or any different segments. The constraints that are closely tied to utterance events are quite language-specific, while the most abstract constraints are language-general, the constraints of conventional interest in

25

Optimality Theory. The maximum entropy framework by itself does not dictate which event types or partial abstractions constitute constraints, depending on some other algorithm or expert analysis to determine which constraints are considered. Because the constraints of a Maximum Entropy Grammar are immutable symbolic entities that influence each other only through the mapping from weights to data probabilities, a Maximum Entropy Grammar is not ideal for representing an exemplar model, but the maximum entropy framework is broadly compatible with rich-memory connectionist models (Goldrick, 2007), and some of the characteristic behaviors of exemplar models fall out from the effects of training the Maximum Entropy Grammar. For example, the large behavioral difference between seen (word) and unseen (pseudoword) events stems from seen events being associated with more specific indexed constraints, while unseen events are only associated with more abstract and general constraints. In the case of the word/pseudoword acceptibility task, in parsing an actual word brick, any competing candidates (perceptually similar) will violate the Lex[brIk] constraint, in addition to some faithfulness constraints, so brick will be mapped faithfully through parsing and production, and easily be recognized as acceptable. In parsing the pseudoword sphrick, the faithful mapping receives no support from lexical constraints, but receives some support from sequence constraints Seq[#sf] and Seq[frI]. The faithful mapping might be non-optimal in either perception or production, losing out to something like [sIfrIk], so it would be judged unacceptable. However, the faithful mapping would still be assigned a small probability, more than for a pseudoword like [sfrgtl], and so it would be judged more acceptable than that one. The indexed constraints have high magnitude weights, producing unambiguous ceiling-effect judgements, while the abstract constraints interact with other constraints of similar magnitude. Similarly, sensitivity to the frequency of constructions, lexical items or sublexical units can result from the greater exposure to these events, which causes these events to play a larger role in shaping the grammar. Neighborhood density effects result when multiple partial abstraction constraints apply in a single context.

26

In contrast to the language-specific details in the model of the individual learner, the typological model represents an ideal typologist’s knowledge about language patterns in general, representing the accumulated restrictiveness of the biases from production and perception, learning and utterance selection, iterated over many generations. The constraints of this model are just the abstract constraints of the individual learner, generalized away from the language-particular exemplars, to the extent that this is possible. The forces of language change tend to propagate abstract constraints through the lexicon and constructions, but the recent history of a language may leave patterns and exceptions that cannot be abstracted away from that history (Blevins, 2006; Pater, 2009a). But even when there are exceptions and subregularities whose details cannot be abstracted away from history, there are still cross-linguistic generalizations than can be expressed in terms of abstract constraints (Orgun, 1996). In the model of an individual learner, the learning bias is represented by a probability distribution over the constraint weights, equivalent to a child’s ‘expectations’ about the linguistic environment prior to exposure to any language data. In other words, the learning bias expresses probabilistic knowledge about the space of learnable languages and their relative learnability. In the typological model, the comparable prior probability over the constraint weights represents a linguist’s knowledge about attested language patterns, or expectations about what features a newly encountered language will show, prior to learning anything about that particular language. If we map the space of weight probabilities back through the grammar, we produce the typological distribution of language patterns.

3.2.2

Why are there typological tendencies at all?

To even begin to answer the question of why cumulativity effects are possible but rare, we need to be more explicit about why there are typological trends at all. Where do constraints on language typology come from, and how do they become expressed in languages? The battlefields of academia are strewn with manuscripts arguing for the supremacy of language-

27

specific innate knowledge (Chomsky, 1965; Pinker & Bloom, 1990), domain-general cognition (Elman, Bates, Johnson, & Karmiloff-Smith, 1996), sociocultural experience (Dik, 1997), or the experience of language itself (Hopper & Traugott, 2003). While it is still useful to distinguish these different influences on linguistic behavior, recent work in language evolution suggests that not only do these influences all interact in creating language, but that they all reshape each other. The Iterated Learning Model (ILM, Kirby, Smith, & Brighton, 2004) derives typological generalizations from the dynamics of language learning in the context of sociocultural and sociobiological evolution. For example, an early investigation (Kirby, 1999) developed models of how the implicational hierarchy of word order in noun phrases (e.g. in prepositional languages, nouns preceding their modifying adjectives implies that nouns precede their modifying genitives) could be seen as metastable states in the space of grammars, quasi-optimal because of the opposition between speaker economy and communicative success. Griffiths and Kalish (2007) provide a Bayesian analysis of iterated learning, showing that the learners’ learning bias (their ‘expections’ prior to exposure to language) is essential to predicting the distribution of languages that is stable under iterated learning. Learners are provided with language data and produce a hypothesized language model by considering how likely it is that each possible hypothesis produced the given data, hedging their bets based on their prior expectations about possible language models. In the analysis that Griffiths and Kalish provide, the next generation of language data is produced by sampling data generated by the acquired language model(s). Under the assumption that learners’ utterances are produced by first sampling a language model hypothesis from the space of possibilities and using that to generate an utterance (i.e., sampling from the posterior distribution), iterated learning filters out any effect of the initial learning data, and the final evolutionarily stable distribution matches the distribution of the learners’ prior expectations. If instead learners produce utterances according to the language model that they believe to be most probable (i.e., MAP estimation), then the

28

stable distribution is dominated by the languages with highest probabilities in the prior probability distribution, with the relative frequencies determined by several factors, including the structure of the hypothesis space, the quantity of data exposure, and communication noise. The question of when people use each strategy for language learning (sampling the hypotheses versus using the optimal hypothesis) is something of an open question, closely entangled with the question of under what conditions people favor regularizing variation versus matching the variation probabilities (C. Hudson Kam & Newport, 2009; Reali & Griffiths, 2009). We will return to these questions in the next section. The model that Griffiths and Kalish use emphasized the role of innate bias, but Kirby, Dowman, and Griffiths (2007) extend the analysis to include a prior probability over meanings, characterizing the functional requirements of language in use (Fig 1). Just as the learning bias acts to subtly reshape the language model hypothesis space (the mental Ilanguage) during learning, utterance selection mechanisms shape the language data space (the community E-language). This meaning prior similarly affects the language typology. For example, when the meaning prior favors a few common meanings, the stable languages reproduce the well-known pattern of irregular morphology concentrated in the most common words. Even though this model still oversimplifies the system—such as not including the influence that our cognitive models of social context have on learning or the influence that physiological constraints have on production—it does provide a formal entry point for many of the functional pressures on language. Christiansen, Chater, and Reali (2009) show by simulation that this learner-external prior is the one that is most explanatory of how language differs from other cognitive abilities, since the learning bias consists of two parts: a general cognitive bias which predates language and is largely tethered to other cognitive tasks, and a linguistic bias which evolves biologically toward the language distribution that is determined by the external bias in conjunction with the general cognitive bias. Since biological evolution is so much slower than the evolution of language and culture, there is little opportunity for the linguistic bias to encode information. That is, migration of learned behavior into 29

Biological Evolution

L

E

S

C

I

L

E

C

I

S

L

E

C

I

S

Sociocultural Evolution

Figure 1: Influence network of language evolution under learning bias and utterance selection. The learning process induces an I-language model from E-language data, under a prior probability composed of general Cognitive-perceptual and Linguistic biases. The production process generates E-language data from an I-language model, in concert with a Sociocultural prior probability over utterances, which is composed of the aggregated selection biases of shared experience and communicative and social goals.

learning biases is weak and only applies to features that are consistently emergent from the interaction of general cognition and sociocultural interaction. So regardless of where a selective pressure originated, it will be exaggerated in the typological stable distribution and reflected weakly in individuals’ learning bias. The effects of sociocultural selective pressures and general cognitive learning biases will each be magnified by the iterated learning process, and the linguistic bias will evolve to be mildly informative of the resulting stable distribution. Typological generalizations characterize the evolutionarily stable distribution, which on the one hand is quite restrictive compared to the distribution of learnable systems, and on the other hand is full of low probability but non-negligible exceptions, since it is generated by a complex dynamical system. On this basis, I hypothesize that the typological generalization against cumulative constraint interaction is reflected in

30

people’s learning biases. If the generalization derives primarily from general cognition, then learning biases may reflect it relatively strongly, and if the generalization derives primarily from utterance selection, the learning biases will reflect the generalization more weakly. Whether learning bias is primarily a cause or primarily a result of the non-cumulativity generalization, we can expect to find part of the explanation in the shape of the learning bias.

3.3

Hypotheses about the weight space

The literature on training maximum entropy models and learning Harmonic Grammars suggests several different hypotheses about the shape of the learning bias. Psycholinguistic studies do not definitively identify which of these best approximates human behavior, but these studies do indicate that some hypotheses are better than others. In particular, the hypotheses about the shape of the learning bias differ with respect to two characteristics of linguistic behavior: generalization from observed language data to new language data, and maintenance or regularization of inconsistent variation. If we are agnostic about any differences among kinds of constraints (e.g. markedness versus faithfulness, or exemplar-based versus substantive), the shape of the weight space can be projected from just two probability distributions. One is the distribution over the constraint weights (i.e., over the costs associated with the first violation of each constraint), and the other is the distribution over violation costs (i.e., the costs associated with additional violations of an already violated constraint). The distribution over constraint weights influences generalization and regularization in the individual learner model and both constraint cumulativity and violation cumulativity in the typological model. The distribution over violation costs has a smaller influence on the nature of variation in the individual learner, and also affects violation cumulativity in the typological model. The calculations of Harmonic Grammar tableaux have consistently assumed that the total violation cost is exactly proportional to the number of violations, while the constraint 31

weights, in the general case, are chosen somewhat arbitrarily to illustrate particular possibilities, or mapped onto a geometric series (e.g. 1, 10, 100, 1000, . . . ) when simulating Optimality Theory (Legendre et al., 2006). However, typological discussion of Harmonic Grammar has focused on what is possible rather than what is probable, generally leaving assumptions about these two distributions implicit. Because these assumptions do have important consequences in the typological probabilities, they merit explicit formulation and reconsideration.

3.3.1

Probability distributions of constraint weights

The literature on maximum entropy models makes use of three different hypotheses about the distribution of constraint weights. Literature on learning in variationist Harmonic Grammar suggests one more. The mathematically simplest prior weight distribution is a locally uniform distribution (or unbounded uniform distribution), which represents the expectation that the weights could be anything, e.g. as likely to be 100 or a million as it is to be 10 or 1. It is unbiased in the sense that the learner will accept whatever weight configuration is best supported by the observed data. The locally uniform prior distribution is often used in maximum entropy models that have a relatively small number of constraints that are general and non-redundant, but it is problematic for two reasons. In detailed language learning, which requires models that have a large number of constraints, many of them specific or overlapping, using a locally uniform prior leads to a language model that does not generalize to unobserved data, when in fact human language learners regularly extend the language in creating new utterances. In language typology, we cannot use a locally uniform distribution to predict language probabilities because the upper bound of the distribution is undefined. Typologically-oriented Harmonic Grammars have required all weights to be non-negative, because if general abstract constraints are permitted to have negative weights in some languages and positive weights in others, this leads to spurious typological effects (Prince, 2002; Pater, 2009b). So a lower bound for the weights is set at zero, but a uniform distribution on the semi-infinite

32

interval from zero to infinity has infinitesimal probability in any finite range, leading to undefined typological probabilities. Setting a finite upper bound on the interval (as done in the PyPhon software of Bane and Riggle (2009)) changes the shape of the weight space from semi-infinite to a finite hypercube. Figure 2 top shows a bounded uniform distribution and the corresponding weight space for two constraints, which have weights drawn from the same uniform distribution. The weight space is a rectangular volume, with length and width equal to the upper bound of the distribution, and the height is the constant probability. Within the square region below the arbitrary upper bound, any combination of weights is equally likely, while any combination outside that square is impossible. The most common weight distribution in maximum entropy language models is a normal (Gaussian) distribution centered around zero (Hoerl & Kennard, 1970; Chen & Rosenfeld, 1999), which represents the expectation that the weights have a finite variance and a mean of zero. There is some probability that constraint weights could be quite large, but the bias is towards weights that are small. The Maximum Entropy Phonotactic Learner (Hayes & Wilson, 2008) adapts the normal distribution to Maximum Entropy Grammar by removing the half of the distribution that falls on the negative side of zero, resulting in what is known as a half-normal distribution (Fig. 2 mid). In the weight space for two constraints, the probability mass forms a circular hill around the origin, where the contours of equal probability are concentric arcs. The surfaces of equal probability in the multidimensional space consist of sections of concentric spherical shells. Another common weight distribution in maximum entropy language models is a Laplace distribution (Tibshirani, 1996), which consists of two exponential distributions back to back, and it represents the expectation that the magnitudes of the weights have a finite sum. Compared to the normal distribution, the Laplace distribution has a bit less probability mass near median magnitudes, and more probability mass near zero and very large magnitudes. Because of this, learning a model using a Laplace distribution tends to reduce redundancy in the constraint set and increase generalization ability, by setting the weight of some of 33

2−D weight space

0.10 0.08

p(w1, w2)

bounded uniform p(w)

1−D probability

0.06 0.04 0.02 0.00 0

5

10

15

20

25

30

20

25

30

20

25

30

w1

w2

w1

w2

w1

w2

weight

0.08

p(w1, w2)

half−normal p(w)

0.10

0.06 0.04 0.02 0.00 0

5

10

15

weight

0.08

p(w1, w2)

exponential p(w)

0.10

0.06 0.04 0.02 0.00 0

5

10

15

weight

Figure 2: Probability density functions in one dimension (left) and two dimensions (right) for the bounded uniform distribution (top), the half-normal distribution (mid), and exponential distribution (bottom). Lines of equal probability are projected onto the ceiling of the weight space plots. The median of each distribution is set to 10.

34

the redundant constraints to zero. If we remove the half of the distribution that falls below zero, the result is an exponential distribution (Fig. 2 bottom). The weight space for two constraints has the probability mass in a sharply peaked and gently sloping hill, where the contours of equal probability are parallel straight lines. The surfaces of equal probability in the multidimensional space are parallel planes. In Stochastic Optimality Theory (Boersma & Hayes, 2001) and Noisy Harmonic Grammar (Boersma & Pater, 2008), each constraint is associated with a specific weight, but at evaluation time, an additional normally distributed random noise is added to each constraint weight. This noise leads to variability in the constraint rankings and resulting candidate harmonies. The noise in the Harmonic Grammar ranking is formally equivalent to uncertainty in a maximum entropy model’s posterior weight distribution, the probability distribution of the weights after learning is complete. This uncertainty is usually ignored in maximum entropy models, and the learning algorithm of Noisy Harmonic Grammar leaves the learning bias implicit, but the formal relationship indicates that Noisy Harmonic Grammar assumes something like a normally distributed weight prior. Because typologically spurious languages result when the weights are allowed to be negative (either with or without evaluation noise), Boersma and Pater argue that in order for Noisy Harmonic Grammar to be compatible with typology, the weights should be exponentiated, yielding what they call Noisy Exponential Harmonic Grammar. So while conventional non-variationist Harmonic Grammar calculates the harmony (of a candidate which has a set of violations {VC } of constraints {C} with weights {wC }) as H=

X

VC wC

C

this Noisy Exponential Harmonic Grammar calculates the harmony as

H=

X

VC ewC +N (0,σ)

C

where N (0, σ) represents the normally distributed random noise, with mean 0 and vari-

35

ance σ 2 . The exponential can equivalently be written as eN (wC ,σ) , which we can rename as a weight wC0 to restore the parallelism with conventional Harmonic Grammar weights and maximum entropy weights. The exponential of a normally distributed random variable is log-normally distributed, so Exponential Noisy Harmonic Grammar is assuming that the constraint weights wC0 = eN (wC ,σ) are log-normally distributed. The uncertainty about the weight after learning is much less than the prior uncertainty, so the distribution variance parameter σ 2 for the weight prior probability will be much larger than the variance of the posterior probability, but it is unclear how much larger it would be. Figure 3 shows lognormally distributed weight probabilities for three values of the variance parameter, along with the corresponding weight spaces for two constraints. Under a log-normal prior probability, a weight equal to zero is impossible, but the highest probability region lies where both weights are close to zero. Each of these hypothesized weight distributions, when considered as a learning bias, predicts learning behavior that is biased in some ways and unbiased in other ways. The bounded uniform distribution would lead to unbiased learning if the bound is high and the Maximum Entropy Grammar which generated the learning data had all of its weights well below that bound. However, if some of the weights of the target language are beyond the bound, then the learner will be strongly biased and might be unable to reproduce the target language. The normal distribution is biased towards small weight values, but it is unbiased with respect to the ratios of weights. To see this, notice that the lines of equal probability in the two-dimensional weight space are concentric arcs around the origin. The learning bias tends to push the optimal solution uphill in the weight space, perpendicular to the lines of equal probability. In the case of the normal distribution, the direction of bias is always directly toward the origin, parallel to the lines of constant slope that characterize ratios of the weights. Furthermore, since the bounded uniform distribution is unbiased within the square, generalization only emerges if the constraints are restricted to be general, while each of the other distributions pushes the solution towards lower weight values, leading to 36

2−D weight space

0.10 0.08

p(w1, w2)

log−normal (σ σ2 = 1) p(w)

1−D probability

0.06 0.04 0.02 0.00 0

5

10

15

20

25

30

25

30

25

30

w1

w2

w1

w2

w1

w2

0.10 0.08

p(w1, w2)

log−normal (σ σ2 = 2) p(w)

weight

0.06 0.04 0.02 0.00 0

5

10

15

20

0.10 0.08

p(w1, w2)

log−normal (σ σ2 = 3) p(w)

weight

0.06 0.04 0.02 0.00 0

5

10

15

20

weight

Figure 3: Probability density functions in one dimension (left) and two dimensions (right) for three log-normal probability distributions. Lines of equal probability are projected onto the ceiling of the weight space plots. The median of each function is set to 10, with the log-space variance σ 2 = 1 (top), σ 2 = 2 (mid), and σ 2 = 3 (bottom).

37

increased generalization in these distributions.

3.3.2

Learning bias and regularization

Some of these distributions are also biased with respect to the level of variation. The normal distribution learning bias is subtly biased towards higher levels of variation in the language, while the exponential distribution is unbiased in this respect, and the log-normal distribution is generally biased towards lower levels of variation. That is, it tends to regularize inconsistent variation, or convert meaningless variation to meaningful variation. This provides an empirical test of these distributions as learning biases. Regularizing behavior is a compromise between two rational strategies: frequency maximizing and frequency matching (Gaissmaier & Schooler, 2008). For example, if 80% of the people in a certain region call a certain kind of snack a ‘biscuit’, while the rest of the people call it a ‘cookie’, effective communication can be achieved 80% of the time by maximizing (always assuming it is a ‘biscuit’), whereas frequency matching (randomly guessing ‘biscuit’ 80% of the time) will lead to success only 80% × 80% + 20% × 20% = 68% of the time. So the simple strategy of selecting the most frequent possibility is advantageous for the immediate task. However, if there is structure to the variation (such as if the ‘biscuit’ people tend to pronounce their vowels differently), then someone using the maximizing strategy has no opportunity to test hypotheses about the use of ‘cookie’, while someone using the probability matching strategy can do so, learning the structured variation more quickly. Animals consistently use a maximizing strategy. On both linguistic and non-linguistic tasks, young children tend to maximize but shift towards matching as they mature (Ramscar & Gitcho, 2007). Adults generally use a frequency matching strategy but will shift towards a maximizing strategy either when distracted by another task or after significant training (Shanks, Tunney, & McCarthy, 2002). Interestingly, experiments with split brain patients (Wolford, Miller, & Gazzaniga, 2000; Miller & Valsangkar-Smyth, 2005) show asymmetries between left and right sides of the brain: for many sequence tasks, the left brain frequency

38

matches and the right brain maximizes, while for predicting sequences of faces (a task that is right lateralized), the left brain maximizes and the right brain frequency matches. Many examples of regularizing behavior in language processing are known, including pidginization and creolization, the morphological development of sign languages, children’s U-shaped learning of irregular English morphology, and some over-regularization among adult learners (see C. Hudson Kam and Newport (2009) for a review). Experimental results indicate that children do indeed favor regularizing more than adults do (C. Hudson Kam & Newport, 2009), that regularizing behavior in adults is subject to the same kinds of processing load effects seen in non-linguistic behavior (C. L. Hudson Kam & Chong, 2009), and that even when seeming to probability match, adults are still subtly regularizing, an effect that only becomes noticeable after accumulating over several ‘generations’ of learners (Reali & Griffiths, 2009). Reali and Griffiths (2009) also show that anti-regularizing behavior (with a bias towards making each option equally probable) can be produced for one task, namely flipping a coin. But in general and in language learning specifically, people show mildly regularizing behavior. To see the regularizing behavior of each of the weight distributions, consider a simple binary decision problem, where one candidate violates constraint C1 which has weight w1 , and the other candidate violates constraint C2 which has weight w2 . All other candidates have negligible probability due to violating some other high-weight constraint, and any other constraints violated by these two candidates have negligible weight. For example, this could be a faithful output and an unfaithful output, or two unfaithful outputs that violate different markedness constraints. The Maximum Entropy Grammar probabilities for this decision are determined by a single parameter t = w1 − w2 .

39

(17) Equal violations binary decision weights

w1

w2

/in/

C1

C2

out1

*

out2

*

H

ψ

p

−w1

2−w1

(1 + 2w1 −w2 )−1

−w2

2−w2

(1 + 2w2 −w1 )−1

Z = 2−w1 + 2−w2 The probability p(out1 ) = 2−w1 /(2−w1 + 2−w2 ) is simplified by dividing the numerator through, and after making the substitution t = w1 − w2 , we have p(out1 ) = (1 + 2t )−1 and p(out2 ) = (1 + 2−t )−1 . The top left of Figure 4 shows lines of constant t (and thus equal output probability) passing diagonally through the normal distribution weight space, close to the origin where variation is non-negligible. The line t = 0 (where p(out1 ) = p(out2 ) = 0.5) passes directly down the hill of the weight space probabilities, and along that line the effect of the bias is uphill directly parallel to the line, generally reducing the magnitude of the weights, but not affecting t. In contrast, along line t = 4 (where p(out1 ) =

1 1+24

=

1 ), 17

directly uphill points

slightly towards smaller values of t. On the other side, at t = −4, the bias pushes towards larger values of t. As a result, the overall effect of the normal distribution weight space is to prefer solutions with higher levels of variation. This analysis extends to decisions with more complicated violation profiles. Suppose that candidate out1 violated C1 twice, and both candidates violate a third constraint C3 . Just as in Optimality Theory, the constraint that both candidates violate becomes irrelevant to the calculation of optimality or even probability, and the new t parameter is 2w1 + w2 .

40

23

02

2

4

0

t=

0.0028

2

4

6

8

0

2

4

8

0. 00 2

00

00

35

0.

00

00

0.

00

3

4

25

0

25

2

0.

0.

0.

2

4 t=

t= −4 t= − 2 t= 0 t= 2 t= 4

6 w2

0 t=

2

4

−2 t=

0. 00 3

00 4

0

2

0.

t=

4

00 35

8

0.

00

−4 t=

6 0.

6

w1

8

w1

w2

23

4

t= −4 t= − 2 t= 0 t= 2 t= 4

6 4

0 2 t=

2 0 0

0

2

4

6

8

0

2

4

6

8

8

w1

8

w1

6

t=

0.0

02

06 0.0 07

0

02

0.008

0.002

2

0.004

0.003

4

0

0

0.008

0.001

0.0

0.0

t=

4

06 0.0 07

0.0

2

0.0

01

05

2

05

4

w2

0

−2 t=

t= 0.0

2

0. 0

t= −4 t= − 2 t= 0 t= 2 t= 4

0. 00 1

−4 t=

6 4

w2

0. 00

0.0

0.0027

t=

0.0028

0.00 25

0.0026

−2 t=

4

0. 00

4

−4

t=

6

0.0027

w2

02

8

0.0

25

w2

8

0.00

0.0026

6

8

0.001

0

w1

0.002

2

0.004

0.003

4

6

8

w1

Figure 4: Regularization and anti-regularization in the simple binary decision problem, for the equal violation case (left) and unequal violation case (right). The anti-regularizing halfnormal prior (top) subtly favors solutions near t = 0, where each outcome has p = 0.5, while the exponential prior (mid) is unbiased with respect to level of variation, and the regularizing log-normal prior with σ 2 = 3 (bottom) subtly disfavors solutions near t = 0. 41

(18) Unequal violations binary decision weights

w1

w2

w3

/in/

C1

C2

C3

H

ψ

p

out1

**

*

−(2w1 + w3 )

2−2w1 −w3

(1 + 22w1 −w2 )−1

*

−(w2 + w3 )

2−w2 −w3

(1 + 2w2 −2w1 )−1

out2

*

Z = 2−w3 (2−2w1 + 2−w2 ) The weight of C3 falls out of the calculation because the ψ terms can be factored (e.g. ψ(out1 ) = 2−2w1 2−w3 ), and the 2−w3 in the numerator of p cancels with the matching term in Z. After dividing the numerator through and making the substitution t = 2w1 − w2 , we again have p(out1 ) = (1 + 2t )−1 and p(out2 ) = (1 + 2−t )−1 . If instead of violating C1 twice, the first candidate violated a fourth constraint C4 , the calculation would work out the same, because in a maximum entropy model, two constraints with the same violation profiles and learning bias are indistinguishable and receive the same weight, so w4 = w1 , and H(out1 ) = −(2w1 + w3 ) again. The lines of constant t (Fig. 4 top right) run at a different angle and lie closer together, but the generalization remains the same, that the learning bias pushes the solution towards the t = 0 line. Under the exponential distribution (Fig. 4 mid), the lines of constant t all run directly downhill in the equal violation case, so the exponential distribution is unbiased with respect to the level of variation. In the unequal violation case, pushing the solution in the uphill direction is uniformly towards lower values of t (higher probabilities of out2 ), so the learning bias does change the variation level, but it is towards more variation if out1 is more probable in the learning data, towards less variation if out2 is more probable in the learning data, and unbiased overall. Under the log-normal distribution (Fig. 4 bottom), the learning bias tends to push the solution towards the axes, except if the solution lies very close to the axes. In the equal violation case, this tends to regularize variation, pushing solutions near t = 2 towards higher values of t and solutions near t = −2 towards lower values of t. In the unequal violation case, the log-normal distribution favors higher probabilities of out2 , 42

regularizing more when it was already more probable, and regularizing less when it was less probable. On the basis of the regularizing or anti-regularizing behavior of these weight distributions, we can conclude that human language learning can be modeled better by the non-regularizing exponential distribution or regularizing log-normal distribution than by the anti-regularizing normal distribution. The bounded uniform distribution is also not a satisfactory model of learning, because it requires an arbitrary upper bound, where weight probabilities are relatively high up to that point and zero beyond that point.

3.3.3

Violation cost functions

Though literature on Harmonic Grammar and even Maximum Entropy Grammar (Jäger & Rosenbach, 2006; Hayes & Wilson, 2008) have assumed that the cost scales linearly with the number of violations, the cost functions in maximum entropy models need not do so. In industrial applications, for computational efficiency the cost functions are generally taken to be binary indicator functions, equal to 1 when the related event occurred at least once, and 0 otherwise. When gradient cost is required, a scalar variable like word length or frequency may be divided into ‘bins’, with a separate constraint associated with each bin, or the cost function may be approximated via spline interpolation, with one constraint associated with each knot of the spline (D. Yu, Deng, & Acero, 2009), thus folding fitting the constraint weights and fitting the cost function into a single optimization problem. In statistical regression models, it is common to consider a power transformation, designed to alter the distribution of a scalar variable to make it more nearly normal (Box & Cox, 1964). Making the distribution of costs more like a normal distribution generally improves the accuracy and reliability of the regression model, especially when there are multiple variables in the model. For example, a power transformation could be used to convert violations associated with the mapping from input i to candidate o into the corresponding

43

10 6



λ=1 λ=0 λ = −1 λ = −∞

● ● ● ● ●

4

● ● ●

2

relative cost

8



0

● ●

0

2

4

6

8

10

violations

Figure 5: The cost function as a power transformation

cost function. fC (i → o) =

(λ) VC

=

  

(VC +1)λ −1 2λ −1

λ 6= 0

(19)

  log2 (VC + 1) λ = 0 Here, VC is the number of violations of constraint C by the mapping from the input i to the candidate o, and fC (i → o) is the corresponding cost. The power parameter λ characterizes how much the distribution is shifted. This power transformation is a family of functions that √ includes fC (x) = VC (when λ = 1), as well as fC (x) ∝ VC + 1 − 1 (when λ = 0.5) and fC (x) = log2 (VC + 1) (when λ = 0). In the limit as λ approaches −∞, fC (x) approaches the binary indicator function: 1 for VC > 0, and 0 otherwise (Fig. 5). Though some regression models of linguistic variation leave scalar variables like word frequency and phrase length untransformed, others use a logarithmic transform (λ = 0) for length variables (Bresnan, Cueni, Nikitina, & Baayen, 2007; Szmrecsanyi, 2005) or frequency

44

variables (Hay & Bresnan, 2006). A log transformation is often (though not always) useful for log-normally distributed variables, and many linguistic variables like phrase lengths and word frequencies have distributions that are approximately log-normally distributed. In the context of a population of language learners wandering the space of possible languages, the learners could develop a prior expectation that some values of λ would be more advantageous than other values, allowing them to learn language patterns more quickly.

4

Evaluating typological predictions

In the previous section, I showed that out of the hypothesized weight distributions, the uniform distribution shows neither generalization nor regularization behavior, the half-normal distribution shows generalization and anti-regularization, the exponential distribution shows generalization but not regularization, and the log-normal distribution shows generalization and regularization behavior. Since human learners show generalization and mild regularization, we should expect the exponential and log-normal distributions to more accurately reflect human language learning than the other two hypotheses. Because of the relationship between learning biases and typological distributions, we should expect the exponential and log-normal distributions to produce more accurate typological predictions as well. The predicted probabilities of each type of linguistic pattern are calculated by integrating the typologist’s weight prior over the portion of weight space that results in each linguistic pattern. The probability can easily be calculated exactly for low-dimensional problems with simple prior probability distributions, but the analytic calculation becomes intractable when the number of constraints is large or the prior probability is not explicitly integrable. We begin by calculating the typology analytically for the simplest cases, but the later typology probabilities are calculated by Monte Carlo integration. In each case, we find that the conventional assumptions of Harmonic Grammar predict more cumulativity than the alternative assumptions do, but even the typologies calculated under these alternative assumptions have

45

more cumulative effects than we seem to find in described languages.

4.1

General-case neutralization

Recall that in the general-case neutralization typology, which demonstrated the typological strangeness of constraint cumulativity, there were three constraints involved, which I express here in their more abstract form. • Faith (F): Specifications of feature α should match between underlying and surface forms • NoAlpha (G): Feature α should not appear in the surface form • Alpha/D (S): Only feature α should appear in context D In the neutralization typology produced by the constraints F, S, and G, the inequalities given in (7, 8) divide up the space of possible weights into five volumes corresponding to each of the five language types. For example, the volume above the dividing planes in Fig. 6 corresponds to the faithful full contrast language, and the volume facing the viewer in Fig. 6 corresponds to the typologically unusual general-case neutralization language. Under the assumption of a uniform weight distribution, the probability of each language is easily computable. If each constraint’s weight is drawn from the interval [0, wmax ], the probability density function is 1 , 3 wmax

since the probability mass is evenly distributed over the cube. The probability of the

complete neutralization language (on the bottom right in Fig. 6) is Z 0

wmax

Z 0

WG

Z 0

WG −WS

1

dWF dWS dWG  wmax WG3 = = 1/6 3 6wmax 0 3 wmax

Exploiting symmetries of the dividing planes, we can see that special neutralization and complementary distribution each occupy half the space (i.e. 1/12) that complete neutral46

wG

wS

Full Contrast

wF Spec Neut

Gen Neut Total Neut

Comp Dist

Figure 6: Weight space volumes in the neutralization typology. The full contrast language (WF > WG > WS − WF ) lies above the dividing planes, the general-case neutralization language (WG > WF > |WS − WG |) is facing the viewer, complete neutralization (WG > WF + WS ) is bottom right, special neutralization (WS − WG > WF > WG ) on the far left, and complementary distribution (WS − WF > WG > WF ) at bottom left.

47

Table 1: Language probabilities in the neutralization typology Type Prior: uniform half-normal exponential log-normal full contrast 41.7% 39.2% 37.5% 35.4% 16.7% 21.6% 25.0% 29.2% total neutralization complementary distribution 8.3% 10.8% 12.5% 14.6% special neutralization 8.3% 10.8% 12.5% 14.6% general neutralization 25.0% 17.6% 12.5% 6.2%

ization does. Full contrast (5/12) plus special case neutralization together occupy half the cube, and general case neutralization is left with the remainder (1/4). Integration over alternative weight distributions is more difficult, but can be approximated numerically (Tab. 1).7 The alternative assumptions about the weight distribution tend to put more of the probability density around the three axes, with the result that languages near the axes have increased probability, and languages that are mostly not near one of the axes, especially general neutralization, have decreased probability. While the bounded uniform distribution predicts that general neutralization is relatively common (25%), the probability of general neutralization under the half-normal and exponential distributions is reduced, and under log-normal distributions with large variance (σ 2  1), it is reduced still further. However, even considering that we do not have a systematic typological survey about the abundance of general neutralization patterns cross-linguistically, an estimate of 12.5% or 6.2% still seems high.

4.2

Stress windows in default-to-same-side stress

Recall that the default-to-same-side stress typology demonstrated the typological strangeness of violation cumulativity. This typology is produced by the interaction of StressRight and StressHeavy, and there are actually three parameters: the weight of StressRight, the The log-normal probabilities are calculated with σ 2 = 3. Reducing the variance increases the general neutralization probabilities, and increasing the variance reduces the general neutralization probabilities. 7

48

20

20

D C

E

D C B

15 wStrH

5

10 5

wStrH

15

B

10

E

0

A

0

A

0

5

10

15

20

0

wStrR

a.)

5

10

15

20

wStrR

b.)

Figure 7: Weight space regions in default-to-same-side stress typology, according to (a.) uniform weight distribution and λ = 1, and (b.) uniform weight distribution and λ = 0. Region A is consistent final stress, Region B has a two-syllable final window (like Yapese), Region C has a three-syllable final window (like Pirahã), Region D has a four-syllable window (unattested), and Region E has no window effect (like Aguacatec), assuming no words are longer than 10 syllables.

weight of StressHeavy, and the λ parameter of the cost function. But rather than integrate over a probability distribution in three dimensions, I take the simplifying assumption that the probability distribution of λ is focused in an infinitesimal region either around λ = 1 or around λ = 0. This allows us to visualize the typological space in just two dimensions (Fig. 7). I also leave aside the problem of words with more than one heavy syllable. The region where WStrR > WStrH (Region A) represents languages that always stress the final syllable. The region between WStrR = WStrH and 2(λ) WStrR = WStrH (Region B) represents languages that stress a heavy syllable if it is in the final two-syllable window, and otherwise stress the final syllable. The region between 2(λ) WStrR = WStrH and 3(λ) WStrR = WStrH (Region C) uses a final three-syllable window, and so forth. If word lengths are finite,

49

then there is region (Region E) where languages would not show any window effect. That is, they have typical windowless default-to-same-side stress. The probability of languages that use an n-syllable window is the integral of the weight probability density over the corresponding area. When we assume that the weight distribution is uniform, this can be written as Z 0

wmax

Z

WStrH /n(λ)

1

dWStrR dWStrH    2 wmax 1 1 1 WStrH 0 = − 2 2wmax n(λ) (n + 1)(λ)   1 1 1 = − 2 n(λ) (n + 1)(λ)

2 WStrH /(n+1)(λ) wmax

For λ = 1, this reduces to 1 1 1 1 1 = , , , ,··· 2n(n + 1) 4 12 24 40 ≈0.25, 0.083, 0.042, 0.025, · · ·

For λ = 0, this reduces to log2 (n + 2) − log2 (n + 1) log2 (3) − 1 2 − log2 (3) log2 (5) − 2 = , , ,··· 2 log2 (n + 1) log2 (n + 2) 2 log2 (3) 4 log2 (3) 4 log2 (5) ≈0.185, 0.065, 0.035, 0.022, · · ·

With the conventional λ = 1 assumption, languages with syllable windows of 2 syllables or more are already relatively rare, and the λ = 0 assumption taken from variationist studies reduces the probability a bit further. Compared to the weight space defined by the bounded uniform distribution, the exponential and log-normal distributions also reduce the probability of the pathological stress systems, by concentrating the probability mass near the axes. We can see this conceptually from the shape of the probability within the weight space. For example, the exponential 50

20

6e

−0

4

0. 0

01

10

wStrH

15

2

0.

00

18

0.

00

5

3 03

0.

00

6

0

0. 0

0

24

5

10

15

20

wStrR Figure 8: Default-to-same-side weight space with lines of equal probability under the exponential weight distribution. The shaded region (three-syllable windowed stress) has much of its area away from the origin, where the exponential distribution weight probabilities are low.

distribution extends to infinity, but the lines of uniform density (comparable to topographic elevation lines or dialect map isoglosses) run diagonally (Fig. 8), effectively making the space triangular rather than square. Just as the center diagonal line (at p=0.0012) cuts off half the area for consistent final stress (Region A) but more than half the area for the languages with windows of length two (Region B), or three (Region C), the exponential distribution overall favors languages with consistent final stress or very large windows (Tab. 1). The log-normal distributions have about the same effect, but moreso, while the half-normal distribution, with it’s circular profile, only decreases the probability of languages with a 2-syllable window. 51

Table 2: Language probabilities in the default-to-same-side stress Prior: uniform half-normal exponential Type lambda: 1 0 1 0 1 0 A) Final stress 50.0% 50.0% 50.0% 50.0% 50.0% 50.0% B) 2-syllable window 25.0% 18.5% 20.4% 14.2% 16.6% 11.4% C) 3-syllable window 8.3% 6.5% 9.0% 6.3% 8.3% 5.3% D) 4-syllable window 4.2% 3.5% 4.9% 3.6% 5.1% 3.3% 5.0% 14.5% 6.4% 18.0% 9.1% 22.4% E) n > 10 window

typology log-normal 1 0 50.0% 50.0% 11.2% 7.5% 6.2% 3.7% 4.1% 2.3% 17.3% 30.5%

Comparisons between these probabilities and typological data are not straightforward because quite a few more constraints are hypothesized to be involved in quantity sensitive stress, and many of those constraints would interact with the two constraints in this toy analysis, influencing the relative probabilities of these language types. However, typological data is a weak confirmation that a logarithmic cost function is more appropriate than a linear cost function for the distance-based StressRight constraint. Consistent final stress (Type A) and windowless default-to-same-side stress (Type E) are relatively common, while default-to-same-side stress patterns with two-syllable (Type B) or especially three-syllable (Type C) windows are rare, and languages with larger windows (e.g. Type D) are unattested. Even with a linear cost function, languages with large windows are predicted to be less common than languages with small windows, but the logarithmic cost function (λ = 0) produces probabilities that are more in line with observed languages. For example, under the exponential weight prior with a linear cost function, languages with three-syllable windows (Type C) are predicted to be almost as common as languages with windowless default-tosame-side stress (Type E), whereas with the logarithmic cost function and the same weight prior, languages with windowless stress are predicted to be four times as common as threesyllable window languages. This does not resolve the issue of the absence of large-window stress systems (e.g. Type D), but it is a step towards explaining it.

52

4.3

Jakobson’s syllable structure typology

Bane and Riggle (2009) show that the Optimality Theory factorial typology based on the original analysis of Prince and Smolensky (1993/2004) produces syllable inventories that are consistent with Jakobson’s generalization about syllable structures (20), (20) Prince and Smolensky syllable inventories a) Minimal (Onset required, Coda banned): {CV} b) Codaless (Onset optional, Coda banned): {CV,V} c) Onsetful (Onset required, Coda optional): {CV,CVC} d) All good (Onset optional, Coda optional): {CV,CVC,V,VC} while the Harmonic Grammar typology generated by the same constraints includes some language types that have the syllable inventory {CV,CVC,VC}, which is inconsistent with Jakobson’s generalization. The five constraints in the analysis are: • Onset: Syllables should have an initial consonant • NoCoda: Syllables should not end with a consonant • DepV: Any vowel in the output should correspond to a vowel in the input • DepC: Any consonant in the output should correspond to a consonant in the input • Max: Any segment in an input form should correspond to a segment in the output Bane and Riggle use all three-segment sequences of {C,V} as inputs. The outputs may consist of multiple syllables, a single syllable, or nothing, if all segments are deleted, but each syllable in the output is required to have exactly one V and no CC sequences. The Optimality Theory factorial typology produces 12 language types distinguished by their syllable inventories plus the strategies used to resolve divergence from the syllable template. 53

Table 3: Language types and total probability assigned to basic syllable inventories, according to Optimality Theory r-volume or Harmonic Grammar uniform prior weight space Optimality Theory Harmonic Grammar Inventory Syllables in Output Types Probability Types Probability ∅, CV 3 33.3% 3 33.3% Minimal CV 1 13.3% 1 13.3% ∅, CV, V 1 6.7% 1 6.7% Codaless CV, V 1 13.3% 1 13.3% ∅, CV, CVC 3 12.5% 9 11.0% Onsetful CV, CVC 1 7.5% 1 7.5% ∅, CV, CVC, V, VC 1 5.8% 2 5.8% All good CV, CVC, V, VC 1 7.5% 1 7.5% No V ∅, CV, CVC, VC 0 0.0% 4 1.5%

The corresponding Harmonic Grammar typology has 23 distinct language types, which have the same possible syllable inventories as the Optimality Theory typology, except for four language types that have non-null outputs composed of {CV, CVC, VC} syllables, plus the null output for some inputs (Tab. 3). This inventory has the peculiar characteristic of no V syllables even though there are VC syllables. In such a language, onsets are optional only when there is a coda. The probabilities for these inventory types, calculated with Bane and Riggle’s software, indicate that the Optimality Theory typology and the Harmonic Grammar typology are not as different as the counts of language types suggest. The probability that Harmonic Grammar assigns to the ‘No V’ languages is just 1.5%. This 1.5% is taken away from the probability of the {∅, CV, CVC} languages, and all the other output syllable sets have the same probability in the two frameworks. The ‘No V’ languages have such low probability because they depend on the violation cumulativity effect discussed in section 2, repeated here: These languages delete a solitary vowel, because the cost of deletion is less than the cost of violating Onset or repairing the violation by epenthesis, but they faithfully keep a VC syllable because two deletions is more costly than the Onset violation.

54

(21) Violation cumulativity for ‘no V’ languages weights /V/

40

30

20

DepC

Onset

Max

V

R R

CV

−40

*

DepC

VC

CVC ∅

−30

*



/VC/

H

Onset

*

−20

Max

H −30

*

−40

* **

−40

So these languages only occur when wDepC > wOnset and 2wM ax > wOnset > wM ax . In addition, wN oCoda must be less than both wM ax and wDepV , to prevent coda repair. These restrictions reduce these languages to a small probability. Like for general neutralization and default-to-same-side stress windows, Harmonic Grammar with regularizing weight priors produce typological probabilities that are between those produced by Optimality Theory and the uniform weight prior. However, since the predictions of Optimality Theory and Harmonic Grammar with the uniform prior are not very different, the differences of among priors is not very noticeable. For example, the exponential distribution predicts 11.4% {∅, CV, CVC} languages and 1.1% {∅, CV, CVC, VC} languages, scarcely different than what the uniform prior predicts. However, typological data indicates than neither Optimality Theory nor Harmonic Grammar, with these constraints, is producing a reasonable estimate of cross-linguistic variation. Cross-linguistically, the most common inventory types are the ‘all good’ and ‘onsetful’ types. In the University of Leiden StressTyp database8 , the syllable templates are provided for 252 languages (see Appendix B). Over half of these (53%) have syllable inventories of the ‘all good’ type, while 27% have ‘onsetful’ inventories, 15% have ‘codaless’ inventories, and just 8

http://www.unileiden.net/stresstyp/index.htm

55

3% have ‘minimal’ inventories. Finally, three languages (Cayuga, Guguyimidjir and Sentani) have {CV, CVC, V} inventories (banning the ‘worst of the worst’ VC syllables, which violate both Onset and NoCoda), and one language (Arrernte) only has VC syllables. The sample is not balanced genetically or geographically, but it is large enough and diverse enough that it should still be indicative of relative abundance of syllable inventories. While the observed abundance of inventory types is:

all good > onsetful > codaless > minimal > no VC > VC only

Harmonic Grammar predicts that the abundance of inventory types would be:

minimal > codaless > onsetful > all good > No V

and Optimality Theory r-volume predicts that the abundance would be:

minimal >

codaless

> all good

onsetful That is, the predictions are almost the reverse of observed probabilities. The observed typology of syllable inventories indicates at least three facts that a probabilistic model of typology should be able to represent (22)

a) all good > minimal: even though violations of Onset and NoCoda are avoided cross-linguistically, both are quite common in syllable inventories. b) onsetful > codaless: violations of NoCoda are cross-linguistically more probable than violations of Onset. c) no VC > No V, etc.: banning only the worst of the worst is a possible syllable inventory, while banning any other one syllable type is much less probable, if possible at all. 56

Table 4: Language types and total probability assigned to basic syllable inventories, with the constraint set supplemented by Faith, according to Optimality Theory r-volume or Harmonic Grammar uniform prior weight space Optimality Theory Harmonic Grammar Inventory Syllables in Output Types Probability Types Probability ∅, CV 3 15.3% 3 6.1% Minimal CV 1 6.9% 1 3.3% ∅, CV, V 1 8.6% 1 7.2% Codaless CV, V 1 10.8% 1 8.4% ∅, CV, CVC 5 12.2% 9 5.3% Onsetful CV, CVC 1 7.2% 1 4.6% ∅, CV, CVC, V, VC 2 18.9% 2 32.0% All good CV, CVC, V, VC 1 20.0% 1 32.6% No V ∅, CV, CVC, VC 0 0.0% 4 0.5%

Arrernte, the only attested language with VC syllables only, can reasonably be left outside the domain of a predictive model of syllable typology, because it is a single data point that departs so radically from the trends of the other languages. An attempt to incorporate it into the model would necessarily be tailoring the model to fit the peculiarities of the sample. As such, it is simply a reminder that the model is only a model, and not a comprehensive theory of syllable structure. Obervations 22a and 22b can be understood either as statements about the number of markedness versus faithfulness constraints involved in the syllable structure violations and their repairs, or about (probabilistic) universal hierarchies among the constraints involved. For example, the ‘all good’ inventory can be made more probable than the ‘minimal inventory’ by simply including redundant faithfulness constraints. If we supplement the previous constraint set with a generic Faith constraint that penalizes any epenthesis or deletion, the probabilities from either Optimality Theory or Harmonic Grammar are much more in line with observed frequencies (Tab. 4). The asymmetry between the onset condition and the coda condition could emerge if there were an asymmetric markedness constraint—an additional constraint that penalized onset violations but not coda violations—or if coda repairs 57

Table 5: Language types and total probability assigned to basic syllable inventories, with the constraint set {Faith, Max, Dep, DepV, Onset, Coda}. Optimality Theory Harmonic Grammar Inventory Syllables in Output Types Probability Types Probability ∅, CV 2 15.0% 2 6.5% Minimal CV 1 6.1% 1 1.5% ∅, CV, V 1 11.7% 1 9.3% Codaless CV, V 1 5.5% 1 1.0% ∅, CV, CVC 3 13.3% 5 7.7% Onsetful CV, CVC 1 7.2% 1 3.5% ∅, CV, CVC, V, VC 2 26.7% 2 59.2% All good CV, CVC, V, VC 1 14.4% 1 10.7% No V ∅, CV, CVC, VC 0 0.0% 2 0.6%

(consonant deletion and vowel insertion) were penalized more than onset repairs (consonant insertion and vowel deletion). One solution is to replace the DepC constraint with a general Dep constraint, so that vowel insertion is always more costly than consonant insertion. This leads to the probabilities in Table 5, which predicts the order ‘All good’ > ‘Onsetful’ > ‘Codaless’ > ‘Minimal’ > ‘No V’, though the probabilities for ‘Onsetful’, ‘Codaless’, and ‘Minimal’ are barely different. Finally, observation 22c, specifying that ‘No VC’ is a possible language, while ‘No V’ is not, is a ‘banning only the worst of the worst’ generalization, characteristic of constraint cumulativity, so it should be easily representable in the Harmonic Grammar. In fact, if we require all inputs to have non-null outputs (such as by a universally dominant ExpressMorph constraint), this eliminates the ‘No V’ languages, requiring them to faithfully keep V syllables or repair them by epenthesis. In addition, lone VC syllables can no longer be completely deleted but can be converted to V syllables by deleting the coda. As a result, ‘No VC’ languages are possible though improbable (Tab. 6). In order to maintain the ordering relation between Codaless and Minimal languages, as well as to make the constraint set more symmetric, the constraint MaxC has also been added to the constraint set. As previously, 58

Table 6: Language types and total probability assigned to basic syllable inventories, with the constraint set {Faith, Max, MaxC, Dep, DepV, Onset, Coda}, and the null output disallowed. Optimality Theory Harmonic Grammar Inventory Syllables in Output Types Probability Types Uniform Exponential Minimal CV 5 14.2% 5 3.4% 7.1% Codaless CV, V 4 19.1% 5 4.1% 11.5% Onsetful CV, CVC 9 19.1% 13 13.3% 17.9% All good CV, CVC, V, VC 6 47.6% 10 79.1% 63.5% BOWOW CV, CVC, V 0 0.0% 3 0.07% 0.05%

the probabilities under the exponential distribution are in between the r-volume probabilities and the Harmonic Grammar probabilities under the uniform distribution.

5

Conclusions

In §3, I showed why we should expect the exponential and log-normal distributions to produce typological probabilities more in line with observations, and §4 showed that indeed, that was the case. The typologies of windowed default-to-same-side stress and syllable structure also suggest that the best analysis for Harmonic Grammar typologies might not be the same a for Optimality Theory, since the cumulativity effects in attested language patterns require additional constraints in Optimality Theory, but generally don’t in Harmonic Grammar. A major open question is how and why observed typological frequencies still don’t align with the predicted probabilities under regularizing weight spaces. To what extent do these represent systematic differences between the overall shape of the learning bias and the overall shape of the typological prior, versus differences in particular classes of constraints, due to physiology or social pressures that would not manifest the same in learning bias as in typology? A related question is in what domains are cumulativity effects common? The link between regularization and non-cumulativity does explain why cumulativity seems to be more common in variable phenomena, but there’s still a lack of explanation about the 59

difference between places we do or don’t see cumulativity in non-variable phenomena. An important first step in that direction is typological surveys of cumulativity effects. Another important question is the ways that utterance selection and the lexicon can influence cumulativity. Wedel (2007) shows how generalization over the lexicon in an exemplar model can produce a general pattern of constraints not interacting cumulatively, but in exceptional circumstances cumulativity still occurring. Generalizations in the exemplar model ignore some exceptions within a category, but the model is strongly influenced by a large number of exemplars that behave similarly. It predicts that strict domination is violated when a majority of examples in the lexicon have the constraints in conflict, whereas strict domination does hold true when constraints conflict less often in the lexicon. Melding the strengths of exemplar models and Harmonic Grammar is by no means trivial, but once we admit lexically indexed constraints into the grammar it at least becomes tractable.

References Anderson, S. (2000). Towards an optimal account of second position phenomena. In J. Dekkers, F. van der Leeuw, & J. van de Weijer (Eds.), Optimality theory: Phonology, syntax, and acquisition (pp. 302–333). Oxford University Press. Baker, G. K. (2004). Palatal phenomena in Spanish phonology. Unpublished doctoral dissertation, University of Florida. Bakovic, E. (2000). Harmony, dominance and control. Unpublished doctoral dissertation, Rutgers University. Bane, M., & Riggle, J. (2008). Three correlates of the typological frequency of quantityinsensitive stress systems. In Proceedings of the 10th ACL SIGMORPHON. Bane, M., & Riggle, J. (2009). The typological consequences of weighted constraints. In Proceedings of CLS 45. Bateman, N. (2007). A crosslinguistic investigation of palatalization. Unpublished doctoral

60

dissertation, UC San Diego. Beckman, J. (1998). Positional faithfulness. Unpublished doctoral dissertation, University of Massachusetts Amherst. Benua, L. (2000). Phonological relations between words. Psychology Press. Berger, A. L., Della Pietra, S. A., & Della Pietra, V. J. (1996). A maximum entropy approach to natural language processing. Computational Linguistics, 22 (1), 39–71. Biber, D. (1995). Dimensions of register variation: a cross-linguistic comparison. Cambridge University Press. Biber, D., Connor, U., & Upton, T. A. (2007). Discourse on the move: Using corpus analysis to describe discourse structure. John Benjamins. Blevins, J. (2006). Word-based morphology. Linguistics, 42 , 531–573. Boersma, P., & Hayes, B. (2001). Empirical tests of the gradual learning algorithm. Linguistic Inquiry, 32 , 45–86. Boersma, P., & Pater, J. (2008). Convergence properties of a gradual learning algorithm for harmonic grammar. ROA-970. Boroditsky, L., & Ramscar, M. (2002). The roles of body and mind in abstract thought. Psychological Science, 13 (2), 185–189. Box, G. E. P., & Cox, D. R. (1964). An analysis of transformations. Journal of the Royal Statistical Society B , 26 (2), 211–252. Breen, G., & Pensalfini, R. (1999). Arrernte: A language with no syllable onsets. Linguistic Inquiry, 30 (1), 1–25. Bresnan, J., Cueni, A., Nikitina, T., & Baayen, R. H. (2007). Predicting the dative alternation. In Cognitive foundations of interpretation. Amsterdam: Royal Netherlands Academy of Science. Cedergren, H. J., & Sankoff, D. (1974). Variable rules: Performance as a statistical reflection of competence. Language, 50 (2), 333–355. Chen, S. F., & Rosenfeld, R. (1999). A Gaussian prior for smoothing maximum entropy

61

models (Tech. Rep. No. CMU-CS-99-108). Carnegie Mellon Unviersity. Chomsky, N. (1965). Aspects of the theory of syntax. MIT Press. Christiansen, M. H., Chater, N., & Reali, F. (2009). The biological and cultural foundations of language. Communicative & Integrative Biology, 2 (3), 221-222. Clements, G. N., & Keyser, S. J. (1983). CV phonology: A generative theory of syllable structure. The MIT Press. Coetzee, A., & Kawahara, S. (2010). Frequency and other biases in phonological variation. ROA-1098. Creel, S. C., Aslin, R. N., & Tanenhaus, M. K. (2008). Heeding the voice of experience: The role of talker variation in lexical access. Cognition, 106 , 633–664. Cristófaro-Silva, T. (2003). Palatalisation in Brazilian Portuguese. In S. Ploch (Ed.), Living on the edge: 28 papers in honour of Jonathan Kaye (Vol. 62). Mouton De Gruyter. Davis, M. H., & Gaskell, M. G. (2009). A complementary systems account of word learning: neural and behavioural evidence. Philosophical Transactions of the Royal Society B , 364 , 3773–3800. Dik, S. C. (1997). The theory of functional grammar (K. Hengeveld, Ed.). Walter de Gruyter. Eckert, P., & Rickford, J. (Eds.). (2001). Style and sociolinguistic variation. Cambridge University Press. Eisner, J. (2000). Review of Kager “Optimality Theory”. Computational Linguistics, 26 (2), 286–290. Elman, J. L., Bates, E. A., Johnson, M. H., & Karmiloff-Smith, A. (1996). Rethinking innateness (J. L. Elman, Ed.). MIT Press. Evans, N., & Levinson, S. C. (2009). The myth of language universals: Language diversity and its importance for cognitive science. Behavioral and Brain Sciences, 32 , 429–492. Everett, D. (1988). On metrical constituent structure in Pirahã phonology. Natural Language and Linguistic Theory, 6 (2), 207–246.

62

Gahl, S., & Yu, A. C. L. (2006). Introduction to the special issue on exemplar-based models in linguistics. The Linguistic Review , 23 , 213–216. Gaissmaier, W., & Schooler, L. J. (2008). The smart potential behind probability matching. Cognition, 109 , 416–422. Goldinger, S. D. (2007). A complementary-systems approach to abstract and episodic speech perception. In Proceedings of ICPhS XVI. Goldrick, M. (2007). Constraint interaction: A lingua franca for stochastic theories of language. In C. T. Schütze & V. S. Ferreira (Eds.), MIT working papers in linguistics (Vol. 53, pp. 95–114). Goldsmith, J. (1993). Harmonic phonology. In J. Goldsmith (Ed.), The last phonological rule: reflections on constraints and derivations. University of Chicago Press. Goldwater, S., & Johnson, M. (2003). Learning OT constraint rankings using a maximum entropy model. In Proceedings of the workshop on variation within optimality theory. Gordon, M. (2002). A factorial typology of quantity-insensitive stress. Natural Language and Linguistic Theory, 20 , 491–552. Griffiths, T. L., & Kalish, M. L. (2007). Language evolution by iterated learning with Bayesian agents. Cognitive Science, 33 (3), 441-480. Haspelmath, M. (2007). Pre-established categories don’t exist: Consequences for language description and typology. Linguistic Typology, 11 (1), 119–132. Hay, J., & Bresnan, J. (2006). Spoken syntax: The phonetics of giving a hand in New Zealand English. The Linguistic Review , 23 (3), 321–349. Hayes, B. (1995). Metrical stress theory: principles and case studies. University of Chicago Press. Hayes, B., & Wilson, C. (2008). A maximum entropy model of phonotactics and phonotactic learning. Linguistic Inquiry. Hoerl, A. E., & Kennard, R. W. (1970). Ridge regression: Biased estimation for nonorthogonal problems. Technometrics, 12 (1), 55–67.

63

Hopper, P., & Traugott, E. (2003). Grammaticalization. Cambridge University Press. Hudson Kam, C., & Newport, A. (2009). Getting it right by getting it wrong: When learners change languages. Cognitive Psychology, 59 , 30-66. Hudson Kam, C. L., & Chong, A. (2009). Investigating the cause of language regularization in adults: memory constraints or learning effects? Journal of Experimental Psychology: Learning, Memory, and Cognition, 35 , 815–821. Ito, J., & Mester, A. (1998). Markedness and word structure: OCP effects in Japanese. ROA-255. Ito, J., & Mester, A. (2003). Systemic markedness and faithfulness. In Proceedings of CLS 39. Jäger, G. (2003). Learning constraint sub-hierarchies: The bidirectional gradual learning algorithm. In R. Blutner & H. Zeevat (Eds.), Optimality theory and pragmatics. Palgrave Macmillan, Houndmills. Jäger, G. (2007). Maximum entropy models and stochastic optimality theory. In A. Zaenen, J. Simpson, T. H. King, J. Grimshaw, J. Maling, & C. Manning (Eds.), Architectures, rules, and preferences. variations on themes by Joan W. Bresnan (pp. 467–479). CSLI Publications. Jäger, G., & Rosenbach, A. (2006). The winner takes it all—almost: cumulativity in grammatical variation. Linguistics, 44 (5), 937–971. Jakobson, R. (1962/1971). Selected writings: Phonological studies (Vol. 1; S. Rudy, Ed.). The Hague: Mouton. Jarosz, G. (2006). Richness of the base and probalistic unsupervised learning in optimality theory. In Proceedings of 8th ACL SIGPHON. Jensen, J. T. (1977). Yapese reference grammar. Honolulu: University of Hawaii Press. Jesney, K., & Tessier, A.-M. (2010). Biases in harmonic grammar: the road to restrictive learning. Natural Language and Linguistic Theory, 28 (4). Johnson, K. (2006). Resonance in an exemplar-based lexicon: The emergence of social

64

identity and phonology. Journal of Phonetics, 34 , 485–499. Kager, R. (2006). Lexical irregularity and the typology of contrast. In The nature of the word: Essays in honor of Paul Kiparsky. MIT Press. Kim, G.-R. (2002). Korean palatalization in optimality theory: against the strict parallelism. Studies in Phonetics, Phonology and Morphology, 8 (1), 1–15. Kindermann, R., & Snell, J. L. (1980). Markov random fields and their applications. American Mathematical Society. Kirby, S. (1999). Function, selection, and innateness: The emergence of language universals. Oxford University Press. Kirby, S., Dowman, M., & Griffiths, T. L. (2007). Innateness and culture in the evolution of language. PNAS , 104 (12), 5241–5245. Kirby, S., Smith, K., & Brighton, H. (2004). From UG to universals. Studies in Language, 28 (3), 587-607. Lanham, L. W. (1955). A study of Gitonga of Inhambane. Johanneburg: Witwatersrant University Press. Legendre, G., Sorace, A., & Smolensky, P. (2006). The harmonic mind: From neural computation to optimality-theoretic grammar. In P. Smolensky & G. Legendre (Eds.), (Vol. 2, pp. 339–399). MIT Press. Levelt, C. C., Schiller, N. O., & Levelt, W. J. (1999). The acquisition of syllable types. Language Acquisition, 8 (3), 237–264. Lubowicz, A. (2005). Locality of conjunction. In J. Alderete (Ed.), Proceedings of WCCFL 24. McCarthy, J. J. (2002). A thematic guide to optimality theory. Cambridge University Press. McCarthy, J. J. (2003). OT constraints are categorical. Phonology, 20 (1), 75–138. McCarthy, J. J., & Prince, A. (1986/1996). Prosodic morphology (Technical Report No. 32). Rutgers University Center for Cognitive Science. McCarthy, J. J., & Prince, A. (1995). Faithfulness and reduplicative identity. UMass

65

Occasional Papers in Linguistics/ROA-60. Miller, M. B., & Valsangkar-Smyth, M. (2005). Probability matching in the right hemisphere. Brain and Cognition, 57 , 165–167. Moreton, E., & Smolensky, P. (2002). Typological consequences of local constraint conjunction. In L. Mikkelson & C. Potts (Eds.), Proceedings of WCCFL 21 (pp. 306–319). Orgun, C. O. (1996). Sign-based morphology and phonology. Unpublished doctoral dissertation, UC Berkeley. Pater, J. (2008). Gradual learning and convergence. Linguistic Inquiry, 39 , 334–345. Pater, J. (2009a). Morpheme-specific phonology: Constraint indexation and inconsistency resolution. In S. Parker (Ed.), Phonological argumentation: Essays on evidence and motivation. London: Equinox. Pater, J. (2009b). Weighted constraints in generative linguistics. Cognitive Science, 33 , 999–1035. Pierrehumbert, J. B. (2002). Word-specific phonetics. In C. Gussenhoven & N. Warner (Eds.), Laboratory phonology (Vol. 7, pp. 101–139). Mouton de Gruyter. Pinker, S., & Bloom, P. (1990). Natural language and natural selection. Behavioral and Brain Sciences, 13 (4), 707–784. Prince, A. (2002). Anything goes. In T. Honma, M. Okazaki, T. Tabata, & S. ichi Tanaka (Eds.), A new century of phonology and phonological theory: A festschrift for Professor Shosuke Haraguchi on the occasion of his sixtieth birthday (pp. 66–90). Tokyo: Kaitakusha. Prince, A., & Smolensky, P. (1993/2004). Optimality theory: Constraint interaction in generative grammar. ROA-537. Ramscar, M., & Gitcho, N. (2007). Developmental change and the nature of learning in childhood. Trends in Cognitive Sciences, 11 , 274–279. Reali, F., & Griffiths, T. L. (2009). The evolution of frequency distributions: Relating regularization to inductive biases through iterated learning. Cognition, 111 , 317-328.

66

Rubach, J. (2003). Polish palatalization in derivational optimality theory. Lingua, 113 , 197–237. Rubach, J. (2006). Mid vowel fronting in Ukrainian. Phonology, 22 , 1–36. Shanks, D. R., Tunney, R. J., & McCarthy, J. D. (2002). A re-examination of probability matching and rational choice. Journal of Behavioral Decision Making, 15 , 233–250. Smith, J. L. (2000). Prominence, augmentation, and neutralization in phonology. In L. Conathan, J. Good, D. Kavitskaya, A. Wulf, & A. Yu (Eds.), Proceedings of BLS 26 (pp. 247–257). Smolensky, P. (1995). On the internal structure of the constraint component Con of UG. Handout of talk at UCLA, April 7. ROA-86. Smolensky, P. (2006). The harmonic mind: From neural computation to optimality-theoretic grammar. In P. Smolensky & G. Legendre (Eds.), (Vol. 2, pp. 27–160). MIT Press. Sorace, A., & Keller, F. (2005). Gradience in linguistic data. Lingua, 115 , 1497–1524. Szmrecsanyi, B. (2005). Language users as creatures of habit: A corpus-based analysis of persistence in spoken English. Corpus Linguistics and Linguistic Theory, 1 (1), 113– 150. Tibshirani, R. (1996). Regression shrinkage and selection via the lasso. Journal of the Royal Statistics Society B , 58 (1), 267–288. Tily, H., Gahl, S., Arnon, I., Snider, N., Kothari, A., & Bresnan, J. (2009). Syntactic probabilities affect pronunciation variation in spontaneous speech. Language and Cognition, 1 (2), 147–165. Walker, R. (1997). Mongolian stress, licensing, and factorial typology. ROA-172. Wedel, A. B. (2007). Feedback and regularity in the lexicon. Phonology, 24 , 147–185. Wilson, C. (2006). Learning phonology with substantive bias: An experimental and computational study of velar palatalization. Cognitive Science, 30 , 945–982. Wolford, G., Miller, M. B., & Gazzaniga, M. (2000). The left hemisphere’s role in hypotheses formation. The Journal of Neuroscience, 20 (RC64), 1–4.

67

Yu, A. (2003). The morphology and phonology of infixation. Unpublished doctoral dissertation, UC Berkeley. Yu, D., Deng, L., & Acero, A. (2009). Using continuous features in the maximum entropy model. Pattern Recognition Letters, 30 , 1295–1300. Zhang, J., Lai, Y., & Sailor, C. (2009). Opacity, phonetics, and frequency in Taiwanese tone sandhi. In Current issues in unity and diversity of languages: Collection of papers selected from the 18th International Congress of Linguists (p. 3019-3038). Zhao, Y., & Jurafsky, D. (2009). The effect of lexical frequency and Lombard reflex on tone hyperarticulation. Journal of Phonetics, 37 , 231–241.

A

Appendix: Tableaux in the neutralization typology

(23) Full Contrast (WF > WG > WS − WF ) weights

R

/sa/ Sa

20

G: No[−ant]

S: Pal

F: Id(ant)

Sa

* G: No[−ant]

* S: Pal

/Si/

*

si Si

S: Pal

F: Id(ant) −20

* G: No[−ant]

−20 −10

* G: No[−ant]

−30

F: Id(ant) *

si Si

H 0

sa

/si/

R

20

sa

/Sa/

R R

10

* S: Pal

F: Id(ant)

*

*

−30 −40 −10

*

68

(24) Complementary Distribution (WS − WF > WG > WF ) weights

R R

/sa/

40

10

G: No[−ant]

S: Pal

F: Id(ant)

Sa

* G: No[−ant]

* S: Pal

sa Sa

/si/

/Si/

*

si Si

S: Pal

F: Id(ant) −40

* G: No[−ant]

−10 −20

* G: No[−ant]

−30

F: Id(ant) *

si Si

H 0

sa

/Sa/

R R

20

* S: Pal

F: Id(ant)

*

*

−30 −50 −20

*

(25) Special-case Neutralization (WS − WG > WF > WG ) weights

R

/sa/

40

20

G: No[−ant]

S: Pal

F: Id(ant)

Sa

* G: No[−ant]

* S: Pal

sa Sa

/si/

/Si/

*

si Si

S: Pal

F: Id(ant) −40

* G: No[−ant]

−20 −10

* G: No[−ant]

−30

F: Id(ant) *

si Si

H 0

sa

/Sa/

R R R

10

* S: Pal

F: Id(ant)

*

*

−30 −60 −10

*

69

(26) Total Neutralization (WG > WF + WS ) weights

R R R R

/sa/

30

10

10

G: No[−ant]

S: Pal

F: Id(ant)

0

sa Sa

/Sa/

* G: No[−ant]

* S: Pal

sa Sa

/si/

/Si/

*

si Si

S: Pal

F: Id(ant) −10

* G: No[−ant]

−10 −30

* G: No[−ant]

−40

F: Id(ant) *

si Si

H

* S: Pal

F: Id(ant)

*

*

−40 −20 −30

*

70

B

Appendix: Syllable template typological data

These are the languages in the University of Leiden StressTyp database9 indicated to have each category of syllable template. Syllable templates that were indicated as uncertain (with question marks) were excluded, but templates marked as ‘probably’ or ‘at least’ were included. Vowel length distinctions and the distribution of consonant clusters are ignored. No attempt was made to correct the percentages for uneven genetic and geographical sampling, but the languages are arranged by family to show the diversity of the sample. a) 8 languages (3.2%) from 6 families - Minimal ({CV} Onset required, Coda banned): Arawakan: Banawá Cayuvava: Cayuvava Choco: Embera Saija Na-Dene: Slave Niger-Congo: Grebo; Senoufo, Supyire Uto-Aztecan: Luiseño; Tepehuan, Southeastern b) 37 languages (14.7%) from 13 families - Codaless ({CV,V} Onset optional, Coda banned): Arawakan: Paumarí Australian: Mpakwathi Malayo-Polynesian: Da’a; Fijian; Kambera; Kilivila10 ; Kwaio; Ledo; Muna; Napu; Ngada; Nias; Padoe; Pamona; Rapanui; Sio; Tahitian; Tawala; Tuamotuan; Tukang Besi; Uma; Wolio Mura: Pirahã11 Niger-Congo: Kongo Tacanan: Cavineña; Tacana Trans-New Guinea: Ekari Tucanoan: Cubeo; Desano12 Uto-Aztecan: Kawaiisu 9

http://www.unileiden.net/stresstyp/index.htm [m] codas 11 no V, only VV 12 glottal stop codas 10

71

Warao: Warao West Papuan: Galela; Pa’disua; Tabaru; Tobelo Witotoan: Huitoto Yanomam: Sanumã c) 69 languages (27.4%) from 22 families - Onsetful ({CV,CVC} Onset required, Coda optional): Afro-Asiatic: Iraqw; Arabic, Beirut; Arabic, Damascene; Arabic, Egyptian spoken; Arabic, Gulf; Arabic, South Levantine Spoken; Aramaic; Cairene Arabic; Hebrew, Tiberian; Maltese; Palestinian Arabic; Arabic, Bedouin-Hijazi Algic: Unami Arawakan: Piro; Suriname Arawak Australian: Mangarayi; Wardaman; Gunwinggu; Ngalkbun; Dhurga; Gaalpu; Gayardilt; Gidabal; Juat; Juwalarai; Kuku-Yalanji; Kuuku Ya’u; Mantjiltjara; Mayapi; Muruwari; Pitjantjatjara; Thurawal; Walmajarri; Wangaybuwan-Ngiyambaa; Warrgamay; Yanyuwa; Bagundji Austro-Asiatic: Bhumij; Khmer, Central; Khmu’ Bunaban: Gooniyandi Carib: Hixkaryána Chapacura-Wanham: Pakaásnovos Hokan: Karok; Yana Indo-European: Hindi Kutenai: Kutenai Malayo-Polynesian: Tiruray; Aklanon; Hanunóo Mayan: Pokomchí; Tzotzil, Zinacanteco; Jacalteco Mixe-Zoque: Zoque, Copainalá Muskogean: Koasati; Muskogee Na-Dene: Masset Haida Niger-Congo: Koromfe North Caucasian: Hunzib

72

Penutian: Maidu, Mountain; Miwok, Northern Sierra; Miwok, Southern Sierra; Nez Perce; Yawelmani Torricelli: Ningil Uralic: Estonian; Nenets, Tundra Uto-Aztecan: Yaqui; Cahuilla d) 134 languages (53.2%) from 46 families - All good ({CV,CVC,V,VC} Onset optional, Coda optional): Afro-Asiatic: Tachelhit; Kera; Saho Algic: Menomini Altaic: Bashkir; Evenki; Uzbek, Northern Araucanian: Mapuche Arawakan: Achagua; Bare Australian: Burarra; Maranunggu; Mullukmulluk; Ngalakan; Nunggubuyu; Baadi; Jaru; Kalkutung; Kokata; Pintupi-Luritja13 ; Yindjibarndi; Tiwi; Djingili; Wambaya; Maung Aymaran: Aymara; Jaqaru Basque: Basque, Souletin Carib: Carib Caucasian: Tsakhur Chibchan: Rama Chon: Tehuelche Chukotko-Kamchatkan: Chukot Dravidian: Malayalam East Bird’s Head: Meah East Papuan: Lavukaleve Eskimo-Aleut: Greenlandic, West; Norton Sound; Yupik, Chevak; Yupik, Pacific Gulf; Yupik, St. Lawrence Island; Yupik, General Central Gilyak: Gilyak 13

V only initially

73

Hokan: Maricopa; Mesa Grande Diegueño; Quechan Indo-European: Breton; Cornish; Gaelic, Irish; Gaelic, Scottish; Manx; Dutch; Swedish; Greek, Modern; Gujarati; Kalami; Maithili; Italian Iroquoian: Seneca; Mohawk, Akwesasne Macro-Ge: Canela-Krahô Malayo-Polynesian: Biak; Bugis; Gayo; Javanese; Kara; Karo Batak; Kisar; Kola; Kuanua, Kunama; Lampung; Larike; Lauje; Leti; Malagasy; Mamasa; Manggarai; Mentu; Mongondow; Paama; Pulopetak; Rantepao; Ratahan; Sama Baangingi, Bajan; Sawai; Selayar; Taba; Tagalog; Timor; Toba-Batak; Tondano; Tarangan, West Mayan: Aguacateco; Mam Misumalpan: Mískito Nambiquaran: Nambiquara Niger-Congo: Diola-Fogny Nilo-Saharan: Krongo; Murle North Caucasian: Abkhaz; Archi; Avar, Ghodoberi Oto-Manguean: Chinanteco, Lealao Panoan: Shipibo-Conibo Penutian: Hanis Coos; Shoshone, Panamint Quechuan: Quechua; Quichua, Imbabura Sepik-Ramu: Alamblak; Murik; Yimas Sino-Tibetan: Bawm Siouan: Dakota Timicua: Mocama Torricelli: Yil Trans-New Guinea: Amele; Ketengban; Lower Grand Valley Dani; Ono; Usan; Wahgi; Wambon; Weri; Woisika Trumai: Trumai Tupi: Urubú-Kaapor Uralic: Finnish; Karelian; Mansi; Vod Uto-Aztecan: Hopi 74

West Papuan: Mai Brat Yuchi: Yuchi Yukaghir: Yukaghir14 e) 3 languages (1.2%) from 3 families - BOWOW ({CV,CVC,V} Onset optional if no coda): Australian: Guguyimidjir Iroquoian: Cayuga Trans-New Guinea: Sentani f) 1 language (0.4%) from 1 family - VC only: Australian: Arrernte

14

onsetless only initially

75