Word Length Effects in a Contextual Model of Verbal Perception

2 downloads 42 Views 2MB Size Report
Jul 19, 2016 - (Dated: July 20, 2016) ... NC] 19 Jul 2016 ..... (19). In the first sum, which is the contribution from j ≥ i in eq. (18), S and L stand for the number of ...
Word Length Effects in a Contextual Model of Verbal Perception Francesco Fumarola (Dated: July 20, 2016) Retrieved-context models have played a crucial role in the understanding of serialposition effects in free recall. In this paper, a simple model in that class is proposed and tested against the case of word-length experiments. In recent years, standard interpretations of the word-length effect have been undermined by a series of ex-

arXiv:1607.05530v1 [q-bio.NC] 19 Jul 2016

perimental results, culminating with data that display an inversion of the effect for mixed lists. The model discussed in this paper predicts the experimental behavior as an effect of the different level of localization of short and long words in semantic space. Events corresponding to the recognition of a nonlocal word have a clustering property in phase space, which facilitates associative retrieval. The standard word-length effect arises directly from this property, and the inverse effect from its breakdown. The theory predicts that the contiguity effect should be stronger for shorter words. Several other predictions are listed, and experiments are proposed to further test the model. Finally, a possible interpretation of the results is discussed.

I. A.

INTRODUCTION

Free recall: lexical and serial-position effects

In 1894, with their pioneering work on free-recall experiments, Binet and Henry introduced a key tool for the controlled investigation of short-term memory (Binet and Henry, 1894). In its traditional form, a free-recall experiment is performed by presenting the subject with a list of words and then requesting him or her to recall it in any order (Murdock, 1960, 1962; Roberts, 1972; Standing, 1973). Several types of effects have been reported: 1) Effects depending on the lexical properties of individual words. In particular, lists of short words are recalled better than lists of long ones, a fact known in the literature as the word-length effect (Baddeley et al., 1975; Russo and Grammatopoulou, 2003; Tehan and Tolan, 2007; Bhatarah et al., 2009).

2 2) Effects in which the recall probability depends on the absolute position of words in the list. It has been observed that the first and last words in the list are recalled more easily (”primacy” and ”recency” effects). 3) Effects depending on the relative position of words with respect to each other. Most notably, the recall probabilities of contiguous words correlate positively, a fact known as the contiguity effect (Murdock, 1960, 1962). The need to understand serial-position effects led to the devising of retrieved-context models, such as the Temporal Context Model (Howard and Kanaha, 2002). In these models the recall process, rather than retrieving a word directly, retrieves the context associated to the word first. Within this scenario, recency effects appear because the context at the time of the ”memory test” is most similar to the context associated with recent items. When an item is retrieved, it reinstates the context active when that item was presented. Because this context overlaps with the encoding context of the items’ neighbors, a contiguity effect results. Through these models, serial-position effects have been substantially understood over the past fifteen years (Howard and Kahana, 2002, 2002b; Sederberg, Howard, and Kahana, 2008; Polyn, Norman, and Kahana, 2009, 2009b; Lohnas, Polyn, and Kahana, 2015; Healey and Kahana, 2015). The same cannot be said, however, about the word-length effect.

B.

Riddles of the word length effect

The word-length effect (WLE) has been a traditional testing ground for models of shortterm memory (Campoy, 2011; Jalbert et al., 2011), and it has played a key role in establishing the working-memory paradigm and the phonological loop hypothesis (Baddeley and Hitch., 1974). The standard account of the effect (Baddeley, 2007) relies on a trade-off between memory decay (in the phonological store) and subvocal rehearsal via an articulatory control process. Because shorter words take less time to rehearse, more decaying traces of them can be refreshed than decaying traces of long items, and, therefore, more short items can be recalled. This picture, however, is not able to account for all experimental observations concerning this effect, and has been repeatedly called into question. In (Neath et al., 2003), it was shown that with words having the same number of syllables

3 but different pronunciation times, no unambiguous WLE arises. This result (extended in Jalbert et al. 2011) suggests that the effect depends on the number of syllables, and not on the time it takes to pronounce them. Experiments have also been performed in conditions where there was a delay between lists, making subvocal rehearsal possible in the interval. No appreciable difference in recall probabilities was found (Campoy, 2008). In the same study, experiments were performed in which subvocal rehearsal was prevented by a high presentation rate. No delay was allowed between the presentation of word lists and the memory test. Yet, the WLE occurred unperturbed. In the 2011 paper I just cited, Jalbert et al. concluded: ”the WLE may be better explained by the differences in linguistic and lexical properties of short and long words rather than by length per se”.

C.

Semantic predictors of word length

In the meanwhile, within the fields of experimental and computational linguistics, progress has been made in understanding the role of word length in verbal processing. Over the years, it has emerged that words with different lengths tend to have different semantic properties. The idea was first put forth in pedagogical studies (Klare, 1988). In (Elts, 1995), a correlation coefficient of 0.96 was found between a noun’s length and its average tendency to be used as a technical term (”terminologicality”). Mikk et al. (2000), using data on the human-assessed complexity of a large sample of words, found a correlation coefficient 0.86 between words’ length and their semantic complexity. Pinning down the precise semantic property that correlates to word length has proven difficult. Already in (Greenberg, 1966) it was argued that a word’s length correlates positively to its conceptual ”markedness” of meaning. Various notions of markedness have subsequently been discussed in the literature (Haspelmath, 2006). Piantadosi et al. (Piantadosi et al., 2011, 2011B) and later Mahowald et al., (Mahowald et al., 2012) reported that the length of words correlates positively with their contextual information rate. More recently, Lewis and Frank (2016) have carried out a comprehensive experimental study across 80 languages. They found that, in all the languages considered,

4 judgments of conceptual complexity for a sample of real words correlate highly with their length, and they even control for frequency, familiarity, imageability, and concreteness. Their conclusion is: ”While word lengths are systematically related to usage − both frequency and contextual predictability − our results reveal a systematic relationship with meaning as well”. In the light of these findings, it would be a natural step to attempt an explanation of the WLE in terms of the semantic differences among words. However, no such approach seems to have been attempted in the literature.

D.

The inverse word length effect

Recently, new aspects of the WLE have emerged through the analysis of a large set of data from experiments by Miller and al (Miller et al., 2012). The data analysis was performed by Katkov et al. (Katkov et al., 2014), who found no negative correlation between total length of presented items and number of recalled words, thus disproving both rehearsal-time theories and hypotheses based on the increasing complexity of longer items. Moreover, they reported an inversion of the effect in mixed lists, that is, lists where words are selected irrespectively of their length. They observed that, in this type of lists, the mean values of recall probabilities allow to establish an increasing trend. Long words are recalled better than short ones. An ”inverse” WLE had been previously reported by at least two groups, but in somewhat less general circumstances: one of them (Hulme et al., 2006) embedded strictly pure lists with a single word of a different type, while the results of Xu et al. (Xu et al., 2009), may not bear direct comparison with data in languages other than Chinese. If the inversion of the WLE for mixed lists will be confirmed by further experiments, it will have to be taken into account by every general theory of the standard WLE. Let us consider, therefore, what requirements a model should fulfill to explain both phenomena simultaneously. Call γ the fraction of long words in the list; call Pl (γ) the probability of recalling successfully a given long word from a list in which a fraction γ of words are long; and let Ps (γ) be the probability of recalling successfully a short word, from a list with a fraction γ of long words.

5 Obviously, the function Pl (γ) is only defined for γ > 0, and the function Ps (γ) only for γ < 1. For γ ∈]0, 1[, both functions are defined. Theorists would have to reconcile two observations on the curves Ps (γ), Pl (γ): 1. Pl (γ = 1) < Ps (γ = 0) 2. Pl (γ) > Ps (γ) for γ ∈]0, 1[. The only way these two inequalities can be simultaneously satisfied is if both Pl and Ps are, on the whole, decreasing functions of γ. The simplest choice of these curves compatible with experiments is one where both are monotonously decreasing, that is: dPl |Xw2 |. Consider two words w1 , w2 such that Xw1 = {−A, A} and Xw2 = {−A/2, A/2}, with A even and > 2. We have |Xw1 | = |Xw2 |, but dw1 =

A2 2A+1

> dw2 =

A2 +A . 2(2A+1)

A direct relation can be shown to exist between a word’s average reaching distance and the size of its Voronoi cells. Call λi the size of the i-th Voronoi cell of word w. Assuming ¯  1, we may neglect boundary effects and write λ λi /2 ¯ 2 ¯ (λ − λ) 1 X X λ dw ∼ 2 d∼ 1+ ¯2 |X| i∈X d=1 4 λ

! (5)

w

Thus, the average reaching distance of a word depends solely on the first two moments of its Voronoi length distribution. ¯ 2 & λ¯2 for a word w, its word structure contains strong fluctuations, so one may If (λ − λ) separate X into regions were states described by w are denser, and regions were they are sparser. Word w is effectively ”localized” inside the regions were such states are dense, that is, it expresses those semantic areas better than those where its states are sparse.

10 The degree of word localization has arguably a strong effect on the dynamics, as shown in Figure 6.

FIG. 6: Example of two trajectories induced by the same input (shown as an array of circles). The two trajectories begin when the system in different states. The red (localized) word creates a narrow trapping region in semantic space, marked by the dashed ellipse. Once the localized word appears in the input, all trajectories are trapped within the ellipse.

C.

Analysis of word-length related properties

In most experiments, the relevant number of word-lengths is four, since words with more than four syllables are rare in English. For simplicity, here we will consider the existence of only two word-lengths, short and long. Call dα (with α = s, l) the reaching distance averaged over space and over all words of the same length (short or long). From eq. (5), we see that the space-averaged reaching distance is the product of two factors, involving respectively the first and second moment of the Voronoi length distribution. Let us consider how these two moments depend on the word’s length. ¯ may be related to the frequency of a given word in a The average Voronoi length λ corpus of the language. Indeed, if the system, in its ’speaking mode’, explores contextual space ergodically and uniformly, the frequency of a word is νw = λw

−1

.

In a typical corpus of the English language, the frequency ν(S) of words with S syllable is monotonously decreasing. As a consequence, we expect the average of the Voronoi length λw to be larger for short words than for long words. While this is correct for most languages, notice that there are exceptions, such as Turkish and Arabic, where the function ν(S) is

11 peaked at S = 2 (Fucks, 1956; Grzybek, 2007). Let us now look at the relative fluctuations of the Voronoi length, described by

¯ 2 (λ−λ) λ

2

.

We will surmise their magnitude through a qualitative argument. As mentioned in the Introduction, various approaches have been taken to prove that long words are on average more ’technical’, ’specialized’, ’distinctive’ or ’marked’ than short words. Several claims made in (Elts, 1995), (Mikk et al., 2000), and (Lewis and Frank, 2016) may be rephrased as the statement that long words are, on average, conceptually more specific. A word is conceptually specific if it is localized in certain areas of semantic space. A correlation exists, therefore, between word-length and semantic localization. Localization, in turn, will occur if the scale over which the Voronoi length fluctuates is comparable or greater than its average value. This leads to the conclusion that the relative fluctuations of the Voronoi length will be larger for longer words. Thus, both factors in eq. (5) take a greater value if the word is long. It follows that dl > ds .

D.

Higher dimensions

If X is taken to be a connected subset of Zn , the definition we have given for the word structure {Xw }w∈W applies all the same. The points in space are now vectors, and equations (3) and (4) are still valid, the distance in eq. (4) being the Euclidean distance in n dimensions. The Voronoi cells, however, become less simple to treat as they can be arbitrary polyhedra (for a complete treatement, see Aurenhammer et al., 2013). Formula (5) for the reaching distance must be modified, and it takes a geometry-dependent form. The Voronoi structure, yet, affects directly only the process of verbal perception, not the process of memory retrieval, which will be the subject of the next section. Thus, while in the figures I will refer to the one-dimensional case, the mathematical results will apply to any number of dimensions.

12 III. A.

VERBAL RECALL

Retrieved-context primer

We have seen that the rules of motion of the system become markovian during the perception of verbal input. In retrieved-context models of short-term memory, the rules of motion are also markovian during the search for memories, which is conducted through the principle of free association. In these models, the retrieval of memories per se is not a measurable phenomenon. What can be observed is the retrieval of words describing those memories. Therefore, a recall experiment must be seen as the composition of two processes: a processes of memory retrieval, and a process of memory verbalization. The markovian process of memory retrieval must be described here, in keeping with the spirit of context-driven models, as a random walk on X that effects a retrieval whenever it meets a state corresponding to the experience to be recalled. The verbalization process, on the other hand, depends on the verbalizability of memories: a memory x, once retrieved, has a probability qx of leading the system to produce the word describing it. The following mathematical problem arises. Supposing one is given 1) the structure {vx , qx }x of the vocabulary; 2) a word list w ~ = (w1 , w2 , . . . , wN ) presented to the system; 3) the state of the system when the word list begins to be presented, that is, a probability distribution χ[y0 ] on its position y0 ; 4) the state of the system when the retrieval process begins, that is, a probability distribution ψ[x0 ] on the new position x0 ; one wants to predict the probability that the i-th word will be among those recalled by the system. In the next section, this program will be carried out for the particular case of a vocabulary containing words of two different lengths.

B.

Application to the double word-length scenario

We will begin by defining the probability Ph (t) that a memory placed at distance h from x0 is met by the retrieving random walk for the first time after a time t. In one dimension, this is given by

13

A y0

B

C x0

D

t

FIG. 7: Trajectory during presentation and recall of a three-word list. Starting from a random position (stage A), the system moves under the effect of verbal input (stage B), and its discontinuous path leaves memory traces (stage C) that are pursued by a random walk in the retrieval stage (stage D). The points in space-time corresponding to retrieval are boxed.

X

Ph (t) = ~ n:

where f2n = 0 and f2n−1 =

(2n−3)!! . n!2n

Ph 1

ni =t

fn1 fn2 ...fnh

14 The probability that a memory will be retrieved is therefore

p

retrieval

(h) =

T X

Ph (t)

(6)

0

where the cutoff T is needed to obtain meaningful results in one or two dimensions, and can otherwise be let to infinity. Suppose a list w ~ = (w1 , w2 , . . . , wN ) has been presented. Call pi (w; ~ y0 , x0 ) the probability of retrieving the memory created by the i-th word in the list, given a certain initial position y0 for the trajectory during presentation, and a certain initial position x0 for the trajectory during retrieval. We have:

pi (w; ~ y0 , x0 ) = p

retrieval

 Ξi (y0 ) − x0



(7)

where Ξi := ξwi ◦ . . . ◦ ξw2 ◦ ξw1 . Our goal is to compute the average recall probability for an arbitrary list composed by S short words and L long words arranged into a given order. This amounts to averaging eq. (7) over all lists of the same type α ~ = (α1 , . . . , αN ) where αi ∈ {s, l} for i = 1, . . . , N : 1 pi (~ α; y0 , x0 ) = S L Ws Wl

* X

pi (w; ~ y0 , x0 ) :=

 retrieval p Ξi (y0 ) − x0

N w∈W ~



+ (8) α ~

α(wi )=αi i=1,...,N

where α(w) is the type of word w and Ws (Wl ) is the number of short (long) words in the vocabulary. We will perform the averaging over lists through a mean-field approach. Mean-field approaches, widely employed in physics, consist in inverting the order of the two steps: the computation of observables and the averaging. Instead of averaging the final probabilities, one averages an intermediate, non-observable quantity usually called the ”field”. In this case, we may average the functions ξw themselves. From section IIC, we know that the average displacement induced by the function ξw is equal to dα(w) with ds < dl , while no constraints emerged concerning the direction of this displacement as a function of word type. Therefore, to average the functions Ξi over all word lists of one type, we replace ξw with the function ξ¯α(w) defined by ξ¯α (x) = x + dα eˆ, where eˆ is a unity vector randomly chosen from a suitable distribution Ω0 [ˆ e].

15 We can thus write Ξi (y0 ) ∼ y0 +

Pj

k=1

dαk eˆk , and eq. (8) becomes

*

i   X retrieval p dαk eˆk + y0 − x0

pi (~ α; y0 , x0 ) =

k=1

where Ω(ˆ e1 , . . . , eˆi ) =

Qi

k=1

+ (9) Ω

Ω0 [eˆk ].

The initial positions y0 and x0 are not measurable quantities. In eq. (9), therefore, they must be averaged over through two suitable distributions χ(y0 ) and ψ(x0 ), yielding  i   X retrieval pi (~ α) = p dαk eˆk + y0 − x0 Pi

k=1

=

 p(yi |x0 )

(10)

χ,ψ,Ω

k=1

where yi :=



χ,ψ,Ω

 dαk eˆk and p(y|x) := pretrieval y − x .

The function we need to average may be rewritten as   X   p(yi |x0 ) = P i1 = i + P i1 = j p(yi |yj )

(11)

j6=i

  where P i1 = i is the probability that the i-th memory, yi , will be the first one to be   P P i = i < 1. Substituting eq. (11) into eq. (10), we find found, and N 1 i=1 * pi (~ α) =

+ X 

   P i1 = i χ,ψ + P i1 = j χ,ψ p(yi |yj ) j6=i

(12)



The distribution ψ refers to the state x0 of the system after the so-called retention interval, during which the subject freely elaborates the information gathered during presentation. A full understanding of such elaboration would require modeling the free motion of this system, but we have only been able to markovianize the equations of motion during a progressive searching task or in the presence of a driving input. Thus, a reasonable ansatz is necessary. Neglecting recency effects, we can suppose ψ to contain N similar peaks at the locations

y1 , y2 , . . . , yN explored during presentation. In this picture, the dependence of P[i1 = i] χ,ψ D E on the index i will be negligeable, so we can approximate P[i1 = i] with a constant χ,ψ

value p0 . Substituting this into eq. (12), and rewriting the argument of pretrieval , we find " pi (~ α ) = p0 1 +

* X j6=i

 pretrieval

max(i,j)

X k=min(i,j)+1

 dαk eˆk

+ # (13) Ω

16 where Ω is now the product of |j − i| copies of the distribution Ω0 , and the value of the summand depends solely on the number of long and short words located between word i and word j. Using pretrieval (0) = 1 to include the j = i term into the sum, and noticing that the labels of the eˆk ’s are interchangeable, we can finally rewrite eq. (13) as

pi (~ α ) = p0

X

π(Sij , Lij )

(14)

j

where we have introduced the quantities

max(i,j)

X

Lij :=

1(αk = l)

Sij := |i − j| − Lij

(15)

min(i,j)+1

 q m   X X retrieval ˆ hk π(m, q) := p eˆk + dl ds k=1

k=1

(16) Ω

ˆ k are independently distributed according to m + q copies of and the unit vectors gˆk , h the distribution Ω0 . For pure lists, eq. (14) becomes ppure i,α

  N −i i−1  X X  := pi (α, α, . . . , α) = p0 1 + + π δαs h, δαl h h=1

C.

(17)

h=1

Coexistence of word length effects

As mentioned above, the probability of recalling the i-th word of the list is equal to the probability of retrieving the i-th memory, multiplied by the factor qwi . In the previous section, we averaged the retrieval probability over all words of the same type; similarly, we must now average the verbalizability qx , which we take to be distributed independently of the word structure {vx }x . Defining qα as the average verbalizability of words of type α, we obtain the full recall probability

Pi (α1 , . . . , αN ) = qαi pi (α1 , . . . , αN )

(18)

As mentioned in the Introduction, the classical WLE is the experimental fact that pure lists made of shorter words are easier to remember. Of course such behavior may always

17 be prevented, in principle, by making the verbalizability ratio qs /ql sufficiently low. Yet, if the reported inversion of the WLE exists within this model, it must rely on the opposite requirement – namely, that the verbalizability ratio is sufficiently high. The questions we are therefore supposed to answer are: first, whether there exists for this model a range of values of qs /ql where both the classical and the inverse WLE occur; second, under what conditions on the parameters this may happen, and whether such conditions are relevant to current experiments. Let us describe the typical experimental situation. In the experiments, word lists are generated by drawing words at random from a vocabulary W. In a double-word length scenario, this vocabulary will contain Ws short words and Ws long ones. We may consider, therefore, an ensemble of lists of length N , where each word has a probability γ :=

Wl Ws +Wl

of being long, and a probability 1 − γ of being short. The recall probability for the i-th word of the list can be averaged over all lists whose i-th word is of type α, yielding pi,α (γ) := hpi (~ α)iγ . Substituting eq. (14), this becomes: "

D

X

pi,α (γ) = p0

E π(S, L) + γ

S,L≥0 0≤S+L≤N −i

X

D

E π(S + δαs , L + δαl )

S,L≥0 0≤S+L≤i−1

# (19) γ

In the first sum, which is the contribution from j ≥ i in eq. (18), S and L stand for the number of short or long words in positions k such that i + 1 ≤ k ≤ j; in the second sum (the contribution from j < i) S and L stand for the number of short or long words in positions k such that j + 1 ≤ k ≤ i. In eq. (19), the notation



...

γ

has come to denote an averaging over S, L performed

for each separate value of S + L and summed together. This is done through the binomial distribution  Φ(S, L) =

 L+S L γ (1 − γ)S L

(20)

Given that the total recall probability is Pi,α (γ) = qα pi,α (γ), the standard WLE effect (Pi,s (0) > Pi,l (1)) will occur if

ql qs

< θcl , where θcl = pi,s (0)/pi,l (1), and the inverse effect

will occur if Pi,s (γ) < Pi,l (γ), that is, if effects coexist if θinv < rewritten as

ql qs

ql qs

> θinv , where θinv = pi,s (γ)/pi,l (γ). The two

< θcl , which can only happen if θinv < θcl . This condition can be

18

pi,l (1)pi,s (γ) < pi,s (0)pi,l (γ)

(21)

Now, it can be seen that pi,α (γ) a decreasing function of γ, because by increasing γ one transfers weight from the first to the second argument of π inside both terms of eq. (19), which reduces the value of the function π. Hence, we have pi,l (1) < pi,l (γ) and pi,s (γ) < pi,s (0), from which it follows that the inequality (21) is identically satisfied. We conclude that, in this model, the WLE can undergo an inversion for any γ ∈]0, 1[, as sketched in Figure 1.

D.

Formulas for recall probabilities: slow-diffusion regime

Consider a list of the type (α1 , . . . , αN ), where αi ∈ {s, l}, containing L words and S short ones. The trajectory during presentation is illustrated in Figure 8. The system begins from a random position y0 , and at each new word wi of type αi , its position is shifted forth by the operator ξwi . The distance travelled at step i is, in the mean field approach, equal to dαi . So a memory produced by the presentation of a short word will be formed in the vicinity of the latest memory (that is, within a distance of order ds ) whereas a memory produced by the presentation of a long word will be formed at a longer distance (of order dl ) from the memory preceding it. Consequently, memories are divided into clusters separated by a distance of order dl from each other. Each cluster spreads over a width of order ds .

lsslsssslsssslss

FIG. 8: Typical trajectory during the presentation of a mixed list. The longer jumps correspond to the presentation of long (localized) words. The structure of the list is shown in the box, where s stands for short words and l for long ones.

19 These memory clusters correspond to different ”segments” of the list. Call {li }Li=1 the index values corresponding to long words within a given list. If l1 = 1, the segments are ~si = (wli , wli +1 , . . . , wli+1 −1 ) for i = 1, . . . , L − 1, and ~sL = (wlL , wlL +1 , . . . , wN ). If w1 is short, there is an additional segment ~s0 = (w1 , w2 , . . . , wl1 −1 ), and the number of segments is L + 1. Call ai the length of segment ~si . All the ai ’s must be positive except a0 , which P may be null, and Li=0 ai = N . The direction in which the trajectory moves at each step is defined by the unknown distribution Ω0 , which depends on the details of the word structure {vx }x . For a generic choice of Ω0 , clusters formed at longer time intervals from each other will lie further apart in contextual space. Hence, the trajectory during presentation is the composition of two processes: a clustering process and a diffusion process. p Notice that in eq. (13) the argument of pretrieval is of the order of Sij d2s + Lij d2l . If p √ the list is short enough (that is, if pretrieval ( Sd2s + Ld2l ) ∼ pretrieval (dl ), pretrieval ( Sds ) ∼ √ pretrieval (ds ) and pretrieval (dl )  pretrieval ( Sds )), the summand has two orders of magnitude: one of the order of pretrieval (dl ) = pl and one of the order of pretrieval (ds ) = ps . This corresponds to the fact that the diffusion process is slow on the scale of the trajectory during presentation. In this regime, formula (13) for pi (~ α) may be estimated by replacing the summand with pl whenever there is at least one long word between min(i, j) + 1 and max(i, j), and with ps otherwise. This is equivalent to approximating the matrix elements π(m, q) of eq. (16) with π(1, 0) if q = 0 and m > 0, and with π(0, 1) if q > 0, thus ignoring the dependence of the retrieval process on the distance between segments of the list. Call yi1 the first memory to be retrieved. The conditional retrieval probability for memory yj is pretrieval (|yj − yi1 |). In the slow-diffusion limit, this is of the order of ps if memories yi1 and yj belong to the same cluster, and is of the order of pl otherwise. Averaging over the first retrieval, we find h i p(i) = p0 1 + (ci − 1)ps + (N − ci )pl where ci is the length of the segment of the list to which word i belongs. The average recall probabilty P α for words of type α will thus be equal to:

(22)

20

L h i qs p0 X  Ps = ai − 1 + δ0i 1 + (ai − 1)ps + (N − ai )pl S i=0

(23)

L i ql p 0 X h Pl = 1 + (ai − 1)ps + (N − ai )pl L i=1

(24)

Defining µ =

N −a0 L

and ∆ =

1 L

PL

i=0

a2i , we can rewrite this as

h i L P s = qs p0 N p l − ps + 1 + (∆ − µ)(ps − pl ) h N −L i Pl = ql p0 N pl − ps + 1 + µ(ps − pl )

(25) (26)

All the dependence on the ordering of words in the list, therefore, enters the recall probabilities through the parameteres µ and ∆. The values of these parameters are shown in table I for simple lists. TABLE I: Values of the observables µ, ∆, and Ps /Pl for simple lists List Structure

µ



Ps /Pl

l l l...

1

1



s| .{z . . s} l . . . l

1

l...l s l...l

N N −1

N +2 N −1

qs N pl +ps +1 ql N 2 pl +N −2N pl +ps −1

s| .{z . . s} l s . . . s

N −M

N 2 − 2M (N − M )

qs N 2 ps −2N ps +ps +N −1+M (2M −2N +1)(ps −pl ) ql (N −1)[(N −M )ps +M pl −ps +1]

l |s .{z . . s} l |s .{z . . s} . . . l |s .{z . . s}

m+1

(m + 1)2

qs ql

1+

S2 N −S

qs (N −S)pl +(S−1)ps +1 ql (N −1)pl +1

S

M

m

m

m

For a pure list, entirely composed of words of type α, eqs. (25) and (26) become:

  Pαpure = qα p0 1 + (N − 1)pα

(27)

Data for mixed lists may be interpolated with formulas (25) and (26) to test the theory and fix the values of the internal parameters.

21 Finally, let us look at the range of occurrence of the WLEs in the slow-diffusion regime. The classical WLE (Pspure > Plpure ) emerges for θcl =

ql qs

< θcl where

1 + (N − 1)ps 1 + (N − 1)pl

(28)

The ratio Pl /Ps for mixed lists can be estimated as follows. Given a fixed value of L < N , µ ranges between µmin = 1 (for a0 = N − L) and µmax = N/L (for a0 = 0). The minimum −L−1 e value of ∆ is obtained by starting the list with a short word and having N − (L + 1)d NL+1 −L−1 −L−1 −L−1 e words and (L + 1)(1 + d NL+1 e) − N segments of 1 + b NL+1 c segments of 1 + d NL+1

words. The maximal ∆ is obtained by setting a0 = 0 and lumping all the short words into one segment: ∆max =

(N −L+1)2 +L−1 . L

Substituting ∆max and µmin in (25), (26), one obtains a strict upper bound on the ratio Ps /Pl :

Ps Pl


θinv ; hence, the possibility that the classical and inverse effects may coexist requires

θcl > θinv . If we rewrite this inequality by means of eq. (28) and (29), the ”microscopic” probability parameters ps and pl cancel out and we obtain the general condition L2 −LN +2(N −L) < 0, which is identically satisfied for any L > 2. It follows that, in the slow-diffusion regime, the system displays an inversion of the WLE for all mixed lists containing more than two long words. This statement, unlike the conclusions of the previous section, is not only true on average, but holds true regardless of the order in which short and long words are arranged within the list.

E.

Formulas for recall probabilities: fast-diffusion regime

In the previous section we have considered the case where the diffusion process was much slower than the clustering process – which translates into upper bounds on the order of magnitude of S and L. Notice that we defined these bounds in terms of the parameters ds and dl , whose value may vary from subject to subject. Therefore, the very same list may be

22 experienced in a fairly stationary regime by one subject, and in a regime of fast diffusion by a more easily distracted one. Here, we will consider the opposite case of very fast diffusion, defined as the regime where pretrieval (ds + x) goes down fast on the length scale of ds , so the matrix elements of π ˆ decay quickly as the value of the indices grows, and the recall dynamics is dominated by the contiguity effect. If a long word causes a lesser diffusion than two short words, we may neglect all but the top-left elements of the π ˆ -matrix: π(0, 0) = 1, π(1, 0) = ps , and π(0, 1) = pl . Otherwise, we can work in the lowest approximation by neglecting also pl . Eq. (19) becomes h i Pα (γ) ∼ p0 qα 1 + ps + pα − (ps − pl )γ

(30)

where I dropped the pedix i because all dependence on i vanishes as long as i is neither 1 nor N . The thresholds for the verbalizability ratio are

θcl = and for θinv
2 are relevant, and a polynomial interpolation can be performed, by truncating the sums in eq. (19). The crossover toward linearity of the curves Pα (γ) may be explored by tuning experimentally the amount of diffusion in the system, as we will see in the next section.

IV.

TESTING THE THEORY

The foregoing analysis has shown that a contextual model of verbal recall can display consistently both the classical and the inverse WLE. There are, however, other mechanisms that could lead to a similar prediction. Katkov et al, for instance, propose a lexical explanation of both the direct and inverse WLE, based on

23 possible differences in the long-term neural representation of long and short words (Katkov et al., 2014). To test the theory I have discussed, therefore, one needs qualitative predictions capable of teasing out which mechanism is really responsible for the effects.

A.

Predictions on average recall probabililties

The data used by Katkov et al. to identify a reverse WLE refer to a distribution of word lengths with a number of syllables S between S = 1 and S = 4. A natural way of testing the present theory would be through experiments in which the words presented for recall are drawn from either of two sets, a set of very long words (number of syllables S = 3 or S = 4) and a set of very short ones (S = 1). If the lists are sufficiently short, we expect the system to operate in a slow-diffusion regime, so three parameters must be kept track of in the experiment: ∆, µ, and the number L of long words in each list. While the recall probability of individual words depends on the details of the list structure, the average recall probability for short or long words is affected by the list structure only through those three parameters, which completely characterize the list for the purposes of this type of experiment. It is then feasible to test qualitatively the following predictions: 1. P s correlates positively with ∆ and negatively with µ. 2. P l correlates positively with µ. These are straightforward consequences of (25), (26). In experiments with words of two lengths, data may be interpolated with formulas (25) and (26) to test the theory and to fix the values of the internal parameters.

B.

Predictions depending on word position

Serial-position effects related to word-length can be isolated experimentally by using word lists constructed on two or more basic templates of segment-structure α ~ = (α1 , . . . , αN ), with αi = s, l. The experimenter would present equivalent lists to a number of participants and would correlate the recall probability of words with their positions within each template structure.

24 Large data sets from such experiments should be able to corroborate or to rule out the mechanisms I have described. Competing effects of familiar types, such as primacy and recency, may be easily subtracted. Two simple predictions can be made, easier to check in a slow-diffusion regime but equally valid with faster diffusion: 1. Words from longer segments have a higher recall probability than words of the same type from shorter segments; 2. Successful recall of words from one segment hinders the recall of words from others. The first prediction follows from eq. (22), and should be easy to test. The second prediction must be compared with data by computing the correlation functions Cij , that is, the joint probability for the recall of the i-th and j-th words in the list. From there, it is straightforward to obtain the in-segment and cross-segment correlation functions:

* Cαin1 α2 (d) =

Cij

+

*

|i−j|=d i,j∈ same segment α(wi )=α1 ,α(wj )=α2

Cαcross (d) = 1 α2

Cij

+ |i−j|=d i,j∈ different segments α(wi )=α1 ,α(wj )=α2

(32)

and to verify whether Cαin1 α2 (d) > Cαcross (d), as the theory suggests. 1 α2 We may add to these predictions a third one – namely, the fact that the long word of each segment is the easiest one to recall. This results directly from the bounds we derived in the previous section on the ratio qs /ql .

C.

Predictions on the contiguity effect

The contiguity effect (Murdock, 1960, 1962) is the observation that the recall probabilities of contiguous words correlate positively. Otherwise said, the recall of a word favors the recall of words that are contiguous to it within the list. In the model we have considered, no matter how slowly diffusion operates, the contiguity effect is stronger for short words than for long words. This follows from the fact that two consecutive short words belong necessarily to the same segment, whereas a short word and a long word, though contiguous, may belong to different segments, and two long words are sure to belong to different segments.

25 This prediction has an especial importance because, although it has been derived in a two-length model, it generalizes to models with multiple lengths. In the scenario we have described, it is evident that the shorter a word is (i.e., the smaller its reaching distance) the more important the role that will be played by contiguity in its recall. The shorter the words involved, the stronger the contiguity effect. Comparison with existing databases should be sufficient to either disprove or confirm this prediction.

D.

Inter-response intervals

A further prediction of the theory regards inter-response intervals – that, is the time elapsing between one recalled item and the next – whose measurement in free-recall experiments dates back to (Murdock and Okada, 1970). During retrieval from long-term memory, it was shown (Gruenewald and Lockhead, 1980) that clusters occur due to stable semantic associations between objects: a subject who is asked to list some animals, for instance, may recall first a set of farm animals, then a number of house pets, then several birds. The inter-response intervals are shorter within clusters than between clusters. The retrieval of examples in the experiment of Gruenewald and Lockhead depends entirely on the long-term representation of items. In the situation we have described, on the contrary, short-term memory is at play, and the retrieval process has to locate the vanishing traces of a recent experience. Nonetheless, it is easy to see that the time interval elapsing between the retrieval of two memories will be longer between two memories belonging to different clusters, and shorter for memories belonging to the same cluster. Hence, the same is true among items of the list that are successfully verbalized. The inter-response intervals will be longer for the consecutive recall of two words belonging to different segments of the list, and shorter for the consecutive retrieval of two words belonging to the same segment.

26 E.

Experiments with varying presentation rate

As mentioned in the Introduction, the WLE has been reported in experiments where the time interval between the presentation of consecutive items was a controlled parameter. Experiments on the WLE with rapid presentation of the stimuli were first performed by Coltheart and Langdon (1998), who found the WLE by presenting an item every 114 ms, every 157 ms, and every 243 ms. In (Campoy, 2008), somewhat lower presentation rates were used (between 300 and 400 ms), and again the persistence of the effect was proven over different rehearsal times. Here, I will argue that such experiments may offer an ideal tool to study the crossover between the ”diffusive” and the ”clustering” regime. Indeed, if we modify the foregoing computation to allow the system to random-walk on its own for a time τ between the presentation of the i-th and i + 1-th items, this will be equivalent to increasing the average distance d travelled between the memory yi generated by word wi and the memory yi+1 generated by word yi+1 . This amounts to rescaling time while replacing dl and ds with larger effective distances. The matrix elements of π ˆ will decay faster as their indices grow. Therefore, the system will move closer to the diffusive regime. On the other hand, the theory predicts (section IIIE) that the curve Pα (γ) will become linear in the fast-diffusion regime. Hence, by reducing the presentation rate, one should see the two curves in Fig. 1 becoming progressively linearized, at least up to values of τ so large that not only the clustering, but also the contiguity effect breaks down. In this limit, moreover, eq. (30) predicts that the curves Ps (γ) and Pl (γ) will become parallel. Other ways of controlling the crossover between clustering and diffusing regime (e.g. by pharmacological means, or through distractor tasks such as those of Bjork and Whitten, 1974) can be similarly applied.

V.

INTERPRETATION AND CONCLUSIONS

I have proposed a simple contextual model of verbal perception and verbal recall. The model is based on the notion that a word does not have, in general, a single meaning. When an intelligent system is exposed to a stream of verbal input, it decides on the meaning of each new word on the basis of both the internal structure of its vocabulary and the meaning

27 it has given to the words preceding it. This also applies to a list of random words, because the mind strives to interpret them as parts of a meaningful discourse. It may be instructive to think of such discourses as ”narratives”. Common experience tells us that a two-word list is already capable of creating a strong narrative sense (e.g.: picnic, lightning). When a word in the list has no semantic connection to the context created by the words preceding it, the mind perceives a ”change of scenery” and assumes that a new narrative is beginning. A list of words is thus perceived as a collection of distinct ”stories”. When prompted to recall the list, the system remembers each story as a separate experience, and needs to re-experiences a given story before retrieving the words responsible for creating it. Words that have specific meanings have obviously less probability of fitting into a randomly generated story. Otherwise said, the words most likely to break the narrative are those with the highest level of localization in semantic space. We have argued that this correlates positively with word length. Hence, a list of N long words is likely to break into as many one-word stories, whereas a list of short words is more likely to be perceived as a single continuous narrative. Since a single narrative is easier to recall than many unrelated ones, the standard WLE ensues. The clustering property of short words is at play in mixed lists as well. But its effect is hindered by the presence of long words breaking the narrative. As our analysis has shown, this can lead to an inversion of the WLE, analogous to recent experimental observations. In this scenario, the behavior depicted by Figure 1 becomes quite logical. By replacing a short word with a long word, one splits the list into a larger number of narratives, which makes every single word in the list (whether short or long) harder to reach during the retrieval process. The interplay between the trajectory of the system during the presentation of lists and the trajectory during the memory test produces a nontrivial spectrum of behaviors, highly dependent both on the structure of the list and on the amount of ”diffusion” that interferes with the clustering. I have proposed a number of measurements meant to provide conclusive empirical tests of the model. In particular, the theory predicts that the contiguity effect will be stronger for shorter words. Several directions stand open for future theoretical work. Some of them are: 1. generalizing the analysis to the case where the word lengths available are more than two; 2. including

28 possible competition between words for the verbalization of a given state; 3. singling out extra effects from the fluctuations around the mean field behavior; 4. accounting for primacy and recency effects; 5. extending the method to predict the genesis of false memories. Finally, by positing a suitable mechanism for spontaneous language production, it would be useful to derive equations linking the underlying word structure to emerging verbal patterns, thus providing a direct link between the hidden variables and the observables of the model.

VI.

BIBLIOGRAPHY

Aurenhammer F., Klein R., and Lee D.T., 2013. Voronoi Diagrams and Delaunay Triangulations. Singapore: World Scientific Publishing Company. Baddeley, A.D., Hitch, G., 1974. Working memory. In G.H. Bower (Ed.), The psychology of learning and motivation: Advances in research and theory , Vol. 8, pp. 47-89. New York: Academic Press. Baddeley A.D., Thomson N., Buchanan M., 1975. Word length and the structure of short-term memory. Journal of Verbal Learning and Verbal Behavior, 14:575-589. Baddeley, A.D., 2007. Working memory, thought and action. Oxford: Oxford University Press. Bhatarad P., Ward G., Smith J. Hayes L., 2009. Examining the relationship between free recall and immediate serial recall: similar patterns of rehearsal and similar effects of word length, presentation rate and articulatory suppression. Memory and Cognition 37: 689-713. Binet A. and Henry. V., 1894. La memoire des mots. L’annee psychologique, Bd. I 1:1-23. Bjork, R. A. and Whitten, W. B., 1974. Recency-sensitive retrieval processes in long-term free recall. CognitivePsychology, 6: 173189. Campoy G., 2008. The effect of word length in short-term memory: Is rehearsal necessary? Quarterly Journal of Experimental Psychology, 61:5, 724-734. Campoy, G., 2011. Retroactive interference in short-term memory and the word-length effect. Acta Psychol. 138, 135-142. Coltheart, V., Langdon, R., 1998. Recall of short word lists presented visually at fast rates: Effects of phonological similarity and word length. Memory and Cognition, 26,

29 330342. Elts, J., 1995. Word length and its semantic complexity, in Family and textbooks: 115-126. Tartu: University of Tartu. Fucks W., 1956. Die Mathematischen Gesetze der Bildung von Sprachelementen aus Ihren Bestandteilen, Nachrichtentechnische Fachberichte 3:7-21. Greenberg, J., 1966. Universals of language. Cambridge, MA: MIT Press. Gruenewald, P.J. and Lockhead, G.R., 1980. The free recall of category examples. J. Exp. Psychol. [Hum- Learn]. 6, 225-240. Grzybek P., 2007. History and methodology of word-length studies, in ”Contributions to the Science of Text and Language”, Dordrecht: Springer, pp. 15-90. Haspelmath, M., 2006. Against markedness (and what to replace it with). Journal of Linguistics, 42 (01), 25-70. Healey, M. K. and Kahana, M. J., 2016. A four-component model of age-related memory change. Psychological Review, 123(1), 23-69. Howard, M. W. and Kahana, M. J., 2002a. A distributed representation of temporal context. Journal of Mathematical Psychology, 46, 269-299. Howard, M. W. and Kahana, M. J., 2002b. When does semantic similarity help episodic retrieval? Journal of Memory and Language, 46, 85-98. Hulme, C., Suprenant, A. M., Bireta, T. J., Stuart, G., and Neath, I., 2004. Abolishing the word-length effect. J. Exp. Psychol. Learn. Mem. Cogn. 30, 98-106. Jalbert, A., Neath, I., Bireta, T. J., and Surprenant, A. M., 2011. When does length cause the word length effect? J. Exp. Psychol. Learn. Mem. Cogn. 37, 338-353. Kahana M. J., 1996. Associative retrieval processes in free recall. Memory and Cognition 24:103-9. Katkov M., Romani S., Tsodyks M., 2014. Word length effect in free recall of randomly assembled word lists. Frontiers of Computational Neuroscience 8:129. Klare, G.R., 1988. The formative years. In B.L. Zakaluk and S. J. Samules (Eds.), Readability, its past, present and future (pp. 14-34), Newark, Delaware: IRA, Kruglanski A.M. and Tory Higgin E., 2007. Social Psychology: Handbook of Basic Principles, The Guilford Press; Second Edition edition (p.642). Mahowald K., Fedorenko E., Piantadosi S.T., Gibson E., 2012. Info/information theory: speakers actively choose shorter words in predictable contexts, Cognition, 126: 313-318.

30 Lewis, M. L. and Frank M.C., 2016. The length of words reflects their conceptual complexity. Cognition 153: 182-195. Mikk. J., Heli U., and Elts J., 2001. Word length as an indicator of semantic complexity, in Text as a linguistic paradigm: levels, constituents, constructs, Festschrift in honour of Ludek Hrebcek. Trier, 187-195. Miller J.F., Weidemann C.T., Kahana M.J., 2012. Recall termination in free recall. Memory and Cognition 40: 4, Pages: 540 - 550. Murdock B-B, 1960. The immediate retention of unrelated words. Journal of Experimental Psychology 60:222-234. Murdock B. B., 1962. The serial position effect of free recall. Journal of Experimental Psychology, 64(5)M:482-488. Murdock B.B. and Okada R. , 1970. Interresponse times in single-trial free recall. Journal of Experimental Psychology 86:263-267. Neath I., Bireta T.J., Surprenant A.M., 2003. The time-based word length effect and stimulus set specificity, Psychonomic Bulletin Rev. Jun;10(2):430-4. Neath I., Brown G. D. A., 2006. SIMPLE: further applications of a local distinctiveness model of memory, in The Psychology of Learning and Motivation, ed. Ross B. H., editor. (San Diego, CA: Academic Press), 201-243. Piantadosi S. T., Tily H. and Gibson E., 2011. Word lengths are optimized for efficient communication, Proceedings of the National Academy of Sciences, 108, 9:3526. Piantadosi S. T., Tily H. and Gibson E., 2011B. Reply to Reilly and Kean: Clarifications on word length and information content, Proceedings of the National Academy of Sciences, 108, 20: E109. Polyn, S. M., Norman, K. A., and Kahana, M. J., 2009a. A context maintenance and retrieval model of organizational processes in free recall. Psychological Review, 116, 129-156. Polyn, S. M., Norman, K. A., and Kahana, M. J., 2009b. Task context and organization in free recall. Neuropsychologia, 47, 2158-2163. Roberts W.A., 1972. Free recall of word lists varying in length and rate of presentation: a test of total-time hypotheses. Journal of Experimental Psychology 92:365-372. Romani S., Pinkoviezky I., Rubin A., Tsodyks M., 2013. Scaling laws of associative memory retrieval. Neural Computation 25:2523-2544. Russo R. and Grammatopoulou N., 2003. Word length and articulatory suppression affect

31 short-term and long-term recall tasks. Memory and Cognition 31:728-737. Sederberg, P. B., Howard, M. W., and Kahana, M. J., 2008. A context-based theory of recency and contiguity in free recall. Psychological Review, 115(4), 893-912. Standing L., 1973. Learning 10.000 pictures. Quarterly Journal of Experimental Psychology 25:207-222. Tehan G. and Tolan G.A., 2007. Word length effects in long-term memory. Journal of Memory and Language 56:35-48. Xu Zhan and Li Bi-Qin, 2009. The Mechanism of Reverse Word Length Effect of Chinese in Working Memory. Acta Psychologica Sinica, Vol. 41 Issue (09): 802-811.