arXiv:1412.6616v1 [cs.CL] 20 Dec 2014

0 downloads 0 Views 130KB Size Report
All that is in- ... useful in the solution of analogy problems (Sections 3.1 and 3.2). ... In terms of a cognitive algebra, BEAGLE can be seen as a result of just two ..... The near-orthogonality of random vectors is the key A basic statistical analysis of ...
Under review as a conference paper at ICLR 2015

O UTPERFORMING W ORD 2V EC ON A NALOGY TASKS WITH R ANDOM P ROJECTIONS

arXiv:1412.6616v1 [cs.CL] 20 Dec 2014

Abram Demski Department of Computer Science and Institute for Creative Technologies Playa Vista, CA 90094 USA [email protected] Volkan Ustun Institute for Creative Technologies Playa Vista, CA 90094 USA [email protected] Paul Rosenbloom Department of Computer Science and Institute for Creative Technologies Playa Vista, CA 90094 USA [email protected] Cody Kommers Department of Psychology University of California, Los Angeles [email protected]

A BSTRACT We present a distributed vector representation based on a simplification of the BEAGLE system, designed in the context of the Sigma cognitive architecture. Our method does not require gradient-based training of neural networks, matrix decompositions as with LSA, or convolutions as with BEAGLE. All that is involved is a sum of random vectors and their pointwise products. Despite the simplicity of this technique, it gives state-of-the-art results on analogy problems, in most cases better than Word2Vec. To explain this success, we interpret it as a dimension reduction via random projection.

1

I NTRODUCTION

DVRS is a distributed representation of words designed for the Sigma cognitive architecture (Ustun et al., 2014). Inspired by the design of BEAGLE (Jones & Mewhort, 2007), DVRS constructs vector representations for words by summing up vectors representing its experiences. The details of DVRS were influenced strongly by the Sigma cognitive architecture. Implementing distributed representations in a cognitive architecture such as Sigma provides design constraints, due to the underlying theory of cognition which the architecture embodies. Unlike pure machine learning, performance is not the only goal; we also hope to find cognitive principles. Nonetheless, the performance turned out to be quite good. DVRS does not rely on the gradient-based training of neural embeddings (Mikolov et al., 2013; Mnih & Kavukcuoglu, 2013), or optimization of any other objective as with GloVe (Pennington et al., 2014). It does not involve a matrix factorization as with latent semantic analysis and similar techniques (Baroni et al., 2014). It does not even require the circular convolutions of BEAGLE and earlier holographic representations (Jones & Mewhort, 2007) (Plate, 1995). The representations are built up from random vectors using only vector sum and pointwise product. As a result, it is especially simple to implement. 1

Under review as a conference paper at ICLR 2015

This paper presents further work on the DVRS system. The performance of DVRS and Word2Vec is examined side-by-side on analogy problems, and DVRS has a higher accuracy in a majority of cases. In order to explain the success of the algorithm, a theoretical account is provided, discussing its interpretation as a random projection method. The analysis shows that DVRS can handle a vocabulary size exponential in the vector size.

2

D ISTRIBUTED R EPRESENTATIONS

OF

WORDS

Vector representations put words in a linear space, allowing for geometric computations and often capturing useful semantic information. Each word is associated with a vector of numbers which define its “location” in the semantic space. If the representation is good, words representing similar concepts will end up near to each other. Directions in the space can also have meaning, which are useful in the solution of analogy problems (Sections 3.1 and 3.2). Distributed representations of various kinds have been experimented with since the 60s, according to the technical report Hinton (1984). Already in that survey, distributed representations of words were a topic. Hinton (1984) also discussed the binding problem, which is the question of how to build up a representation of a pair of objects. Plate Plate (1995) describes a system of distributed representations which solves the binding problem using a method called circular convolution. This is a modification of standard convolution in order to keep the size of the representations from growing as a result of binding. The binding operation and a set of other operations on the vectors form a kind of algebra over concepts. 1 The BEAGLE system Jones & Mewhort (2007) uses the circular convolution binding to build up distributed representations of words. BEAGLE reads text and constructs vector representations in an unsupervised way based on co-occurrence and ordering information. Each word wi is assigned a random vector which does not change, called the environment vector (which we will denote as ~ei ). These are used to build up the semantic representation, which in BEAGLE is called the lexical vector. (We will write ~li .) This terminology will be carried over to the discussion of DVRS for the sake of standardization. The lexical vector for a word is constructed as the sum of context information and ordering information. Context information refers to simple co-occurrence: when word i occurs in the same sentence as word j, we add the environment vector ~ej into the lexical vector ~li . Ordering information consists of vectors ~oi which capture information about n-tuples within which the word occurs. BEAGLE constructs these using a formula involving circular convolution, and then adds them into the lexical vector. DVRS, BEAGLE, and their precursors were motivated partially by performance, but largely by cognitive-modeling considerations. This includes a concern for how human memory works, and a desire to find a set of cognitive mechanisms which can model it. This is very different from the machine learning approach, which focuses on performance first and puts unifying principles second. Latent Semantic Analysis (LSA) is a well-established and widelyused algorithm in this category, but rapid progress has recently been made in neural-network based embeddings. Neural models are compared extensively to LSA-style approaches by Baroni et al. (2014). Neural models, and in particular the Word2Vec model, are declared the winner. We will be using Word2Vec as our comparison. In terms of a cognitive algebra, BEAGLE can be seen as a result of just two operations: vector addition and circular convolution. Vector addition gathers things together like a multiset or bag, while circular convolution (as the binding operation) establishes relationship between things, building ordered pairs. 1 The algebraic view seems common in the field. Different systems have different basic operations. Algebraic features of the operations such as commutativity, associativity, and the existence of an inverse are of interest.

2

Under review as a conference paper at ICLR 2015

The design of DVRS is similar to this, but does not use the circular convolution operation from Plate (1995). Circular convolution is O(n log n), and this operation is not conveniently supported within the Sigma architecture. A more natural approach was pointwise multiplication: combining two vectors by multiplying their corresponding elements. This fit well with the existing operations in Sigma, which uses the summary-product algorithm Ustun et al. (2014), and is less expensive, taking O(n) time.

3

T HE DVRS M ODEL

Like BEAGLE, for each word wi DVRS has an associated environment vector ~ei and we will build a lexical vector ~li . The environment vectors are random unit vectors which remain fixed. They are formed by sampling their elements uniformly from [−1, 1] and normalizing the result to length 1. All vectors in the system are of a set dimensionality d. When we need to indicate accessing individual dimensions of a vector ~v , it will be notated by vi with no arrow above, so ~v = hv1 , v2 , . . . , vd i. An indexed vector variable like ~vi , on the other hand, represents one of a collection of vectors ~v1 , ~v2 , . . . . The lexical vectors are constructed from environment vectors and ordering vectors ~o. The term feature (indicated as f~i ) will be used to refer generically to either an environment vector or an ordering vector; these play the role of individual features in the representation. Note that the number of ~ei and ~li are both equal to the number of words in the vocabulary, but the number of ~oi will be larger, and f~i (being the union of the ~ei and ~oi ) larger still. The lexical vector ~li is built incrementally by summing the features relevant to each instance of word wi . We will refer to the current instance being calculated as the target word. For each target word, we find co-occurrence information and ordering information. The co-occurrence information is very similar to BEAGLE, but the ordering information differs significantly. The information stored in these features is more similar to the skip-gram architecture in Mikolov et al. (2013), rather than the n-gram architecture of BEAGLE. DVRS uses the binding operation only to make pairwise bindings of words and word locations, whereas BEAGLE applies it successively to bind short strings of words together. Additionally, the choice of binding operation in DVRS is different. BEAGLE used circular convolution to bind two vectors together into a third. DVRS instead uses pointwise product: for two vectors ~v and w, ~ the pointwise product is ~v .∗ w ~ = hv1 w1 , ..., vd wd i. We performed limited tests showing no advantage to circular convolution, detailed in Ustun et al. (2014). A similar formula to ours is mention in Mnih & Kavukcuoglu (2013). The co-occurrence features consist of the environment vectors of all other words in the current paragraph.2 Ordering information, on the other hand, is collected from words immediately surrounding the target word in the sentence. For each position in a window around the word, we have a sequence vector which encodes that location. The window is currently set to include four words in either direction, so we have eight sequence vectors, ~s−4 , ~s−3 , ~s−2 , ~s−1 , ~s1 , ~s2 , ~s3 , ~s4 . Like the environment vectors, these eight vectors remain constant, and are formed by uniformly sampling their entries from the range [−1, 1] and normalizing. To compute the ordering vector ~o which will be used as a feature, the binding operation is applied to combine the environment vector of the context-word in location j (relative to the current target word) with the sequence vector ~sj . This gives a vector representing a specific word at a specific location (relative to the target word). These features are not collected across sentence boundaries, however; for example, when the target word is the last in a sentence, only four ordering vectors are collected. The co-occurrence information and the ordering information are separately summed together and normalized, so that we have two vectors of length 1, representing the different kinds of information. 2 A typical set of stop-words was excluded from the co-occurrence information. Stop-words were not excluded from the ordering information, however.

3

Under review as a conference paper at ICLR 2015

This allows us to re-weight the information before adding them together. We found that a weight of 1 for co-occurrence and 0.6 for ordering information worked well. The combined vector is then added to the lexical vector, and we set the target word to the next word in the corpus. Putting it all together, define occ(wi , wj ) to be the set of co-occurences between wi and wj in the paragraphs, and let occ(wi , wj , n) be the subset where wj occurs exactly n words after wi (with negative n indicating an occurrence before wi ). The lexical vector for a word is obtained as follows: ~li =

X

~ej + 0.6

occ(wi ,wj )

X

X

~ej n occ(wi ,wj ,n)

.∗ ~sn

where n ranges from -4 to 4, excluding zero. This simple sum of features performs surprisingly well. 3.1

G OOGLE W ORD A NALOGY T EST

Mikolov et al. (2013) showed that distributed vector representations capture various relational similarities and these similarities can be recovered and used to solve word analogy tasks. These tasks simply ask, for example, the question “What is the word that is similar to small in the same way that biggest is similar to big?”. We would like to answer this question by first inferring the relation and then applying it to the first word, small in this case. Such questions can be answered by performing simple algebraic operations over the vector representations. To find the word that is most similar to small in the same way that biggest is similar to big, one can compute the vector ~ = (~lbiggest − ~lbig ) + ~lsmall and determine which word’s lexical vector is closest to this vector V according to cosine similarity. This is interpreted as finding the direction (in the embedding space space) from big to biggest, and then looking the same direction from small to find smallest. The results reported in this section use the Google dataset introduced in Mikolov et al. (2013), GOOGLEf ull and a variant of it, generated by randomly selecting approximately 10% of the questions in the Google dataset, GOOGLEreduced . The GOOGLEf ull dataset contains 19544 questions and the GOOGLEreduced contains 1986 questions. The results obtained using the GOOGLEf ull dataset are reported in parentheses in the tables. The GOOGLEreduced dataset is representative of the GOOGLEf ull dataset as pointed out by the results in the last column of Table 3. The reduced set was therefore used for most of the experiments. Three different datasets are used in training DVRS models: (1) Enwik8 is the first 108 bytes of the English wikipedia dump on March 3, 2006, (2) Enwik9 is the first 109 bytes of the same dump and (3) Enwik2013 is the full English Wikipedia dump on August 5, 2013. Each training set is preprocessed to contain only lowercase characters and spaces, using the script in (Mahoney). Table 1 show the accuracy of the DVRS model trained on Enwik8 and Enwik9 with various vector dimensions. The accuracy significantly improves with Enwik9 over Enwik8 and bigger vector sizes result in better accuracy with diminishing returns; the accuracy plateaus with vector sizes over 1000. It is apparent from Table 2 that co-occurrence information is the main contributor to the accuracy achieved in the Google Word Analogy Test. The impact of the ordering information by itself is limited, however, the composition of co-occurrence and ordering information works substantially better than the sum of their individual performance. Table 3 provides a comparison between DVRS and Word2Vec.3 The Word2Vec code is run with standard parameter set that is defined in the download of the Word2Vec. The accuracy number for Word2Vec trained on the large Enwik2013 dataset is taken from Levy & Goldberg (2014). DVRS performs better than Word2Vec on Enwik8 and Enwik9 for vector sizes of 200 and 512. Surprisingly, however, the DVRS model did not show any improvement with the very large Enwik2013 dataset 3 The results in this paper use 2013 Word2Vec. A new version of Word2Vec was released September 2014, which our initial tests show to perform significantly better, outperforming DVRS on the Google analogy task. No publication yet explains the algorithmic changes involved, so it is difficult to draw a meaningful conclusion from this as of yet.

4

Under review as a conference paper at ICLR 2015

Vector Size

Enwik8

Table 1: Basic DVRS Results Enwik9

100 200 512 1024

15.5% 22.9% 28.3% 30.1% (30.5%)

32.5% 41.7% 46.5% 47.1%

Vector Size

Table 2: Detailed DVRS Results for Enwik9 Cooccurrence Ordering Composite

100 200 512 1024

25.4% 32.4% 35.2% 36.2%

3.3% 4.6% 5.5% 5.8%

32.5% 41.7% 46.5% 47.1%

over Enwik9, unlike Word2Vec. This causes DVRS to trail significantly behind Word2Vec. It seems possible that DVRS is hitting a plateau, unable to improve with more data. Table 4 presents a breakdown of the accuracy between DVRS and Word2Vec trained on Enwik9 with a vector size of 512. The DVRS model captures the semantic information better, whereas the original Word2Vec captures the syntactic information better. A straightforward implementation of the DVRS mode in Common Lisp without any optimized parallel processing takes 18.5 minutes to train on Enwik8 with a vector size of 100 on a MacBookAir4,2. It takes 4.5 minutes for Word2Vec to train with the same data on the same machine using the standard configuration, which utilizes 20 threads. 3.2

M ILLER A NALOGIES T EST

The Miller Analogies Test (MAT) is an assessment of analytical skills used for graduate school admissions, similar to the Graduate Record Examinations (GRE). Performance on the MAT predicts individual graduate school grade point average better than GRE scores or undergraduate GPA. As a rigorous test of cognitive abilities for individuals of above average intelligence, competitive results on the MAT indicate sophisticated reasoning and representation similar to a well educated human. We selected 150 MAT example questions from an free online test preparation website to compare performance of DVRS and Word2Vec. These analogies require sophisticated vocabulary and complex relational representation. Each analogy takes the form A:B::C:(a,b,c,d). ”A is to B as C is to which of the four options?” For example, germinal is to senescent as nascent is to (a) sophomoric, (b) covetous, (c) moribund, or (d) shrewd? (Answer: moribund). Better is to worse as ameliorate is to (a) elucidate, (b) engender, (c) accredit, or (d) exacerbate? (Answer: exacerbate).

Table 3: Comparison with Word2Vec Enwik8 Vector Size 200 512

DVRS 22.9% 28.3%

Word2Vec 19.1% 18.4%

200 512

41.7% 46.5%

38.7% 41.7%

600

45.6% (46.6%)

(62.7%)

Enwik9

Enwik2013

5

Under review as a conference paper at ICLR 2015

Relation

Table 4: Breakdown of relational similarities for Enwik9, Vector Size=512 Word2Vec DVRS

capital-common-countries capital world currency city-in-state family (gender inflections) Semantic

73.53% 47.53% 3.49% 45.37% 68.18% 45.73%

76.47% 78.71% 2.33% 43.17% 52.27% 60.79%

gram1-adjective-to-adverb gram2-opposite gram3-comparative gram4-superlative gram5-present-participle gram6-nationality-adjective gram7-past-tense gram8-plural (nouns) gram9-plural-verbs Syntactic Overall

7.07% 15.79% 60.0% 11.76% 36.36% 66.85% 25.47% 55.81% 46.59% 38.5% 41.7%

8.08% 18.42% 28.46% 9.56% 33.33% 77.53% 23.6% 58.91% 29.55% 34.95% 46.5%

In our tests, Word2Vec did a little better than DVRS, at 46.9% for DVRS and 50.0% for Word2Vec. Answering correctly on these analogies requires complex representations of words that are seen relatively few times in the training data. Based on a conversion chart from California State University, Stanislaus, a raw MAT score of 46% converts to a scaled score of 408. Although percentile ranking depends on the group with whom the test is taken, 400 tends to be near the 50th percentile because the test is scored from 200 to 600. Therefore, DVRS and Word2Vec appear to be on par with the average human graduate school applicant.

4

R ANDOM P ROJECTIONS

It would be helpful to explain the performance of DVRS with a mathematical model of what it is computing. Why does adding random vectors together give us anything of value? Why is it possible to solve analogy problems with vector addition and subtraction, and why is the cosine similarity semantically meaningful? The analysis of this section can’t completely answer these questions, but useful things can be said by viewing DVRS as a random projection method. Random projections are a very simple dimensionality reduction technique, which applies a randomly chosen linear transformation from a high-dimensional space to a lower-dimensional one (a random matrix). The method is justified by the Johnson-Lindenstrauss lemma, which established that relative distances are preserved with high probability under random projection (Indyk & Motwani, 1998). This shows that almost all dimension reductions are good (in the sense of preserving relative distance), so it suffices to choose one randomly. Random projections typically compare well to more traditional techniques of dimensionality reduction like singular value decomposition and principle component analysis (Bingham & Mannila, 2001). An early application to computational learning theory was via the notion of robust concepts: classes which have a large distance between their closest data points (that is, a large margin) (Arriaga & Vempala, 1999). Other useful results about random vectors include an application to clustering. When clustering very high-dimensional data, it is useful to reduce computational cost by first applying dimensionreduction to the data. For learning Gaussian clusters, it was shown that random projections can perform well with a number of dimensions logarithmic in the number of clusters, whereas principle component analysis may require a number of dimensions linear in the number of clusters Dasgupta (2000). Another result, of great relevance to this paper, is the ability to fit exponentially many 6

Under review as a conference paper at ICLR 2015

nearly-orthogonal vectors into a space. This will be discussed in more detail in section 4.1. We will see that there is a similar logarithmic requirement for distributed representations of words, which means that we can represent a large vocabulary with reasonably small vector sizes. 4.1

P ROPERTIES

OF

R ANDOM V ECTORS

The most important property for the system is the meaning of the cosine similarity. Why does this measure of similarity work well? What can we say about the numbers which come out of it? The argument of this section is that cosine similarity counts the fraction of features that two lexical vectors have in common. The cosine similarity of two vectors ~v and w ~ is their dot product divided by their lengths, ~v · w/|~ ~ v||w|. ~ (DVRS normalizes all vectors to length 1, so that there is no difference between the cosine similaroty and the dot product.) The dot product of two independent random vectors of unit length is very probably very close to zero. In other words, random vectors are nearly orthogonal. This has been known for some time, but was Pstudied in detail recently in Cai et al. (2013). Due to this, it is possible to sum random vectors ~s = i ci v~i and recover the counts ci by taking the dot product ~vi · ~s. The dot product distributes over the sum, and ~vi · ~vi = 1. Because the other vectors contribute almost zero to the result, the result is close to ci . This allows us to recover feature frequencies from cosine similarity. The near-orthogonality of random vectors is the key A basic statistical analysis of the phenomenon is presented here to develop an insight into why this happens. Each random vector ~v ∈ Rd is created by sampling d times from a random variable X symmetric about zero and standard deviation σ The cosine similarity of two such vectors sim(~v , w) ~ can be written as: d X vi wi |~v ||w| ~ i=1

(1)

It is easy to see that the expected value is zero, since the symmetry about zero is preserved from the distribution of X. To show that it is in fact quite close to zero, we examine the variance. The variance of a sum of random variables is the sum of their pairwise covariances. The variance of (1) becomes:

V ar[sim(~v , w)] ~ =

d X d X i=1 j=1

=

d X d X i=1 j=1

 vi wi vj wj P d 2 2 l=1 wl k=1 vk

E Pd

vi vj E Pd

2 k=1 vk

(2)

 wi wj  E Pd 2 l=1 wl

(3)

All the expected values in this are zero except when i = j, because whenever vi vj is positive, it’s equally likely for it to have been negative, with the same denominator (and similarly for the expression involving w). ~

=

d X i=1

v2 E Pd i

k=1

 w2 E Pd i 2

2 l=1 wl

vk



=

d X i=1

v2 E Pd i

2 k=1 vk

2

(4)

The fraction inside the expectation is just the result of taking d samples from X 2 and dividing the ith by the sum of them all. By symmetry in i, this must have expected value 1/d. Thus we have:

=

d X 1 = 1/d d2 i=1

7

(5)

Under review as a conference paper at ICLR 2015

Intuitively, what’s going on in this computation is that while the length normalization stops us from assuming independence of the different entries in the vector, they nonetheless have no covariance thanks to the symmetry of the distribution. As a result, the variance of the dot product can be computed from the variance of the individual products. These are 1/d2 as a result of the normalization, so their sum is 1/d. Applying Chebyshev’s inequality, P (|sim(~v , w)| ~ ≥ ǫ) ≤

1 ǫ2 d

(6)

Thus, for large enough d, the chance of similarity above any desired ǫ becomes arbitrarily small. This proof is similar to a typical proof of the law of large numbers. If we were dividing by d rather than dividing by |~v ||w|, ~ the law of large numbers would apply directly: we would be taking the mean of the pointwise product of vectors. Since the expected value is zero, we would then conclude that the mean will converge to zero. What has been shown here is that the vector normalization serves the same role, driving the result closer to zero as the size of the vectors is increased. Cai et al. (2013) examine the properties of a set of n random vectors of dimensionality d.4 Their results go beyond simple pairwise properties of two random vectors. Since a distributed vector representation is likely to use a large number of vectors, this is quite useful to the current subject. They show (Theorem 6) that as d → ∞, if log(n)/d → 0, then the minimum and maximum angles between any of the n vectors approaches 90◦ . This is an extremely strong result; it means that for large d, we would have to be using exponentially many random vectors before we will notice significant collisions between any two random vectors. Furthermore, they show (Theorem 8) that we can put a bound on accidental similarity with high probability: given a particular ratio β = n/ed, √ −4β the minimum and maximum similarities converge to ± 1 − e . Therefore, for a fixed tolerance ǫ > 0 for erroneous similarity, the maximum number of random vectors we can use is about: n=−

ed ln(1 − ǫ2 ) 4

(7)

Again, this is based on the worst accidental similarity between any two random vectors. This indicates that the cosine similarity will likely be less than ǫ for all pairs of random vectors in the system, until we exceed this number. (The rule is still probabilistic, however!) To understand the vectors used in DVRS, it is also necessary to examine the properties of pointwise products (the binding operation). Recall, pointwise product is given by the formula ~v . ∗ w ~ = hv1 w1 , ..., vd wd i. In order for the pointwise product to act as if it were a new random vector, this should be nearly orthogonal to its constituent vectors. Consider the dot product P v2 w ~v · (~v .∗ w) ~ = di=1 z2i zwi . The expected value of this is zero as before, and similarly, the covariance v of different terms is zero. So, we compute the variance as: d d d X X  vi2 wi 2  X vi4 v4 1 wi2 1 E( 4 )E( 2 ) = E( i4 ) ≤ E ( 2 ) = V ar [~v · (~v .∗ w)] ~ = z z z z z d d v w v w v i=1 i=1 i=1

(8)

To justify the final inequality, let Yi be d random variables distributed as X 2 . The expression Pd Pd Pd Pd Pd 4 4 4 2 2 2 2 i=1 E(vi /zv ) = E( i=1 vi /( i=1 v ) ) is then E(( i=1 (Yi ) )/( i=1 Yi ) ). This is the sum of the square of positive numbers divided by the square of their sum, so it must be less than or equal to 1. This allows us to conclude that the variance of ~v · (~v .∗ w) ~ goes to zero at least as fast as the variance of ~v · w. ~ As a result, we conclude that ~v .∗ w ~ acts as if it is a new random vector, nearly orthogonal to both its constituent vectors. 4 They make the assumption that X is normally distributed, which we do not make in DVRS. This is an elegant assumption, because it is equivalent to sampling points uniformly from the unit hypersphere. However, the properties of random vectors are not very sensitive to this assumption, since the sum of iid random numbers approximates a normal distribution quickly. This can be verified with very simple experiments.

8

Under review as a conference paper at ICLR 2015

Table 5: Example Random Projection f1 f2 f3 f4 f5 -0.7694 0.4210 -0.1594 0.4082 -0.7634 -0.3487 0.2248 -0.3334 -0.2077 -0.4312 0.0840 0.3628 -0.7870 -0.5662 -0.7645

4.2

DVRS

AS A

R ANDOM P ROJECTION

DVRS assigns random vectors to each word, and through the binding operation, assigns vectors which act random to the ordering information as well.5 This amounts to a F × d projection matrix, where F is the number of features. We compute the matrix multiplication by iterating through the data and summing up the vectors for features we observe. In Table 5, we see a very small example projection of 5 features down to a 3-dimensional space. The features could be co-occurrence with words, for example. The numbers in the columns correspond to the environment vectors for each word. In order to generate the lexical vector for one of the words, we count co-occurrences in the data. This gives us a co-occurrence vector with 5 elements. Table 5 is the projection matrix which transforms this co-occurrence count vector into the lexical vector for the word. The set of feature vectors can be thought of as a pseudo-basis for the reduced space. A true basis for a vector space is a set of vectors that are linearly independent and span the space. In an orthogonal basis, we additionally have that each basis vector is orthogonal to all others. Unfortunately, there is only room for d basis vectors, so it is not possible to assign independent vectors to each feature when projecting down from a higher-dimensional feature space. The surprising fact from the previous section is that although we can’t make them linearly independent, we can find a set that is very nearly orthogonal, accommodating F that grows exponentially in d. Furthermore, we can find such a pseudo-basis with almost no effort. Consider the dot product of a feature vector and a lexical vector, f~i · ~lj . The lexical vector is the PF (normalized) sum of many feature vectors; we can write it as (1/z) k=1 ck f~k where ck is the number of occurrences of each feature f~k and z is the normalizing constant. The dot product is a PF linear function in each argument, so f~i · ~lj = k=1 ck f~k · f~i . When i = k, the dot product is exactly 1. The rest of the sum constitutes noise from the random projection, which is equally likely to be positive or negative. Treating the normalization constant as fixed, the expected value of f~i · ~lj is ci /z, the normalized number of occurrences of feature i. This shows that the dot product will be proportional to the feature count.6 The lexical vectors are approximately encoding a much larger feature-count table. PF The noise from random projections will be less than k=1 ck ǫ/z, the sum of the maximum error for all vectors over qPeach component of the sum. Since the feature vectors are nearly orthogonal, z F 2 will be close to c|, the length of the features before projection. This must be greater k=0 ck = |~ PF than or equal to k=1 ck , establishing that the noise will be less than ǫ. (Since ǫ is the worst error in the system, it is likely to be much smaller than this.) Similarly, the dot product of two lexical vectors ~li · ~lj will approximate the agreement between their PF PF features, ~li · ~lj = (1/zi ) k=1 cik f~i · ~lj ≈ (1/zi zj ) k=1 cik cjk , where cik and cjk represent the value of the feature k for words i and j respectively, and zi and zj are the normalizing terms for their lexical vectors. This is related to the common observation that a random projection tends to 5 We would not want to generate new random vectors for all possible pieces of ordering information, as well, because this would require a large table to store. So, it is better to generate something deterministically which behaves as if it is random. The pointwise product achieves this. BEAGLE achieves something similar, but with more expensive operations. 6 Although we cannot treat z as fixed when computing the expected value of this dot product, z is fixed across different choice of f~i ; so, the ratios of the dot products are as desired, modulo the noise from the projection.

9

Under review as a conference paper at ICLR 2015

preserve distance. The contribution of noise to this will be less than (1/zi ) same argument made in the case of f~i · ~lj .

PF

i k=1 ck ǫ

≤ ǫ, by the

What all of this indicates is the vocabulary that DVRS can handle is exponential in the vector length. The ability of sums of random vectors to approximately store so much information is rather striking. 4.3

D ISCUSSION

We presented DVRS, a novel distributed representation which compares well to Word2Vec in analogy tests. DVRS seems to perform well on especially semantic analogies, whereas Word2Vec tends to do better on grammar-based analogies. DVRS did not do better than Word2Vec on the Miller analogy test, and more worryingly, it failed to improve significantly when fed significantly more data, causing it to do worse than Word2Vec when compared on that dataset. This was a surprise, and DVRS does not require any gradient-descent training or other optimization process, nor does it use a matrix factorization or any probabilistic topic model. Instead, it constructs meaningful representations from sums and pointwise products of random vectors. The result can be described as a random projection, indicating that cosine similarity of the lexical vectors can effectively measure the similarity of the feature counts. For machine learning applications, DVRS demonstrates the viability of random projections for competitive-quality word embeddings. For cognitive modeling, DVRS shows that circular convolution can be replaced by pointwise product as a binding operation, achieving very good performance.

R EFERENCES Arriaga, Rosa I and Vempala, Santosh. An algorithmic theory of learning: Robust concepts and random projection. In Foundations of Computer Science, 1999. 40th Annual Symposium on, pp. 616–623. IEEE, 1999. Baroni, Marco, Dinu, Georgiana, and Kruszewski, Germ´an. Don’t count, predict! a systematic comparison of context-counting vs. context-predicting semantic vectors. In Proceedings of the 52nd Annual Meeting of the Association for Computational Linguistics, volume 1, 2014. Bingham, Ella and Mannila, Heikki. Random projection in dimensionality reduction: applications to image and text data. In Proceedings of the seventh ACM SIGKDD international conference on Knowledge discovery and data mining, pp. 245–250. ACM, 2001. Cai, Tony, Fan, Jianqing, and Jiang, Tiefeng. Distributions of angles in random packing on spheres. The Journal of Machine Learning Research, 14(1):1837–1864, 2013. Dasgupta, Sanjoy. Experiments with random projection. In Proceedings of the Sixteenth conference on Uncertainty in artificial intelligence, pp. 143–151. Morgan Kaufmann Publishers Inc., 2000. Hinton, Geoffrey E. Distributed representations. 1984. Indyk, Piotr and Motwani, Rajeev. Approximate nearest neighbors: towards removing the curse of dimensionality. In Proceedings of the thirtieth annual ACM symposium on Theory of computing, pp. 604–613. ACM, 1998. Jones, Michael N and Mewhort, Douglas JK. Representing word meaning and order information in a composite holographic lexicon. Psychological review, 114(1):1, 2007. Levy, Omer and Goldberg, Yoav. Linguistic regularities in sparse and explicit word representations. In Proceedings of the Eighteenth Conference on Computational Natural Language Learning, Baltimore, Maryland, USA, June. Association for Computational Linguistics, 2014. Mahoney, Matt. http://mattmahoney.net/dc/textdata.html. Last accessed March 28th, 2014. Mikolov, Tomas, Chen, Kai, Corrado, Greg, and Dean, Jeffrey. Efficient estimation of word representations in vector space. arXiv preprint arXiv:1301.3781, 2013. 10

Under review as a conference paper at ICLR 2015

Mnih, Andriy and Kavukcuoglu, Koray. Learning word embeddings efficiently with noisecontrastive estimation. In Advances in Neural Information Processing Systems, pp. 2265–2273, 2013. Pennington, Jeffrey, Socher, Richard, and Manning, Christopher D. Glove: Global vectors for word representation. Proceedings of the Empiricial Methods in Natural Language Processing (EMNLP 2014), 12, 2014. Plate, Tony A. Holographic reduced representations. Neural networks, IEEE transactions on, 6(3): 623–641, 1995. Ustun, Volkan, Rosenbloom, Paul S, Sagae, Kenji, and Demski, Abram. Distributed vector representations of words in the sigma cognitive architecture. In Proceedings of the 7th conference on Artificial General Intelligence, 2014.

11