Applying Interestingness Measures to Ansar Forum Texts

1 downloads 0 Views 649KB Size Report
extracted, in decreasing order, are: Allah, Quote, Afghanistan,. Taliban ..... cious, stressed, sorry, jerk, tragedy, weak, worthless, ignorant, inadequate, inferior ...
Applying Interestingness Measures to Ansar Forum Texts D.B. Skillicorn School of Computing Queen’s University Canada

[email protected]

ABSTRACT Documents from the Ansar aljihad forum are ranked using a number of word-usage models. Analysis of overall content shows that postings fall strongly into two categories. A model describing Salafist-jihadi content generates a very clear single-factor ranking of postings. This ranking could be interpreted as selecting the most radical postings, and so could direct analyst attention to the most significant documents. A model for deception creates a multifactor ranking that produces a similar ordering, with low-deception postings identified with highly Salafist-jihadi ones. This suggests either that such postings are extremely sincere, or that personal pronoun use and intricate structuring are also markers of Salafist-jihadi language. Although the overall approach is relatively straightforward, the choice of parameters to maximize the usefulness of the results is intricate.

1.

DATASET DETAILS

The Ansar aljihad forum is a mostly English language forum, with limited access – at one time registration, but now only by referral from an existing member. Of the 29,056 posts in the dataset, about half come from a small subset of members. This paper applies ‘bag of words’ textual analysis of various kinds to the contents of the forum postings. It does not examine characteristics of the authors or timings of posts. In general, it is not obvious what aspects of a set of forum data such as this will be interesting in any given context, so the approach is purely inductive. The lack of ground truth about these postings and the inductive approach makes it hard to draw firm conclusions that would be useful in an intelligence setting. On the other hand, analysis using several different models focuses attention on the same small set of postings. This approach is therefore useful, because it is usually not practical for an analyst to read all of the postings in a timely way (and, of course, many relevant datasets would be much larger).

2.

ISSUES

Permission to make digital or hard copies of all or part of this work for personal or classroom use is granted without fee provided that copies are not made or distributed for profit or commercial advantage and that copies bear this notice and the full citation on the first page. To copy otherwise, to republish, to post on servers or to redistribute to lists, requires prior specific permission and/or a fee. ISI-KDD 2010, July 25, 2010, Washington, D.C., USA Copyright 2010 ACM 978-1-4503-0223-4 ...$10.00.

The texts of postings are mapped to a series of documentword matrices using sets of words that model phenomena of interest. The raw data for each analysis is therefore a matrix whose rows correspond to the 29,056 postings, whose columns correspond to members of the set of chosen words, and whose entries are counts of the frequency of each word in each document. The data is very skewed in both dimensions. There are a small set of authors who have made very large numbers of postings; and the size of postings varies from literally a few words to tens of thousands of words. Nevertheless, even for models built using a small set of words, the matrices are extremely sparse. A major difficulty with data in this form is deciding what kind of normalization is appropriate. The goal is to maximize the variation in markers relevant to the analysis, while minimizing irrelevant artifacts. The problem is that some words, typically function words, occur often and it is changes in their rate of occurrence that is potentially significant. Other words, typically nouns, are significant if they occur at all in a document, but their subsequent repetition in the document is much less significant (the document is already ‘about’ the noun). This big difference in the significance of frequency makes normalization challenging. Some of the possible normalizations are: • tfidf, a conventional normalization from IR. The intent of this normalization is to ‘spread’ or differentiate the stored documents more evenly, making it easier to find the appropriate neighbors when a query is mapped into the representation space (vector or LSI). This would disturb the inherent cluster structure in this data and so should not be used. • Normalize based on the length of each document, either in terms of total number of words present, or the total number of model-specific words present. Such a normalization is usually motivated by assuming a generative model for postings. The particular set of postings observed is regarded as a sample from an underlying distribution that describes the hypothetical process that creates postings, and so contains artifacts due to sampling. For example, long documents have drawn more often from a word-choice distribution and so contain more, and more different, words. Taking length of documents into account allows some of this bias to be discounted (not all, because long documents, even when normalized, can take on a greater set of values; so the analysis technique should also be resistant

to quantization variation – which eigendecompositions are). There remains the question of which document length to use to get normalized frequencies. Some of the choices are: – Normalize to the unit hypersphere. This is the standard approach in information retrieval, partly because queries are modelled as really short documents, and this normalization makes documents of all lengths comparable. The problem in a dataset that contains both very short and very long documents, as this one does, is that this normalization blurs the structure substantially. – Divide by the document length. This provides a more gentle increase in similarity between short and long documents. In short documents, though, this increases the apparent signal strength of individual words well above that of any word in a long document, and so is still quite distorting. – Choose some quantum of length, say k. Documents longer than k have their word frequencies divided by their length; documents k words or shorter have their word frequencies divided by k. k then behaves as the unit within which words that occur at predictable rates will have occurred a stable number of times. It should probably be chosen so that most documents are shorter than k. The document length distribution is very skewed for this dataset, with the ‘knee’ of the frequency histogram at about 500, so relatively few documents would be normalized by their actual length. – Do not normalize by length at all. This amounts to an assertion that word frequencies are attributes all on a comparable scale, and so raw values are meaningful. From a different perspective, this also asserts that documents are not simply samples from a distribution of word frequencies, but that document length is also a meaningful choice by the author. This has some plausibility for this dataset, because length correlates with the style of posting quite strongly, but it certainly emphasizes longer documents in the results. • Apply a flattening transformation such a taking logarithms of word frequencies. This does not explicitly take length of documents into account but compresses the distribution of points corresponding to documents into something closer to a hyperspheric annulus. Such a normalization implicitly asserts that presence of a word in a document is more important than absence, but significance does not increase linearly with frequency. For example, many authors have stylistic tics in which they use particular words frequently without altering overall meaning. Different choices of normalization will completely alter the resulting analysis, but there seems no deeply principled way to make this choice. Fortunately, although different choices change the medium-scale structures for this dataset, they seem to have little effect on the extremal structure. There is also an issue of how to normalize the columns of the matrix. After row normalization, the matrix entries

remain non-negative, so measures of similarity between documents (rows) are always positive, and there is no concept of dissimilarity, just weaker similarity. However, an alternate view is that similarity measures should also allow dissimilarity, based on deviations from some ‘typical’ base frequency. In this case, normalizing the columns to z-scores is appropriate. However, there is a further complication. As these matrices are sparse, computing means and standard deviations based on all entries of a column loses available information – the denominators are large regardless of how many documents each word appears in. Furthermore, the mass of zero entries typically ends up very slightly on one side of the origin. As a result, computing correlations in standard ways constructs similarity based partly on the absence of word use between two documents, which is not usually sensible. It is better to compute means, standard deviations, and so z-scores only from the non-zero entries of each column. Assessment of content of the postings is complicated by the fact that they are written in extremely different registers. Many postings, especially the shorter ones, are written in a very informal style typical of many write-once postings on the web (chat, comments, etc.). At the other extreme, there are postings, typically long, that are written in a very ornate and flowery style, often coupled with religious ornamentation. There is a tendency to express religious thought in English from the 17th Century, for example using archaic words like “thou” and “doth”. Tools that were register-aware would be helpful for data such as this, but the lack of practical systemic functional parsing tools limits this level of sophistication at present. In common with much informal writing, spelling is not standard across the postings. This is further complicated by the numerous possible transliterations of Arabic words into English, especially greetings, slogans, and names. These different spellings of English words are not conflated in the analysis, as it is probable that they reflect cultural and geographical variations among authors that might be significant. Examining the list of statistically significant phrases extracted from the document set using the Logik tool suggests that the majority of the discussion is driven by news. The focus is on incidents, people, and places that were likely to have been discussed in the media over the relevant time period, rather than discussions of people within the jihadi movement (for example, mentions of Qari Mohammad Yousuf, one of the Taliban press spokesmen, are far more frequent than mentions of Mullah Omar). The most frequent phrases extracted, in decreasing order, are: Allah, Quote, Afghanistan, Taliban, Islamic Emirate, Mujahideen, government, soldiers, military, American, Pakistan, attack, Brother, Iraq, police, video, militants, troops, Somalia, district, mujahid, local time, Salaam, President, attacks, army, brothers, wa, Mogadishu, country, Baghdad, Islamic, vehicle, Afghan, Iraqi, city, officials, puppet army, Islam, Peace, news, Obama, download, Acer, alaykum, terrorists, landmine, Security, MB [presumably megabyte], Muslims, Muslim, http, tank, fighters, war, alaikum, Pakistani, British, Swat, Soldier, Insha Allah, Somali, Bomb, civilians, enemy, report, html, rapidshare [a file sharing service], capital, Kandahar, explosion, ameen, killing, view, fight, akhi [brother], Jihad, Reuters, Qari Muhammad Yousuf, Israel, Sheikh, insurgents, amir, Gaza, Islamist, Assalamu, release, Zabihullah, God, Media,

Aswat, Israeli, WMV, convoy, fileflyer [another file sharing service], al-Iraq, NATO. The country focus is on Afghanistan (307 documents), Pakistan (202), Somalia (148), Israel (54), Iran (39) and not America (23), Britain (12), Canada (5) and Australia (2). Adjectival country names are more common, for example American (204 documents). Frequencies of references to news sources are: BBC (16 documents), al Jazeera (12), CNN (13), Reuters (55) (an interesting sidelight on technology), and Associated Press Writer (24).

3.

ANALYSIS METHODOLOGY

The analysis that follows uses singular value decomposition applied to document-word matrices using different combinations of possible words and their frequencies. Suitably normalized, a singular value decomposition discovers axes (essentially, eigenvectors) along which the set of documents exhibits variation. The resulting space is then projected into few dimensions (typically, 2 or 3). In the resulting similarity space, proximity corresponds to global similarity (among both documents and words, since SVD is completely symmetric); direction corresponds to global differences (that is, clusters); and distance from the origin corresponds to interestingness. This last is because projection both of points that correlate well with many other points, and points that correlate with few other points places them close to the origin. Points that correlate only moderately with other points are mapped far from the origin in lowdimensional space. This kind of moderate correlation often captures useful notions of interestingness, since it avoids both documents whose word usage is exceedingly typical and those whose word usage is unique. When frequencies of large numbers of words are used, this approach is a kind of clustering. When particular words associated with some property of texts are used, the projection is more typically a spectrum representing the intensity of the property captured by the set of words. When the property is truly single factorial, then the resulting space contains a 1-dimensional manifold, or spectrum, along which points corresponding to each posting are placed. Most complex properties are multifactorial, so it is more typical to see a structure in two or three dimensions. Such a structure can be projected onto a line passing through the first two or three singular values to create a single score, representing interestingness with respect to the model words. In both cases, a plot provides a visualization of global similarities and differences, and distance from the origin provides a visualization of interestingness. The distance from the origin in some k dimensions can be computed, and the documents ranked according to it. Such a ranking loses information about direction, but provides a quick method of focusing attention on a small subset of the records.

4. 4.1

ANALYSIS RESULTS Overall word use

Word extraction on the set of forum postings produced a set of 198,211 distinct “words”. Typical of informal multilingual documents, a substantial fraction of these do not appear in any dictionary; some are ‘wrapper’ words from the context (such as “http”), many are typos or transliterations

Figure 1: Based on words occurring frequently, the postings fall into two well-separated classes. of Arabic words or fragments of Arabic words. Stop words were not removed because, in second-language contexts, differences in stop word use might be significant, reflecting, for example, relative fluency in English. From this set of words, the 779 words that occurred more than 50 times overall were retained to produce a documentword matrix. This threshold was chosen pragmatically, although it is in a linear range of the threshold versus words curve, and the resulting structure did not seem very sensitive to the choice. The matrix was processed as discussed above, without row normalization, but normalizing columns to nonzero z-scores. Figure 1 shows that the documents form two very distinct clusters. One cluster, oriented vertically in the figure, contains postings about military and insurgent activity, focused on Afghanistan and Pakistan. These postings tend to be news or reportage of various kinds, some copied from mainstream organizations. The words associated with it are largely content words, and are visible in Figure 3; words such as “killed”, “province”, “district”, “mujahideen” (apparently the preferred transliteration by mainstream news organizations), “America”, “Islamic”, and “enemy”. The second cluster, oriented horizontally in the figure, contains postings that might be called Jihadi-religious. Figure 3 shows that the words associated with this cluster are primarily function words. This suggests that, for this cluster, it is not content that matters, so much as persuasion, sentiment, power, and emotion. It is surprising that the content of the forum separates so strongly into two clusters – the existence of these two distinct topics is not at all obvious from reading a subset of the postings. To human readers, both the tone and content of postings from the different clusters do not seem markedly different. Because there is no normalization by length, the extremal documents tend to be the longer ones. Normalizing using a boundary of 500, which is about the knee of the curve of lengths, produces very similar structure in the words, but makes it clear that the number of postings in the horizontal cluster is much larger than in the vertical cluster. The overall structure does not change much, providing reassurance that normalization choices here do not dominate the results.

Figure 4: For most-frequent words, the mutual similarity between words and documents. Because of the symmetry of the SVD, rows of both matrices can be plotted as points in the same space. A word and a document are attracted to similar locations when the frequency of the word in the document is large. One interpretation of the SVD is that it is a global integration of this pairwise attraction.

Figure 2: For most-frequent words, document clusters labelled with their posting number. Document 25606 lies between the two clusters; it is a long list of insurgent activities, in the style of the horizontal cluster, but with the content of the vertical cluster. As expected, it uses the word “in” at extremely high rates.

Salafist, al Qaeda, or jihadist content. Koppel et al. [1] built an empirical model of Salafist-Jihadi ideological word use in contrast to that of other ideologies (mainstream, wahhabist, and Muslim Brotherhood) which we use as a surrogate for Salafist-jihadi content in this forum’s postings. At best, this is only a rough approximation to the desired content and style; in particular, it is not designed to discriminate between Salafist-jihadi language and ‘ordinary’ language such as news reports. We begin with the top 100 words from the Koppel model. Several of these Arabic words translate to the same English word, so we end up with 85 English words in the model (shown in Table 1) The frequencies of these words are extracted from the forum data, and the resulting matrix is row normalized by replacing each entry by log(aij + 1). The columns are then normalized to non-zero z-scores as before. Different forms of normalization were tried, but made little difference to the qualitative structure. The results are shown in Figure 5. This model appears to be working well in the sense that it projects postings almost entirely to a 1-dimensional structure that can be interpreted as a continuum from plentiful non-jihadist postings to rarer but more extreme jihadist postings. There is a second, roughly orthogonal component of postings with differentiated use of the words “said”, “were”, “the”, and a few others. Especially the presence of “the” as such a strong marker suggests that second-language issues are relevant here; perhaps the postings in this smaller component are primarily quotations from mainstream news organizations. Removing this component, by removing the associated words from the model, leaves the large component almost unaffected. Some of the extremal postings at the Salafist-jihadist end of the spectrum are: 15646 – “words for jihadis”; 14621 – Book of a Mujahid; 9916 – an extensive political/religious argument; 14736, 17431 – pro-jihadi religious tracts by alMaqdisi, the spiritual mentor of al-Zarqawi; At the other end of the spectrum are postings that are quite vicious in tone, but about other subjects (and shorter which is partly why the extent of the spectrum is not symmetric around the origin). For example:

Figure 3: For most-frequent words, global word similarity. Use of a tool such as Palantir would enable much of the basic content structure in this set of documents to be extracted in sophisticated ways. The advantage of the analysis here is that (a) it is purely inductive rather than analyst-driven, (b) it shows the high-level structure very directly, and (c) using distance from the origin as a surrogate for ‘interestingness’ allows the documents to be ranked, so that analyst attention can automatically be focused on the most significant postings within the set.

4.2

Finding radical postings

It would be most useful to exploit projection and ranking to select those postings with the greatest signs of radical

13494 – comment on a visit by Huckabee to Jerusalem; 10416 – a brief comment suggesting that backlash to insurgent attacks came from drug lords, rather than the general population; 3201 – a posting about Kashmir; 23314 – almost entirely transliterated Arabic, so relevant words not captured; 22406 – a brief news report; Figure 6 shows that the words most strongly associated with Salafist-jihadi postings are function words such as “those” “who”“these”, “they”, and “when”, suggesting that it is relationship and conviction rather than propositional discussion that are important. Content words are not strong markers, but there is perhaps a characteristic style associated with radical postings. This is supported by the results of Koppel et al. [1] who were able to classify documents with different ideologies with about 75% accuracy using only function words. The existence of inflammatory postings at both ends of the spectrum suggests that sentiment analysis could be

Figure 5: For words related to Salafist-jihadi radicalism, the mutual similarity between words and documents.

Figure 6: Structure in the words associated with radicalism.

Figure 7: Postings and words using the deception model.

helpful for this problem, but it would need to be sophisticated since the relevant words go far beyond adjectives and negations.

postings that ranked as highly jihadist. Examination of these extremal postings shows that they are off the charts in terms of first-person singular pronoun and exclusive word frequency. In other words, the reason that Salafist-jihadist postings look low in deceptiveness is that they tend to be intricate yet personal discussions/arguments. This may be a signal of passionate belief, or it may be a stylistic signature developed from particular kinds of religious activity. The postings at the other end of the deception spectrum are primarily news reports copied from mainstream media. In the context of typical forum postings, such documents contain first-person singular pronouns only when someone is being quoted, and are written in a simple, expository style that uses very few exclusive words. Couple this with steady use of action verbs to keep the story moving, and a generally negative tone about war-relevant reporting, and it is clear why such stories rank at the deceptive end of the spectrum. This again emphasizes the need to consider the pool of documents when interpreting relative deceptiveness. The structure of the words from the deception model, shown in Figure 8, shows a 1-dimensional structure, aligned with the axis of deception in the postings, except for a small set of words roughly orthogonal to it. These words, “me”, “my”, and “I” tend to be strongly associated both with relative power and with deception in Western documents. It seems plausible that these pronouns are not so routinely used in Islamic culture so their use frequencies may be author related.

4.3

Looking for deception

We now turn to consider deception. The work of, among others, Pennebaker’s group [2, 3] has shown that (a) deception causes characteristic changes in text or speech, and (b) these same changes can be observed over a large range of different activities that have an element of deceptiveness, from outright lies to negotiation. Since propaganda has an element of deception built into it, we consider whether postings that rank highly using Pennebaker’s deception model are of interest. The model, which is determined empirically but has been widely validated, posits that the characteristic signature of deception is changes in the frequencies of four classes of words: • first-person singular pronouns decrease; • exclusive words, words that introduce a subsidiary phrase or clause that make a sentence more complex, decrease; • negative emotion words increase; and • action verbs increase. The model uses 86 words in all; they are listed in Table 2. As before, the frequencies of the words in the model were extracted from the forum dataset. The entries were scaled by logarithms, and non-zero z-scoring was applied. The column entries, now symmetric about zero, were negated for columns 1–20, which correspond to the first-person singular pronouns and exclusive words for which decreased frequencies are signals of deception. In the resulting matrix, a larger magnitude always represents a positive signal of deception. The same analysis process as before was applied to the resulting matrix. The results are shown in Figure 7. The basic structure is fan-like, resulting from variation in the use of the words shown towards the right of the figure: “I”, “or” and “but”. However, the most striking feature is that the extremal postings to the left, the putatively least deceptive, are the same

5.

DISCUSSION

The goal of this analysis is to provide shortcuts for analysts by ranking postings in order of properties of interest, so that only some top part of the ranking need be examined in detail. Ranking using the content of documents shows that postings to this forum are of two quite distinct kinds. Ranking is of limited usefulness, since length plays a large role in distance from the origin. Different normalizations are possible and might produce an interestingness ranking, but “interesting” here means roughly “on topic” so this may not be very

Figure 8: Structure of the words in the deception model. useful. Ranking using an existing model of Salafist-jihadist word usage patterns turns out to be surprisingly useful, producing a single-factorial ranking of postings where the top-ranked documents do indeed seem to be significant. Ranking using the deception model also turns out to be useful, although in a slightly surprising way. Documents that rank highly on the Salafist-jihadi scale rank low on the deception scale. This may be a signal for sincerity, or a result of stylistic markers acquired during radicalization.

6.

REFERENCES

[1] M. Koppel, N. Akiva, E. Alshech, and K. Bar. Automatically classifying documents by ideological and organizational affiliation. In Proceedings of the IEEE International Conference on Intelligence and Security Informatics (ISI 2009), pages 176–178, 2009. [2] M. Newman, J. Pennebaker, D. Berry, and J. Richards. Lying words: Predicting deception from linguistic style. Personality and Social Psychology Bulletin, 29:665–675, 2003. [3] J. Pennebaker, M. Francis, and R. Booth. Linguistic inquiry and word count (LIWC). Erlbaum Publishers, 2001.

Words in the Koppel model, ranked from top left to bottom right Jihad Parents How Platform Religion Much Monotheism Muslim Family Mujahideen Worlds Ye Way Oppressors Alone Unbelievers Word Understand Infidelity Idolaters Say Faithful Nation Was Tyrants War Rahim They Abi The Fighting Rahman Were God More Revealed Themselves Jewish Taymiyyah Faith Command When Juggernaut Right Earth Folk Greater Mercy Believers Those Prophet Combat Under Struggler Killing Iraq Them America Falsehood Companions Some You Governance Almighty Kfar Minimum Country Shirk These Afghanistan Who Youth Enemy People Terrorism Messenger O Said Including Entire Force Islam Trial Illusion Name Table 1: Top ranked words indicative of Salafist-jihadi ideology in contrast to other forms of Islamic thought, from Koppel et al. [1]. The word set is in Arabic and was translated using Google Translate, introducing some artifacts. For example, “rahman” would usually be written as “merciful” in English, but “kuffar” could appear either transliterated or translated as “infidel”. We ignored such effects, since repeating the experiments with a set translated by a human made little difference. In practice, if an automated tool is ‘good enough’ it should be preferred, since Arabic speakers remain rare in intelligence settings. “Shirk” here is the Arabic word meaning “associating partners in the worship of Allah”; Taymiyyah was a 14th Century Islamic theologian whose ideas have strongly influenced the most conservative versions of Islam.

Categories First-person pronouns Exclusive words Negative-emotion words

Motion verbs

Keywords I, me, my, mine, myself, I’d, I’ll, I’m, I’ve but, except, without, although, besides, however, nor, or, rather, unless, whereas hate, anger, enemy, despise, dislike, abandon, afraid, agony, anguish, bastard, bitch, boring, crazy, dumb, disappointed, disappointing, f-word, suspicious, stressed, sorry, jerk, tragedy, weak, worthless, ignorant, inadequate, inferior, jerked, lie, lied, lies, lonely, loss, terrible, hated, hates, greed, fear, devil, lame, vain, wicked walk, move, go, carry, run, lead, going, taking, action, arrive, arrives, arrived, bringing, driven, carrying, fled, flew, follow, followed, look, take, moved, goes, drive

Table 2: The 86 words used by the Pennebaker model of deception