Identifying locally- and globally-distinctive urban ...

2 downloads 918 Views 4MB Size Report
indicate types of places within an urban region, e.g., 'park', 'city' and 'urban'; many non-place references are also included, e.g., 'instagramapp', 'nikon', '2010'.
Identifying locally- and globally-distinctive urban place descriptors from heterogeneous user-generated content R. Feick 1, *, C. Robertson 2 ABSTRACT

Place, which can be seen simply as space with meaning, has long been recognized as an important concept for understanding how individuals perceive, utilize and value their surroundings. There is increasing interest in mining information from geo-referenced user-generated content (UGC) and volunteered geographic information (VGI) to gain new insight into how people describe and delimit urban places such as neighbourhoods and vernacular landmarks and locales. In this paper, we aim to extend recent efforts to explore semantic similarity in these data by examining differences in place descriptors through tags across multiple scales for selected cities in the USA. We compute measures of tag importance using both a naïve aspatial approach and a method based on spatial relations. We then compare the results of these methods for understanding tag semantics, and reveal to what degree certain characterizations as represented in tag-space are also spatially structured. Tag metrics are computed for multiple fixed resolutions that approximate typical urban place sizes (e.g. city, block, neighbourhood) and a simple extension of a well-known tag-frequency metric is proposed to capture differences in locally distinctive and globally distinctive tags. We present this analysis as an adaptation of traditional text analysis methods with ideas from spatial analysis in order to reveal hidden spatial structure within UGC. KEY WORDS: GIS, Internet/Web, Urban, Data mining, Understanding, Scale, Method, Multiresolution

This is a pre-publication version of a chapter accepted for Advances in spatial data handling and analysis, Y. Yeung, and F. Harvey (eds.). Springer (forthcoming)

1

School of Planning, University of Waterloo, Waterloo, Ontario, Canada [email protected] *Corresponding author. 2 Dept. of Geography and Environmental Studies, Wilfrid Laurier University, Waterloo, Ontario, Canada [email protected]

2

1. INTRODUCTION

People regularly use terms and concepts (e.g. neighbourhood, city center, tourist district, near, far) that are vague, context-specific and ambiguous to varying degrees when they communicate about the places they value, inhabit and interact within. Understanding how place is sensed and communicated is complex since place is a personal and socio-cultural construct that is influenced by individuals’ cognitive processes, experiences and the dynamic context in which place-sensing is situated (Cresswell, 2004). Traditional methods of gathering data related to how people sense and perceive place (e.g. photo-elicitation interviews, participant observation) provide rich qualitative data, however they are typically labour-intensive (Manzo, 2005). The increased availability of geo-referenced user-generated content (UGC), such as photographs, videos and social media posts, offers a complementary avenue to explore how place is sensed and characterized across larger populations and geographic extents. The volume and nature of this UGC or volunteered geographic information (VGI) varies substantially across sources and geographic regions. Generally, these data are comprised at least of: a) a geographic reference or object (e.g. coordinates of a social media post, toponym reference, GPS track) and, b) an associated set of largely unstructured text keywords, phrases or “tags”. Through joint and separate analysis of these spatial and tag components, new insights relevant to our understanding of place use and perception have been gained. For example, Hollenstein and Purves (2010) and Li and Goodchild (2012) demonstrated how the spatial extents of vernacular regions, such as city cores and place references, can be interpreted from geotagged photos (GTPs). Others have illustrated how landmark preferences, urban movement patterns, and place semantics can be inferred from these data (Jankowski et al. (2010), Mackaness and Chaudhry (2013), Purves et al. (2011). Place semantics, described by Rattenbury and Naaman (2009) as socially-defined locations associated with tag terms, offers a promising approach for identifying meanings and uncovering otherwise opaque tag and spatial structures in UGC. Inferring shared place meaning from multiple users’ UGC is challenging in part because of the absence of common ontologies, the often incomplete and inconsistent nature of UGC, and the presence of idiosyncratic abbreviations, colloquialisms and conflations of natural language terms (Shelton et al., 2014; Li et al., 2013; Janowicz et al., 2011). Recently, more attention has been directed at examining spatial structure in place references and descriptors across multiple scales (Rattenbury and Naaman (2009), Mackaness and Chaudhry (2013), Feick and Robertson, 2014). This vein of inquiry recognizes that individuals’ perceptions of place are often comprised of locationally-specific, overlapping and/or hierarchical elements. In some cases, these perceptions may be anchored to explicit and formal entities (e.g. city-province-country), while many others are more personal or related to ephemeral events and experiences (e.g. “my neighbourhood”, “safe areas”, “music festival”). Our interest here lies in complementing recent methodological advances in the joint handling of space and semantics for the analysis of place meanings and descriptions encoded in UGC. In particular, we aim to build upon earlier work that examined tag dominance in GTPs across multiple place scales within a single city by investigating a method to uncover place references that are distinct within multiple urban centres (e.g. city hall) or are unique within a single urban entity (e.g. Empire State Building). We propose an adaption of the well-known term frequency-inverse document frequency (tf-idf) that searches for relative uniqueness of terms (tags) across both global and local extents. 2. METHODS Recent work by Hollenstein and Purves (2010) examined how people characterize urban centres through certain place-oriented keywords in geotagged photographs across the USA and used these place references to derive the spatial extent of city boundaries. Similarly, Feick and Robertson (2014) explored the spatial dynamics of tag-space in terms of neighbourhood similarity and dissimilarity, and how this changes with level of spatial aggregation. However, we can envisage these works as two ends of a spectrum of spatial-semantic analysis, where in the former, the semantic focus is determined a priori (i.e., only looking at place-oriented tags) and in the latter only similarity and dissimilarity are examined and not the inherent meaning of the tags themselves. Here, we are interested in unearthing place semantics from UGC without prior manual identification of candidate place tags using methods developed for text search and analysis (tf-idf). We explore this idea through a case study of a large dataset of Flickr geotagged photograph metadata (GTPs). 2.1 Data and Study Area Data were obtained from the Flickr API that covered the years 2001-2013 for a selection of 14 census urban areas identified by the US Census Bureau in 2012. Urban areas (UAs) are a census unit defined to represent the actual urban footprint comprised of built-up residential, commercial and industrial urban land uses. Officially defined UAs are stratified into urban clusters (2501-49,999 people) and urban areas (50,000 + people) (Figure

3

1). Here we consider only the larger of these urban area classifications. Additional census geography was obtained including census tracts and census blocks in order to investigate spatial semantics across multiple spatial scales within UAs. Fig. 1 Urban areas of the United States with two sample UAs highlighted

GTPs for 14 UAs of varying sizes and locations were obtained from the Flickr API using the python library flickrapi (http://stuvel.eu/flickrapi). A python script queried the API using a grid of reference points spaced 5 km apart for each UA. At each grid point, a radius search was used to obtain nearby photographs. This provided more consistent search results from the API than more direct single point-radius and bounding box searches. Once duplicate photographs were removed, 669,099 photos with unique Flickr photo IDs and valid latitude and longitude values were used for the analysis (Table 1). Table 1 GTP, unique users and tag counts per Urban Area (UA) Urban Area Boston Chicago Dallas-Fort Worth-Arlington Denver-Aurora Detroit Los Angeles-Long Beach-Anaheim Minneapolis-St. Paul New York-Newark Philadelphia Pittsburgh San Jose St. Louis Tampa-St.Petersburg

# of photos 83,360 110,859 36,147 14,230 41,493 64,041 33,166 158,107 48,440 16,308 26,973 13,580 22,395

# of unique tags unique users 83686 7138 106717 8184 34028 3362 24685 1705 44455 3396 82212 7477 42483 2963 165651 16008 47171 4191 24535 1640 31543 3185 16972 1507 27627 2593

untagged photos 7028 12133 5595 1624 5290 6375 3546 14263 5905 1560 2264 1686 2995

2.2 Spatial Data Processing To support spatial-semantic analysis over multiple spatial scales, GTPs were aggregated using spatial joins over three levels of census geography: UAs, Census Tracts and Census Blocks. With each aggregation, both the count of the number of GTPs and their full tag-sets were attached to each census geography unit. The number of tags associated with an individual GTP varied considerably with 70,264 of the 669,099 photos having no tags and a mean tag count of 98. To permit the relative uniqueness of tag words to be examined easily, the commaseparated tag array associated with each photo record was decomposed into n normalized rows in PostgreSQL database tables for each level of census geography. The one-to-many relationships established between tag records and their respective census geography tables enabled calculation of the metrics described in the following

4

section and also supported examination of results across the hierarchy of UAs, census tracts and census blocks. It was found that census block geography was too fine to support meaningful analysis of tags except in specific localized areas. As a result, only results from the UA and CT level are reported here. 2.3 Analysis of Space, Scale and Semantics The growth of the participant-web has led to vast amounts of text data being generated and a heightened need for methods such as sentiment analysis that can aid understanding of the meaning of users’ content. Analysis of tag-sets attached to GTPs can therefore leverage text-analysis tools to better understand embedded semantics and their spatial arrangement (Mackaness and Chaudhry, 2013). Much of the text modelling literature is devoted to learning characteristic descriptions of documents within large collections (corpora) to support tasks such as classification, similarity assessment, relevance analysis and detection of anomalies or unusual features – largely in the context of information retrieval (Janowicz et al. 2011; Baeza-Yates and Ribeiro-Neto, 1999; Blei et al. 2003). A widely used method for identifying characteristic words from a document is term-frequency/inversedocument frequency, which is a normalized measure of the occurrence of a specific term within a document, relative to the number of documents containing that term (Salton and McGill 1983). The notion behind tf-idf is that distinctive terms will be mentioned frequently within a document, whilst being relatively infrequent across other documents in the corpus. To extract place-semantics, we altered the tf-idf to compare distinctive tags local to a specific area, to those found to be distinctive across all locations. By considering each local unit of geography (i.e., census block, census tract, UA) as a document, we count the number of times each tag in the tag-set occurs in each area, relative to the inverse count of occurrences in other geographical units. For what we term the global-tf-idf, the set of eligible areas (i.e., the full tag-set) for the inverse tag-counts was set to all corresponding areas (i.e., blocks or census tracts) in the 14 UAs. In contrast, local tf-idf uses only the census units within a given UA. Global tf-idf therefore reflects the relative importance of a tag across all UAs, while local tf-idf reflects importance within a single urban context. Tf-idf measures are conceptualized as reflecting local and global geographies in tag-space as presented in Figure 2. Low values of global tf-idf are seen to indicate either within-unit (e.g. within census tracts) infrequency and/or across-unit frequency, while high values indicate within-unit frequency and/or across-unit infrequency. Constraining ourselves to urban settings, we might expect low global tf-idf to be generic descriptors characteristic of urban settings (e.g., ‘street’, ‘park’) and high values to be landmark-type place-tags, with high withinunit and low across-unit frequencies. Local tf-idf differs only in the reference set to which within-unit tag frequencies are compared which are spatially constrained to geographic units in a common UA. As such, low values of local tf-idf may indicate either low within-unit frequency or high across-unit frequency in the city. In contrast, high values indicate high within-unit tag counts and low across-unit tag counts. Ultimately, we aim to compare values in both local and global tag-space in order to reveal characteristic place-semantics. Fig. 2 Partitioning tag-space into local and global dimensions

5

Following recent research in urban place semantics (Mackaness and Chaudhry 2013), we examine the tagspace relationships across spatial scale. Using census block and census tract geographies as spatial units, we compare the trajectories of randomly selected tags in both global tf-idf and local tf-idf. Within this paper, the ratio of local tf-idf to global tf-idf is used to operationalize Figure 2 and to explore similarity in place descriptors across space. Finally, we examine spatial patterns of local and global semantics using local measures of spatial autocorrelation to visualize spatial clustering in tag-space. With little theoretical basis upon which to determine the spatial weights matrix required for local measures of spatial analysis, we apply an iterative neighbourhood optimization modification of the local Gi* statistic (Getis and Ord 1992) described by Aldstadt and Getis (2006). RESULTS The distribution of the local/global tf-idf ratio (LGR) was normal, skewed slightly towards lower values (i.e., between 0 and 1), with a maximum value of 1.28 for census tracts. In the context of this paper, statistical significance of the distribution is not examined; instead we simply use the tails to focus attention on extreme cases of local or global tag dominance. Values in the upper tail indicate areas where local dominated global, and values in the lower tail of the distribution indicate areas where globally distinctive tags dominated. Globally dominant tags are given as a word cloud in Figure 3, and exclusively identified place names for large geographies (e.g., state names and large city names) are prominent. Fig. 3 Word cloud of tags from lower tail of the local/global tf-idf distribution (i.e., globally unique) at the census tract scale

Figure 4 presents tags that are locally distinctive in any of the 14 UAs. Tags that are place references often indicate types of places within an urban region, e.g., ‘park’, ‘city’ and ‘urban’; many non-place references are also included, e.g., 'instagramapp', 'nikon', '2010'. This can be expected since place names were not filtered out a priori.

6

Fig. 4 Word cloud generated from upper tail of local/global tf-idf (i.e., locally unique tags) at the census tract scale

Spatial patterns of the LGR across three selected urban regions are presented in Figure 5. As symbolization is standardized across each map, it can be seen that overall values of the LGR were highest in New York and lowest in Boston indicating more local uniqueness of UCG place descriptors in the former. The degree to which these patterns are spatially structured is visualized in Figure 6 through the results of an AMOEBA optimization of the local ∗ statistic. Spatial structure here relates to both the spatial scale of the geographic units (i.e., census tracts) and the place reference counts embedded within the tags.

7

Fig. 5 Local/global tf-idf ratios at the census tract scale: New York (top), Chicago (middle) and Boston (bottom) ranging from low (green), to moderate (yellow) to high (red)

8

Fig. 6 Spatial clustering of AMOEBA values for local/global tf-idf ratios at the census tract scale for New York (top), Chicago (middle) and Boston (bottom)

9

To explore the combined spatial-semantic characteristics revealed by the LGR further, tags and LGR values are presented for selected sub-areas in Figures 7-9. The symbolization scheme in these maps corresponds to what is used in Figures 5 and 6, such that higher values of LGR are shades of red, moderate values are yellow, and lower values are green. Therefore we expect to see locally distinctive tags in orange- and red- shaded census tracts and less locally distinctive tags in areas shaded green. While the census tracts in Figures 7-9 are symbolized according to LGR of tf-idf, the tag annotation includes only the four tags with highest local tf-idf values. To reduce clutter, only census tracts in the top two (i.e. light red and orange) and bottom two classes (light and dark green) of LGR are labelled. Fig. 7 New York City census tracts shaded by LGR with high local tf-idf tags labelled

10

Through these simple visualizations, several areas and features expected to display local uniqueness, such as landmark features, are apparent along with some more generic and often non-place terms. The centre of Figure 7, for example, shows tags for New York’s Times Square and the Flatiron and Empire State Buildings in areas with high local-global tf-idf values. These High-High values are in contrast to more generic tags in the few green-shaded census tracts in the lower right (e.g., ‘apostolic’, ‘garden harvest’). Figure 8 illustrates similar results in Chicago where locally distinct tags (e.g., ‘millennium park’, ‘illinoismedicaldistrict’, ‘thebean’) appear in areas with high LGR, in contrast to candidate place tags such as ‘union station’, ‘statues’, ‘skyscrapers’ which appear in areas with relatively higher global tf-idf values. Finally, while LGR values are generally more muted in Boston, some local place tags are evident in Figure 9 including ‘mit’, ‘bruins’ and ‘quincymarket’. Fig. 8 Chicago census tracts shaded on LGR with high local tf-idf tags labelled

11

Fig. 9 Boston census tracts shaded on LGR with high local tf-idf tags labelled

DISCUSSION The analysis presented here represents some preliminary findings into the joint spatial-semantic properties of GTPs as illustrated through a modification of a commonly used text analysis measure, the tf-idf. By modifying the basis from which the idf term is derived, we can compare the degree to which tags are more locally or globally distinctive, and start to explore the spatial patterns and relationships embedded within these tagging patterns when visualized in urban areas. While this method is a simple approach, the results indicate some degree of success in capturing locally distinctive tagging through analysis of the LGR measure alone, especially for wellknown landmarks. While we envisage uses for this and similar methods in geographic information retrieval tools such as optimizing local queries, we are keenly interested in the degree to which spatial-semantic analysis can shed light on automated detection of place and place-making activities in sources of UGC generally with reduced need for human classification of training tag sets. We are particularly interested in how standard methods for natural language processing might be modified to incorporate elements common to spatial analysis such as local spatial patterns and examining patterns over multiple spatial scales. REFERENCES Aldstadt, J., Getis. A. (2006) Using AMOEBA to create a spatial weights matrix and identify spatial clusters. Geogr. Anal. 38: 327–343 Baeza-Yates, R., Ribeiro-Neto, R. (1999) Modern information retrieval (Vol. 463) ACM, New York Blei, D. M., Ng, A.Y., Jordan, M.I. Latent dirichlet allocation. (2003) J. Mach. Learn. 3: 993-1022 Cresswell, T. (2004) Place. Blackwell, Malden

12

De Choudhury, M., Feldman, M., Amer-Yahia, S., Golbandi, N., Lempel, R., Yu, C. (2010) Automatic construction of travel itineraries using social breadcrumbs. In: Chignell, M., Toms, E. (eds) Proceedings of the 21st ACM Conference on Hypertext and Hypermedia, ACM, New York, pp. 35–44 Feick, R., Robertson, C. (2014) A multi-scale approach to exploring urban places in geotagged photographs. Comput. Environ. Urban. Sys. http://dx.doi.org/10.1016/ j.compenvurbsys.2013.11.006 Getis, A., Ord, J.K. (1992) The analysis of spatial association by use of distance statistics. Geogr. Anal. 24: 189–206 Hollenstein, L., Purves, R. (2010) Exploring place through user-generated content: using Flickr tags to describe city cores. J. Spat. Info. Sci. 1, 21–48 Jankowski, P., Andrienko, N., Andrienko, G., Kisilevich, S. (2010) Discovering landmark preferences and movement patterns from photo postings. Trans. GIS 14: 833–852 Janowicz, K., Raubal, M., Kuhn, W. (2011) the semantics of similarity in geographic information retrieval. J. Spat. Info. Sci. 2: 29-57 Li, L., Goodchild, M.F. (2012) Constructing places from spatial footprints. In: Goodchild, M.F., Pfoser, D. and Sui, D. (eds.) Proceedings of the 1st ACM SIGSPATIAL International Workshop on Crowdsourced and Volunteered Geographic Information (GEOCROWD '12), ACM, New York, pp. 15-21 Li, L., Goodchild, M.F., Xu, B. (2013) Spatial, temporal, and socioeconomic patterns in the use of Twitter and Flickr. Cart. Geog. Info. Sci. 40: 61–77 Mackaness, W.A., Chaudhry, O. (2013) Assessing the veracity of methods for extracting place semantics from Flickr tags. Trans. GIS 17: 544-562 Manzo, L.C. (2005) For better or worse: Exploring multiple dimensions of place meaning. J. Environ. Psychol. 25: 67-86 Purves, R., Edwardes, A., Wood, J. (2011) Describing place through user generated content. First Monday 16, (Sept. 5) Rattenbury, T., Naaman, M. (2009) Methods for extracting place semantics from Flickr tags. ACM Trans. Web. 3: 1-30 Shelton, T., Poorthuis, A., Graham, M., Zook, M. (2014) mapping the data shadows of hurricane sandy: uncovering the sociospatial dimensions of ‘big data’. Geoforum 52: 167-179 Salton, G., McGill, M. J. (1983) Introduction to modern information retrieval. McGraw Hill, New York