DETC2014-34767 - Mahidol University

2 downloads 0 Views 859KB Size Report
Aug 17, 2014 - to facilitate restaurant suggestion for travellers through a mobile app. ..... Galaxy Tab, Samsung Infuse 4G, Sony Ericsson Xperia Play, and.

Proceedings of The ASME 2014 International Design Engineering Technical Conferences & Computers and Information in Engineering Conference IDETC/CIE 2014 August 17-20, 2014, Buffalo, New York, USA



Suppawong Tuarob Computer Science and Engineering The Pennsylvania State University University Park, PA 16802 Email: [email protected]

Conrad S. Tucker Engineering Design and Industrial and Manufacturing Engineering Computer Science and Engineering The Pennsylvania State University University Park, PA 16802 Email: [email protected]

ABSTRACT An innovative consumer (a.k.a. a lead user) is a consumer of a product that faces needs unknown to the public. Innovative consumers play important roles in the product development process as their ideas tend to be innovatively unique and can be potentially useful for development of next generation, innovative products that better satisfy the market needs. Oftentimes, consumers portray their usage experience and opinions about products and product features through social networks such as Twitter and Facebook, making social media a viable, rich in information, and large-scale source for mining product related information. The authors of this work propose a data mining methodology to automatically identify innovative consumers from a heterogeneous pool of social media users. Specifically, a mathematical model is proposed to identify latent features (product features unknown to the public) from social media data. These latent features then serve as the key to discover innovative users from the ever increasing pool of social media users. A real-world case study, which identifies smartphone lead users in the pool of Twitter users, illustrates promising success of the proposed models.

space [12, 19, 24, 28]. Recently, an increasing number of companies have altered their product innovation paradigms by making consumers the center of product development, rather than seeing consumers as the market [4]. These innovative users have been shown to be able to identify further needs beyond the products in market space can satisfy. Such needs are often converted to potential product development ideas that could be incorporated in future products. For example, 3M assembled a team of lead users which included a veterinarian surgeon, a makeup artist, doctors from developing countries and military medics1 . The recruited lead users then brain-stormed their ideas in a two-andhalf day workshop. As a successful result, 3M initiated 3 product lines (i.e. Economy, Skin Doctor, and Armor lines) which was shown to yield eight time more profitable than using the traditional product development method [17]. However, a drawback of such consumer-innovator paradigms is that only a fraction of consumers have the potential to generate innovative ideas useful for development of the target products. This makes the selection of such innovative consumers (a.k.a. lead users) an early challenging task that requires huge amounts of both time and financial resources. Society generates more than 2.5 quintillion (1018 ) bytes of data each day [42]. A substantial amount of this data is generated through social media services such as Twitter, Facebook, and Google that process anywhere between 12 terabytes (1012 ) to 20 petabytes (1015 ) of data each day [1]. Social media allows its users to exchange information in a dynamic, seamless manner almost anywhere and anytime. Knowledge extracted


Introduction It has long been believed that consumers exist as the end of product chains, merely to buy and consume what producers create, while the companies are the sole entities involving in the product development [40]. However, multiple research studies such as [5, 23, 38, 39] have shown that this innovation paradigm is no longer true – consumers themselves are actually the source of the innovation reflected in todays’ products in the market



c 2014 by ASME Copyright ⃝

from social media has proven valuable in various applications. For example, real time analysis of Twitter data has been used to model earthquake warning detection systems [22], detect the spread of influenza-like-illness [10], predict the financial market movement [9, 43], and identify potential product features for development of next generation products [32]. Despite the range of applications, design methodologies that leverage the power of social media data to mine information about products in the market are limited. Researchers in the design community have studied the importance of integrating lead users into the product development processes by recruiting consumers for lead user studies and manually observing product discussion blogs/reviews. However, such methods may suffer from the following limitations:

In this paper, the authors propose a data mining driven methodology that automatically identifies lead users of a particular product/product domain from a pool of social media users. In particular, the authors develop a set of algorithms that first identify latent features discussed in social media. The discovered latent features are then used to identify potential product specific lead users (lead users who have expertise in a particular product) and global lead users (lead users who have critical, innovative ideas about all the products within the product domain). This paper has the following main contributions: 1. The authors adopt text mining techniques to extract product ground-truth features from product specification documents and user-discussed features from social media data. 2. The authors propose a mathematical model to identify the latent features from the extracted ground-truth and userdiscussed features. 3. A probability-based mathematical model is developed to identify product specific and global lead users. 4. The authors illustrate the efficacy of the proposed methodology using a case study of real world smartphone data and Twitter data.

1. Immediacy: Companies would want to receive feedback pertaining to innovative ideas from consumers as soon as possible so they could stay in the advantage of producing products that early satisfy the market’s need. The scouting and recruitment for human test subjects usually take some amount of time (could be months). Similarly, manually observing product discussion blogs requires manual labor which could also take time if limited human resource is allocated for the task. Information in social media, however, is always available, and accessing and processing users’s opinions about products can be done almost immediately with an automated system. 2. Reach: Companies would want the best of the lead users to help generate innovative product ideas. However, scouting for human subjects for lead user studies may only reach out to a limited group of consumers, and could miss out potential consumers. On the contrary, social media consists of a heterogeneous pool of billions of users all over the world whose information can mostly be accessible. 3. Cost Efficiency: The money spent on a user study campaign could be tremendous due to involving man hours and other logistics. On the contrary, accessing social media data with an automated system is mostly free of charge. Furthermore, most social media services even provide tools for accessing the data anywhere and anytime.

The remainder of the paper is organized as follows: Section 2 discusses related literature. 3 discusses the proposed methodology used to address the challenges outlined above. Section 4 introduces the case study along with the experimental results and discussion. Section 5 concludes the paper.


Related Works Literature on automatic identification of lead users for product development is still in the infant stage. The literature discussed here is therefore only that closely related to this research. 2.1

Lead Users for Product Development The applications of lead users for product development have long been studied in the literature. Most works emphasize the importance and the impact of user-innovative product development paradigms that involve consumers to provide innovative ideas for the development of next generation products. Hippel et al. explored how lead users can be systematically spotted, and how lead user perceptions and preferences can be incorporated into industrial and consumer marketing research analyses of emerging needs for new products, processes and services [13, 38–40]. Batallas et al. modelled and analyzed information flows within product development organizations [5]. The model leads to the understanding and identifying information leaders in product development processes. Schreier et al. studied lead user participants and found that leaders have stronger domain-specific innovativeness than ordinary users. Moreover, they perceive new technologies as less complex, and hence are in better positions to adopt them [23]. These works illustrate the clear needs for lead users to supply development ideas during product development processes. However, acquiring such lead users can be timeconsuming and costly, and may not reach out to all the potential

Tuarob and Tucker have recently identified social media data as a viable source of information about products due to the ability to reflect consumers’ opinions and preferences towards particular products or product features beyond the scope of the product specification [32]. For example, messages that convey innovative product ideas such as “U know with all the glass in the iPhone 4 they really should think about integrating a solar panel to recharge the battery.” or “i wish i could use my iPhone as a universal remote control.” are ubiquitous in social media space and could potentially lead to future product development. Hence, the ability to identify such product-related innovative information in social media could prove to be useful when searching for product lead users in the pool of social media users. 2

c 2014 by ASME Copyright ⃝

lead users in the user space. This bottle neck problem behooves a system that can be automated to discover lead users from a heterogeneous pool that covers a wide range of consumers such as the users of social media.

uct features [36] and identifying relevant product features from a high dimensional feature set [35]. Huang et al. defined a feature as a set of connective features of edges or faces, such as convexity, concavity, and tangency. They proposed that features are classified into two main types, i.e. Isolate and Hybrid, and presented an algorithm for recognizing features in each category [41]. Popescu et al. presented OPINE, an unsupervised system for extracting product features from user reviews [21]. For a given product and a corresponding set of reviews, the system is able to extract features along with collective opinions of the users towards particular features. They used 7 product models along with their corresponding web-based reviews for the experiment. Such methods rely on the completeness of the content and correct use of language, and would fail to capture product features discussed in social media where colloquialness and noise are norms. Most of the above techniques utilize the data from product review sites, whose content pertains to products recently purchased, as opposed to content pertaining to product usage over time. Tuarob and Tucker proposed a topic modelling based feature extraction algorithm that takes a collection of social media messages related to a particular product as an input and extracts strong, weak, and controversial product features [32]. Their algorithm uses Latent Dirichlet Allocation [8] to mine topical knowledge from the collection of messages, then extracts product features from social media messages that have high chance of being related to product feature discussion. This approach works well with social media data; however, it cannot extract opinions associated with each extracted feature and does not scale well due to having to re-model topics every time new messages are added to the social media collection (i.e. The algorithm is non-updateable). Huang et al. proposed a feature extraction algorithm as part the RevMiner3 project, which mines restaurant reviews from Yelp.com4 and summarizes the reviews to facilitate restaurant suggestion for travellers through a mobile app. Their feature extraction algorithm is updateable and iteratively learns the language patterns in which feature phrases may occur from a collection of free-text documents and extracts features along with associated user opinions. The RevMiner feature extraction algorithm has two advantages over Tuarob and Tucker’s algorithm in that: 1) it can continue to extract features from newly added data without having to run the whole process again, and 2) it can extract opinions associated with each feature. Hence, the methodology in this work extends RevMiner’s feature extraction algorithm to extract product features from noisy data under social media settings.


Automatic Identification of Leaders Literature in Computer Science and Information Retrieval has proposed methods to automatically identify leaders from pools of users in online communities. Zhao et al. proposed a machine learning based method to identify leaders in online cancer communities [44]. Their method is only applicable to the cancer domain as the learning process of the algorithm requires cancer specific domain knowledge. Song et al. proposed InfluenceRank algorithm for identifying opinion leaders in Blogosphere [25]. Their algorithm utilizes networking connectivity among users which is not always available in some social media services. Tang et al. proposed UserRank algorithm which combines link analysis and content analysis techniques to identify influential users in social network communities [26]. Multiple works have also devoted to building automated systems to identify leaders or influential users in online communities such as [3, 16, 29]. Most of these works are not applicable to the problem focused in this research due to 2 problems: 1) most of the proposed algorithms in the literature require network structures among users which are not always available in some social media services such as Twitter2 , blogs, and product reviews; 2) the definition of leaders in most previous works pertain to how a user’s opinion propagates (or ’influences’ other users) throughout the network, while a lead user in the product development sense is a user who experiences unknown needs. The differences in the definitions of a leader make previous algorithms not suitable in this research. 2.3

Product Feature Extraction Since the proposed methodology depends on reliable product features extracted from textual data, some of the related works about product feature extraction are discussed here. Lim et al. proposed a Bayesian network for modeling user preferences on product features [18]. The model is capable of expressing the uncertainty towards product features, and takes into account a user’s distribution of preferences over all features. A case study of 4 laptop product lines shows that their approach was successful in analyzing in-depth component and platform impact under drifting preferences. Tucker and Kim proposed a machine learning based approach for mining product feature trends in the market from the time series of user preferences [37]. Their proposed model predicts future product trends and automatically classifies product features into three categories: Obsolete, Nonstandard, and Standard features. Other works by Tucker and Kim include mining publicly available customer review data for prod-


Methodology The methodology in the Knowledge Discovery in Databases (KDD) is employed in this research from raw data collection from Social Media to preprocessing, mining and interpreting. First the social media and product specification data are collected and preprocessed. The preprocessed data then is used to model

2 Though one could infer the relationship among Twitter users by constructing communities based on the Reply-To connections, such connections are sparse and spurious. These are not taken into account in most network-based leader identification algorithms.

3 4


c 2014 by ASME Copyright ⃝

the methods to extract product features and identify potential lead users. 3.1

market space. The last step is to identify the lead users of each product s, and the global lead users across all the products in S. The three main components (as shown in bold-grey objective boxes in Figure 1) are proposed and comprehensively investigated in this work.

Overview and Definitions


Product Specs Documents

Social Media Data

Objective 1: Extract Product Features

Objective 1: Extract Product Features

Groundtruth Features

UserDiscussed Features

Data Collection and Preprocessing 3.2.1 Collecting Product Specification Documents A product specification document provides the actual non-biased features of the product. These documents will be used to construct the ground-truth features for each chosen product. Two sources of product specification documents are chosen: Wikipedia6 and the product technical specification manuals from the manufacturers. These two sources are chosen due to being rich in text, facilitating the feature extraction algorithm which works well on textual data. 3.2.2 Social Media Data Collection Social media provides a means for people to interact, share, and exchange information and opinions in virtual communities and networks [2]. For generalization, the proposed methodology minimizes the assumption about functionalities of social media data, and only assumes that a unit of social media is a tuple of unstructured textual content, a user ID, and a timestamp. Such a unit is referred to as a message throughout the paper. This minimal assumption would allow the proposed methodology to generalize across multiple heterogeneous pools of social media such as Twitter, Facebook, Google+, etc.

Objective 2: Identify Latent Features Latent nt Features

Objective 3: Identify and Rank Lead Users Lead Users


Overview of the proposed methodology


Data Selection and Preprocessing Social media messages corresponding to each product domain are retrieved by a query of the product’s name (and variants) within the large stream of social media data. The technique developed by Thelwall et al. is employed to quantify the emotion in a message. The algorithm takes a short text as an input, and outputs two values, each of which ranges from 1 to 5 [27]. The first value represents the positive sentiment level, and the other represents the negative sentiment level. The reason for having the two sentiment scores instead of just one (with −/+ sign representing negative/positive sentiment) is because research findings have determined that positive and negative sentiment can coexist [11]. The positive and negative scores are then combined to produce an emotion strength score using the following equation:

Figure 1 illustrates the proposed methodology in high level. A product feature is defined as a noun phrase representing a property of a product. For example, features for smartphones include screen, app, camera, battery life, etc. Let S be the set of all the products in the same domain5 F be the set of all features, G be the set of all product specification documents, M be the set of all social media messages, and U be the set of all social media users. For a user u ∈ U, Mu is the set of social media messages composed by u. For s ∈ S, Gs and Ms represent the set of specification documents and social media messages corresponding to product s respectively. Similarly, F(Gs ) and F(Ms ) are the sets of product features extracted from Gs and Ms respectively. According to Figure 1, for a product s ∈ S, first the product specification documents (Gs ) and social media messages (Ms ) are collected and preprocessed. Then the feature extractor extracts features from both sets of documents and produces a set of ground-truth product features F(Gs ) and a set of user-discussed product features F(Ms ). Then, F(Gs ) and F(Ms ) are used to identify the set of product specific latent features F ∗ (s), and global latent feature F ∗ (S). A latent feature is a product feature that is discussed in social media but does not yet exist in the

Emotion Strength(ES) = NegativeScore − PositiveScore (1) A message is then classified into one of the 3 categories based on the sign of the Emotion Strength score (i.e. positive (+ve), neutral (0ve), negative (-ve)). The EmotionStrength scores will later be used to identify whether a particular message conveys a positive or negative attitude towards a particular product or product feature. The positive sentiment messages will then be used

5 A product domain is a set of products that belong to the same category, e.g. smartphone, automobile, laptop, etc.



c 2014 by ASME Copyright ⃝

to approximate the demand of a particular product, as proposed in [32]. The approximated demand will be used in the computation of the ranking scores in order to find the global product lead users.

Algorithm 1: The feature extraction algorithm from a collection of documents


Objective 1: Product Feature Extraction from Textual Data For each product s ∈ S, the proposed methodology extracts the ground-truth product features ( F(Gs ) ) from the set of nonbiased product specification documents (Gs )) that describe the actual features of the product, and extracts the user-discussed features ( F(Ms ) ) from the set of social media messages related to the product s (Ms ). Since both Gs and Ms are merely just collections of plain text documents, the same feature extraction algorithm can be applied to them. Note that the proposed methodology chooses to extract product features from textual data, even though well structured data of market products are available because textual data allows distinct features that may not be well standardized to be captured. For example, the iPhone products have the ability to interpret shaking as an input (to shuffle songs, etc.); however, such a feature is not listed anywhere in any popular product databases of the iPhone including GSMArena7 and PhoneArena8 . Such a feature, however, is well described in the iPhone textual description in the iPhone Wikipedia page9 and the iPhone user guide provided by the manufacturer10 (Moreover, the extraction results show that the feature extraction algorithm is able to capture the shaking feature of the iPhone model). Extracting product features from textual data proves to be one of the challenging extraction problems in the Information Retrieval literature. In this paper, a number of feature extraction algorithms proposed in [14, 15, 21, 32] are considered. Out of these algorithms, the authors only have access to the core implementations of [15, 32], and choose to extend the algorithm proposed by Huang et al. [15]. Though both feature extraction algorithms do not require domain knowledge about the products and are suitable for the focused task in this research, Huang et al’s algorithm is extended because it can process large, dynamic datasets more quickly and is able to extracted consumers’ opinions associated with each extracted feature. The original feature extraction algorithm proposed by Huang et al. was used to extract features of restaurants in Seattle area from Yelp reviews [15]. The algorithm is modified so that it could handle noisy data such as social media data more efficiently. The feature extraction algorithm used in this paper is outlined in Algorithm 1. The input is a collection of document D. Note that this can be either a collection of product specification documents (i.e. Gs ) or a set of social media messages (i.e. Ms ). The algorithm then preprocesses each document by cleaning residuals such as symbols, hyper links, usernames, and tags, and correcting misspelled words. Such noise is ubiquitous


1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23

in social media and could cause erroneous results. The Stanford Part of Speech (POS) Tagger11 is used to tag each word with an appropriate part of speech. This technology is required because a product feature is defined to be a noun phrase. The final step of the preprocessing phase extracts potential multi-word features and keeps them for later lookup. A multi-word feature is a feature composed by 2 or more words such as on-screen keyboard and Facebook notification. The main part of the algorithm iteratively learns to identify features and generates a set of extractions (E) from the input collection of documents. Each extraction e ∈ E is a tuple of ⟨ f eature, opinion, f requency⟩ such as ⟨‘onscreen keyboard ′ , ‘ f antastic′ , 5⟩, which infers that the onscreen keyboard feature of this specific product was mentioned as fantastic for 5 times. The algorithm employs a bootstrapping method which is initialized with a small set of ground-truth features. The algorithm then repeatedly learns phrase templates surrounding the seed features, and uses the templates to extract more features. This process continues until the extraction set does not grow. Finally, the algorithm post-processes the extractions by disambiguating and normalizing the features. The disambiguation process involves stemming the features using the Porter’s







en US/iphone user guide.pdf

Input: D: Set of free-text documents to extract product features. Output: E: Set of extractions. Each e ∈ E is a tuple of ⟨ f eature, opinion, f requency⟩, for example e = ⟨‘onscreen keyboard ′ , ‘ f antastic′ , 5⟩ preprocessing; for d ∈ D do Clean d ; POS tag d ; Extract multi-word features ; end initialization; E=⊘; T=⊘; F = Seed Features ; while E can still grow do Learn templates from seed features; Add new template to T; foreach d ∈ D do foreach Sentence s ∈ d do e ← Extract potential feature-opinion pair using T; Add e to E ; end end Update F; end E ← Clustering and normalizing features ; return E;


c 2014 by ASME Copyright ⃝

stemming algorithm12 and clustering them using the WordNet13 SynSet. This postprocessing step groups the same features which may be written differently together (e.g. Screen, Monitor, Screens, and Monitors would be grouped together). Once the set of extractions E is generated, the Feature(E) and Opinion(E) are defined to be the sets of distinct features and distinct opinions respectively. Hence given a collection of documents associated to a product s (either Gs or Ms ), the feature extraction algorithm is able to extract a set of features related to the product (which is referred to as F(Gs ) or F(Ms ), respectively).

needs of such a feature would be expected. Feature Frequency quantifies this. On the other hand, if a latent feature is originated from users’ creativity, such a feature must be unique, which can be quantified by the Inverse Product Frequency. A good latent feature must be both needed and creative, hence a combined FFIPF metric is used to quantify the meaningfulness of each latent feature. Mathematically, given a product collection S and a set of latent features F ∗ (S), the FF, IPF, and FF-IPF of a latent feature f ∈ F ∗ (S) are defined as:

FF( f , F ∗ ) = 0.5 + 0.5 · 3.5

Objective 2: Identifying Latent Features The proposed methodology defines a latent feature of a product domain as a feature that does not exist in any existing products within the domain. In other words, a latent feature is a feature that has not yet been implemented in any products in the market space. With such an assumption, one could automatically identify the set of latent features by subtracting the set of user-discussed features with the set of ground-truth features of all products. Mathematically, given a product domain S, the set of product specific latent features of the product s, F ∗ (s), and the set of global latent features F ∗ (S) are defined as: F ∗ (S) =

F(Ms ) −


F(Gs )

|S| | {s ∈ S : f ∈ s} | FF − IPF( f , F ∗ , S) = FF( f , F ∗ ) · IPF( f , S) IPF( f , S) = log

(5) (6)

The set of extracted latent features will be used to identify consumers who possess and express innovative ideas, whom are referred to as lead users. 3.6

Objective 3: Identifying and Ranking Lead Users Berthon defines lead users as those who experience needs still unknown to the public and who also benefit greatly if they obtain a solution to these needs [6]. This section discusses in detail how product specific and global lead users are identified and ranked from the heterogeneous pool of social media users. Recall that a product specific lead user is a consumer who has expertise and is knowledgeable in a particular product; while a global lead user has critical and innovative ideas about all the products in a particular domain.



F ∗ (s) = F(Ms ) ∩ F ∗ (S)

|Frequency( f )| (4) ∑ f ′ ∈F ∗ |Frequency( f ′ )|


In order to quantify the meaningfulness of each extracted latent feature (since some features could be just noise or remnants caused by algorithmic flaw, such as “http://”, “i mean i”, etc.), the metric Feature Frequency-Inverse Product Frequency (FF-IPF) is developed, with an intuition borrowed from the Term Frequency-Inverse Document Frequency (TF-IDF) metric of the Information Retrieval (IR) field [20]. In the IR field, TF-IDF is widely used for ranking words by their importance with respect to the documents in which it appears and the whole collection of documents. TF-IDF has two components: the term frequency (TF) and the inverse document frequency (IDF). The TF is the frequency of a term appearing in a document. The IDF of a term measures how important the term is to the corpus, and is computed based on the document frequency, the number of documents in which the term appears. Similarly, one could think of a product as being a document, and a product is composed by a bag of features (instead of words as opposed to a document). With such an assumption, one could adopt the TF-IDF style metric in order to quantify the importance of each feature of a product. A latent feature can emerge from 2 sources: consumers’ need and consumers’ creativity. If multiple products lack a certain feature that would satisfy a majority of consumers, then a high volume of discussion regarding the

3.6.1 Identifying Lead Users for a Particular Product The proposed methodology automatically identifies lead users in a pool of social media users by detecting users who emit innovative ideas about the products that they use or are familiar with. Specifically, given a user u ∈ U and a product s ∈ S, the methodology computes P(u|s), the probability that the user u is a lead user of the product s. The probability is referred to as the product specific iScore (or the Innovative Score), which will be used later for ranking users. Top users with highest product specific iScores are regarded as the product specific lead users. Algorithm 2 outlines the procedure of assigning a product specific iScore to a user given a particular product s. P(u|s) can be thought as the likelihood that the user u is a lead user for the product s, and is defined as: P(u|s) =

P(u| f , s) · P( f |s)


f ∈F(Mu )

Where: { 1 1 ; f ∈ F ∗ (S) P(u| f , s) = , P( f |s) = 0 ; Otherwise |F(Gs ) ∪ F(Ms )|





c 2014 by ASME Copyright ⃝


Algorithm 2: Algorithm for identifying and ranking product specific lead users of a particular product s

1 2 3 4 5 6 7 8 9 10

This section introduces a case study used to verify the proposed methodology and discusses the results.

Input: s ∈ S: The product. U: The set of all users. F(Gs ):Ground-truth features. F(Ms ):User discussed features. F ∗ (s): Latent features. Output: Ranked list of users with respect to P(u|s) initialization; I=⊘; foreach user u ∈ U do Mu ← The messages posted by u; Compute F(Mu ) using Algorithm 1 ; iScore ← Compute P(u, s); Add ⟨u, iScore⟩ to I; end I ← Rank users in I by iScores; return I


4.1.1 Smartphone Specification Data The groundtruth specifications of each smartphone model are collected from both and the product specification manual provided by the manufacturer (as a PDF document). Only textual information is extracted from each product specification document since the feature extraction algorithm used in this research only works with textual data.

3.6.2 Identifying Global Lead Users within the Product Domain In order to identify the global lead users across all the products in the product space S, the global iScore (or P(u)) is computed for each user. Top users with highest global iScores are regarded as the global lead users of the product domain S.

∑ P(u|s) · P(s)

4.1.2 Product Related Twitter Data Twitter14 is a microblog service that allows its users to send and read text messages of up to 140 characters, known as tweets. The Twitter dataset used in this research was collected randomly using the provided Twitter API, and comprises 2,117,415,962 (˜ 2.1 billions) tweets in the United States during the period of 31 months, from March 2011 to September 2013.



Based on the law of total probability, P(u) can be computed as the sum of proportional P(u|s) across each product s ∈ S. P(s) is the probability of the product s being known and demanded by the market. Tuarob and Tucker found that the volume of the positive sentiment in social media corresponding to a particular product can be used to quantify the product demand which they found to directly correlate with the actual product sales [32]. In this work, the proposed methodology instantiates such findings and proposes to approximate P(s) with the proportion of positive sentiment over all the products in the same domain, i.e.:

P(s) =

|Positive(s)| ′ ∑s ∈S |Positive(s′ )|

Case Study

A case study of 27 smartphone products is presented that uses social media data (Twitter data) to mine relevant product design information. Data pertaining to product specifications from the smartphone domain is then used to validate the proposed methodology. The selected smartphone models include BlackBerry Bold 9900, Dell Venue Pro, HP Veer, HTC ThunderBolt, iPhone 3G, iPhone 3GS, iPhone 4, iPhone 4S, iPhone 5, iPhone 5C, iPhone 5S, Kyocera Echo, LG Cosmos Touch, LG Enlighten, Motorola Droid RAZR, Motorola DROID X2, Nokia E7, Nokia N9, Samsung Dart, Samsung Exhibit 4G, Samsung Galaxy Nexus, Samsung Galaxy S 4G, Samsung Galaxy S II, Samsung Galaxy Tab, Samsung Infuse 4G, Sony Ericsson Xperia Play, and T-Mobile G2x.

Equation 7 is directly expanded using the law of total probability, which sums over the all the features expressed by the user u related to the product s (i.e. F(Mu )). P(u| f , s) is the probability of the user u being the lead user given a feature f , and is defined to be 1 if f is a latent feature, and 0 otherwise. Finally, P( f |s) is the probability of a user expressing the feature f , and can be computed directly from the pool of all features related to the product s.

P(u) =

Case Study, Results, and Discussion

Tweets related to a product are collected by detecting the presence of the product name (and variants), and preprocessed by cleaning and mapping sentiment level as discussed in Section 3.3. Table 1 lists the number of tweets, percentage positive sentiment, and number of unique Twitter users of each chosen smartphone model. The percentage positive sentiment of a prod|Positive(s)| uct s is calculated by |All Tweets(s)| · 100%, where Positive(s) is the number of positive tweets related to the product s. Figure 2 displays the monthly Twitter discussion share of each chosen smartphone model throughout the 31 month period of data collection. Note that, since some smartphone models (i.e. the iPhones) have enormous discussion shares compared to the others, the proposed methodology has to take normalization into account when comparing one product to another.


Positive(s) is the set of positive messages associated with the product s. 7

c 2014 by ASME Copyright ⃝

100% Samsung Galaxy Tab

Samsung Galaxy Nexus

iPhone 5S

Discussion Propor!on of Each Smartphone Model in Twi"er


Moto Droid RAZR

iPhone 5C




iPhone 5


40% iPhone 4S



iPhone 4

10% iPhone 3GS


T-Mobile G2x Sony Ericsson Xperia Play Samsung Infuse 4G Samsung Galaxy Tab Samsung Galaxy S II Samsung Galaxy S 4G Samsung Galaxy Nexus Samsung Exhibit 4G Samsung Dart Nokia N9 Nokia E7 Motorola DROID X2 Motorola Droid RAZR LG Enlighten LG Cosmos Touch Kyocera Echo iPhone 5S iPhone 5C iPhone 5 iPhone 4S iPhone 4 iPhone 3GS iPhone 3G HTC ThunderBolt HP Veer Dell Venue Pro BlackBerry Bold 9900

FIGURE 2. Monthly distribution of Twitter discussion of each smartphone model across the 31 month period of data collection.


Objective 1: Product Feature Extraction from Textual Data Given a product s ∈ S, the feature extraction algorithm (See Algorithm 1) is applied to the product specification documents (Gs ) in order to obtain the ground-truth features ( F(Gs ) ), and to the tweets related to the product (Ms ) in order to extract features discussed by the Twitter users ( F(Ms ) ). Table 2 enumerates the number of extracted ground-truth features, number of userdiscussed features, and number of product specific latent features. Recall that a product specific latent feature of a product s, is a feature mentioned in the set of social media messages related to s, and does not appear in ground truth features of any products in the product space S.





8000 6634 6000 3897

4000 2481 2000 1




559 40 157 96 298 282


0 0

0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9 FF-IPF (Bin)


1.1 1.2 1.3 1.4

FIGURE 3. Histogram showing the distribution of the Feature Frequency-Inverse Product Frequency (FF-IPF) scores of 25,816 total extracted global latent features.


Objective 2: Identifying Latent Features A set of 25,816 global latent features (F ∗ (S)) are extracted from the smartphone related social media data. A FF-IPF score is calculated for each latent feature. Figure 3 plots the distribution of the FF-IPF scores using a histogram, with an averagemoving trend line. The distribution is heavily skewed to the right, suggesting an exponential growth. This would mean that a majority of the extracted latent features are meaningful (i.e. not noisy, erroneous features). Latent features with FF-IPF scores lower than 1.1 are treated as noise and eliminated, leaving with a


FF-IPF 2 per. Mov. Avg. (FF-IPF)

set of 22,285 global latent feature for further processing. Table 3 lists the top 5 extracted global latent features with highest FF-IPF scores, along with the tweets that provide contextual information about such latent features. These top 5 latent features reflect the actual consumers’ needs that have not been satisfied. These innovative opinions (as interpreted from the sample tweet associated with each latent feature) could be 8

c 2014 by ASME Copyright ⃝

TABLE 1. Selected smartphone models, their associated number of tweets, proportion of positive sentiment tweets (in percent), and number of unique users who posted these tweets.

TABLE 2. Numbers of extracted ground-truth (base) features, userdiscussed (user) features, and product specific latent features of each smartphone model. # Base # User # Latent Model Features Features Features BlackBerry Bold 9900 1126 126 101 Dell Venue Pro 497 50 36 HP Veer 1206 76 56 HTC ThunderBolt 627 335 281 iPhone 3G 1330 532 420 iPhone 3GS 891 775 652 iPhone 4 995 6057 5720 iPhone 4S 963 5922 5582 iPhone 5 1020 13493 13050 iPhone 5C 895 833 717 iPhone 5S 973 1962 1740 Kyocera Echo 895 22 16 LG Cosmos Touch 769 11 6 LG Enlighten 1084 5 1 Motorola Droid RAZR 582 593 496 Motorola DROID X2 504 162 138 Nokia E7 749 14 10 Nokia N9 745 83 62 Samsung Dart 1178 10 6 Samsung Exhibit 4G 1331 10 7 Samsung Galaxy Nexus 456 1147 1017 Samsung Galaxy S 4G 1322 62 37 Samsung Galaxy S II 1319 801 662 Samsung Galaxy Tab 771 884 762 Samsung Infuse 4G 1121 85 60 Sony Ericsson Xperia Play 726 132 102 T-Mobile G2x 945 39 23

Model NumTweets % Posi!ve NumUsers BlackBerry Bold 9900 308 36.04% 252 Dell Venue Pro 96 46.88% 64 HP Veer 143 31.47% 110 HTC ThunderBolt 1157 30.68% 851 iPhone 3G 2154 25.63% 1874 iPhone 3GS 3803 28.06% 3119 iPhone 4 68860 28.92% 43957 iPhone 4S 63500 29.53% 39145 iPhone 5 211311 28.66% 124461 iPhone 5C 5533 24.62% 4475 iPhone 5S 15808 26.45% 12417 Kyocera Echo 52 26.92% 42 LG Cosmos Touch 23 39.13% 20 LG Enlighten 18 16.67% 17 Motorola Droid RAZR 2535 32.54% 1981 Motorola DROID X2 471 26.75% 378 Nokia E7 26 30.77% 18 Nokia N9 208 34.13% 153 Samsung Dart 29 20.69% 28 Samsung Exhibit 4G 23 39.13% 22 Samsung Galaxy Nexus 5218 31.07% 2988 Samsung Galaxy S 4G 188 31.91% 152 Samsung Galaxy S II 4599 31.12% 3517 Samsung Galaxy Tab 3989 30.96% 2578 Samsung Infuse 4G 284 34.15% 215 Sony Ericsson Xperia Play 481 26.20% 325 T-Mobile G2x 83 32.53% 69

critical when designing next generation products. For example, consumers express needs for the waterproof feature for their iPhones; some users believe that a solar panel could be embedded underneath the iPhone screen so that the phone could charge itself when exposed to sunlight; etc. Note that the latent feature hybrid in the given example could be interpreted as either energy-source related or physical-feature related. This problem arises when a feature term is used to refer to more than one distinct features, and paves the path to future works on semantic disambiguation of feature representation.

Table 4 lists some Twitter comments of the top lead user of each sample five smartphone models (i.e. Samsung Galaxy Nexus, HTC ThunderBolt, iPhone 5, Sony Ericsson Xperia Play, and Kyocera Echo). These tweets carry innovative ideas for improving these products. For example, a lead user suggests that the Siri functionality in the iPhone 5 should be able to do more than just talk (He might be suggesting that the iPhone 5 could connect to external hardware to enable Siri to perform physical interactions). Furthermore, one lead user of the Sony Ericsson Xperia Play, a smartphone that emphasizes on the gaming functionality, suggests to incorporate the ability to use the Playstation 3 controllers with the phone. These product specific lead users experience needs to improve the products during product usage. Identifying such product specific lead users would allow the companies to seek solution and innovative ideas to their next generation products without having to deal with traditional costly, time-consuming lead user selection process. Oftentimes a lead user can be critical about product features across multiple products (not just his/her own products). Identifying these global lead users could bring out experts that could give better critical product development ideas. For this, all the


Objective 3: Identifying and Ranking Lead Users Once a set of latent features (F ∗ (S)) is identified, the product specific and global iScores can be computed for each user in order to identify both product specific and global lead users. For each product s, P(u|s) is computed for each of the users in the pool of 198,974 Twitter users who tweet about their smartphone products according to Equation 7. Then P(u) is computed according to Equation 9. Figure 4 plots the average iScore (both product specific iScore and global iScore) of the top 100 lead users for each product and top 100 global lead users (bounded to the primary (left) Y-axis). The figure also plots the average number of tweets about the product and all tweets about smartphone of these top 100 lead users (bounded to the secondary (right) Yaxis). 9

c 2014 by ASME Copyright ⃝

TABLE 3. Top 5 latent features across the chosen smartphone models, FF-IPF scores, and example tweets that related to the latent features.

Latent Feature FF-IPF Waterproof


Solar Panel




Tooth Pick




Example I hope Apple incorporates some of that new waterproof technology in the iPhone 5 iPhone 5 be!er be waterproof , shockproof, scratchproof, thisproof, thatproof, and all the rest of the proofs for $800 ... and what else would make the iPhone 5 even be!er, built in solar power charging ! U know with all the glass in the iPhone 4 they really should think about integra#ng a solar panel to recharge the ba!ery. I wish there was an #android phone out there that was a hybrid of the best features on the droid razr maxx and the galaxy nexus . I need a hybrid-iPhone 4s so the ba!ery can hold on all day when I'm at #vmworld. Steve, are you listening? :) I hope iPhone 5 borrows from Swiss Army and finally adds a removable tooth pick . My life would be 827492916 #mes be!er if my iHome took my iPhone 5 First world problem: mad because my iPhone 5 is not compa!ble with this iHome dock in the hotel room.



Avg Num Msg @TOP100 0.014

Avg Num All Msg @TOP100




0.01 60 0.008 0.006


0.004 20

Number of Twier Messages


120 Avg iScore @TOP100

0.002 0


FIGURE 4. Average product specific iScore (i.e. P(u|s)) and global iScore (i.e. P(u)) of top 100 lead users across all the selected smartphone models, along with average numbers of tweets both related to each smartphone model (Avg Num Msg @TOP100) and average number of tweets related to smartphone in general (Avg Num All Msg @TOP100).

users are ranked based on the P(u) scores. Table 5 lists Twitter messages posted by the top global lead user with highest global iScores that infer innovative ideas about smartphone features.

in the product space. Third, the product specific and global innovative scores (iScores) are computed for each user in the user space. Top product specific users are then regarded as the lead users of such a product. Also, users with top global iScores are regarded as the global lead users. A case study of real-world 27 smartphone models with 31 month’s worth of Twitter data is presented. The results and selected examples show great promises that the proposed methodology is effective in automatically identifying potential lead users from the pool of social media users for the next generation product development. Future works could strengthen the evaluation process by involving user studies, and verify the generalizability of the proposed methods by examining diverse case studies of different product domains and social media services. Machine learning based techniques that allow


Conclusions and Future Works This paper presents a data mining driven methodology to identify innovative consumers, or lead users, from a heterogeneous pool of social media users. The methodology comprises of three main steps. First, product ground-truth features are extracted from the product specification documents, and the userdiscussed features are extracted from social media data. Second, latent features (unrealized features) are extracted from the ground-truth and user-discussed features across all the products 10

c 2014 by ASME Copyright ⃝

TABLE 4. Sample tweets from the top lead user of each sample five smartphone models. These tweets suggest product innovative improvement for each corresponding product. Model Samsung Galaxy Nexus HTC ThunderBolt iPhone 5 Sony Ericsson Xperia Play Kyocera Echo

Product iScore


Sample Twi!er Message

I wish there was an #android phone out there that 0.0496 was a hybrid of the best features on the droid razr maxx and the galaxy nexus. HTC Thunderbolt #fail: Connect phone to PC to 0.0308 access drivers on included SD card ... but need drivers installed to access SD card from PC but unless Siri can do more that just talk ...I'm not 0.0174 sold! #iPhone5 Hmm.. Playing games suppor"ng Xperia Play 0.0085 controls. Wish I could use PS3 controller .. Makes me want an LTE Xperia Play with Tegra3.. 0.0077 Kyocera Echo needs to develop its own apps .



[9] TABLE 5. Sample tweets from the top 5 global lead users of the smartphone domain. These tweets suggest product innovation. Global Sample Twi!er Message iScore I wish there were a tweak for the iPhone 4S that would 0.0127 indicate "4G" instead of just 3G when I'm connected with a HSDPA+ connec!on. If you trust my ins!nct, the iPhone 5S will come in 0.0126 mul!ple colors and two display sizes Very exci!ng Siri on the iPhone 4S ac!vates when you 0.0113 "raise it to your ear" that'd b awesome. I wish i could use my iPhone as a universal remote 0.0107 control. Since iPhone already does fingerprint, Sumsung should 0.0105 scan eyes .

[10] [11]




multiple machines to learn different aspects of social media data such as [7, 30, 31, 33, 34] could be applied to enhance the performance of the feature extraction algorithm.


References [1] What is big data?–Bringing big data to the enterprise. data/bigdata/, 2013. [Online; accessed 16 August 2013]. [2] Toni Ahlqvist. Social media roadmaps: exploring the futures triggered by social media. VTT, 2008. [3] Sinan Aral and Dylan Walker. Identifying influential and susceptible members of social networks. Science, 337(6092):337–341, 2012. [4] Carliss Baldwin and Eric Von Hippel. Modeling a paradigm shift: From producer innovation to user and open collaborative innovation. Harvard Business School Finance Working Paper, (10-038):4764–09, 2010. [5] D.A. Batallas and A.A. Yassine. Information leaders in product development organizational networks: Social net-





work analysis of the design structure matrix. Engineering Management, IEEE Transactions on, 53(4):570–582, 2006. Pierre R. Berthon, Leyland F. Pitt, Ian McCarthy, and Steven M. Kates. When customers get clever: Managerial approaches to dealing with creative consumers. Business Horizons, 50(1):39 – 47, 2007. Sumit Bhatia, Suppawong Tuarob, Prasenjit Mitra, and C. Lee Giles. An algorithm search engine for software developers. In Proceedings of the 3rd International Workshop on Search-Driven Development: Users, Infrastructure, Tools, and Evaluation, SUITE ’11, pages 13–16, New York, NY, USA, 2011. ACM. David M. Blei, Andrew Y. Ng, and Michael I. Jordan. Latent dirichlet allocation. J. Mach. Learn. Res., 3:993–1022, March 2003. J. Bollen, H. Mao, and X. Zeng. Twitter mood predicts the stock market. Journal of Computational Science, 2(1):1–8, 2011. Nigel Collier and Son Doan. Syndromic classification of twitter messages. CoRR, abs/1110.3094, 2011. E. Fox. Emotion science: cognitive and neuroscientific approaches to understanding human emotions. Palgrave Macmillan, 2008. Nikolaus Franke, Eric Von Hippel, and Martin Schreier. Finding commercially attractive user innovations: A test of lead-user theory*. Journal of Product Innovation Management, 23(4):301–315, 2006. Cornelius Herstatt and Eric von Hippel. From experience: Developing new product concepts via the lead user method: A case study in a low-tech field. Journal of Product Innovation Management, 9(3):213 – 221, 1992. Minqing Hu and Bing Liu. Mining opinion features in customer reviews. In Proceedings of the 19th National Conference on Artifical Intelligence, AAAI’04, pages 755–760. AAAI Press, 2004. Jeff Huang, Oren Etzioni, Luke Zettlemoyer, Kevin Clark, and Christian Lee. Revminer: An extractive interface for navigating reviews on a smartphone. In Proceedings of the 25th Annual ACM Symposium on User Interface Software and Technology, UIST ’12, pages 3–12, New York, NY, USA, 2012. ACM. Yung-Ming Li, Chia-Hao Lin, and Cheng-Yang Lai. Identifying influential reviewers for word-of-mouth marketing. Electronic Commerce Research and Applications, 9(4):294 – 304, 2010. ¡ce:title¿Special Section: Social Networks and Web 2.0¡/ce:title¿. Gary L. Lilien, Pamela D. Morrison, Kathleen Searls, Mary Sonnack, and Eric von Hippel. Performance assessment of the lead user idea-generation process for new product development. Manage. Sci., 48(8):1042–1059, August 2002. Soon Chong Johnson Lim, Ying Liu, and Han Tong Loh. An exploratory study of ontology-based platform analysis under user preference uncertainty. In Proc. ASME 2012 Int. Design Engineering Technical Conf. Computers and Information in Engineering Conf. (IDETC/CIE2012), 2012. c 2014 by ASME Copyright ⃝

[19] C. Luthje. Characteristics of innovating users in a consumer goods field: An empirical study of sport-related product consumers. Technovation, 24(9):683 – 695, 2004. [20] Christopher D. Manning, Prabhakar Raghavan, and Hinrich Sch¨utze. Introduction to Information Retrieval. Cambridge University Press, New York, NY, USA, 2008. [21] Ana-Maria Popescu and Oren Etzioni. Extracting product features and opinions from reviews. In Proceedings of the Conference on Human Language Technology and Empirical Methods in Natural Language Processing, HLT ’05, pages 339–346, Stroudsburg, PA, USA, 2005. Association for Computational Linguistics. [22] Takeshi Sakaki, Makoto Okazaki, and Yutaka Matsuo. Earthquake shakes twitter users: real-time event detection by social sensors. In Proceedings of the 19th international conference on World wide web, WWW ’10, pages 851–860, New York, NY, USA, 2010. ACM. [23] Martin Schreier, Stefan Oberhauser, and Reinhard Pr¨ugl. Lead users and the adoption and diffusion of new products: Insights from two extreme sports communities. Marketing Letters, 18(1-2):15–30, 2007. [24] Sonali Shah. Sources and patterns of innovation in a consumer products field: Innovations in sporting equipment. Online verf¨ugbar unter http://opensource. mit. edu/papers/shahsportspaper. pdf, zuletzt gepr¨uft am, 18:2007, 2000. [25] Xiaodan Song, Yun Chi, Koji Hino, and Belle Tseng. Identifying opinion leaders in the blogosphere. In Proceedings of the Sixteenth ACM Conference on Conference on Information and Knowledge Management, CIKM ’07, pages 971–974, New York, NY, USA, 2007. ACM. [26] Xuning Tang and C.C. Yang. Identifing influential users in an online healthcare social network. In Intelligence and Security Informatics (ISI), 2010 IEEE International Conference on, pages 43–48, 2010. [27] Mike Thelwall, Kevan Buckley, Georgios Paltoglou, Di Cai, and Arvid Kappas. Sentiment in short strength detection informal text. J. Am. Soc. Inf. Sci. Technol., 61(12):2544–2558, December 2010. [28] Robert Tietz, Pamela D. Morrison, Christian Luthje, and Cornelius Herstatt. The process of user-innovation: a case study in a consumer goods setting. International Journal of Product Development, 2(4):321–338, 2005. [29] Michael Trusov, Anand Bodapati, and Randolph E Bucklin. Determining influential users in internet social networks. Available at SSRN 1479689, 2009. [30] S. Tuarob, S. Bhatia, P. Mitra, and C.L. Giles. Automatic detection of pseudocodes in scholarly documents using machine learning. In Document Analysis and Recognition (ICDAR), 2013 12th International Conference on, pages 738– 742, Aug 2013. [31] Suppawong Tuarob, Prasenjit Mitra, and C. Lee Giles. Taxonomy-based query-dependent schemes for profile similarity measurement. In Proceedings of the 1st Joint International Workshop on Entity-Oriented and Semantic







[38] [39] [40] [41]

[42] [43]



Search, JIWES ’12, pages 8:1–8:6, New York, NY, USA, 2012. ACM. Suppawong Tuarob and Conrad S Tucker. Fad or here to stay: Predicting product market adoption and longevity using large scale, social media data. In Proc. ASME 2013 Int. Design Engineering Technical Conf. Computers and Information in Engineering Conf.(IDETC/CIE2013), 2013. Suppawong Tuarob, Conrad S. Tucker, Marcel Salathe, and Nilam Ram. Discovering health-related knowledge in social media using ensembles of heterogeneous features. In Proceedings of the 22Nd ACM International Conference on Conference on Information & Knowledge Management, CIKM ’13, pages 1685–1690, New York, NY, USA, 2013. ACM. Suppawong Tuarob, Conrad S Tucker, Marcel Salathe, and Nilam Ram. An ensemble heterogeneous classification methodology for discovering health-related knowledge in social media messages. Journal of Biomedical Informatics, 2014. C. Tucker and H. Kim. Predicting emerging product design trend by mining publicly available customer review data. In Proceedings of the 18th International Conference on Engineering Design (ICED11), Vol. 6, pages 43–52, 2011. Conrad S Tucker and Harrison M Kim. Data-driven decision tree classification for product portfolio design optimization. Journal of Computing and Information Science in Engineering, 9(4):041004, 2009. C.S. Tucker and H.M. Kim. Trend mining for predictive product design. Transactions of the ASME-R-Journal of Mechanical Design, 133(11):111008, 2011. Eric Von Hippel. Successful industrial products from customer ideas. The Journal of Marketing, pages 39–49, 1978. Eric von Hippel. Lead users: A source of novel product concepts. Management Science, 32(7):791–805, 1986. Eric A von Hippel, Susumu Ogawa, and Jeroen PJ de Jong. The age of the consumer-innovator. 2011. Sha Wan, Yunbao Huang, Qifu Wang, Liping Chen, , and Yuhang Sun. A new approach to generic design feature recognition by detecting the hint of topology variation. In Proc. ASME 2012 Int. Design Engineering Technical Conf. Computers and Information in Engineering Conf. (IDETC/CIE2012), 2012. Xindong Wu, Xingquan Zhu, G Wu, and Wei Ding. Data mining with big data. 2013. X. Zhang, H. Fuehres, and P. Gloor. Predicting asset value through twitter buzz. Advances in Collective Intelligence 2011, pages 23–34, 2012. Kang Zhao, Baojun Qiu, Cornelia Caragea, Dinghao Wu, Prasenjit Mitra, John Yen, Greta E Greer, and Kenneth Portier. Identifying leaders in an online cancer survivor community. In Proceedings of the 21st Annual Workshop on Information Technologies and Systems (WITS11), pages 115–120, 2011.

c 2014 by ASME Copyright ⃝

Suggest Documents