Information Diffusion in Online Social Networks: A Survey - CiteSeerX

Titouan Fontai | Download | HTML Embed
  • Jun 19, 2013
  • Views: 51
  • Page(s): 12
  • Size: 854.59 kB
  • Report



1 Information Diffusion in Online Social Networks: A Survey Adrien Guille1 Hakim Hacid2 Ccile Favre1 Djamel A. Zighed1,3 1 ERIC Lab, Lyon 2 University, France {firstname.lastname} 2 Bell Labs France, Alcatel-Lucent, France [email protected] 3 Institute of Human Science, Lyon 2 University, France [email protected] ABSTRACT venting terrorist attacks, anticipating natural haz- Online social networks play a major role in the spread of ards), optimizing business performance (e.g. opti- information at very large scale. A lot of effort have been mizing social marketing campaigns), etc. Therefore made in order to understand this phenomenon, rang- researchers have in recent years developed a vari- ing from popular topic detection to information diffu- ety of techniques and models to capture informa- sion modeling, including influential spreaders identifi- tion diusion in online social networks, analyze it, cation. In this article, we present a survey of represen- extract knowledge from it and predict it. tative methods dealing with these issues and propose a Information diusion is a vast research domain taxonomy that summarizes the state-of-the-art. The ob- and has attracted research interests from many fields, jective is to provide a comprehensive analysis and guide such as physics, biology, etc. The diusion of in- of existing efforts around information diffusion in social novation over a network is one of the original rea- networks. This survey is intended to help researchers in sons for studying networks and the spread of disease quickly understanding existing works and possible im- among a population has been studied for centuries. provements to bring. As computer scientists, we focus here on the par- ticular case of information diusion in online so- cial networks, that raises the following questions : 1. INTRODUCTION (i) which pieces of information or topics are popu- Online social networks allow hundreds of millions lar and diuse the most, (ii) how, why and through of Internet users worldwide to produce and con- which paths information is diusing, and will be dif- sume content. They provide access to a very vast fused in the future, (iii) which members of the net- source of information on an unprecedented scale. work play important roles in the spreading process? Online social networks play a major role in the dif- The main goal of this paper is to review develop- fusion of information by increasing the spread of ments regarding these issues in order to provide a novel information and diverse viewpoints [3]. They simplified view of the field. With this in mind, we have proved to be very powerful in many situations, point out strengths and weaknesses of existing ap- like Facebook during the 2010 Arab spring [22] or proaches and structure them in a taxonomy. This Twitter during the 2008 U.S. presidential elections study is designed to serve as guidelines for scien- [23] for instance. Given the impact of online social tists and practitioners who intend to design new networks on society, the recent focus is on extract- methods in this area. This also will be helpful for ing valuable information from this huge amount of developers who intend to apply existing techniques data. Events, issues, interests, etc. happen and on specific problems since we present a library of evolve very quickly in social networks and their cap- existing approaches in this area. ture, understanding, visualization, and prediction The rest of this paper is organized as follows. are becoming critical expectations from both end- In Section 2 we detail online social networks basic users and researchers. This is motivated by the fact characteristics and information diusion properties. that understanding the dynamics of these networks In Section 3 we present methods to detect topics of may help in better following events (e.g. analyz- interest in social networks using information diu- ing revolutionary waves), solving issues (e.g. pre- sion properties. Then we discuss how to model in- SIGMOD Record, June 2013 (Vol. 42, No. 2) 17

2 formation diusion and detail both explanatory and of messages. Figure 2 represents the stream pro- predictive models in Section 4. Next, we present duced by the members of the network depicted in methods to identify influential information spread- the previous example. That stream can be viewed ers in Section 5. In the last section we summarize as a sequence of decisions (i.e. whether to adopt the reviewed methods in a taxonomy, discuss their a certain topic or not), with later people watching shortcomings and indicate open questions. the actions of earlier people. Therefore, individuals are influenced by the actions taken by others. This 2. BASICS OF ONLINE SOCIAL NET- eect is known as social influence [2], and is defined as follows: WORKS AND INFORMATION DIFFU- SION DEFINITION 2 (Social Influence). A so- An online social network (OSN ) results from the cial phenomenon that individuals can undergo or ex- use of a dedicated web-service, often referred to as ert, also called imitation, translating the fact that social network site (SNS ), that allows its users to (i) actions of a user can induce his connections to be- create a profile page and publish messages, and (ii) have in a similar way. Influence appears explicitly explicitly connect to other users thus creating social when someone retweets someone else for example. relationships. De facto, an OSN can be described as a user-generated content system that permits its DEFINITION 3 (Herd behavior). A social users to communicate and share information. behavior occurring when a sequence of individuals An OSN is formally represented by a graph, where make an identical action, not necessarily ignoring nodes are users and edges are relationships that can their private information signals. be either directed or not depending on how the SNS manages relationships. More precisely, it depends DEFINITION 4 (Information Cascade). on whether it allows connecting in an unilateral A behavior of information adoption by people in a (e.g. Twitter social model of following) or bilateral social network resulting from the fact that people (e.g. Facebook social model of friendship) manner. ignore their own information signals and make de- Messages are the main information vehicle in such cisions from inferences based on earlier peoples ac- services. Users publish messages to share or for- tions. ward various kinds of information, such as product recommendations, political opinions, ideas, etc. A message is described by (i) a text, (ii) an author, (iii) a time-stamp and optionally, (iv) the set of people (called mentioned users in the social net- working jargon) to whom the message is specifically targeted. Figure 1 shows an OSN represented by a directed graph enriched by the messages published by its four members. An arc e = (ux , uy ) means that the user ux is exposed to the messages pub- lished by uy . This representation reveals that, for example, the user named u1 is exposed to the content shared by u2 and u3 . It also indicates that no one receives the messages written by u4 . Figure 1: An example of OSN enriched by users messages. Users are denoted ui and DEFINITION 1 (Topic). A coherent set of messages mj . An arc (ux , uy ) means that ux semantically related terms that express a single ar- is exposed to the messages published by uy . gument. In practice, we find three interpretations of this definition: (i) a set S of terms, with |S| = 1, e.g. {obama} (ii) a set S of terms, with |S| > 1, e.g. {obama, visit, china} and (iii) a proba- bility distribution over a set S of terms. Every piece of information can be transformed Figure 2: The stream of messages produced into a topic [6, 30] using one of the common for- by the members of the network depicted on malisms detailed in Definition 1. Globally, the con- Figure 1. tent produced by the members of an OSN is a stream 18 SIGMOD Record, June 2013 (Vol. 42, No. 2)

3 Based on the social influence eect, information can spread across the network through the prin- ciples of herd behavior and informational cascade which we define respectively in Definition 3 and 4. In this context, some topics can become extremely popular, spread worldwide, and contribute to new trends. Eventually, the ingredients of an informa- tion diusion process taking place in an OSN can be summarized as follows: (i) a piece of information Figure 3: Temporal dynamics of popular top- carried by messages, (ii) spreads along the edges ics. Each shade of gray represents a topic. of the network according to particular mechanics, (iii) depending on specific properties of the edges and nodes. In the following sections, we will dis- ing the raw continuous data into a sequence of col- cuss these dierent aspects with the most relevant lection of messages published during equally sized recent work related to them as well as an analysis time slices. This principle is illustrated on Figure 4, of weaknesses, strength, and possible improvements which shows a possible discretization of the stream for each aspect. previously depicted in Figure 2. This pre-processing step is not trivial since it defines the granularity of 3. DETECTING POPULAR TOPICS the topic detection. A very fine discretization (i.e. One of the main tasks when studying information short time-slices) will allow to detect topics that diusion is to develop automatic means to provide were popular during short periods whereas a dis- a global view of the topics that are popular over cretization using longer time-slices will not. time or will become popular, and animate the net- work. This involves extracting tables of content to sum up discussions, recommending popular top- ics to users, or predicting future popular topics. Traditional topic detection techniques developed to analyze static corpora are not adapted to mes- sage streams generated by OSNs. In order to effi- ciently detect topics in textual streams, it has been Figure 4: A possible discretization of the suggested to focus on bursts. In his seminal work, stream of messages shown on Figure 2. Kleinberg [26] proposes a state machine to model the arrival times of documents in a stream in or- Shamma et al. [46] propose a simple model, PT der to identify bursts, assuming that all the docu- (i.e. Peaky Topics) , similar to the classical tf-idf ments belong to the same topic. Leskovec et al. [27] model [44] in the sense that it is based on a normal- show that the temporal dynamics of the most pop- ized term frequency metric. In order to quantify the ular topics in social media are indeed made up of a overall term usage, they consider each time slice as succession of rising and falling patterns of popular- a pseudo-document composed of all the messages in ity, in other words, successive bursts of popularity. the corresponding collection. The normalized term Figure 3 shows a typical example of the temporal tf frequency ntf is defined as follows: ntft,i = cft,i , t dynamics of top topics in OSNs. th where tft,i is the frequency of term t at the i time DEFINITION 5 (Bursty topic). A behav- slice and cft is the frequency of term t in the whole ior associated to a topic within a time interval in message stream. Using that metric, bursty topics which it has been extensively treated but rarely be- defined as single terms are ranked. However, some fore and after. terms can be polysemous or ambiguous and a single term doesnt seem to be enough to clearly identify a In the following, we detail methods designed to topic. Therefore, more sophisticated methods have detect topics that have drawn bursts of interest, i.e. been developed. bursty topics (see Definition 5), from a stream of AlSumait et al. [1] propose an online topic model, topically diverse messages. more precisely, a non-Markov on-line LDA Gibbs All approaches detailed hereafter rely on the com- sampler topic model, called OLDA. Basically, LDA putation of some frequencies and work on discrete (i.e. Latent Dirichlet Allocation [4]) is a statis- data. Therefore they require the stream of mes- tical generative model that relies on a hierarchi- sages to be discretized. This is done by transform- cal Bayesian network that relates words and mes- SIGMOD Record, June 2013 (Vol. 42, No. 2) 19

4 sages through latent topics. The generative process and more frequently, OSNs users publish non-textual behind is that documents are represented as ran- content such as URL, pictures or videos. To deal dom mixtures over latent topics, where each topic with non-textual content, Takahashi et al. [47] pro- is characterized by a distribution over words. The pose to use mentions contained in messages to iden- idea of OLDA is to incrementally update the topic tify bursty topics, instead of focusing on the textual model at each time slice using the previously gen- content. Mentioning is a social practice used to ex- erated model as a prior and the corresponding col- plicitly target messages and eventually engage dis- lection of messages to guide the learning of the new cussion. For that, they develop a method that com- generative process. This method builds an evolu- bines a mentioning anomaly score and a change- tionary matrix for each topic that captures the evo- point detection technique based on SDNML (i.e. lution of the topic over time and thus permits to Sequentially Discounting Normalized Maximum Like- detect bursty topics. lihood). The anomaly is calculated with respect Cataldi et al. [6] propose the TSTE method (i.e. to the standard mentioning behavior of each user, Temporal and Social Terms Evaluation) that con- which is estimated by a probability model. siders both temporal and social properties of the Table 1 summarizes the surveyed methods ac- stream of messages. To this end, they develop a cording to four axes. The table is structured ac- five-step process that firstly formalize the messages cording to four main criteria that allow for a quick content as vectors of terms with their relative fre- comparison: (i) how is a topic defined, (ii) which quencies computed by using the augmented normal- dimensions are incorporated into each method, (iii) ized term frequency [43]. Then, the authority of which types of content each method can handle, and the active authors is assessed using their relation- (iv) either the method detects actual bursts or pre- ships and the Page Rank algorithm [35]. It allows dicts them. It should be noted that the table is not to model the life cycle of each term on the basis of a intended to express any preference regarding one biological metaphor, which is based on the calcula- method or another, but rather to present a global tion of values of nutrition and energy that leverage comparison. the users authority. Using supervised or unsuper- dimension(s) content type vised techniques, rooted in the calculation of a crit- ical drop value based on the energy, the proposed definition task type reference method can identify most bursty terms. Finally, a solution is provided to define bursty topics as sets topic of terms using a co-occurence based metric. These methods identify particular topics that have set of terms distribution non-textual observation single term drawn bursts of interest in the past. Lu et al. [40] prediction develop a method that permits predicting which content textual social topics will draw attention in the near future. Au- thors propose to adapt a technical analysis indi- cator primary used for stock price study, namely PT x x x x MACD (i.e. Moving Average Convergence Diver- gence), to identify bursty topics, defined as a single OLDA x x x x term. The principle of MACD is to turn two trend- TSTE x x x x x following indicators, precisely a short period and a SDNML x x x x x longer period moving average of terms frequency, MACD x x x x into a momentum oscillator. The trend momentum is obtained by calculating the dierence between the Table 1: Summary of topic detection ap- long and the shorter moving averages. Authors give proaches w.r.t topic definition, incorporated two simple rules to identify when the trends of a dimensions, handled content and the task. term will rise: (i) when the value of the trend mo- mentum changes from negative to positive, the topic is beginning to rise; (ii) when the value changes from 4. MODELING INFORMATION DIFFU- positive to negative, the level of attention given to SION the topic is falling. The above methods are based on the detection Modeling how information spreads is of outstand- of unusual term frequencies in exchanged messages ing interest for stopping the spread of viruses, ana- to detect interesting topics in OSNs. However, more lyzing how misinformation spread, etc. In this sec- tion, we first give the basics of diusion modeling 20 SIGMOD Record, June 2013 (Vol. 42, No. 2)

5 and then detail the dierent models proposed to capture or predict spreading processes in OSNs. DEFINITION 6 (Activation Sequence). . An ordered set of nodes capturing the order in which the nodes of the network adopted a piece of infor- mation. DEFINITION 7 (Spreading Cascade). A directed tree having as a root the first node of the activation sequence. The tree captures the influence between nodes (branches represent who transmitted Figure 5: An OSN in which darker nodes the information to whom) and unfolds in the same took part in the diusion process of a par- order as the activation sequence. ticular information. The activation sequence The diusion process is characterized by two as- can be extracted using the time at which the pects: its structure, i.e. the diusion graph that messages were published: [u4 ; u2 ; u3 ; u5 ], with transcribes who influenced whom, and its temporal t1 < t2 < t3 < t4 . dynamics, i.e. the evolution of the diusion rate which is defined as the amount of nodes that adopts the piece of information over time. The simplest and are very useful to understand how information way to describe the spreading process is to consider propagated. that a node can be either activated (i.e. has re- Gomez et al. [15] propose to explore correla- ceived the information and tries to propagate it) or tions in nodes infections times to infer the struc- not. Thus, the propagation process can be viewed ture of the spreading cascade and assume that acti- as a successive activation of nodes throughout the vated nodes influence each of their neighbors inde- network, called activation sequence, defined in Def- pendently with some probability. Thus, the proba- inition 6. bility that one node had transmitted information to Usually, models developed in the context of OSNs another is decreasing in the dierence of their ac- assume that people are only influenced by actions tivation time. They develop NETINF, an iterative taken by their connections. To put it dierently, algorithm based on submodular function optimiza- they consider that an OSN is a closed world and tion for finding the spreading cascade that maxi- assume that information spreads because of infor- mizes the likelihood of observed data. mational cascades. That is why the path followed Gomez et al. [14] extend NETINF and propose by a piece of information in the network (i.e. the to model the diusion process as a spatially dis- diusion graph) is often referred to as the spread- crete network of continuous, conditionally indepen- ing cascade, defined in Definition 7. Activation se- dent temporal processes occurring at dierent rates. quences are simply extracted from data by collect- The likelihood of a node infecting another at a given ing messages dealing with the studied information, time is modeled via a probability density function i.e. topic, and ordering them according to the time depending on infection times and the transmission axis. This principle is illustrated in Figure 5. It rate between the two nodes. The proposed algo- provides knowledge about where and when a piece rithm, NETRATE, infers pairwise transmission rates of information propagated but not how and why did and the graph of diusion by formulating and solv- it propagate. Therefore, there is a need for models ing a convex maximum likelihood problem [9]. that can capture and predict the hidden mechanism These methods consider that the underlying net- underlying diusion. We can distinguish two cate- work remains static over time. This is not a satisfy- gories of models in this scope: (i) explanatory mod- ing assumption, since the topology of OSNs evolves els and (ii) predictive models. In the following, we very quickly, both in terms of edges creation and detail these two categories and analyze some repre- deletion. For that reason, Gomez et al. [16] extend sentative eorts in both of them. NETRATE and propose a time-varying inference algorithm, INFOPATH, that uses stochastic gradi- 4.1 Explanatory Models ents to provide on-line estimates of the structure The aim of explanatory models is to infer the un- and temporal dynamics of a network that changes derlying spreading cascade, given a complete acti- over time. vation sequence. These models make it possible to In addition, because of technical and crawling retrace the path taken by a piece of information API limitations, there is a data acquisition bottle- SIGMOD Record, June 2013 (Vol. 42, No. 2) 21

6 missing data properties reference supports network inferred transmission probability Figure 6: A spreading process modeled by transmission rate Independent Cascades in four steps. properties pairwaise pairwaise dynamic cascade 4.2.1 Graph based approaches static There are two seminal models in this category, namely Independent Cascades (IC ) [13] and Linear NETINF x x x Threshold (LT ) [17]. They assume the existence NETRATE x x x x of a static graph structure underlying the diusion INFOPATH x x x x x and focus on the structure of the process. They k-tree model x x x are based on a directed graph where each node can be activated or not with a monotonicity assump- Table 2: Summary of explanatory models tion, i.e. activated nodes cannot deactivate. The IC w.r.t the nature of the underlying network, model requires a diusion probability to be associ- inferred properties and the ability of the ated to each edge whereas LT requires an influence method to work with incomplete data. degree to be defined on each edge and an influence threshold for each node. For both models, the dif- fusion process proceeds iteratively in a synchronous neck potentially responsible for missing data. To way along a discrete time-axis, starting from a set overcome this issue, one approach is to crawl data of initially activated nodes, commonly named early as efficiently as possible. Choudhury et al. [7] adopters [37]: analysed how the data sampling strategy impacts the discovery of information diusion in social me- DEFINITION 8 (Early Adopters). A set dia. Based on experimentations on Twitter data, of users who are the first to adopt a piece of in- they concluded that sampling methods that con- formation and then trigger its diusion. sider both network topology and users attributes such as activity and localisation allow to capture In the case of IC, for each iteration, the newly information diusion with lower error in compari- activated nodes try once to activate their neigh- son to naive strategies, like random or activity-only bors with the probability defined on the edge joining based sampling. Another approach is to develop them. In the case of LT, at each iteration, the in- specific models that assume that data are missing. active nodes are activated by their activated neigh- Sadikov et al. [41] develop a method based on a k- bors if the sum of influence degrees exceeds their tree model designed to, given only a fraction of the own influence threshold. Successful activations are complete activation sequence, estimate the proper- eective at the next iteration. In both cases, the ties of the complete spreading cascade, such as its process ends when no new transmission is possible, size or depth. i.e. no neighboring node can be contacted. These We summarize the surveyed explanatory models two mechanisms reflect two dierent points of view: in Table 2. In the following, we detail the second IC is sender-centric while LT is receiver-centric. An category of models, namely, predictive models. example of spreading process modeled with IC is given by Figure 6. We detail hereafter models aris- 4.2 Predictive Models ing from those approaches and adapted to OSNs. These models aim at predicting how a specific dif- Galuba et al. [11] propose to use the LT model fusion process would unfold in a given network, from to predict the graph of diusion, having already ob- temporal and/or spatial points of view by learning served the beginning of the process. Their model from past diusion traces. We classify existing mod- relies on parameters such as information virality, els into two development axes, graph and non-graph pairwise users degree of influence and user proba- based approaches. bility of adopting any information. The LT model 22 SIGMOD Record, June 2013 (Vol. 42, No. 2)

7 is fitted on the data describing the beginning of the diusion process by optimizing the parameters us- ing the gradient ascent method. However, LT cant reproduce realistic temporal dynamics. Saito et al. [42] relax the synchronicity assump- tion of traditional IC and LT graph-based mod- els by proposing asynchronous extensions. Named AsIC and AsLT (i.e. asynchronous independent cascades and asynchronous linear threshold), they proceed iteratively along a continuous time axis and require the same parameters as their synchronous Figure 7: LIM forecasts the rate of diu- counterparts plus a time-delay parameter on each sion by summing the influence functions of edge of the graph. Models parameters are defined a given set of early adopters. Here, the early in a parametric way and authors provide a method adopters are u1 , u2 and u3 whose respective to learn the functional dependency of the model influence functions are Iu1 , Iu2 and Iu3 . parameters from nodes attributes. They formulate the task as a maximum likelihood estimation prob- lem and an update algorithm that guarantees the convergence is derived. However, they only exper- imented with synthetic data and dont provide a have adopted the information become susceptible practical solution. at the next time-step (i.e. = 1). This is a strong Guille et al. [19] also model the propagation pro- assumption since in real-world social networks, in- cess as asynchronous independent cascades. They fluence is not evenly distributed between all nodes develop theT-BaSIC model (i.e. Time-Based Asyn- and it is necessary to develop more complex mod- chronous Independent Cascades), which parameters eling that take into account this characteristic. arent fixed numerical values but functions depend- Yang et al. [50] start from the assumption that ing on time. The model parameters are estimated the diusion of information is governed by the in- from social, semantic and temporal nodes features fluence of individual nodes. The method focuses using logistic regression. on predicting the temporal dynamics of information diusion, under the form of a time-series describ- ing the rate of diusion of a piece of information, 4.2.2 Non-graph based approaches i.e. the volume of nodes that adopt the informa- Non-graph based approaches do not assume the tion through time. They develop a Linear Influ- existence of a specific graph structure and have been ence model (LIM ), where the influence functions mainly developed to model epidemiological processes. of individual nodes govern the overall rate of dif- They classify nodes into several classes (i.e. states) fusion. The influence functions are represented in and focus on the evolution of the proportions of a non-parametric way and are estimated by solv- nodes in each class. SIR and SIS are the two sem- ing a non-negative least squares problem using the inal models [21, 34], where S stands for suscepti- Reflective Newton Method [8]. Figure 7 illustrates ble, I for infected (i.e. adopted the information) how LIM forecasts the rate of diusion from a set and R for recovered (i.e. refractory). In both cases, of early adopters and their activation time. nodes in the S class switch to the I class with a fixed Wang et al. [48] propose a Partial Dierential probability . Then, in the case of SIS, nodes in Equation (PDE ) based model to predict the diu- the I class switch to the S class with a fixed prob- sion of an information injected in the network by a ability , whereas in the case of SIR they perma- given node. More precisely, a diusive logistic equa- nently switch to the R class. The percentage of tion model is used to predict both topological and nodes in each class is expressed by simple dier- temporal dynamics. Here, the topology of the net- ential equations. Both models assume that every work is considered only in term of the distance from node has the same probability to be connected to each node to the source node. The dynamics of the another and thus connections inside the population process is given by a logistic equation that models are made at random. the density of influenced users at a given distance of Leskovec et al. [28] propose a simple and intu- the source and at a given time. That definition of itive SIS model that requires a single parameter, the network topology allows to formulate the prob- . It assumes that all nodes have the same prob- lem simply, as for classical non-graph based meth- ability to adopt the information and nodes that ods while integrating some spatial knowledge. The SIGMOD Record, June 2013 (Vol. 42, No. 2) 23

8 mathematical network. They find that the most efficient spreaders dimension(s) are those located within the core of the network as non-parametric modeling reference identified by the k-core decomposition analysis [45], as defined in Definition 9. Basically, the principle of basis the k-core decomposition is to assign a core index ks to each node such that nodes with the lowest values non-graph based are located at the periphery of the network while nodes with the highest values are located in the graph based parametric center of the network. The innermost nodes thus forms the core of the network. Brown et al. [5] ob- content social serve that the results of the k-shell decomposition time on Twitter network are highly skewed. Therefore they propose a modified algorithm that uses a log- LT-based x x x x arithmic mapping, in order to produce fewer and AsIC, n/a n/a n/a x x more meaningful k-shell values. AsLT Cataldi et al. [6] propose to use the well known T-BaSIC x x x x x PageRank algorithm [35] to assess the distribution SIS-based x x x of influence throughout the network. The PageR- LIM x x x x ank value of a given node is proportional to the probability of visiting that node in a random walk PDE x x x x of the social network, where the set of states of the Table 3: Summary of diusion predic- random walk is the set of nodes. tion methods, distinguishing graph and non- The methods we have just described only exploit graph based approaches w.r.t incorporated the topology of the network, and ignore other im- dimensions and mathematical modeling. portant properties, such as nodes features and the way they process information. Starting from the observation that most OSNs members are passive parameters of the model are estimated using the information consumers, Romero et al. [38] develop Cubic Spline Interpolation method [12]. a graph-based approach similar to the well known We summarize the surveyed predictive models in HITS algorithm, IP (i.e. Influence-Passivity), that Table 3. In the following section, we discuss the assigns a relative influence and a passivity score role of nodes in the propagation process and how to to every users based on the ratio at which they identify influential spreaders. forward information. However, no individual can be a universal influencer, and influential members 5. IDENTIFYING INFLUENTIAL INFOR- of the network tend to be influential only in one MATION SPREADERS or some specific domains of knowledge. Therefore, Identifying the most influential spreaders in a net- Pal et al. [36] develop a non-graph based, topic- work is critical for ensuring efficient diusion of in- sensitive method. To do so, they define a set of formation. For instance, a social media campaign nodal and topical features for characterizing the can be optimized by targeting influential individuals network members. Using probabilistic clustering who can trigger large cascades of further adoptions. over this feature space, they rank nodes with a This section presents briefly some methods that il- within-cluster ranking procedure to identify the most lustrate the various possible ways to measure the influential and authoritative people for a given topic. relative importance and influence of each node in Weng et al. [49] also develop a topic-sensitive ver- an online social network. sion of the Page Rank algorithm dedicated to Twit- ter, TwitterRank. DEFINITION 9 (K-Core). Let G be a graph. Kempe et al. [24] adopt a dierent approach and If H is a sub-graph of G, (H) will denote the min- propose to use the IC and LT models (previously imum degree of H. Thus each node of H is adja- described in Section 4.2.1) to tackle the influence cent to at least (H) other nodes of H. If H is a maximization problem. This problem asks, for a maximal connected (induced) sub-graph of G with parameter k, to find a k-node set of maximum in- (H) >= k, we say that H is a k-core of G [45]. fluence in the network. The influence of a given set of nodes corresponds to the number of activated Kitsak et al. [25] show that the best spreaders nodes at the end of the diusion process according are not necessarily the most connected people in the 24 SIGMOD Record, June 2013 (Vol. 42, No. 2)

9 dimension(s) features incorporated of little interest. In contrast, OLDA defines a topic graph based reference as a distribution over a set of words but in turn has a high complexity, which prevents it from being ap- plied at large scale. Consequently, there is a need for new methods that could produce intelligible re- sults while preserving efficiency. We identify two possible ways to do so, through: (i) the conception users topic of new scalable algorithms, or (ii) improved imple- mentations of the algorithms using, e.g. distributed k-shell decomposition x systems (such as Hadoop). log k-shell decomposition x Social dimension. Furthermore, popular topic PageRank x detection could be improved by leveraging bursti- ness and people authority, as does TSTE, which Topic-sensitive PageRank x x relies on the PageRank algorithm. However, that IP x x possibility remains ill explored so far. Topical Authorities x x Data complexity. Currently the focus is set on k-node set x the textual content exchanged in social networks. However, more and more often, users exchange other Table 4: Summary of influential spreaders types of data such as images, videos, URLs point- identification methods distinguishing graph ing to those objects or Web pages, etc. This situa- and non-graph based approaches w.r.t incor- tion has to be fully considered and integrated at the porated dimensions. heart of the eorts carried out to provide a complete solution for topic detection. to IC or LT, using this set as the set of initially 6.2 Modeling Information Diffusion activated nodes. They provide an approximation for this optimization problem using a greedy hill- We distinguish two types of models, explanatory climbing strategy based on submodular functions. and predictive. Concerning predictive models, on The surveyed influence assessment methods are the one hand there are non-graph based methods, summarized in Table 4. that are limited by the fact that they ignore the topology of the network and only forecast the evo- 6. DISCUSSION lution of the rate at which information globally dif- fuses. On the other hand, there are graph based In this article, we surveyed representative and approaches that are able to predict who will influ- state-of-the-art methods related to information dif- ence whom. However, they cannot be used when fusion analysis in online social networks, ranging the network is unknown or implicit. Although a from popular topic detection to diusion modeling lot of eort have been performed in this area, gen- techniques, including methods for identifying influ- erally speaking, there is a need to consider more ential spreaders. Figure 8 presents the taxonomy of realistic constraints when studying information dif- the various approaches employed to address these fusion. In particular, the following issues have to be issues. Hereafter we provide a discussion regarding dealt with: their shortcomings and related open problems. DEFINITION 10 (Closed World). The 6.1 Detecting Popular Topics closed world assumption holds that information can The detection of popular topics from the stream only propagate from node to node via the network of messages produced by the members of an OSN re- edges and that nodes cannot be influenced by exter- lies on the identification of bursts. There are mainly nal sources. two ways to detect such patterns, by analyzing (i) term frequency or (ii) social interaction frequency. Closed world assumption. The major obser- In this area, the following challenges certainly need vation about modeling information diusion is cer- to be addressed: tainly that all the described approaches work under Topic definition and scalability. It is obvi- a closed world assumption, defined in Definition 10. ous that not all methods define a topic in the same In other words, they assume that people can only way. For instance Peaky Topics simply assimilates be influenced by other members of the network and a topic to a word. It has the advantage to be a low that information spreads because of informational complexity solution, however, the produced result is cascades. However, most observed spreading pro- SIGMOD Record, June 2013 (Vol. 42, No. 2) 25

10 Figure 8: The above taxonomy presents the three main research challenges arising from in- formation diusion in online social networks and the related types of approaches, annotated with areas for improvement. cesses in OSNs do not rely solely on social influ- important for predictive models to be topic-sensitive. ence. The closed-world assumption is proven incor- Romero et al. [39] have studied Twitter and found rect in recent work on Twitter done by Myers et significant dierences in the mechanics of informa- al. [32] in which authors observe that information tion diusion across topics. More particularly, they tends to jump across the network. The study shows have observed that information dealing with politi- that only 71% of the information volume in Twit- cally controversial topics are particularly persistent, ter is due to internal influence and the remaining with repeated exposures continuing to have unusu- 29% can be attributed to external events and influ- ally large marginal eects on adoption, which val- ence. Consequently they provide a model capable idates the complex contagion principle that stipu- of quantifying the level of external exposure and in- lates that repeated exposures to an idea are par- fluence using hazard functions [10]. To relax this ticularly crucial when the idea is controversial or assumption, one way would be to align users pro- contentious. files across multiple social networking sites. In this Dynamic networks. Finally, it is important way, it would be possible to observe the information to note that OSNs are highly dynamic structures. diusion among various platforms simultaneously Nonetheless most of the existing work rely on the as- (subject to the availability of data). Some work sumption that the network remains static over time. tend to address this type of problems by proposing Integrating link prediction could be a basis to im- to de-anonymize the social networks [33]. prove prediction accuracy. A more complete review Cooperating and competing diusion pro- of literature on this topic can be found in [20]. cesses. In addition, the described studies rely on the assumption that diusion processes are inde- 6.3 Identifying Influential Spreaders pendent, i.e. each information spreads in isolation. There are various ways to tackle this issue, rang- Myers et al. [31] argue that spreading processes ing from pure topological approaches, such as k- cooperate and compete. Competing contagions de- shell decomposition or HITS to textual clustering crease each others probability of diusion, while based approaches, including hybrid methods, such cooperating ones help each other in being adopted. as IP which combines the HITS algorithm with They propose a model that quantifies how dierent nodes features. As mentioned previously, there is spreading cascades interact with each other. It pre- no such thing as a universal influencer and therefore dicts diusion probabilities that are on average 71% topic-sensitive methods have also been developed. more or less than the diusion probability would be Opinion detection. The notion of influence is for a purely independent diusion process. We be- strongly linked to the notion of opinion. Numer- lieve that models have to consider and incorporate ous studies on this issue have emerged in recent this knowledge. years, aiming at automatically detecting opinions Topic-sensitive modeling. Furthermore, it is or sentiment from corpus of data. We believe that 26 SIGMOD Record, June 2013 (Vol. 42, No. 2)

11 it might be interesting to include this kind of work [8] T. F. Coleman and Y. Li. A reflective newton in the context of information diusion. Work deal- method for minimizing a quadratic function ing with the diusion of opinions themselves have subject to bounds on some of the variables. emerged [29] and it seems that there is an interest SIAM J. on Optimization, 6(4):10401058, to couple these approaches. Apr. 1996. [9] I. CVX Research. CVX: Matlab software for 6.4 Applications disciplined convex programming, version 2.0 Even if there are a lot of contributions in the beta., sep 2012. domain of online social networks dynamics analy- [10] R. C. Elandt-Johnson and N. L. Johnson. sis, we can remark that implementations are rarely Survival Models and Data Analysis. John provided for re-use. What is more, available imple- Wiley and Sons, 1980/1999. mentations require dierent formatting of the in- [11] W. Galuba, K. Aberer, D. Chakraborty, put data and are written using various program- Z. Despotovic, and W. Kellerer. Outtweeting ming languages, which makes it hard to evaluate or the twitterers - predicting information compare existing techniques. SONDY [18] intends cascades in microblogs. In WOSN 10, pages to facilitate the implementation and distribution of 311, 2010. techniques for online social networks data mining. [12] C. F. Gerald and P. O. Wheatley. Applied It is an open-source tool that provides data pre- numerical analysis with MAPLE; 7th ed. processing functionalities and implements some of Addison-Wesley, Reading, MA, 2004. the methods reviewed in this paper for topic de- [13] J. Goldenberg, B. Libai, and E. Muller. Talk tection and influential spreaders identification. It of the network: A complex systems look at features a user-friendly interface and proposes visu- the underlying process of word-of-mouth. alizations for topic trends and network structure. Marketing Letters, 2001. [14] M. Gomez-Rodriguez, D. Balduzzi, and 7. REFERENCES B. Scholkopf. Uncovering the temporal dynamics of diusion networks. In ICML 11, [1] L. AlSumait, D. Barbara, and C. Domeniconi. pages 561568, 2011. On-line lda: Adaptive topic models for mining [15] M. Gomez Rodriguez, J. Leskovec, and text streams with applications to topic A. Krause. Inferring networks of diusion and detection and tracking. In ICDM 08, pages influence. In KDD 10, pages 10191028, 2010. 312, 2008. [16] M. Gomez-Rodriguez, J. Leskovec, and [2] A. Anagnostopoulos, R. Kumar, and B. Schokopf. Structure and dynamics of M. Mahdian. Influence and correlation in information pathways in online media. In social networks. In KDD 08, pages 715, WSDM 13, pages 2332, 2013. 2008. [17] M. Granovetter. Threshold models of [3] E. Bakshy, I. Rosenn, C. Marlow, and collective behavior. American journal of L. Adamic. The role of social networks in sociology, pages 14201443, 1978. information diusion. In WWW 12, pages [18] A. Guille, C. Favre, H. Hacid, and D. Zighed. 519528, 2012. Sondy: An open source platform for social [4] D. Blei, A. Ng, and M. Jordan. Latent dynamics mining and analysis. In dirichlet allocation. The Journal of Machine SIGMOD 13, (demonstration) 2013. Learning Research, 3:9931022, 2003. [19] A. Guille and H. Hacid. A predictive model [5] P. Brown and J. Feng. Measuring user for the temporal dynamics of information influence on Twitter using modified k-shell diusion in online social networks. In decomposition. In ICWSM 11 Workshops, WWW 12 Companion, pages 11451152, pages 1823, 2011. 2012. [6] M. Cataldi, L. Di Caro, and C. Schifanella. [20] M. A. Hasan and M. J. Zaki. A survey of link Emerging topic detection on Twitter based on prediction in social networks. In Social temporal and social terms evaluation. In Network Data Analytics, pages 243275. MDMKDD 10, pages 413, 2010. Springer, 2011. [7] M. D. Choudhury, Y.-R. Lin, H. Sundaram, [21] H. W. Hethcote. The mathematics of K. S. Candan, L. Xie, and A. Kelliher. How infectious diseases. SIAM REVIEW, does the data sampling strategy impact the 42(4):599653, 2000. discovery of information diusion in social [22] P. N. Howard and A. Duy. Opening closed media? In ICWSM 10, pages 3441, 2010. SIGMOD Record, June 2013 (Vol. 42, No. 2) 27

12 regimes, what was the role of social media [37] E. M. Rogers. Diusion of Innovations, 5th during the arab spring? Project on Edition. Free Press, 5th edition, aug 2003. Information Technology and Political Islam, [38] D. Romero, W. Galuba, S. Asur, and pages 130, 2011. B. Huberman. Influence and passivity in [23] A. Hughes and L. Palen. Twitter adoption social media. In ECML/PKDD 11, pages and use in mass convergence and emergency 1833, 2011. events. International Journal of Emergency [39] D. M. Romero, B. Meeder, and J. Kleinberg. Management, 6(3):248260, 2009. Dierences in the mechanics of information [24] D. Kempe. Maximizing the spread of influence diusion across topics: idioms, political through a social network. In KDD 03, pages hashtags, and complex contagion on Twitter. 137146, 2003. In WWW 11, pages 695704, 2011. [25] M. Kitsak, L. Gallos, S. Havlin, F. Liljeros, [40] L. Rong and Y. Qing. Trends analysis of news L. Muchnik, H. Stanley, and H. Makse. topics on Twitter. International Journal of Identification of influential spreaders in Machine Learning and Computing, complex networks. Nature Physics, 2(3):327332, 2012. 6(11):888893, Aug 2010. [41] E. Sadikov, M. Medina, J. Leskovec, and [26] J. Kleinberg. Bursty and hierarchical H. Garcia-Molina. Correcting for missing data structure in streams. In KDD 02, pages in information cascades. In WSDM 11, pages 91101, 2002. 5564, 2011. [27] J. Leskovec, L. Backstrom, and J. Kleinberg. [42] K. Saito, K. Ohara, Y. Yamagishi, Meme-tracking and the dynamics of the news M. Kimura, and H. Motoda. Learning cycle. In KDD 09, pages 497506, 2009. diusion probability based on node attributes [28] J. Leskovec, M. Mcglohon, C. Faloutsos, in social networks. In ISMIS 11, pages N. Glance, and M. Hurst. Cascading behavior 153162, 2011. in large blog graphs. In SDM 07, pages [43] G. Salton and C. Buckley. Term-weighting 551556, (short paper) 2007. approaches in automatic text retrieval. Inf. [29] L. Li, A. Scaglione, A. Swami, and Q. Zhao. Process. Manage., 24(5):513523, 1988. Phase transition in opinion diusion in social [44] G. Salton and M. J. McGill. Introduction to networks. In ICASSP 12, pages 30733076, Modern Information Retrieval. McGraw-Hill, 2012. 1986. [30] J. Makkonen, H. Ahonen-Myka, and [45] S. B. Seidman. Network structure and M. Salmenkivi. Simple semantics in topic minimum degree. Social Networks, 5(3):269 detection and tracking. Inf. Retr., 287, 1983. 7(3-4):347368, Sept. 2004. [46] D. A. Shamma, L. Kennedy, and E. F. [31] S. Myers and J. Leskovec. Clash of the Churchill. Peaks and persistence: modeling contagions: Cooperation and competition in the shape of microblog conversations. In information diusion. In ICDM 12, pages CSCW 11, pages 355358, (short paper) 539548, 2012. 2011. [32] S. A. Myers, C. Zhu, and J. Leskovec. [47] T. Takahashi, R. Tomioka, and K. Yamanishi. Information diusion and external influence in Discovering emerging topics in social streams networks. In KDD 12, pages 3341, 2012. via link anomaly detection. In ICDM 11, [33] A. Narayanan and V. Shmatikov. pages 12301235, 2011. De-anonymizing social networks. In SP 09, [48] F. Wang, H. Wang, and K. Xu. Diusive pages 173187, 2009. logistic model towards predicting information [34] M. E. J. Newman. The structure and function diusion in online social networks. In of complex networks. SIAM Review, ICDCS 12 Workshops, pages 133139, 2012. 45:167256, 2003. [49] J. Weng, E.-P. Lim, J. Jiang, and Q. He. [35] L. Page, S. Brin, R. Motwani, and TwitterRank: finding topic-sensitive T. Winograd. The pagerank citation ranking: influential twitterers. In WSDM 10, pages Bringing order to the web. In WWW 98, 261270, 2010. pages 161172, 1998. [50] J. Yang and J. Leskovec. Modeling [36] A. Pal and S. Counts. Identifying topical information diusion in implicit networks. In authorities in microblogs. In WSDM 11, ICDM 10, pages 599608, 2010. pages 4554, 2011. 28 SIGMOD Record, June 2013 (Vol. 42, No. 2)

Load More