quinta-feira, 23 de fevereiro de 2012

The Pulse of News in Social Media: Forecasting Popularity

This is another paper on the series "Predicting popularity of Online Content", written by researchers from UCLA and HP Labs, including Bernardo Huberman. Different from a previous paper I've read, here the authors studied the problem of predicting popularity of news articles based on properties of the articles themselves and not on early popularity records. This problem is very relevant for journalists, content providers, advertisers, and news recommendation systems because of the convergence of news and social media in the dissemination of content on the Web.

The dataset used in this study come mainly from two sources. The first one is feedzilla, a news aggregator, and the second from Twitter. For a given news article from feedzilla, all Twitters that mentioned its respective URL were crawled. Properties of an article are: source, category, subjectivity, and named entities mentioned. For each of of these properties, the authors define a score called t-density. The score of a category is the number of tweets per article of the given category. Subjectivity is a binary attribute defined using a subjectivity classifier. This classifier was fed with tv and radio show transcripts as the subjective class, and articles from c-span (politics and history) and First Monday as the objective class. Named entities were identified using the Stanford-NER tool. For a given article, the score is the average score of the entities it mentions. Source score is based on the success of the given source in another sample of tweets (different from the main dataset). Scores were smoothed and normalized to reduce the effect of temporal variations.

Based on information from google news made available by KnewsKnife (news site rating), the paper investigates whether the top sources found using their score measure match the top sources on google news. The conclusion is that they do not match. Although the number of tweets and articles from a given source are somehow positively correlated with the google news rating, the proposed score is not correlated with the ratings. In fact, the top sources on google news are traditional media websites (e.g., WSJ, Reuters) while the top sources identified in the study are web marketing blogs (e.g., Mashable, Google Blog).

Two problems versions of the popularity prediction problem are considered in the paper: regression and classification. In the regression problem, the goal is to predict the exact number of tweets mentioning a given articles using the features just described. Three models (linear, KNN, and SVM) were applied and the linear model achieved the best results (R^2=0.34). The result is even better when only one category (technology) is considered (R^2=0.43). The authors argue that this is due to the overlap across categories on feedzilla. In general, the most important evidence is the source of the article. In the classification problem, three classes were defined (1-20, 20-100, and 100-2400 tweets). Four classification models were considered (SVM, Decision Tree, Bagging, and Naive Bayes). The bagging and the decision tree algorithm achieved the best result (84% accuracy). Again, the most important feature is the source.

I did not like this paper very much. I think it is due to a lack of theoretical background and the excess of ad-hoc decisions w.r.t. the datasets and the scoring schemes. The results show that the source determines the popularity of news articles what was somehow expected. Maybe, a more important question would be: What makes an article from a ordinary blog or news website to become popular? Or what makes an article from a famous blog or news website fail?

Link: http://arxiv.org/abs/1202.0332

terça-feira, 21 de fevereiro de 2012

Boolean Networks with Reliable Dynamics

This was an experience. I've tried to understand a physics paper in order to compare research in physics and computer science. In general, I think physics is more well-defined and beautiful as a science than computer science. If you take theoretical physics as an example, they are much more aware of what are their "big questions" then us. They know how to describe these questions and their relevance in a more intuitive way. On the other hand, experimental physics has more statistical rigor than experimental computer science (e.g., the 5-sigma). Moreover, they are much better at defining a theoretical framework to study their problems. As a consequence, the progress of the research in physics is more clear than in computer science to me. Putting it in a few words, physics is a harder science than computer science.

After three days of reading, I decided to give up on this paper. It is not that I haven't understand anything, but I certainly did not understand everything. Actually, it is hardest paper I've read so far this year. In part, this is due to my unfamiliarity with the concepts involved, which are not very well described in the paper. Moreover, the authors usually go too deep and formally into small details that are not easy to follow.

The paper is authored by two researchers from the Technical University of Darmstadt, Germany. It studies several property of boolean networks (BNs) that have reliable trajectories. A boolean network has a boolean variable associated to each of its nodes. These variables are set according to input data from other nodes and boolean functions. A trajectory of a BN is a set of consecutive states of the network and it is said to be reliable if it does not depend on the order in which nodes are updated. BNs attract the interest of biologists, as a model for gene regulatory networks, and statistical physicists, as a complex system with several interesting properties. 

The state of a node is subject to a function as follows:

sigma_i(t+1) = f_i(sigma(t))u_i(t) + sigma_i(t)(1-u_i(t))

Where t is time, f_i is a function that can be represented as true table of the inputs of the vertex i, and u_i defines whether the state of i is updated at a given time t. States can be updated according to different schemes (synchronous, asynchronous and deterministic, asynchronous and stochastic).

An attractor is state or a set of states in which the network stabilizes (loop). An attractor of size one is called a fixed point. The authors enforce the reliability of the trajectory by guaranteeing that two consecutive states differ always by one value and that the state of only one node is changed at time. They generate random trajectories and then apply an algorithm that produces the simplest networks and functions that generate the given trajectory. This algorithm generates the update function of each node v independently, based on the states of the nodes that changed their state right before v. More nodes may be necessary as inputs to a given vertex (different combinations are tested). A trajectory is characterized by the number of nodes involved and the average number of flips per node (Poisson distribution). 

The first part of the results are about the topology of the networks generated. They show that the average degree of vertices is l (average number of flips/node) and it follows a Poisson distribution. The clustering coefficient of reliable networks is higher than the one of random (shuffled edges) networks. The number of loops in the network increases with l and decreases with N (number of nodes). In general, denser subgraphs have a higher z-score, and the abundance of these subgraphs increases with l.

Regarding the update functions, the authors show that there are classes of functions that present similar probability. The authors show how these classes are related to restrictions on the local trajectory of nodes (e.g., functions that are insensitive to one or more inputs can not be generated).  Moreover, homogeneous functions (i.e., functions with small number of minority outputs) are more likely. This property is proved  in the paper.

The final part of the results are about the state space structure. Using a stochastic update scheme, the authors evaluated the probability of attractors of different sizes. Small networks produce short attractors. Almost always the attractor has size one (fixed point) but in some cases the attractors are longer than the defined trajectories.

I think reading this paper was worthy. It remembered me of my high school times when I tried to read about relativity, particle physics and cosmology. Reading about complex subjects make other subjects look easier. Moreover, I got curious about other related systems such as Bayesian networks and cellular automata. Finally, I still believe computer science, as a young science, can learn a lot from physics.

Link: http://arxiv.org/abs/arXiv:0905.0925

domingo, 19 de fevereiro de 2012

On the Geographic Distribution of On-line Game Servers and Players

This is another characterization paper I've read as a reference for one paper I've just submitted about online game streaming. It was written by researchers from Oregon Health and Science University. They have also authored another paper about online gaming already summarized here. The idea of this particular paper is very simple, characterizing how players and servers are distributed over the world. They argue that this kind of analysis is relevant to improve player experience, specially in terms of provisioning server resources  globally.

Even not being very much into gaming, I'm aware that lagging really damages player experience. In order to reduce lags, it is interesting to place servers close to players in the network. As gaming services become a high profiting business, companies put a lot of effort into resource allocation, including server placement. The authors have an hypothesis that players are usually close to the servers as a natural consequence of lagging.

The dataset used in this analysis are counter-strike server traces (for player distribution) and lists of game servers returned to game clients for different games (for server distribution). Geographic locations were generated by a commercial geographic mapping tool.

I really liked the way game and player server distributions can be visualized in the paper. The authors use longitude histograms, which show the volume of players/servers in a given meridian. I've done a similar analysis to show how streamers are distributed on Twitch.Tv for my paper on game streaming. In general, game servers concentrated mostly in the U.S., specially in the west cost. There are also many servers in Europe and Asia. In terms of latitude, the great majority of the servers are in the northern hemisphere. Players follow a similar geographic distribution. However, the authors found that 50% of the accesses to the server com from outside of North America. Several hypotheses for this result are enumerated, including disparities between geographic and network position, malfunctioning of server selection mechanisms, and shortage of servers overseas.

Another interesting result presented in the paper is that the geographic distribution of players is correlated to the time of the day, which motivates the shift of servers according to demand. Moreover, when compared to the geographic distribution of accesses to a department website, players are closer to the game server. accesses to a web site that provides game statistics and forums about games was also found to be close to the server.

This paper does exactly what it aims to do and it is not very audacious. The analysis are intuitive and the visualizations are good.

Link: http://www.thefengs.com/wuchang/work/cstrike/netgames03_geo.pdf

2012 WWW Conference: Accepted Papers

The list of papers accepted at the WWW Conference was released! Here is the list of papers that I'm interested in:

A Habit Mining Approach for Discovering Similar Mobile Users
Actions speak as loud as words: Predicting relationships from social behavior data
An Exploration of Improving Collaborative Recommender Systems via User-Item Subgroups
Analyzing Spammers’ Social Networks For Fun and Profit — A Case Study of Cyber Criminal Ecosystem on Twitter
Are Web users really Markovian?
Care to Comment? Recommendations for Commenting on News Stories
Build Your Own Music Recommender by Modeling Internet Radio Streams
Community Detection in Incomplete Information Networks
Context-Aware Topic Models for Entity Disambiguation
Discovering Geographical Topics from Twitter Streams
Distributed Graph Pattern Matching
Document Hierarchies from Text and Links
Dynamical Classes of Collective Attention in Twitter
Framework and Algorithms for Network Bucket Testing
How Effective is Targeted Advertising?
Human Wayfinding in Information Networks
Information Transfer in Social Media
Learning and Predicting Behavioral Dynamics on the Web
Learning Causality for News Events Prediction
Learning from the Past: Answering New Questions with Past Answers
LINDEN: Linking Named Entities with Knowledge Base via Semantic Knowledge
Micropinion Generation: An Unsupervised Approach to Generating Ultra-Concise Summaries of Opinions
New Objective Functions for Social Collaborative Filtering
Online Team Formation in Social Networks
Optimizing Budget Allocation Among Channels and Influencers
Recommendations to Boost Content Spread in Social Networks
The Role of Social Networks in Information Diffusion
Understanding Task-driven Information Flow in Collaborative Networks
Using Content and Interactions for Discovering Communities in Social Networks
Vertex Collocation Profiles: Subgraph Counting for Link Analysis and Prediction
Winner Takes All: Competing Viruses or Ideas on fair-play Networks

Reading all this papers would take me about 2 months. I hope I can prune at least half of them to make it work.

Link: http://www2012.wwwconference.org/program/accepted-papers/main-scientific-tracks/

sábado, 18 de fevereiro de 2012

Characterizing Online Games

I read this paper as a reference for a recent paper about live game streaming we submitted to the Workshop on Mining Social Network Dynamics, that will be part of the WWW Conference this year. When this paper was written, online gaming was something new and the authors provided an extensive characterization of several aspects related o online gaming, including player and server data. The objective is to give guidelines for game workload management in order to maximize player satisfaction and minimize costs. The authors of the paper Zipper Interactive (game development company), Portland State University,  IBM Research, and Networked Alternate Reality Creations (game development company).

The datasets used in the study are:
GameSpy: Real-time player population data on individual games.
MSN games: Similar to GameSpy, but focused on casual games (i.e. simple games downloaded directly from portals).
CS server: traces of a busy counter-strike server.
Eve online: complete server session history for a popular game.

Provisioning Resources:

Game population sizes varies a lot, what make them difficult to predict. Moreover, game popularity follows a power-law distribution.The authors found consistent daily and weekly variations in player activity. The application of a fast Fourier Transform to the date shows that the cycles occur in 24h and 168h (1 week) intervals. Instantaneous load changes between identical points in time of consecutive weeks follows a t location-scale distribution and most of the variations week-to-week variations are under 15%. For different games, usage patterns are very similar and even the usage patterns of commercial websites are similar to those found for games, which reduces the benefits of resource sharing. Finally, games present strong diurnal geographic patterns, which motivates geographic shifts of resources.

Player Characterization:

Players do not try to reconnect many times, the number of accepted reconnects follows a negative exponential distribution. Most of the players try the game for a short time and quit and the number of sessions of a player, before quiting, follows a weibull distribution. The authors present an interesting discussion about why the ability of MMORPGs to attract new players decrease with time, since new players come at a great disadvantage. My brother thinks that updates are the usual solution for this problem, since they put newcomers and 'professional' players close again, but the results disagree with him, showing that updates have small impact over the player population and playing time. The paper also  investigates how players that might be losing interest in the game can be identified automatically. Players that are quitting reduce their playtime.  Moreover, they play shorter sessions and have increased intersession time.

 This paper was very easy to read and understand. It is good that the subject of the paper is games and not video/audio, because I'm getting tired of these characterization papers. The authors seem to be specialists on the subject and some discussions are really interesting.

Link: http://www.thefengs.com/wuchang/work/cstrike/ton09_characterizing.pdf

terça-feira, 14 de fevereiro de 2012

Social Network Sensors for Early Detection of Contagious Outbreaks

This is not a Computer Science paper. In fact, the authors are affiliated to Art, Medical, and Social Science schools. Therefore, it is inevitable for me to contrast the methodology applied in this paper with the traditional CS research methodology. This paper studies the selection of social sensors for the detection of epidemics. However, instead of having the whole network available, such as in the influence maximization problem, the authors considered a more restricted scenario where they have access only to friends of random individuals (or nominated friends) and compare it against a random selection of sensors. The general idea is the friendship paradox, that can be stated as follows.

Friendship Paradox: The average degree of neighbors of vertices is expected to be higher than the average degree of vertices in a network. Lets say that the degree average degree of vertices from a graph is u. Actually, the mean degree (avg) of neighbors of vertices is u + a quantity defined by the variation of the degree distribution divided by u. The proof follows:

avg = Sum_{uv in E} deg(v) / 2|E| = Sum_{v in V} deg(v)^2 / 2|E|
The degree of each vertex from the graph can be expressed as follows:
degree(v) = u + epsilon_v
Where epsilon_v gives how the degree of v deviates from the average. Replacing the degree of vertices with it, we have:
avg = Sum_{v in V} (u+epsilon_v)^2 / 2|E|
Expanding (u+epsilon_v)^2:
avg = (|V|u^2 + 2uSum_{v in V} epsilon_v + |V|Sum_{v in V} (epsilon_v) ^2) / 2|E|
Sum_{v in V} epsilon_v is 0, by definition and Sum_{v in V} (epsilon_v) ^2 is the variance (sigma^2) of the degree. Therefore:
avg = (|V|u^2 +  |V|sigma^2) / 2|E| = u + sigma^2/u

It is of common knowledge that central nodes are infected sooner during an outbreak.  However, mapping the whole network may not be feasible, what is the main motivation of this work. According to the friendship paradox, while a random sample of vertices have the same mean degree of the population, average degree of neighbors of vertices is higher. As a consequence, friends of friends are more central than random vertices.

In order to evaluate the proposed strategy, they considered a network of 744 students from Harvard during 4 months, for which 319 students are random samples of the students population and 425 are friends of the first group. Cases of influenza (formal and self-reported) among these students were tracked during the period of analysis. The incidence of influenza in the dataset is 8% and 32%, for formal and self-reported cases, respectively.

The authors show that the curve for diagnosed flu incidence in friends of friends is shifted 13.9 days earlier than in time than the curve for random nodes. The curve for self-reported cases is shifted 3.2 days earlier. They also show that their strategy enables a prediction 46 days before the peak in daily incidence in visits to the health  service for the diagnosed cases and 83 days before the peak in daily incidence in self-reported symptoms for the self-reported cases.

The study also considered how different centrality measures could help in the identification of sensors when complete information about the network is available. The metrics considered are:

degree centrality: vertices with above average degree were selected.
betweeness centrality: vertices with above average betweeness were selected.
transitivity: clustering coefficient of a vertex. Vertices with below average were selected.
coreness: degree of a vertex if all its neighbors with smaller degrees are removed. Vertices with higher-than-average coreness were considered.

The shifts achieved by sensors selected using each metric were:

degree: 5.7 days (medical), 8 days (self-reported).
betweeness: 16.5 days (medical), 22.9 days (self-reported).
k-coreness: 4.3 days (medical), 7.5 days (self-reported).
transitivity: 31.9 days (medical), 15 days (self-reported).

I pretty much liked this paper, specially because of the friendship paradox. I spent some time to prove it, but it was worthy. In general, I like the statistical rigor of these medical papers but I do not appreciate some ad-hoc decisions made (e.g., considering vertices with degree above the average as sensors). It would be interesting to consider this problem from a CS perspective, using a larger dataset and an a more extensive evaluation of the proposed strategy.

Link: www.plosone.org/doi/pone.0012948

quarta-feira, 8 de fevereiro de 2012

An Analysis of Live Streaming Workloads on the Internet

This is a characterization work developed by researchers from Carnegie Mellon University. The authors analyze a 3-month workload from Akamai Technologies, which is a big player in content distribution. The paper focuses not only on typical workload analysis (e.g. popularity, session arrival) but also studies other interesting aspects such as user diversity and client lifetime.

The dataset applied in this paper contains more than 70M requests for 5K URLs. It comprises many publishers and content. Examples include radio stations and short videos. The authors consider a URL as an event and call stream a 24-hour chunk of an event. About 70% of the streams are audio and only 7% are video. Most of the analysis are based on 3 distinct categories of content: All (everything), DaillyTop40 (top 20 audio and top 20 video for each encoding format for every day in the 3-month period) and Large (streams with peaks of more than 1K concurrent viewers). Moreover, streams are also classified as non-stop (24/7, 76%) or short (a few hours, 24%) durations and as recurring (97% of large streams) or one-time stream. Another interesting classification of streams is regarding the request arrival. Flash crowd streams are those that present significant sudden increase in the arrival rate. All short streams and half of the large streams present flash crowd behavior

Popularity: The popularity distribution of streams was found to follow a zipf-like distribution with 2 modes, one for streams with 10K-7M requests and other for streams with less than 10K requests.

Arrival Process: 70% of the streams have a median interarrival interval of 1 second or less. Interarrival time is shorter for large than for short streams, since large streams have more requests. An exponential distribution can be used do model request interarrivals at shorter (stationary) timescales.

Temporal Variation: The workload presents clear temporal variation (daily and weekly), with peaks of number of requests in the weekends. Moreover, it is possible to distinguish the impact of American and European users in the number of requests over time.

Flash Crowds: Short duration streams present less time-of-day and time-zone related behavior. The authors argue that this may be due to the fact that short content has its popularity driven by itself instead of by the user behavior.

Tail Analysis of Session Duration: The tail is the last 10% of the (CCDF) distribution. Non-stop streams present Pareto heavy-tailed behavior at some point in the plot. For short streams, there are clear cut-offs and the session duration distribution follows a truncated Pareto distribution.

Transport Protocol: While UDP is considered the most appropriate protocol for streaming, 40% of the sessions use TCP.

Host Location Analysis: Based on the IP of hosts, an analysis of the geographic location of hosts is presented. This analysis is done by different levels (AS, city, country, timezone). For most of the streams, most of the clients are in the same timezone. Large streams clients present diverse AS domains (200-3,500) and location at city (200-3,500), country and timezone level. Large streams present also large clustering index (minimum number of locations to cover 90% of the IP addresses). Smaller streams present lower location diversity, but it is still significant (e.g. half of the streams cover 100+ cities, 50% cover at least 10 countries). When simultaneous clients are considered (i.e. clients that accessed the video in a 1-hour interval) the diversity is reduced but is still significant.

Client Birthrate: Clients are identified by user ids. Birthrate varies from 10% to 100% for DailyTop40 streams. Large streams have lower birthrate, as well as non-stop streams when compared to short ones.

Client Lifetime: half of the users stay only for one day. For 90% of the events more than 50% of the users are one-timers. Clients that are not one-timers have an average lifetime equal to the event duration, being responsible for steady memberships.

In did not like reading this paper very much. Most of the characterization papers I've read seem to bureaucratic -popularity follows this distributions as everybody knew, session arrival follows this other, etc- to me. Moreover, it is difficult to organize so many findings and explanations in a single flow of ideas. Maybe, instead of characterizing one scenario using several analysis, it would be better to provide and in-depth analysis of one property in many scenarios.

Link: http://esm.cs.cmu.edu/technology/papers/imc04.pdf

segunda-feira, 6 de fevereiro de 2012

A Data-Based Approach to Social Influence Maximization

These papers with so much theory are really holding me back from my reading plans. I may get more used to them with practice. This paper studies the social influence maximization problem, which is basically the problem of selecting a small set of nodes that maximize the influence in a social network, and is co-authored by researchers from University of British Columbia and Yahoo! Research. Different from previous work that have considered a version of this problem where a real network is used but a propagation model is used to simulate the influence, instead of real influence traces. One funny thing is that I have criticized this approach 2 years ago while looking for a new research project.

The independent cascade model (IC), where each active node has a given probability of influencing its neighbor, and the linear threshold model (LT), where each node has a uniformly distributed threshold that must be reached by the total weight from its active neighbors in order to it be activated too, are the propagation models applied in the influence maximization problem. The parameters of these models for a given network are usually assumed to be given, although only some recent work have focused on this important aspect of the problem. Given a seed set S, the standard approach is to simulate the spread of influence using one of the defined propagation models and then evaluate the influence of S. The objective is the identification of a seed set of size k that maximizes the influence, a problem known to be #P-hard. In order to solve this problem for large datasets, previous work have found that the expected influence function is monotone and submodular, concepts defined as follows.

Monotonicity: A function f is said to be monotone if f(S) <= f(T) whenever S is a subset of T

Submodularity: A function f is said to be submodular if f(S+w)-f(S) >= f(T+w)-f(T).

An important property of monotone submodular functions is that the a greedy algorithm that selects items  to be part of S based on marginal gain achieves a result that approximates the optimal result by a factor of (1-1/e)

The first important result of the paper is that existing ad hoc propagation models are not able to learn influence probabilities that fit those found in real data. They show it by comparing the edge probabilities learned by an EM-based method against other existing methods. Two real datasets one from flixster.com, which is a movie rating website, and the other from flickr.com are used. On flixter, a user u_1 influences u_2 if u_2 rates a movie already rated by u_1 and after they became friends. Similarly, on flickr, u_1 influences u_2 whenever u_2 joins a group of which u_1 was a member before and they were friends before it happened. It is important to notice that this definition of influence is a heuristics. The datasets were split into train and test sets. In the first experiment, the authors show that the seeds selected using the propagation models are very different from those based on the EM-based method, i.e. they do not fit the seeds selected using the propagation data. For all the methods a greedy algorithm was applied in the identification of the seed sets, I assumed that the marginal gains were computed using Monte Carlo simulation.  It is also shown that the expected influence computed using the propagation models differs significantly from the actual influence in the test set. The procedure used in this evaluation is selecting the initiators of spreading processes in the test set and taking as their propagation size the number of users that performed such actions. EM achieves the best results among the methods evaluated.

Further, the authors propose the Credibility Distribution Model (CD), which is the greatest contribution of the paper. The CD model uses propagation traces, which are tuples <user,action,time>, to measure the expected influence of a set of nodes S. The model is based on the effect of a given seed set S over each node from the network:

sigma_cd(S) = Sum_{u in V} k_S_u

where k_S_u is the credit relative to u and is defined as:

k_S_u = (1/A_u) * Sum_{a in A} Big_Gamma_S_u(a)

where Big_Gamma_S_u(a) is given as:

1,                               if u is part of S
Sum_{w in N_in(u,a)}Big_Gamma_S_w(a).small_gamma_w_u(a)

where small_gamma_w_u(a) = 1 / d_in(u,a), N_in(u,a) is the set of vertices that have u as neighbors and have the action a, and d_in(u,a) is the indegree of u in the graph induced by the action a. The authors propose also a better alternative for computing small_gamma_w_a that depends on the time in between w and u performed a, the average time that u takes to copy an action from v, and the influenceability of u. The intuition of the model is just starting from u in the reverse direction of edges until finding a node from S always multiplying the result by small_gamma in the graph induced by a given action. The authors show that the influence maximization problem under the CD model is #P-hard (the vertex covering problem can be reduced to it) using induction on path lengths. I will not discuss the proof here. They also show that the CD function is monotone and submodular using induction.

The last theoretical contribution of the paper is showing that the marginal gain of a given vertex, which is the basis of the greedy algorithm for the identification of seed sets, can be computed efficiently without simulation or even scanning the action log many times. Marginal gains can be computed using the following expression instead:

sigma_cd(S+x) - sigma_cd(S) = Sum_{a in A}((1-Big_Gamma_S_x(a)) . Sum_{u in V} (1/A_u) . Big_Gamma_x_u^V-S(a))

The proof of this equation takes 1 entire page of the paper. The intuition of this expression is that the marginal gain can be computed using the credit of the current seed set S for each action from the set of actions and also the credit of x over the vertices that are not in the seed set. This comes from the fact that the score of each vertex from the seed set can be computed independently by removing any other vertex in the seed set from the set of nodes considered, something that I understood but just does not seem very right to me. It important to notice that the marginal gain must consider both the removal of x from the set of nodes that may be activated and its addition to the seed set as an initializer. In particular, the removal of x requires subtracting the scores of all paths between vertices from the seed set and any other vertex passing through x. I've followed the proof and I think it is really impressive, very good work.

The algorithms that apply the CD model in the identification of the seed set S are straightforward. First, the action log is scanned and the credit for each combination of vertices and actions are computed. This data is given as an input to a greed algorithm that applies the smart lazy strategy proposed by Leskovec, which updates the marginal gain of the top candidate vertex since the marginal gain of vertices always decreases along the iterations. Marginal gains are computed using the expression presented above. 

In the experimental evaluation section, the authors apply the two datasets already described here. The CD model is compared with the IC and the LT model. It is shown that, in general, the CD model predicts the propagation sizes better than the other models. The seed sets identified uing the IC and the LT models have no superposition with the set found using CD and they achieve higher influence spread, but not as higher as I expected. Moreover, CD is at least one order of magnitude faster, although it may require a large memory space in the worst case. A pruning threshold is proposed to solve this issue. Finally, the authors show that the seed sets and their influence spread found using CD converges using a small part of the action log.

This paper is a really great reading. It certainly takes some courage to go through its theorems, definitions and proofs, but I considered it worthy. However, I consider that there is a lot of room for improving the proposed model, specially because it seems to consider the effect of each node from the seed set independently, what is counterintuitive. I should be better at math in order to read a paper like this faster and also to devise this kind of technique. Let's keep working!

Link: http://www.vldb.org/pvldb/vol5/p073_amitgoyal_vldb2012.pdf