Me and my research: janeiro 2012

segunda-feira, 30 de janeiro de 2012

Mining Top-K Large Structural Patterns in a Massive Network

After being far for a while, I finally found time to write about this paper. It is certainly on my top list, therefore, it took me a long time to read it carefully. The researchers involved in this work are from Singapore Management University, Peking University, UCSB, UIUC and UIC. Xifeng Yan, Jiawei Han, and Philip Yu are among them. The problem studied in the paper is finding frequent patterns in a single labeled graph. A frequent pattern is a subgraph for which the structure and attributes involved is frequent, more formally, the frequency of a subgraph is defined in terms of its isomorph subgraphs in the graph.

The basic idea of the work is that finding frequent subgraphs in a single graph is very expensive, what has been shown by previous work. Nevertheless, in practice, we are interested mainly in the largest subgraphs. But is it possible to mine only these large subgraphs efficiently? The authors propose an elegant approximate algorithm to solve this problem, which is called SpiderMine.

SpiderMine is based on the notion of a spider. A spider is a subgraph for which the distance between any vertex and a given head vertex is bounded by r (parameter). SpiderMine grows a random sample of spiders in order to find the largest patterns with high probability. In fact, they determine what is the required number of spiders required to find the k largest patterns with a 1-e probability, where e is a user-specified parameter. It is important to distinguish a subgraph from its embeddings. A subgraph is a pattern (labels+structure) while a embedding is a particular occurrence of a subgraph in the graph.

SpiderMine works as follows. First, all frequent spiders are identified. In the second step, M randomly selected frequent spiders are grown in D_max/2r iterations (D_max is a parameter). A pattern is grown by attaching spiders to its boundary vertices so that at each iteration a pattern has its radius increased by r. Subgraphs are merged in case they overlap during this process and are also frequent. Merged patterns are potential large subgraphs and are grown during a third step. In case a large pattern has at least two spiders as subgraphs, it will be find by SpiderMine. Spiders are also useful to speedup isomorphism checking, since isomorph subgraphs will have the same spider set.

One of my favorite parts of the paper is where the value of M is devised. Basically, the probability of a pattern P being hit by spider (i.e., this spider has one of the vertices of P as head) is given by:

P_hit(P) = Sum_{v in V(P)} |Spider(v)| / |S_all|

where Spider(v) is composed of spiders that have at least one embedding with v as head (I understood v as a label and Spider(v) as the set of spiders with v as head) and S_all is the complete set of spiders. The expected value of |Spider(v)| is |S_all|/|V(G)|, i.e. assuming that all vertices in V are equally likely to be part of a spider, this ratio gives the number of spiders with a given vertex. As a consequence, the probability of a pattern P being hit is lower bounded by |V(P)|/|V(G)|. The probability of a vertex not to be hit, P_fail, i.e. the probability of at most one vertex from the pattern being the head of a spider is given by:

P_fail(P) = (1-P_hit(P))^M + M.P_hit(P)(1-P_hit(P))^M-1

If P_hit <=2, then P_fail is upper bounded to (M+1)(1-P_hit(P))^M and the probability of hitting the K largest patterns is upper bounded by (1 - (M+1)(1-V_min/|V(G)|)^M)^K, where V_min is a parameter that gives the minimum size of a large pattern. Using this last expression, it is possible to set the value of M based on e. Though this math seems pretty much intuitive, I do not think I would be able of deducing it by myself.

The authors compare SpiderMine against existing algorithms for mining frequent subgraphs from single graphs using synthetic (ER and scale-free) and real datasets (DBLP and Jeti). They show that SpiderMine is more effective in finding the largest patterns and is also more scalable than competing techniques. In particular, they compare Spidermine and ORIGAMI, which is an algorithm for the identification of representative graph patterns, in the transaction setting and show that SpiderMine is able to find larger patterns. While the execution time of the proposed algorithm grows exponentially with r, it works well for small r. Finally, they show some interesting patterns found in real settings. In DBLP, they found collaboration patterns that are common to several research groups (e.g., senior and prolific authors organized in a particular way). In Jeti, which is a software, they found some frequent relationships among methods from different classes.

My general view of this paper is very positive. The authors propose a very interesting and powerful algorithm to solve a relevant problem. The technical contributions are sound and well described, except for some small details. One particular point that is still not very clear to me is whether the complete set of spiders is used in the growth process. It seems to be required in order to consider the complete set instead of only the M selected spiders in order to generate large patterns.

Link: http://www.vldb.org/pvldb/vol4/p807-zhu.pdf

terça-feira, 24 de janeiro de 2012

A Hierarchical Characterization of a Live Streaming Media Workload

This paper is co-authored by researchers from Universidade Federal de Minas Gerais and Boston University. It presents the first characterization of a live streaming workload at multiple levels. The main idea of the paper is comparing live streaming with other workloads, such as workloads from servers that stream stored objects. As a consequence, this paper supports the generation of realistic synthetic live streaming media workloads.

The authors mention two important differences between stored and live streaming media systems:
1) In live streaming, the value of the content is in its liveness. Therefore, different from stored media, rejecting new connections during a server overload is not a viable alternative.
2) Live streaming workloads are likely to exhibit stronger temporal patterns.

The workload studied was obtained from a popular live streaming media server during one month. It has two available live streaming media objects that cover a Brazilian reality show.

The hierarchical methodology applied in the paper analyzes the workload in three levels (layers):
1) Client layer: Based on individual users;
2) Session layer: Based on user sessions;
3) Transfer layer: Based on transfers that compose sessions.

Based on this methodology, the paper characterizes important aspects of each layer proposed.

Client Layer

The distributions of IPs/AS, transfers/AS and transfer/country suggest a Zipf-like profile and the number of active clients can be described by an exponential distribution. The workload shows clear temporal patterns, specially daily ones. An autocorrelation analysis confirms the strength of daily patterns. Client interarrival times can be described by a Pareto distribution, it is further shown that although the arrival process is non-stationary, it can be described by a sequence of stationary Poisson arrival processes for which the average arrival rate reflects the temporal variation found. The number of sections/user also follows a Zipf-like function.

Session Layer

Session size follows a lognormal distribution and OFF times follow an exponential. The number of transfers/session is described by a Pareto distribution and arrival times within a session can be fitted by a lognormal function.

Transfer Layer

The number of concurrent transfers follow an exponential distribution and transfers present a similar temporal pattern as the one found for client activity. Transfer interarrivals can be described by two Pareto distributions, which the authors argue to be related to the existence of two classes of content, the popular and the unpopular ones. Transfer length follows a lognormal distribution, a characteristic related to traffic self-similarity and with important implications to communication performance. It is interesting to notice that in the particular case of live streaming, transfer length is a direct result of user interactions instead of object characteristics.

Representativeness of Findings

The authors applied the same methodology to another live streaming workload from a news and sports radio station. They found similar results for both workloads, except regarding the parameters and also interarrival time, which is better described by a lognormal distribution in the new workload. The authors argue that this result is due to the nature of the interactions between clients and objects in the streaming media services.

Synthesis of Media Workloads

The Gismo toolset for the generation of synthetic workload was extended in order to enable the generation of live streaming workloads using the results found.

This paper is very well-written and presents a thorough analysis of a rich workload. One aspect that should have been more discussed is the impact of the small number of objects (only 2) of the results. Modern live streaming services, such as justin.tv, have a large number of streams available, which may have important implications for the results presented in this paper.

Link: http://conferences.sigcomm.org/imc/2002/imw2002-papers/180.pdf

segunda-feira, 23 de janeiro de 2012

Analyzing Client Interactivity in Streaming Media

This paper, co-authored by researchers from UFMG, studies how users interact with streaming media services. User interactions, such as pausing and jumping forward, have an important impact on the the design of realistic synthetic workloads and also on the performance of streaming media servers. The results presented are based on four real workloads from eTeach (educational video), TV/UOL (entertainment video), RADIO/UOL (entertainment audio), and an anonymous online radio refereed as ISP/RADIO. The authors argue that the study of large workloads from different domains enables interesting analyses that arise from the similarities and differences among the way users interact with these systems.

Several important results are presented along the paper. We summarize these results as follows:

Daily and Hourly Load Variations

eTeaches have few user requests, if compared with the other workloads, but delivers large amounts of content on average. Audio workloads also present a high average amount of content delivered, different from TV/UOL, which is composed mostly by small videos. In eTeach, the accesses peak around the middle of the week, while entertainment content presents a more even access distribution. During the day, eTeach accesses peaks in the middle of day and entertainment content is accessed evenly throughout the day, except in the first hours.

File Access Frequency

The frequency/popularity distribution of files is heavy-tailed for all the workloads studied. However, different distributions fit each workload well. Access distribution follows a Zipf for TV/UOL and is better described by two Zipfs for the other workloads. The authors argue that these two Zipfs may be related to the fact that users request files at most once, but it does not explain this pattern completely.

Session Arrival

The best function to fit session inter-arrival times depends on the workload. They found that an exponential function describes well the session arrival for TV/UOL and RADIO/UOL. However, inter-arrival times are better described by a weibull or a lognormal distribution for eTeach, depending on the file size. Moreover, a pareto distribution is the best fit for the ISP/RADIO workload.

Session Start Positions

Starting from the beginning of the file is not a rule, specially for video content. Nevertheless, access is skewed towards the beginning of the file. The authors show that skew faction, workload type and file size are correlated.

ON and OFF Times

The distribution that better describes ON and OFF times varies with day, file size and workload. ON times follow a Pareto (for small files) or a weibull (large files) and OFF times are well described by a weibull distribution in general.

Session Interactive Requests

Interactive requests are correlated to content type and file size. Large videos, specially educational ones, present a highly interactive user behavior. On the other hand, audio sessions usually have only one client request. The most popular type of request is pause, but as the file size increases, the number of jump forward interactions increases as well. Any interaction type is more frequently followed by another interaction of the same type (repetition). Moreover, jump distances are usually under 45 seconds but increase with file size, which has implications for buffering and client prefetching.

Profiles of Client Interactive Behavior

The larger the video, the shorter the portion of the file is requested at each interaction. For audio, users frequently request the entire file or stop at an arbitrary position.

Implications for Caching

While segments of popular files are accessed uniformly, less popular files have their access skewed towards the beginning of the file. As a consequence, different caching strategies may be more suitable according to the file popularity. In general, considering content popularity is key to achieve good performance, since a large amount of content is accessed sporadically.

This is not my favorite type of paper, but I think that I need to be able to read papers that are out of my comfort zone. The paper is well-written and the ideas are very clear. I missed more in-depth analysis of results and their implications at some points in the paper.

Link: www.iw3c2.org/WWW2004/docs/1p534.pdf

sábado, 21 de janeiro de 2012

Predicting Consumer Behavior with Web Search

This paper is very related to a previous paper I've read about how search volume data can be used for monitoring influenza epidemics. Here, the authors study a similar problem which is the prediction of consumer behavior (e.g., movie revenues, game sales, and song ranks) based on search data. The researchers involved in the work are from the Yahoo! Research Lab in New York City and include Duncan Watts, who is an authority on social network analysis.

In the recent years, several papers have shown how search volume is correlated to collective behavior, what is very intuitive. This is probably one of the most interesting among these papers because the authors do not assume that such a method would work. In fact, they show that search volume is not very useful in several important scenarios.

Using search data from the Yahoo! search engine and also from the Yahoo! music, the authors study the effectiveness of this data in the prediction of: (1) opening weekend box-office movie revenues, (2) first-month sales of video games, and (3) weekly ranks of songs on the Billboard Hot 100. The prediction model applied is linear correlation. Movie revenues are obtained from the IMDb website and game sales are available in vgchartz.com. A query is associated to a given movie if the IMDb page of this movie appears as one of the top results for this query on Yahoo! search. In the case of games, specific identifiers from URLs were applied to match a query to a particular game. Songs were linked to queries through normalized versions of their titles.

Results show that the accuracy of the models varies significantly depending on the application scenario. The predictions were evaluated in terms of how they were correlated to the actual results and achieved a 0.85, 0.76, and 0.56 correlation coefficients for the movie, game, and music prediction tasks. It is interesting to see that, in some cases, good prediction can be made even using data from one week in the past.

The authors also evaluate baseline models in the prediction tasks proposed. For the box-office revenue prediction, they apply the budget, number of screens opened and box-office predictions from hsx.com. To predict game sales, they use critic ratings from vgchartz.com and revenues of previous editions of games, if available. Music ranks for a given time t are predicted using ranks from two weeks before t. It is interesting to see that such baselines outperform search count based models. When hybrid models are generated, search volume data improves the results of the baselines only for non-sequel games (from 0.44 to 0.50) and music (from 0.70 to 0.87).

Not happy with the previous results, the authors also discuss the use of search data to monitor influenza. They show that an auto-regressive model, which applies previous reports on influenza caseloads achieves an accuracy of 0.86 while google flu trends achieved an accuracy of 0.94. While the results based on search volume are better, the gains are relatively small when compared against the auto-regressive model.

I appreciated reading this paper mainly because it is controversial. While several studies are going in the direction of showing the potential of search data in several tasks, the authors challenge this hypothesis and their results are very interesting. After reading so many papers using hadoop for data analysis, I start to wonder whether I should jump into hadoop as well.

Link: http://www.pnas.org/content/early/2010/09/20/1005962107

The Party is Over Here: Structure and Content in the 2010 Election

In this paper, some researchers from the University of Michigan studied the usage of Twitter by candidates in the 2010 U.S elections (senate and gubernatorial). The work focuses on content and structural features related to Democrat, Republican and Tea Party candidates. The objective is the understand the impact of such features over the election results. In other words, the general goal is to assess the potential of Twitter in predicting election results, what I consider to be a quite challenging task.

A dataset used in the paper contains 460,038 tweets and 137,376 pages posted by 687 candidates during the 3.5 years past the election day. The authors assume that web pages mentioned by the candidates express their opinions. Regarding the topology of the network, the paper shows that it represents very well the structure of the parties, with many edges between candidates from each party and the Tea Party as a part of the Republican network. In terms of density, the Democrats induce the sparsest network and the Republicans the densest one among the parties. Moreover, Tea Party members are more active than Republicans and Democrats on Twitter.

The content of tweets and pages is modeled through a language model. Language models enable the representation of a language through a probability distribution over terms.

Weight of the term t for a user u:

w(t,u) = TF(t,D_u).udf(t,D_u).idf(t,D)

where D_u is the content of the candidate u, D is the set of documents (pages + tweets), t is a term from the vocabulary V, udf(t, D_u) = df(t,D_u) / |D_u|, idf(t,D_u) = log(1 + |D| / df(t)) and TF(t,D_u) = Sum_{d in D_u} tf(t,d) / |D_u|.

Probability of a term t in the corpus D:

P(t|D) = TF(t,D).udf(t,D)

Normalized probabilities:

w(t,u) = w(t,u) / (Sum_{t in V} w(t,u))

P(t|D) = P(t|D) / (Sum_{t in V} P(t|D))

Smoothing:

P(t|u) = (1 - y).w(t,u) + y.P(t|D)

where y is a normalization factor.

Final (normalized again) probability of a term for a user:

P(t|u) = P(t|u) / (Sum_{t in V}P(t|u))

In order to compute the probability of term for a given party, it is necessary to replace the user u by the set of users from the respective party.

Two distributions are compared through the Kullback-Leibler (KL) divergence. The non-symmetric version of the KL divergence is expressed as:

D(P_1,P_2) = Sum_{t in V} P_1(t).log P_1(t) / log P_2(t)

Moreover, a symmetric version of the KL divergence is defined as follows:

D(P_1,P_2) = Sum_{t in V} P_1(t).log P_1(t) / log P_2(t) + Sum_{t in V} P_2(t).log P_2(t) / log P_1(t)

The KL divergence is applied in the comparison of candidates, candidates and parties, and candidates and the corpus. Moreover, the non-symmetric version is used in the comparison between a term and a distribution. The paper presents the top terms that define each party (i.e. terms with highest marginal KL compared to the corpus). The terms are education (democrats), spending (republicans) and barney_frank (Tea Party). Barney Frank is a very important democrat congressman who is also gay, but I do not have enough information to understand what Tea Party members are saying about him. By comparing the profile of each pair of candidates from the same party, it is shown that the Tea Party is the most cohesive party, while the Democrat is the less cohesive one. Moreover, content and network distance are correlated in the dataset (i.e., candidates that tweet similar content are close in the network).

The final part of the paper studies the problem of predicting the election results. The method applied in the predictions is logistic regression with the following variables: closeness (incoming, outgoing, and all paths), hits, pagerank, in/out degree, KL of the party, party, same-party (i.e., whether the candidate is a member of the party the occupied the position in the previous term), incumbent (i.e., whether the candidate occupied the position in the previous term), and Twitter statistics (tweets, retweets, hashtags, etc.). The single variables that achieve the best results are (in this order): same-party, incumbent, indegree, and closeness (all paths). An interesting finding is that being close the the party and the corpus, in terms of content, is better than being different, which means that focusing on centric issues is a good strategy. Moreover, Twitter statistics are uninformative for the prediction task. The results considering all variables show that the incumbent, party, and same-party variables achieve an accuracy of 81.5% and the inclusion of content and structure variables leads to an accuracy of 88% in predicting election winners.

I confess that I expected more from this paper. It was good to learn more about language models but the results presented in the paper are kind of disappointing. While the content and structure analysis is limited to simple statistics and expected results, the prediction task is not very conclusive. In particular, it seems that the models proposed are a good explanation for the 2010 election results instead of generally effective prediction models.

Link: misc.si.umich.edu/publications/69

sexta-feira, 20 de janeiro de 2012

I Tube, You Tube, Everybody Tubes: Analyzing the World's Largest User Generated Content Video System

I've read this paper, written by some researchers from KAIST and Telefonica Research in 2007, three years ago. It is a very strong paper (best IMC paper of that year) that present a thorough analysis of user generated content systems (UGCs), specially Youtube. His contributions go from an in-depth statistical analysis of the video popularity to the estimates of benefits that may result if Youtube was implemented using the P2P protocol. While being a set of interesting questions answered using large-scale data and cool plots, the paper is focuses on the popularity of videos and its effects from several relevant perspectives.

The motivation for such a study comes from the increasing popularity of UGC Vod systems and the lack of similar analysis in the literature. The main dataset used is composed of around 2M videos, corresponding to 4M views, from Youtube (Entertainment and Science and Technology categories). Data from Daum, which is a UGC from Korea, Netflix (ratings), Lovefilm (movie data), and Yahoo! (theater gross income) are also applied.

UGC x non-UGC: The paper lists some important characteristics that distinguish UGC from non-UGC content. In general, UGC has a higher production rate but a similar production per user when compared to non-UGC content. As a consequence, while top UGC producers release much more content than top non-UGC ones, the average size of UGC videos is much smaller. They also found a low user participation (e.g., comments and ratings) in UGC systems and a high correlation between the number of views and ratings of videos. Regarding how users find videos, half of them have external links. Interestingly, these videos represent 90% of the views, but only about 3% of such views are from users coming through external sites. I think Facebook has changed it recently.

UGC Popularity: My favorite section of the paper. The popularity distribution of videos is expected to be power-law or log-normal. Power-law distributions are a consequence of a rich-get-richer process, while the log-normal distribution is a result of proportionate effect. The main visual characteristic of a power-law is that its plot is a straight line across several orders of magnitude in a log-log scale. An analysis of the video popularity shows that it presents the pareto principle (10% of the videos correspond to 90% of the views) and this has strong implications in many scenarios (e.g., cache performance). The fit of the popularity distributions against some known distributions (power-law, log-normal, exponential, and power-law with exponential cutoff) shows that the tail of the curve may follow a power-law with exponential cutoff (i.e., a power-law followed by an exponential). The authors argue that the truncated tail of the curve can be generated by an aging or a fetch-at-most-once process, both from the literature, and show that fetch-at-most-once is the more likely explanation through simulations. The effect is amplified by the increase of the number of requests/user and the decrease of the number of videos available. The focus is then changed to the long tail (less popular videos) and the authors show that the number of views can be described by a Zipf with an exponential cutoff but also seems to follow a log-normal. This truncated tail may be due to the large number of uninteresting movies, sampling bias (towards interesting videos), and the information filtering (e.g., search engine rankings). Finally, the paper presents an analysis of the benefits of the removal of the truncated tail (e.g., popularity gains of 45% for the Entertainment category).

Popularity Evolution: In general, there is no strong correlation between video age and the number of views it attracts, except for very young videos that get more views than expected. Another interesting result is that most of the top popularity video of a day are recent, but, in the long range, the ranks of recent videos decrease. If a video does not get enough views during its early days, it is unlikely that it will get many requests in the future, in fact, there is a strong correlation (> 0.84) between the popularity of a video during its first two day and its future popularity. Moreover, different from old videos, young videos may suffer significant changes in their popularity rank.

Efficient UGC Design: The authors simulate the use of three cache schemes using the traces from Youtube. They show that a static finite cache with long-term popular content works very well (27% of the requests are missed) and a hybrid scheme with space to store the daily top popularity videos reduces the missing rate to 17%, which is a direct consequence of the skewness of the popularity distributions. Moreover, they simulate the performance of a P2P VoD system using Youtube data. Since they have daily popularity of videos, they estimate the inter-arrival time and concurrent requests for a given video. They found that 95% of the videos are expected to be requested only once in a 10 minute interval. However, assuming different scenarios where users share content while watching the video, during a session (28 min.) or one extra hour or day after watching a video, they show that P2P can be very effective. Basically, they simulated a system where 2 concurrent users share content if possible or access the server otherwise. By sharing content only during the video, the server workload may be reduced by 41%. Furthermore, if users share content during an extra day, the server workload is decreased by 98.7%.

Aliasing and Illegal Uploads: Aliases are copies of a video and may dilute content popularity. From a sample of aliases tagged by volunteers, the paper shows that the removal of some aliases may increase the popularity of some videos by two orders of magnitude. Most of the aliases are from one-timers and the maximum number of aliases/user found was 15. Moreover, a small fraction of the videos deletions were due to copyright violations.

This is certainly a good example of a well-written characterization paper. It presents many interesting conclusions motivated by relevant questions. The paper presentation is amazing and the authors show good expertise in statistics and system design.

Link: an.kaist.ac.kr/traces/papers/imc131-cha.pd

terça-feira, 17 de janeiro de 2012

Cheaters in the Steam Community Gaming Social Network

I found this paper in Technology Review some weeks ago and, since I've been working with gaming data, I decided to read it. Although being a pre-print version, this work has attracted a lot of interest recently. Researchers from University of South Florida, University of Delaware, and University of British Columbia collaborated in this paper that studies social interaction in the Steam Community, which is the most important gaming community in the world. In particular, the work characterizes several aspects of cheaters (i.e., people that violate the rules games using software) through social network analysis techniques. I must confess that this is not an easy paper to summarize.

The study is based on a crawled dataset composed profiles of 30M users that interact through a social network. A second dataset, containing data about 10K players, was collected from a particular server. Users are classified as cheaters or non-cheaters in a given game by an anti-cheat system provided by Steam. More than 720K users from the dataset are classified as cheaters, who are blocked by the majority of the available servers.

Cheaters were found to present several interesting characteristics that distinguish them from the other gamers. They own less games, play less, are more interested in multi-player games, and use more restrict privacy settings. Moreover, cheaters are likely to be friends with others cheaters (homophily) and loose more friends than non-cheaters. On the other hand, cheaters and non-cheaters have similar numbers of friends. The results show also a strong correlation between friendship and in-game interaction. Cheaters interact less with the remaining of the community, indicating a possible reaction against their behavior. Regarding group relationships, the authors found that cheaters are not equally distributed among groups, which is an evidence that cheaters learn about cheating on groups. A geographic analysis of cheaters and non-cheaters shows that both playing and cheating are not well distributed among countries. The authors argue that cultural aspects may be related to cheating behavior, something already found by other studies. Another interesting result presented in the paper is that cheating propagates over time, which means that new cheaters are likely to be friends with previously discovered cheaters.

This is one of those papers many relevant findings but few technical contributions. It was hard to read due to the number of analysis made and the brevity of the discussions. I would like to see more general conclusions instead of punctual findings but I confess that I'm not an expert on online gaming. The authors provide an extensive coverage of the online gaming literature and also seem to be familiar with online gaming concepts in general. I hope to see a new paper combining the discoveries of this paper and machine learning techniques in the identification of cheaters.

Link: http://arxiv.org/abs/1112.4915

segunda-feira, 16 de janeiro de 2012

Does Bad News Go Away Faster?

Today, I've read this short paper about the correlation between persistence and content in information diffusion. This work was developed by some people from Cornell University, including Jon Kleinberg. The general idea of the paper is that it is possible to identify information that will persist by analyzing content features associated to it. A piece of information is said to be persistent if it still attracts attention some time after its peak of popularity. On the other hand, non-persistent information fades-away very quickly after its peak.

The authors apply a dataset of 21K URLs from Twitter as a case study. The decay time of a URL is defined as the time it takes to this URL to reach 75% of its mentions after its peak of popularity. They show that the decay time of URLs seems to follow a power-law distribution. Moreover, the mean decay time is 217 hours and the median decay time is 19 hours. Based on its decay time, a URL is classified as persistent if its decay time is longer than 24 hours. Non-persistent URLs are those for which the decay time is shorter than 6 hours. It is interesting to see that the time series that represent these two categories are clearly different.

The content features applied in the study are extracted from the html page pointed by the URL. They consider combinations of the header, the body and the URL links. These features are used to train a SVM classifier, which achieves around 70% of accuracy in the identification of persistent URLs. In particular, the more data (i.e. features) is given to the classifier, the more effective it gets.

In a further analysis, the authors study linguistic properties of the content associated to the two categories of URLs. Using LIWC (that stands for Linguistic Inquiry and Word Count), which is a taxonomy composed of 60 pre-defined categories of words, they show that affective content is related to persistence. Moreover, they found out that content associated to cognitive processes (e.g., thinking, knowing) is more viral (i.e., less persistent) and rapidly fading URLs point to content with more words related action and time (e.g., news). In terms of the words that better represent each class, the authors show that, while persistent URLs are links to pages that contain words related to entertainment, positive emotions, advertisement and marketing, non-persistent URLs are references to pages with words associated to news, blogs, and names.

I consider the problem studied in the paper to be very interesting to a broad range of professionals. This idea that the information itself affects the way it propagates is very intuitive but I've never seem a scientific study about it. However, I also believe that the paper is not very conclusive regarding any unexpected result. In fact, the conclusions of the paper while interesting are not surprising at all.

Link: http://www.cs.cornell.edu/home/kleinber/icwsm11-longevity.pdf

domingo, 15 de janeiro de 2012

Statistical Properties of Sampled Networks

This paper studies topological properties of sampled networks. It is a work done by three physicists from KAIST in 2009. Sampling appears in data analysis for several reasons. In the particular case of network analysis, sampling is usually a consequence of the crawling technique applied or a requirement due to the large size of real graphs. However, results found using sampled networks may be different from those obtained from the original network. This paper compares the topological properties of sampled networks produced by three standard sampling techniques with the properties of the original networks.

Random sampling is probably the most widespread sampling technique in the literature. Nevertheless, while random sampling works very well in several scenarios, networks have an important property that affects the performance of random sampling, which is the existence of two distinct entities: nodes and links. As an example, the degree is not an individual characteristic of a node. The authors also discuss the reverse relationship between sampling and attacks (or breakdowns) in networks, something that is very interesting.

There are several sampling techniques in the literature. In this paper, they consider three techniques: node, link, and snowball sampling. In node sampling, a certain number of nodes are chosen randomly and the links among these nodes are kept. Link sampling can be seen as the opposite, in the sense that links are selected at random and nodes connected by such links are kept. Snowball sampling starts from a single node and recursivelly picks nodes connected to already selected nodes until the desired number of nodes is reached. Partially crawled networks are usually a result of snowball sampling.

In order to study the sampling techniques described, the authors consider four networks: a synthetic network generated using the Barabasi-Albert model, a protein interaction network, a network of autonomous systems, and co-authorship network. The topological properties of interest are: degree distribution, betweeness centrality, average shortest path, assortativity and clustering coefficient.

Degree distribution: The authors present the following analytical formulation for the degree distribution p' of a sampled network based on the degree distribution p of the original network:

p'(k) = Sum_{k_0=k to inf} p(k_0) C(k_0,k) a^k(1-a)^k_0-k

Where a is the probability of selecting a node from the original network (size of sample / size of network). This formulation works for both node and link sampling strategies. As part of my master's thesis research project, it took me one week to devise a similar formulation by myself. I should have read this paper one year ago. Something that I did not know is that this formulation may not work well for small values of a. This property has important implications for the normalization approach I proposed in my research project. The authors found that, while the node and link sampling overestimate the exponent of the degree distribution, the snowball sampling underestimates it, due to its bias towards high degree nodes.

Average path length: For node and link sampling, the average path length is larger than the one from the original network. On the other hand, the snowball sampling gives an average path that is lower than the original value.

Betweenness centrality distribution: The betweenness centrality of a node is the ratio of the number of shortest paths running through it to the number of shortest paths between pairs of nodes from the network. The betweenness centrality is known to follow a power-law distribution in several real graphs. The betweenness centrality of nodes from sampled networks are similar to those from the original network in terms of degree distribution: node and link sampling overestimate the exponent, while snowball sampling underestimates it.

Assortativity: Assortativity is the correlation between the degrees of connected nodes. It is known to be positive in social networks and negative in biological and technological networks. The authors found that node and link sampling conserve the assortativity of networks. However, sampled networks from snowball sampling are more disassortative than the orginal networks. The authors argue that this result is due to the links to the nodes in the last layer, which are lost in the snowball sampling process.

Clustering coefficient: The clustering coefficient of a node is given by:

C_i = 2y / k_i(k_i-1)

Where y is the number of links between neighbors of i and k_i is the degree of i. The clustering coefficient of a network is the average clustering coefficient of its nodes. For node and snowball sampling, the clustering coefficient of sampled networks are close to those from the original ones. Link sampling underestimates the clustering coefficient of networks.

This is certainly one of the most interesting papers I've ever read. It is very related the research I've done in my master's degree project. Moreover, it provides the reader with plenty of theoretical and empirical results. I've been thinking about comparing several properties of random samples and graphs induced by homophily in attributed graphs. I think this kind of analysis may give a broader view of the correlation between node attributes and the network topology in real graphs.

Link: http://arxiv.org/abs/cond-mat/0505232

sábado, 14 de janeiro de 2012

Predicting the popularity of online content

This was certainly the busiest week of the year, so this is the first paper I've read this week. It is a paper about the problem of predicting the popularity of content in the web, a work done by Gabor Szabo and Bernardo Huberman, from HP Social Computing Lab. There is a shorter version of this paper published by the ACM Communications Magazine.

The basic premise of the paper is that we can predict the popularity of a given content from early measurements of its popularity. The authors study statistical techniques to solve this problem using data from Digg (story popularity) and Youtube (video popularity). In order to reduce the effect of daily and weekly popularity variations, the authors apply a logarithm transformation of the popularity counts. They also propose the digg hour as a measure of time that is less affected by such variations. A digg hour takes longer at low-activity periods.

Given an indicator time t_i and a reference time t_r, t_i < t_r, the problem consists of predicting the popularity of a content c at t_r given its popularity at t_i. The authors show that there are significant linear correlations between popularity counts at t_i and t_r, specially when a logarithm transformation is applied. As a consequence, they propose the following linear model for popularity prediction:

ln N_s(t_r) = ln[r(t_i,t_r)N_s(t_i)] + noise(t_i,t_r)

Where N_s is the popularity at a given time, r(t_i,t_r) is the linear relationship between popularities at different times and noise describes the randomness in the data. The authors show that the noise is normally distributed in log-scale. Moreover, a motivation for the linear model proposed is the log-normal popularity distribution found in digg stories. Three prediction models, with respective error functions to be minimized, are proposed: a linear model, a constant scaling model and a growth profile model (from the literature).

Linear model: Described by the equation given above. The error function is the least-squares absolute error. However, instead of minimizing the log-transformed error, the objective is minimizing the linear error. Therefore, the authors apply a transformation that gives the linear popularity at t_r using the popularity at t_i in log-scale.

Constant scaling model: The model is defined by the following equation:

N_s(t_i,t_r) = alpha(t_i,t_r) N_s(t_i)

The relative squared error is the function to be minimized and the value of alpha that minimizes the error is found using training data. The relative squared error function is given by:

RSE = Sum_c [N_c(t_i,t_r)/N_c(t_r) - 1]^2

Growth profile model: This model is based on average growth profiles devised from the training set. The growth profile is the average popularity of a content at time t_i and is normalized by the popularity at t_r. The growth profile is defined according to the following equation:

P(t_0,t_1) = <N_c(t_0)/N_c(t_1)>_c

Where <>_c is the average value of the ratio over the content from the training set. Given the profile P, the prediction model is defined as follows:

N_s(t_r) = N_s(t_i)/P(t_i,t_r)

The three proposed models are evaluated using the least-squared error and the relative squared error functions. The datasets are divided into training and test sets of equal size. While the linear model achieves the best results for small values of t_i in terms of LSE using data from Digg, the three models perform very similarly when t_i approaches to t_r. For the Youtube data, it is not possible to draw significant conclusions w.r.t which model is the best one. Moreover, the variations in the value of LSE are higher than for the Digg dataset. In terms of the RSE function, the Constant Scaling Model achieved the best results, as expected, since it minimizes this error function.

The authors argue that the different results found for the Digg and the Youtube datasets can be explained by usage of such applications. While the saturation of the popularity of digg stories occurs very early, youtube videos are accessed during a much longer range of time. Therefore, while the average relative error converges to 0, with small variations, as we approach to the reference time for digg content, this error does not decrease monotonically and presents high variations until t_i is very close to t_r for youtube videos.

I appreciated reading this paper for several reasons. First of all, the problem studied is very relevant and also challenging. The techniques presented are theoretically sound and their experimental evaluation presented several interesting insights about the properties of both the techniques and the datasets. I will keep reading about more work done by Bernardo Huberman and other members of his lab.

Link: http://arxiv.org/abs/0811.0405

domingo, 8 de janeiro de 2012

Sampling for Large Graphs

During the past months, I've been thinking a lot about topological properties (e.g., degree distribution, clustering coefficient, diameter) of random subgraphs, which can be seen as samples of graphs. So, I decided to read this paper again. In statistics, sampling is basically the technique of selecting a limited number of instances to represent a larger population. This paper studies the sampling from graphs, which is the selection of a smaller, but still representative, subgraph from an original graph. These samples are useful in the analysis of really large graphs, for which processing is impractical.

The authors study the problem of sampling large graphs from two perspectives: scale-down and back-in-time. The scale-down sampling is defined as the the selection of a sample S of size n' from a graph G that has n vertices such that n' < n and that S has similar properties as G. In the back-in-time sampling, the sample S is expected to have the same properties of the graph G when it had n' vertices. Samples are evaluated in terms of several static (e.g., in and out-degree, clustering coefficient distribution) and temporal (e.g., densification power-law, shrinking diameter) properties of the original graph. Two distributions are compared using the Komolgorov-Smirnov D-statistic. The sampling techniques studied in the paper are:

Random Node: Vertices are selected at random.
Random Walk Page-rank Node: Vertices are selected with a probability that is proportional to their page-rank weight.
Random Degree Node: Vertices are selected with a probability that is proportional to their degrees.
Random Edge: Vertices are selected at random.
Random Node-edge: Vertices are selected at random and then a random edge from them is selected randomly as well.
Hybrid: Applies the random node-edge sampling with a probability p and the random edge sampling with a probability 1 - p (p = 0.8).
Random Node Neighbor: A vertex is selected at random together with all its neighbors.
Random Walk: Picks a vertex at random and simulates a random walk starting from from it, the algorithm goes back to the starting vertex with a probability c (c = 0.15).
Random Jump: Similar to the random walk, but jumps to a random vertex instead of the same starting vertex.
Forest Fire: Randomly picks a vertex and then selects x of its out-going neighbors where x is geometrically distributed with mean pf/(1-pf) and pf is a parameter.

The graph datasets used in the evaluation are a citation network from arXiv, an autonomous system network, a bipartite affiliation network from arXiv and a network of trust from epinions. The results show that though there is not a best overall strategy, the Random Walk Node and the Forest Fire sampling achieved the best results in the scale-down sampling task. Regarding the back-in-time task, the Random Page Rank Node and the Forest Fire strategies obtained the best results in terms of Komolgorov-Smirnov D-statistic on average. The paper also presents an evaluation of the impact of the sample size over the quality of the sampling. For small samples (< 40%), the Random Degree Node, Random Jump, and Random Walk Node achieved the best results in the scale-down task and the Forest Fire, Random Node and Random Page-Rank Node achieved the best resutls in the back-in-time task. Moreover, methods based on edge selection did not performed well in general.

My first conclusion from this paper is that it is not exactly what I'm looking for. I seems that the paper Statistical properties of sampled networks is much more related to the properties of random subgraphs. However, I enjoyed reading this paper, specially because of its experimental section. The evaluation metric applied (the Komolgorov-Smirnov D-statistical) may be handy in the future.

Link: http://cs.stanford.edu/people/jure/pubs/sampling-kdd06.pdf

Patterns of Temporal Variation in Online Media

During the past years I've studied how URLs become popular on Twitter and this is very strong related work. In this paper, Jaewon Yang and Jure Leskovec, both from Stanford University, study patterns that describe how the popularity of a given content (e.g., URLs, hashtags, memes) evolves along time, which they call patterns of temporal variation. Each content produces a time series that describes its popularity during a given time interval and the authors propose a technique for the identification of clusters of contents that present a similar variation pattern. This problem has several relevant applications, such as understanding how online content appears, gets popular, and fades over time, and also predicting content that is likely to become popular.

The clustering algorithm for time series proposed in the paper, which is called K-Spectral Centroid (k-SC) is very similar to the k-means algorithm. However, the authors define a new distance metric that is more suitable for the particular scenario of temporal variation of online content. The proposed metric is defined as an optimization problem that finds the best match between two time series both in terms of the x and y scales. The objective is to shift and scale the time series in order to minimize the effects of the absolute popularity values and the particular time in which such series peaked. Based on this metric, they define an overall evaluation function to be minimized and also an optimization problem for the identification of a cluster's centroid in each iteration of the algorithm. K-SC receives the number of clusters as an input parameter and initial centroids are assigned at random. The algorithm iterates assigning each time series to its closest centroid and then updating the centroid of each cluster until the assignments converge. The identification of the centroids requires the computation of the eigenvectors of a matrix of size L, which takes O(L^3). Moreover, the performance of the algorithm is very sensitive to the initial assignments. In order to make K-SC more scalable, the authors propose an incremental version of the algorithm using the Discrete Haar Wavelet Transform.

The K-SC algorithm is applied in the analysis of 2 datasets. The first one is a set of phrases extracted from blogs and news articles through the MemeTracker methodology. Each time series is truncated to 128 hours and the number of clusters is chosen empirically. Experiments show that K-SC outperforms two variations of K-means in terms of two evaluation metrics: one the measures the distance between points and their respective centroids and another that evaluates the distance between centroids. While the first variation of the K-means algorithm aligns the peaks of the time series, the second also scales peak volumes. They also show that the incremental version of K-SC is more scalable than the naive one.

The 6 clusters identified by K-SC are very interesting. They represent intuitive and distinct patterns, such as fast and short popularization and popularity rebounding. The authors argue that these patterns are related to the time in which different types of media mention the phrases and that this information may be useful to classify phrases into each of the 6 categories. They show that these temporal features are more effective than volume features (i.e., the fraction of the total volume of citations corresponding to a particular website) and also tf-idf features based on the citations of a particular website to a given phrase and the overall citations of this website. Similar results are found in the application of K-SC to the analysis of URLs on Twitter.

This paper studied a very interesting problem and presents several contributions both in terms of the techniques proposed and the case studies presented. I think that, at some points, the authors make assertions that too strong, specially regarding the classification accuracy of the techniques, which seems quite similar to me. However, my overall evaluation of the paper is very positive.

Link: http://cs.stanford.edu/people/jure/pubs/memeshapes-wsdm11.pdf

sexta-feira, 6 de janeiro de 2012

It's Who you Know: Graph Mining using Recursive Structural Features

This paper studies the problem of extracting features for nodes. Node features are useful in many graph mining tasks, such as classification, de-anonymization, and outlier detection. The focus of this work is the discovery of structural features, i.e. feature that are related to the topology of the graph. Structural features are specially interesting for transfer learning tasks, where the knowledge discovered from one graph is applied to another one. The authors cite as an example of a transfer learning task the problem of classifying IPs from a given network in terms of traffic type using labelled IPs from another network.

Node features are classified as follows. Local features are vertex degrees. Egonet features are extracted from the graph induced by a given node and its neighborhood (e.g., the number of edges that connect at least one node in this induced graph). Neighborhood features combine both the local and egonet features. Recursive features, which definition are the most important contribution of the paper, are recursive aggregations (sums and means) of neighborhood and even other recursive features. Finally, the set of regional features comprises the neighborhood and recursive features.

Recursive features have their range of values reduced to small integers through a vertical logarithm binning procedure. According to the authors, this procedure is effective in discriminating values that follow a power-law distribution. A feature graph containing edges between every pair of feature values that never disagree by more than s (input parameter) is used to prune the number of features. Each connected component of this graph is replaced by the simplest feature (i.e., the one generated with the minimum number of recursions).

The authors apply regional features in three graph mining tasks: within network classification, across network classification, and node de-anonymization. Such features are given as input to a log-forest model, which is a bagging of logistic regression classifiers. For within network classification, they use a dataset where nodes are IPs, edges represent communication, and classes are related to the type of traffic (e.g., DNS, Web, P2P). Their technique outperforms the weighted vote relational neighbor classifier, which makes use of neighbor's labels, in cases where the labels are sparse and achieve competitive results when 90% of nodes are labelled. The same dataset is used for across-network classification, in which different networks are generated using data from distinct periods of time and a completely labelled network is used to classify nodes from another completely unlabelled network. The results show that regional features achieve accuracies varying from 70% to 90% in this scenario.

The de-anonymization task consists of, given two networks with shared nodes, identifying corresponding nodes (i.e., those that represent the same identity). They solve this problem in IM, communication and SMS networks, generated in different periods of time. Moreover, they also apply features from a who-follows-whom network to de-anonymize a who-mentions-whom network from Twitter. Regional features outperform local and neighborhood features and also a baseline strategy that makes random choices in most of the datasets. While such features are able to set top-1% scores to the target node in up to 78% of the cases in the IP network and up to 55% in the IM network, the results on Twitter are poor. Moreover, regional features does not achieve the best results for the SMS data, unless it is able to select the nodes to be classified.

In general, the paper is not very clear. The authors should give at least one example of the generation of recursive features. Furthermore, although the proposed technique is shown to be effective, I found it to be arbitrary in several points, specially the pruning strategy. One of the strengths of the paper is its thorough experimental evaluation. Nevertheless, the results would be more easily understood if a standard evaluation procedure was employed. In particular, I was not able to deduce in which extent the regional features were more effective in the de-anonymization of nodes in the IP than in the Twitter case study.

terça-feira, 3 de janeiro de 2012

Detecting Influenza Epidemics using Search Engine Query Data

This paper was in my TO-READ stack for a long time. It is a work developed by some researchers from Google and the Centers for Disease Control and Prevention and was published by the Nature magazine in 2009. The paper shows how query data is correlated with influenza illness epidemics in a given region, being useful for monitoring purposes. More specifically, a linear regression using the relative frequency of some queries submitted to the google search engine estimates accurately the percentage of physician visits in which a patient presents influenza-like symptoms. Official reports were applied in the identification of the queries and also in the validation of the models.

An interesting property of the strategy proposed is the fact that it is completely automatic. Instead of applying machine learning techniques to classify a set of 50 million queries, they applied the official data to rank and select the queries that produced the most accurate estimates. The weekly counts of each query were normalized for each state during a five-year period. Although the study considered data from each region of the United States, the selected model (i.e., set of queries + linear regression) was the one that obtained the best global results. The final set of queries are related to influenza complications, cold/flu remedy, symptoms, etc.

The mean correlation of the model with the official reports was 0.90. The model achieved a mean correlation of 0.97 when applied to untested data (i.e., not used during the fitting phase) and a further evaluation using data from the state of Utah obtained a mean correlation of 0.90. Moreover, the authors found that the proposed model is able to provide estimates 1-2 weeks ahead of the publication of the reports.

An interesting aspect of the paper is that it combines a very simple model (linear regression) with a huge dataset. In academy, we accustomed to the opposite approach, which is the application of a complex model to fit a small dataset. I like to call the strategy used in the paper the google approach. According to the paper, hundreds of machines running map-reduce were used in the processing of the models. Pretty cool, isn't it? Searching for related papers, I found one which argues that the query data is not that useful for this kind of monitoring. Nevertheless, using web data to monitor epidemics has become a hot topic.

Link: http://research.google.com/archive/papers/detecting-influenza-epidemics.pdf

Echoes of Power: Language Effects and Power Differences in Social Interaction

Today I've read a very interesting paper entitled Echoes of Power: Language Effects and Power Differences in Social Interaction. In this paper, some researchers from Cornell University and Yahoo! study the effect of people's position (or power) over the way they influence each other in terms of language usage. The basic idea of the paper is that in communication a higher-powered person (e.g., a justice) is more likely to affect the language of others than a less-powered person (e.g., a lawyer). Moreover, a higher-powered person has its language less affected by the others than a less-powered one. The authors show these patterns using the concept of linguistic coordination, which is, basically, the way people tend to mimic the usage of function words (e.g., articles, pronouns) from those who they are communicating with.

Based on the exchange theory framework, the authors consider two types of power in social interactions: status and dependence. Measures of coordination between two groups are defined for a specific linguistic style marker and also for aggregated markers. Such measures are, basically, a function of the increasing of the probability of a speaker 'a' from one group A to use a given marker in a reply to a target 'b' from a group B given that 'b' has used this marker while communicating with 'a'. The author say that, in this case, 'a' coordinated to 'b'.

They study these exchanges in language patterns in two real-life scenarios: Wikipedia and Supreme Court arguments. In the Wikipedia, admins (higher-powered) and non-admins (lower-powered) communicate about article editions. In terms of power, the authors found that users coordinate more to admins than to non-admins, as expected. However, the results also shown that admins coordinate more to other users than non-admins, which is counterintuitive. A further analysis revealed that admins reduce their levels of coordination after they are promoted. Considering dependence, they found that an user coordinates more to thoses users who he/she disagrees with. In the Supreme Court arguments, justices (higher-powered) and lawyers (lower-powered) argue about cases. As expected, they found that lawyers coordinate more to justices than the opposite. This coordination is even more likely for the cases in which the laywer and justice disagreed. Moreover, favorable justices coordinate more.

The authors also show how coordination features (i.e., features that state whether a user x coodinates more to y than the opposite) are useful for the prediction of the most powerful user in social interactions. In fact, while other features, such as textual content, may be more effective in this task, coordination features work in cross domain settings. Finally, the paper also studies activity and gender bias in coordination.

Link: http://www.cs.cornell.edu/~cristian/Echoes_of_power.html

Journals to watch

After listing a few conferences that may publish papers of my interest, I though it could be a good idea to do the same for journals as well. I must confess that I do not read journal papers as often as I should. My current strategy has being finding interesting journal papers through conference papers, but it seems that many people are using the same strategy. As a consequence, some journal papers that are relevant to my research are not cited by relevant conference papers. I'm aware that many journal papers in computer science are just extended versions of conference papers. Nevertheless, even in these cases the journal paper may be worth reading because it gives more detail on a given study. By my surprise, there are too many journals out there. While some of them have a specific topic, such as Data Mining, others are about Computer Science. Journals about science in general (e.g., Nature, Science) were not included because I'm already used to check them occasionally.

Journal	Publisher
Transactions on the Web	ACM
AI Magazine	AAAI
Communications of the ACM	ACM
Computing Surveys	ACM
Data & Knowledge Engineering	Elsevier
Data Mining and Knowledge Discovery	Springer
Decision Support Systems	Elsevier
IEEE Computer	IEEE
Information Processing Letters	Elsevier
Information Systems	Elsevier
Intelligent Systems	IEEE
Internet Computing	IEEE
Internet Mathematics	Taylor & Francis
Journal of Intelligent Information Systems	Springer
Journal of Machine Learning Research	SPARC
Journal on Very Large Data Bases	VLDB
Knowledge and Information Systems	Springer
Machine Learning	Springer
Parallel Computing	Elsevier
Pattern Recognition Letters	Elsevier
Proceedings of the National Academy of Sciences	American Academy of Sciences
SIGKDD Explorations	ACM
SIGMOD Record	ACM
Social Network Analysis and Mining	Springer
Statistical Analysis and Data Mining	Wiley
Transactions on Database Systems	ACM
Transactions on Intelligent Systems and Technology	ACM
Transactions on Knowledge and Data Engineering	IEEE
Transactions on Knowledge Discovery from Data	ACM
Transactions on Management Information Systems	ACM
Transactions on Pattern Analysis and Machine Intelligence	IEEE

domingo, 1 de janeiro de 2012

Conferences to watch in 2012

2012 has just begun and this year I'm supposed to read a lot. So, I decided to make a list of conferences to watch in 2012. The list contains papers on Data Mining, Web, Information Retrieval, Databases, Artificial Intelligence, Machine Learning, e-Commerce, Recommender Systems, Digital Libraries, and Parallel and Distributed Computing. The plan is to read at least one paper on every business day and to post small summaries of them here.

Conference	Date
ACM Conference on Web Search and Data Mining	Feb 8-12
ACM SIGPLAN Symposium on Principles and Practice of Parallel Programming	Feb 25-29
International Conference on Extending Database Technology	Mar 26-30
Conference on Uncertainty in Artificial Intelligence	Aug 15-17
IEEE International Conference on Data Engineering	Apr 1-5
World Wide Web Conference	Apr 16-20
ACM SIGMOD/PODS Conference	May 20-25
SIAM Conference on Data Mining	Apr 26-28
IEEE Parallel and Distributed Processing Symposium	May 21-25
Pacific-Asia Conference on Knowledge Discovery and Data Mining	May 29-Jun 1
AAAI Conference on Weblogs and Social Media	Jun 4-8
ACM Conference on Electronic Commerce	Jun 4-8
ACM/IEEE Joint Conference on Digital Libraries	Jun 10-14
ACM Web Science	Jun 22-24
AAAI Conference on Artificial Intelligence	Jul 22-26
International Conference on Machine Learning	Jun 26-Jul 1
ACM SIGKDD Conference on Knowledge Discovery and Data Mining	Aug 12-16
ACM SIGIR Conference	Aug 12-16
Conference on Advances in Social Networks Analysis and Mining	Aug 26-29
Conference on Very Large Databases	Aug 27-31
Conference on Social Computing	Sept 3-6
Conference on Practice of Knowledge Discovery in Databases	Sept 24-28
Internet Measurement Conference	Nov 14-16
Conference on Neural Information Processing Systems	Dec 3-8
IEEE Conference on Data Mining	Dec 10-13
ACM Conference on Information and Knowledge Management	TBD
International Conference on Recommender Systems	TBD