Me and my research: julho 2012

sábado, 28 de julho de 2012

Finding high-quality content in social media

This work was is authored by researchers from Emory University and Yahoo! Research. The research question studied here is how can we find high-quality content in social media websites. In particular, the authors have considered how several attributes associated question/answering data can support the identification of high-quality questions and answers.

The main motivation for this work comes from the huge volume of user generated content in the Web. More specifically, community-driven question/answering portals were becoming popular at the time. The authors had access to a dataset from Yahoo! Answers, which is a very popular question/answering platform.

The general idea of the paper is combining a large set of attributes, associated to questions, answers, and users, in order to predict/classify high-quality questions and answers. These attributes can be classified into the following groups:

Content features: Extracted from the textual information (questions and answers). Include punctuation and typos, syntactic and semantic complexity, and gramaticality. I will not cover the specific strategy applied to extract these features here.
User relationships: The paper mentions two types of graph: a user-item and a user-user. The user-user graphs are very intuitive. Each relationship defined (e.g., answering a question, giving a star to a question, giving thumps up/down to an answer) defines a graph. Link analysis algorithms (PageRank and HITS) are applied in order to find authorities/hubs in such graphs. However, I did not get how the user-item graph is employed.
Usage statistics: Basically, are variations of the number of clicks received by the question. In particular, normalizing the number of clicks by category is very important.

Given these attributes, it is proposed a binary classification problem that consists of distinguishing high-quality content from the rest. The authors evaluated several classifiers and found that the best performance was achieved by a technique known as stochastic gradient boosted trees. A stochastic gradient boosted tree is, roughly speaking, an ensemble of decision trees.

The question/answering data can be seen as a set of interactions among users, questions, and answers, as shown in the following schema.

Moreover, answer and question features can be seen as paths in a tree, as follows:

Another set of features proposed by the authors is related to the properties of the text of a question and its answer. These set includes the Kullback-Leibler divergence between the language models of the question and the answer, their non-stopword overlap, and the ratio between their lengths.

The question/answering dataset applied in this work contains 6,665 questions and 8,366 answers. Questions and answers were labeled as high-quality content or not by human editors. Further, this dataset was extended to a larger set by adding users whose answers or questions were part of the original dataset together with all their questions and respective answers.

A characterization of the dataset shows that the user interaction graphs have highly skewed distributions. Furthermore, it is shown that users act as both askers and answeres (i.e., there are not defined fixed roles).

The following table shows the accuracy of the classifier in the question classification task using different sets of features and their combinations. The baseline applies only n-grams and intrinsic combines several textual (content) features. The evaluation metrics considered are precision (P), recall (R) and the Area Under the ROC Curve (AUC).

The most significant features in question classification identified using a chi-squared test are:

Number of stars received by the asker
Punctuation density in the question's subject
Question's category
Number of clicks received by the question (normalized by category)
Avg number of thumbs up received by questions written by the asker

Next, we show the effectiveness of the answer classification method.

Top features for answer classification are:

Answer length
Number of words in the answer with a frequency larger than c in the dataset
Number of thumbs up - thumbs down received by the answerer divided by the total number of thumbs received

My opinion on this paper is negative. I think the results are not reproducible due to the complexity of the techniques used. Some of the attributes considered are too specific and were applied in an ad-hoc fashion. Finally, I could not get many relevant lessons/conclusions from the results presented.

Link: http://www.mathcs.emory.edu/~eugene/papers/wsdm2008quality.pdf

quinta-feira, 26 de julho de 2012

Identifying relevant social media content leveraging information diversity and user cognition

This is another paper related to a research project that I'm working on. It studies the problem of finding relevant social media content (e.g., tweets, URLs). The research was developed by people from Rutgers University and Microsoft Research.

The main motivation for this paper is the popularity of social media applications such as Twitter. The authors argue that these kind of media are specially useful for information seeking and exploration on timely happenings. However, given the huge volume of user generated content in these social media websites, how can we identify relevant content? This is the main question studied in this paper. The large scale and high rate of growth of social media combined with the rich set of attributes (e.g., network properties, location, time) makes this question really challenging from a research perspective. One specific point that the authors make in order to motivate their research is that static structural metrics, such as PageRank and HITS, are not suitable for social media because authorities in this scenario are likely to be emergent and temporally dynamic.

The first results presented in the paper come from a small survey showing that most of Twitter users apply the Twitter search interface or external search exploration tools (e.g., Bing Social) in order to search/explore content on Twitter.

One specific aspect emphasized along the paper is the importance of diversity in social media content. In other words, in several scenarios, the users may be interested in heterogeneous information on a given topic. The example given is about the oil spill, for which a broad coverage is desirable.

Different from most of the papers on social media I've read so far. This paper is focused on more cognitive and subjective users' perceptions of the content. Therefore, the evaluation is based on user studies instead of traditional metrics such as precision and recall. In fact, the authors state that there is no ground truth information to support a large scale study of the quality metrics they are interested in. More specifically, a relevant content is defined by them as interesting, informative, engaging and better remembered. The empirical studies presented in the paper are all based on Twitter data.

To be more formal, the problem studied in the paper is based on the definition of entropy, which is a measure of the degree of uncertainty of a given variable. Moreover, the authors define the following set of content attributes for Twitter content (i.e., tweets):

Diffusion property (is it an RT?)
Responsivity nature (is it a reply?)
Temporal relevance (when was it posted?)
Thematic association (topic given by OpenCalais)
Authority dimension (how many followers does the author have?)
Geographic dimension (timezone)
Hub dimension (how many followees/friends does the author have?)
Degree of activity (how many tweets has the author produced so far?)

Given these attributes, the problem studied in the paper consists of, given a stream of tweets T, a diversity parameter w, and a sample size s, determining a tweet set T_w, such that |T_w| = s, and the diversity level (entropy of attributes) of T_w is w. In practice, a desired T_w may have a diversity level as close to w as possible. Moreover, it is important to define an order of T_w in terms of diversity.

The authors asked to a group of 11 users about the importance of each of these attributes to them. But I could not find anything interesting in such results. I believe that asking users was not a good idea because they may not be aware of the importance of the attributes for themselves. They applied the results of this survey as weights to attributes, but found that these weights does not improve the proposed technique signficantly.

The proposed solution to the problem of selecting relevant content is divided into two parts: (1) A content selection method and (2) a content organization method.

Content selection: First, a set of tweets is filtered according to the particular topic of interest. Each content (tweet) is represented as a set of attributes and it seems like the text of the tweet is not applied. Then, the following heuristics is applied in the selection of a sample of tweets:

Create an empty set T_w
Pick a tweet at random
For each remaining tweet, compute the distortion of the entropy (l_1 norm) of the current sample in case the given tweet is added.
Add the tweet that makes the entropy level closest to the desired level w
In case the sample has s tweets, finish. Otherwise, go back to 3

Content organization: Tweets are ordered in terms of their distortion of the normalized entropy with respect to w.

The dataset applied in the evaluation of the proposed method contains 1.4B tweets. They considered as baselines some variations of the proposed method and also one method that always returns the most recent tweet and another that always returns the most tweeted content. The particular content of interest are URLs (but some of the attributes considered are associated to the tweet that contains such URL).

The above results show that the diversity of the returned samples are close to the desired levels (w).

The user study describe in this paper involved 67 users. Moreover, the authors considered two topics: oil spill and iphone. The evaluation criteria considered are: interestingness, informativeness, engagement, and memorableness. A set of 12 samples containg 10 tweets each was presented to users. Each user was asked the following questions:

How long did you take reading the sample?
How interesting is it?
How diverse is it?
How informative is it?

The actual time taken was compared against the perceived time as a measure of engagement. Moreover, they checked whether the users were able to remember the tweets they saw after a time interval as a measure of memorableness. The level of diversity was also varied.

The level of interaction between the variables involved in the experiments was checked using a repeated measures ANOVA procedure. The results has shown that the interactions are not significant. The following results show that the proposed method (PM) outperforms the baselines in terms of all metrics considered.

Moreover, PM generates results for which the diversity is better perceived by the users.

It can be noticed that the relationship between the actual and perceived diversity is not linear. The next result show how different quality measures are affected by diversity level.

In general, good results are achieved by low and high, but not intermediate, diversity levels. This result is intriguing.

I was not expecting much of this paper, but it end up being a good reading. I think that the approach taken to give weights to attribute was naive. Also, some of the most interesting results in the paper are not very supported. As an example, I would not expect that very low and very high diversity are appreciated by the users. However, the explanations for such fact given by the authors were weak. I believe that some of findings of this paper may be due to the particular entropy measure applied.

Link: http://research.microsoft.com/en-us/um/people/counts/pubs/relevantsocialmedia_ht2011.pdf

segunda-feira, 23 de julho de 2012

The structure of online diffusion networks

This research was developed by some people who were working at Yahoo! Labs at the time. One of the authors is Duncan Watts, who is now affiliated to Microsoft Research. The main topic of the paper is viral marketing. More specifically, the authors study how much of the total adoption of an object (e.g., URL, product) is due to viral spreading. In other to support their findings, they analyze spreading processes in 7 different scenarios.

For a long time, the research community has tried to model diffusion as an epidemic process. In particular, these disease-like models are expected to produce an "S shaped" cumulative adoption curve. Further, threshold models, in which users have minimum adoption levels to be reached before they become adopters, became popular. Moreover, network models of adoptions, where users' actions depend on a small set of neighbors and both local and global structural features could influence the size and the likelihood of cascades, attracted the interest of the research community. In particular, this models motivated several studies on the selection of a small number of nodes that could maximize the influence over the network. From an optimization perspective, other studies focused on the trade-off between the cost of activating a seed set and the global influence achieved by these nodes in the diffusion process. However, the generation of large cascades due to diffusion processes still lacks empirical evidence.

A cascade is composed by a seed node, which took action independently from other nodes, and a set of non-seed nodes influenced either directly or indirectly by the seed to take the same action. In this paper, the authors consider tree representations of cascades. The source of influence is always the earliest parent of the affected node. This study characterizes these cascades in terms of size, depth, and the proportion of the total adoption due a given isomorphic class of cascades.

The datasets applied in this work are:

Yahoo! kindness: Website that asked users to describe and share their acts of kindness through Facebook, Twitter, etc. Diffusion is tracked through a unique URL given to each user. The dataset contains 59K users.
Zync: Plugin for watching videos with contacts in the Yahoo! IM applications. A diffusion occurs whenever an invitation to share a stream is accepted. This dataset contains 374K users.
The secretary game: Online game. Players were encouraged to share their (unique) game URL. The total number of adopters is 2.9K.
Twitter news stories: Collection of 80K news stories posted by NYT, CNN, MSNBC, Y! News, and Huffington Post on Twitter. Adoptions means sharing a URL and the number of adoption events is 288K.
Twitter videos. 540K Youtube videos posted on Twitter. An adoption occur whenever a user posts a link to a video. There are 1.3M adoption events.
Friendsense: Third-party Facebook application that asked users about their political views and also their beliefs about their friends' political views. The dataset contains 2.5K users, 100K answers and 80 questions.
Yahoo! Voice: Paid VoIP service supported by Yahoo! Messenger. An adopter is someone who bought Y! Voice credits and cascades are identified according to the interactions in the Yahoo! IM network. The dataset is composed by 1.8M users.

Results show that, in general, cascade trees are small. The most frequent cascade size is 1 and cascade size distribution is very skewed. The authors investigate whether a small number of large cascades would create a significant proportion of the adoptions and find that this is not the case. Less than 10% of the adoptions occur in cascades with more than 10 nodes. Moreover, the vast majority of adoptions occur within 1 generation of the seed node.

Based on this results, the authors make several interesting points about the impact of their discoveries over our understanding of viral marketing. In fact, they challenge the common belief that a small set of users may produce large cascades in real scenarios. On the other hand, they also discuss how different results could be found in case adoptions were associated to explicit incentives or in scenarios where the adoption is automatic, such as in email viruses. The following figure shows some of the largest cascades found in the Twitter datasets. Colors represent cascade levels (i.e., distance from the seed node).

This paper is very well-written and has a very clear contribution. I appreciated the literature review, which gave a historic view on information diffusion research. However, the paper lacks a further step in the direction of alternative models able to capture the main properties of adoption cascades found. In fact, it seems that this further step was left as a future work. Let's wait and see what comes next.

Link: http://5harad.com/papers/diffusion.pdf

domingo, 22 de julho de 2012

Congress of the Brazilian Computer Society

I've attended the 2012 edition of the Congress of the Brazilian Computer Society (SBC) that run from July 16-19 in Curitiba, Parana. The conference was nice, except for the tragic death of several students that were traveling by bus from Belem, Para in order to attend the Congress. For several moments, it was sad to think that these fellow students could be there learning and sharing experiences with us but, for reasons I cant explain, they were prevented to do so.

I had the opportunity to present my M.Sc research on correlating vertex attributes and community structure in attributed graphs at the Thesis and Dissertation Contest and I'm very honored that my work was selected as best M.Sc thesis of 2011. The other nine competitors presented really great research work on several topics, such as Software Engineering, Linked Data, Graph Theory, Grid Computing, Information Retrieval, and Optimization. In fact, I was not expecting that my thesis would be selected as the best one among so many interesting work. I think that my talk was very convincing, specially because I had practiced it a lot and also because I took several comments from my colleagues and advisor in order to improve it. Another aspect that may have played a key role is that my M.Sc research goes from a general and intuitive problem to some cool theoretical and empirical results. As a consequence, I was able to catch the attention of the audience and the reviewers with a lot of cool examples, a little bit of math, some algorithmic techniques, and performance evaluations. Organizing a research project as a sequence of related research questions seems to be a powerful presentation strategy and I should keep employing it in the near future. I think this award is a good signal that I'm evolving as a researcher, which is a great motivation to keep working hard.

I also presented a recent work on detecting relevant content and influential users on Twitter at the first edition of the Brazilian Workshop on Social Network Mining. In this case, my presentation was terrible. I was tired because of the previous talk in the same day. Also, the slides were poor and I had not practiced the talk at all. Unfortunately, I went beyond the time limit, what makes things even worse. Nevertheless, our paper won the best paper award at the event (by my surprise again). That was my first best paper award and I'm very proud about it too. Now I'm working to turn those preliminary results into a full conference paper. The workshop was a lot of fun, specially because I had the opportunity to ask a lot of questions after the talks. I don't know why, but Brazilian researchers usually do not ask questions. I think this behavior is very disappointing. Questions indicate that the talk was relevant for the audience and have several benefits such as helping the presenter to improve his presentation skills and supporting more and better research. In particular, I think that questions are very important for students and young scientists.

The congress was a great opportunity to talk with several colleagues from UFMG and other universities as well. The conference dinner was really nice, though I've heard several complains about the event organization in general.

My first impression of Curitiba was very positive. Different from my friends, I liked the weather (about 10 °C during the winter) and the people. Moreover, the transit and the public transport are much better than those of Belo Horizonte Metropolitan Area, where I live. I had a few hours to make a very fast bus tour around the city. The picture included in this post was took at the Tangua Park during the sunset.

Information seeking: Convergence of search, recommendation and advertising

This paper is one of those that try to give a broader view on a recent research problem. In particular, this paper is focused on the idea of integrating search, recommendation, and advertising into a unique system. The authors are from Stanford and include Hector Garcia-Molina who is pretty famous in the Database Management research community.

The motivation for the convergence among search, recommendation, and advertising is the so called "deluge of data" (or information overload). In this scenario, a fundamental problem consists of identifying objects that satisfy a user's information needs. The authors highlight three important mechanisms related to information providing:

Search: Return a set of objects from a collection based on a user query;
Recommendation: Return a set of object of the interest of a user based on contextual information.
Advertisement: Similar to recommendation but specifically for products and services.

The authors argue that these three mechanisms are strongly related and, as a consequence, an integrated view that takes them into account may bring several benefits. In fact, these mechanisms have as common goal matching a context (e.g., a query) to a collection of information objects. The following figure illustrates this unified view.

As means to characterize search, recommendation and advertisement, the paper describes them in terms of the following properties:

The pull delivery mode refers to those scenarios where there is an explicit user request. On the other hand, in the push delivery mode there is no explicit user request. The unexpected good aspects occurs whenever a novel result may be beneficial. Collective knowledge is related to the use of a historical database in order to provide a better service.

A short coverage of the search, recommendation, and advertising mechanisms is given. I will summarize some of the main points presented by the authors.

Search: It is the most mature between the three technologies. Can be divided in to two main steps: filtering and ranking. The filtering step is usually based on a query and the ranking step aims to sort the filtered results in terms of relevance.

Recommendation: Can be classified by the strategy employed (content-based or collaborative filtering) and the recipient of the recommendations (a user or a group). Content-based strategies recommends objects that are similar to those the user liked in the past. Collaborative filtering strategies identify interesting associations between users and objects consumed by them in order to make recommendations. Moreover, there is a distinction between memory and model-based collaborative filtering techniques. While the first apply statistical measures in to match users with similar tastes, the second build a model that captures such relationships. Regarding the recommendation recipient, recommendation strategies can be divided into those that recommend objects to a single user and those that recommend objects to a group of users (e.g., a movie).

Ads: Sponsored search is the mechanism by which relevant ads are delivered as part of the search experience. Therefore, sponsored search must satisfy both user need for relevant results and the advertiser's desire for qualified traffic to their websites. The usual approach applied by search engines is offering keywords through auctions. The price of a keyword depends on its relevance. Whenever the user submits a query, the search engine matches the query against the keywords and returns the ads the are relevant w.r.t the query.

Currently, the aforementioned information providing mechanisms are separated because they cater to different needs. Moreover, these mechanisms usually bring important performance issues that may prevent their integration in real applications. However, the authors present several points showing that several technologies are evolving towards this integration, which they call Information Consolidation Engine (ICE). An ICE provides a single interface for all information needs, sharing technologies across search, recommendation, and advertising, and covering a broad set of objects. In the backend, an ICE should merge functionalities of these three mechanisms, supporting better modularization and reuse.

The main challenges in the development of ICEs are:

Complex targets: ICEs may return a set (or package) of objects in order to satisfy the users' needs.
Constraints: The user may provide a set of filtering, package, sequencing and prohibitive constraints over the set of objects of interest.
Personalization: The more information the user has about the user, the better are the recommendations it is able to provide. However, it is known that personalization suitable only for active users and special types of queries. Moreover, personalization narrows the view of users and bring important privacy issues.
Metrics: Evaluating ICEs is a challenge because it involves highly subjective answers. Existing evaluation metrics are not powerful enough to measure the quality of ICEs. In particular, different metrics may be required for different domains.
Other challenges: Other challenges include trust (from both the system and user side), explainability, and user feedback gathering.

The paper finishes with a short case study describing Courserank, which is a social site where Stanford students can review courses and plan their academic program. The site provides course recommendations considering university requirements, among other services. Their current research has being focused on formalizing a language for personalization of recommendations, providing consumption plans based on sequencing information, and using tags to created a unified interface to browse and receive search results and recommendations.

I did not like this paper very much. In fact, I was expecting better examples and concepts related to the convergence of the aspects covered in the paper. Also, I believe that social media plays a key role in the integration of search, recommendation, and advertising. However, the authors do not emphasize the role of social media in the paper.

Link: http://ilpubs.stanford.edu:8090/963/

quinta-feira, 12 de julho de 2012

Everyone's an influencer: Quantifying Influence on Twitter

Another paper on influence in social networks using Twitter data. The first author of this paper wrote it while he was a P.hD student at Michigan University. The other authors, including the legendary Duncan Watts, were at Yahoo! Research at the time though most of them may have already left Y!R by now. In this paper the authors conduct several experiments in order to understand and model the role played by users, specially the so called influentials, and content in the generation of information cascades on Twitter.

An information cascade is a popular representation of the diffusion of a given content. It describes the chain of propagation from the seed to the leaf users. Directed edges (u,v) in the cascade mean that that there is a path from v to u (e.g., v follows u). The definition of influence used in the paper is the logarithm of the sum of the size of the cascades seeded by a given user. The type of content considered are URLs, more specifically those URLs shortened using bit.ly.

All the analysis are based on a Twitter dataset containing 87M tweets, 74M diffusion events - I suppose this means the number of pairs (URL,seed) - crawled in a 2-month time frame and also a follower network collected using the users that took part in at least one event as seeds.

In case the cascade reaches an ambiguous step - a user that published a given content has two neighbors who also published it - the authors considered three possibilities:

First influence: The first publisher takes the credit
Last influence: The most recent (last) publisher takes the credit
Split credit: The credit is divided equally between the publishers.

In practice, the three strategies achieved similar results, but it is good to know them anyway.

It is interesting to notice that most of these papers on propagation/diffusion on Twitter consider URLs and hashtags as the type of content to be propagated. The authors discuss some of the implications of this choice. In particular, URLs are more inclusive than other information items such as tweets. The same can be said about hashtags. In particular, they discuss the potential effect of other sources of correlation, specially homophily, over the occurrence of cascades. In fact, the first author worked on this problem later and I have already summarize his paper here.

Examples of cascades are shown in the following figure:

Cascade sizes seems to follow a power-law distribution and cascade depth distribution look like an exponential.

In order to model influence, the authors propose the use of a regression tree model. Regression tree models are an elegant way to integrate several simple predictors into a tree for which the branches are based on values of variables. This property is expected to handle cascades with varied sizes well. The attributes of the regression model are the following:

User attributes (log-transformed)

# followers
# friends
# tweets
date of joining

Past influence (in the first month):

avg, min, and max total influence
avg, min, and max local (immediate followers) influence.

Data from the first month are used in the generation of features to predict influence in the second month. Here is the regression tree model found. Conditions indicate partitions of the feature space (left if the condition holds and right otherwise) and leaf nodes give the logarithm of the predicted mean cascade size.

Past influence and number of followers are the only relevant features. Moreover, while the model works well for average values, which are very low in general, it does not predict well the specific cascade sizes.

Further, the authors study the effect of content over information diffusion. URLs were stratified into groups according to their cascade sizes and a sample from each group was selected uniformly at random. These samples were evaluated in terms of several criteria using humans from Mechanical Turk (it sounds funny). This figure shows average cascade sizes for different URL types and categories:

They also found that URLs classified as interesting and associated to positive feelings tend do generate larger cascades. However, adding the content attributes to the the regression model did not improved it.

In the final (and most interesting) part of the evaluation, the authors tried to answer a very challenging question: If you are a marketing professional and want to hire some twitteres to spread your content, which users would you choose? Intuitively, two opposing strategies would be selecting many "low rank" users or a few "star" users to do the job. In fact, the solution is likely to be a trade-off between these strategies. The authors defined a cost function as follows:

c_i = c_a+f_i.c_f

where c_a is a fixed cost per user, c_f is a cost-per-follower and f_i is the number of followers of the user u_i. The value of c_a is considered to be in the form alpha*c_f. Therefore, the value of alpha defines this trade-off between selecting many less influential or few more influential users. If alpha is small, contracting many less influential users becomes more attractive. On the other hand, if alpha is large, then selecting a few influential users tends to be a better strategy. They binned users according to their predicted influence given by the regression model and computed the mean actual influence / cost for each group. The results found are the following:

If alpha = 0, then selecting users with small mean predicted influence is the best choice. As alpha is increased, selecting influential users becomes more attractive, as expected. However, as the authors pointed out, the transition between these two sides is slow. This evidence puts in check the widely accepted belief that selecting a small set of influentials is a good strategy for viral marketing campaigns.

This is the second paper authored by Eytan Bakshy that were summarized in this blog. Both papers tried to answer fundamental questions about social influence using simple but sound empirical analysis and comprehensive datasets. In particular, I liked more this paper than the other one because the question here is clearer. Moreover, reading the introduction of this paper was delightful (great work!). One question about the last part of the evaluation (ordinary users vs. influentials) is: If the model does not work and the overall average size of cascades is small, does this analysis make sense? I mean, after thinking about this analysis a little bit I realized that the results obtained were highly expected. However, a better model may support a more conclusive analysis. I would like to see the results of an optimal model - that always predicts the actual influence - in order to fill this gap.

Link: research.yahoo.com/files/wsdm333w-bakshy.pdf

terça-feira, 10 de julho de 2012

Finding Trendsetters in Information Networks

I found this paper to be very related to the research project I'm working on during the past few months. It is still unpublished but the authors already made it available, what is great. This research seems to have been developed inside my department (DCC-UFMG) and involved also people from Universitat Pompeu Fabra (Spain), Yahoo! Research, and Universidade Federal de Ouro Preto (Brazil). One of the authors is Ricardo Baeza-Yates, who is pretty famous in the Information Retrieval research community.

The problem studied in this research was identifying trendsetters, which are those people who adopt and spread new ideas before they become popular, in information networks such as Twitter. Given the graph induced by a topic, the authors propose the generation of a new graph where the edges are based on the influence between users expressed in the data. A PageRank based algorithm over this influence graph gives a score to every user based on his capacity of: (1) adopting the trend (or topic) early and (2) spreading the topic through his followers.

The idea of identifying trendsetters is supported by previous studies on influence. In particular, the authors describe two very interesting concepts, which are follower hubs and innovative hubs. Follower hubs have a broad audience but also present a high threshold for the adoption of new ideas. Therefore, these hubs are influencers. On the other hand, innovative hubs are less popular and also have a lower adoption threshold but are responsible for spreading new ideas through the network (i.e., they are trendsetters). Though this idea is somehow attractive, I think it does not describe the information diffusion process very well (more on this later).

The definition of a topic applied in this work is very simple. In fact, they do not provide any automatic strategy for topic identification. A topic is defined as a set of contents (hashtags, URLs, memes, etc.). Each topic induces a graph that contains the users who spread it and the edges among them. Based on this graph, the two following functions are proposed:

The function s_1(v)_i gives the users who spread the topic i. Moreover, the function s_2(u,v)_i defines whether the user v has influenced the user u. In this case, there must be an edge from v to u in the graph. The function s_2(u,v)_i (0 <= s_2 <= 1) determines a exponential decay of the influence score with time and the parameter alpha allows the user to define the time span of analysis (short or long term influence). The influence of a user v over u is defined as follows:

where the symbol '.' means the scalar product, ||x|| is the Euclidean norm of x, and L is the number of non-zero positions in the vector s_2. Influence values are normalized according to the following expression:

Based on the influence scores given by the function defined above, the Trendsetter (TS) score of vertices for a specific topic k is given as:

This formulation is very similar to the PageRank equation, including the use of a damping factor d. The following Figure compares the time when the top TS user posted a message about the Iran election against the time of occurrence of the top PageRank (PR) user .

The idea is that top TS users are early adopters while top PR users engage on the topic much later.

The evaluation of the TS ranking technique is performed using a large (and old) Twitter dataset from 2009 containing about 1.6B tweets and the complete following network. The topics considered are based on the most frequent hashtags classified in a previous work. Each frequent hashtag (total of 370) defines a topic. Moreover, new hashtags were added to topics based on co-occurrence with a frequent hashtags.

The first part of the evaluation section is dedicated to evaluating how top TS users are early adopters compared to the top PR and ID (In-degree) users. Next, the authors group topics based on their popularity shapes using the KSC-algorithm and show how early the top TS users adopt the topics for different popularity curve shapes. It is also shown that the Influence ratio, which is the fraction of one's followers who have been influenced at least once by him, of the top TS users is higher than that of top PR and ID users. When the rankings given by TS, PR, ID, and EA (early adopters) is compared, it is possible to notice that TS gives rankings that are distinct from the ones given by the other strategies. The authors argue that TS gives a nice balance between them. Finally, the authors show that top TS users may be identified using a small amount (e.g. 10%) of initial data, which is not true for PR and ID.

Reading this paper was interesting. I appreciated a lot some parts of it, specially the related work section, which included some references that are not familiar to me. However, the point that attracted my attention the most was how the authors partially supported their technique and findings using the definition of follower and innovative hubs. Because it changes the role played by influentials in the diffusion process. First, it basically assumes that influentials do not produce new content and I can find many counterexamples for it (e.g., CNN, Perez Hilton). Moreover, based on the results found, influentials arrive too late in the process (sometimes, after the peak of popularity). Well, I would hardly call someone like this an influential in the first place. But whenever these "influentials" are not engaged in the diffusion of a given information, I would expect them to arrive around the popularity peak as any other user. The fact that they used hashtags as data make things even more complex, because the space of hashtags is small and, so, some popular hashtags are expected to be introduced by a random user. But these users do not necessarily are key in information diffusion. I would like to see a similar study using other type of content in order to make things clearer.

Link: http://www.decom.ufop.br/fabricio/download/KDDrt614-Saez-Trumper.pdf

terça-feira, 3 de julho de 2012

TWITOBI: A Recommendation System for Twitter Using Probabilistic Modeling

I've been reading this paper for quite a long time. It is related to a project I'm working on about relevance and influence of content in diffusion data. The reason it took me so long to read this paper is because I've decided to learn Expectation Maximization before. In fact, learning EM was worthier than reading this paper, though it is not all bad.

The authors of this paper are from Seoul National University and I've never heard about them before. The problem studied in the paper is recommending tweets and users to follow on Twitter. The approach used to solve the problem is probabilistic modelling.

I wont discuss the relevance of the problem here because it seems pretty relevant to me.

The following dataset illustrates the problem studied in the paper:

Given training data with users, followees, and tweets, the problem consists of recommending new tweets and followees to them.

The probabilistic model proposed by the authors is based on observed and unobserved data. Observed data are the set of users (U), tweets (T), and words (W). The function n(t,w) gives the number of occurrences of the word w in the tweet t. The sets T_u, O_u, and I_u contain the tweets, followees, and followers of u. Unobserved data (or hidden variables) are the topic z of each tweet t in T. The set of topics is denoted by Z = {z_1, z_2, ... z_n}.

The model is based on the following parameters:

p(w|z) - Probability that a tweet with topic z has the term w.
p(z|u) - Probability that the user user u selects the topic z.
p(f|u) - Probability that the user u selects a topic influenced by his followee f.

The following figure illustrates how tweets are created according to the model:

The algorithm for generating data is as follows:

1) User chooses whether he is going to select a topic by himself (with probability alpha) or a topic from one of his followees (with probability 1-alpha).
a) In case he chooses to select a topic by himself, it is done according to p(z|u).
b) Otherwise, he selects one of his followees with probability p(f|u) and one of his followee's topics based on p(z|f).
2) Given the selected topic z, a word w from z is selected according to p(w|z).

The log-likelihood of the dataset is given by:

Formulating this log-likelihood function is the hardest step in the generation of a probabilistic model like this. I think I understood all steps, but I may need some practice in order to do this kind of modelling by myself.

It is easy to recognize the model parameters in the above formula. But it is still necessary to define the hidden variables. These variables capture the topic of tweets and whether the user chooses a topic by himself or based on his followees' topics. They are defined as follows:

p(phi=z|w,u) - Probability that a tweet is about z given that it has the word w and is from the user u.
p(theta=u|z,u) - Probability that a user u produces a tweet about z by himself.
p(theta=f|z,u) - Probability that a user u produces a tweet about z influenced by his followee f.

These hidden variables are computed based on the parameters of the model, as shown in the following figure:

In the Expectation (E) step, the hidden variables are estimated based on the current values of the parameters, which are formulated as follows:

In the Maximization (M) step, the values of parameters that maximize the log-likelihood function are computed using Lagrange multipliers. One thing that I did not understand very well how the above expressions are applied together with the optimization method.

The paper also presents a parallel algorithm that computes the parameters of the probabilistic model using map-reduce, but I will jump directly to the results instead of describing it.

The experimental results apply a dataset containing 12M tweets and 8K users. Part of the data, containing 400 users, is selected as test set. Users in the test set have at least 10 followees and 10 tweets. From these users, 10 followees and 10 tweets are selected as random. The idea is predicting these 10 followees through user recommendation and the 10 tweets using tweet recommendation. The baseline for user recommendation is a method based on TF-IDF that I have already summarized here. And the baseline for tweet recommendation also based on TF-IDF is compared against the proposed method. The results show that TWITOBI outperforms the baselines in both the top-k followee and the top-k tweet recommendation tasks in terms of recall-at-k, precision-at-k, and average hit-rank. Moreover, the parallel version of the algorithm achieves linear speedup.

This paper studies a very interesting problem but the solution does not seem well-presented to me. In fact, I'm not convinced that such a complex approach is even necessary to achieve such results. Maybe, the authors should give more details on the proposed method instead of presenting the parallelization using map-reduce.

Link: http://www.computer.org/portal/web/csdl/doi/10.1109/ICDM.2011.150