terça-feira, 3 de julho de 2012

TWITOBI: A Recommendation System for Twitter Using Probabilistic Modeling

I've been reading this paper for quite a long time. It is related to a project I'm working on about relevance and influence of content in diffusion data. The reason it took me so long to read this paper is because I've decided to learn Expectation Maximization before. In fact, learning EM was worthier than reading this paper, though it is not all bad.

The authors of this paper are from Seoul National University and I've never heard about them before. The problem studied in the paper is recommending tweets and users to follow on Twitter. The approach  used to solve the problem is probabilistic modelling.

I wont discuss the relevance of the problem here because it seems pretty relevant to me.

The following dataset illustrates the problem studied in the paper:


Given training data with users, followees, and tweets, the problem consists of recommending new tweets and followees to them.

The probabilistic model proposed by the authors is based on observed and unobserved data. Observed data are the set of users (U), tweets (T), and words (W). The function n(t,w) gives the number of occurrences of the word w in the tweet t. The sets T_u, O_u, and I_u contain the tweets, followees, and followers of u. Unobserved data (or hidden variables) are the topic z of each tweet t in T. The set of topics is denoted by Z = {z_1, z_2, ... z_n}. 


The model is based on the following parameters:

p(w|z) - Probability that a tweet with topic z has the term w.
p(z|u) - Probability that the user user u selects the topic z.
p(f|u) - Probability that the user u selects a topic influenced by his followee f.


The following figure illustrates how tweets are created according to the model:



The algorithm for generating data is as follows:

1) User chooses whether he is going to select a topic by himself (with probability alpha) or a topic from one of his followees (with probability 1-alpha).
     a) In case he chooses to select a topic by himself, it is done according to p(z|u).
     b) Otherwise, he selects one of his followees with probability p(f|u) and one of his followee's topics based on p(z|f).
2) Given the selected topic z, a word w from z is selected according to p(w|z).

The log-likelihood of the dataset is given by:


Formulating this log-likelihood function is the hardest step in the generation of a probabilistic model like this. I think I understood  all steps, but I may need some practice in order to do this kind of modelling by myself.

It is easy to recognize the model parameters in the above formula. But it is still necessary to define the hidden variables. These variables capture the topic of tweets and whether the user chooses a topic by himself or based on his followees' topics. They are defined as follows:


p(phi=z|w,u) - Probability that a tweet is about z given that it has the word w and is from the user u.
p(theta=u|z,u) - Probability that a user u produces a tweet about z by himself.
p(theta=f|z,u) - Probability that a user u produces a tweet about z influenced by his followee f.

These hidden variables are computed based on the parameters of the model, as shown in the following figure:



In the Expectation (E) step, the hidden variables are estimated based on the current values of the parameters, which are formulated as follows:


In the Maximization (M) step, the values of parameters that maximize the log-likelihood function are computed using Lagrange multipliers. One thing that I did not understand very well how the above expressions are applied together with the optimization method.

The paper also presents a parallel algorithm that computes the parameters of the probabilistic model using map-reduce, but I will jump directly to the results instead of describing it.

The experimental results apply a dataset containing 12M tweets and 8K users. Part of the data, containing 400 users, is selected as test set. Users in the test set have at least 10 followees and 10 tweets. From these users, 10 followees and 10 tweets are selected as random. The idea is predicting these 10 followees through user recommendation and the 10 tweets using tweet recommendation. The baseline for user recommendation is a method based on TF-IDF that I have already summarized here. And the baseline for tweet recommendation also based on TF-IDF is compared against the proposed method. The results show that  TWITOBI outperforms the baselines in both the top-k followee and the top-k tweet recommendation tasks in terms of recall-at-k, precision-at-k, and average hit-rank. Moreover, the parallel version of the algorithm achieves linear speedup.

This paper studies a very interesting problem but the solution does not seem well-presented to me. In fact, I'm not convinced that such a complex approach is even necessary to achieve such results. Maybe, the authors should give more details on the proposed method instead of presenting the  parallelization using map-reduce.

Link: http://www.computer.org/portal/web/csdl/doi/10.1109/ICDM.2011.150

Nenhum comentário:

Postar um comentário