terça-feira, 10 de julho de 2012

Finding Trendsetters in Information Networks

I found this paper to be very related to the research project I'm working on during the past few months. It is still unpublished but the authors already made it available, what is great. This research seems to have been developed inside my department (DCC-UFMG) and involved also people from Universitat Pompeu Fabra (Spain), Yahoo! Research, and Universidade Federal de Ouro Preto (Brazil). One of the authors is Ricardo Baeza-Yates, who is pretty famous in the Information Retrieval research community.

The problem studied in this research was identifying trendsetters, which are those people who adopt and spread new ideas before they become popular, in information networks such as Twitter. Given the graph induced by a topic, the authors propose the generation of a new graph where the edges are based on the influence between users expressed in the data. A PageRank based algorithm over this influence graph gives a score to every user based on his capacity of: (1) adopting the trend (or topic) early and (2) spreading the topic through his followers.

The idea of identifying trendsetters is supported by previous studies on influence. In particular, the authors describe two very interesting concepts, which are follower hubs and innovative hubs. Follower hubs have a broad audience but also present a high threshold for the adoption of new ideas. Therefore, these hubs are influencers. On the other hand, innovative hubs are less popular and also have a lower adoption threshold but are responsible for spreading new ideas through the network (i.e., they are trendsetters).   Though this idea is somehow attractive, I think it does not describe the information diffusion process very well (more on this later).

The definition of a topic applied in this work is very simple. In fact, they do not provide any automatic strategy for topic identification. A topic is defined as a set of contents (hashtags, URLs, memes, etc.). Each topic induces a graph that contains the users who spread it and the edges among them. Based on this graph, the two following functions are proposed:


The function s_1(v)_i gives the users who spread the topic i.  Moreover, the function s_2(u,v)_i defines whether the user v has influenced the user u. In this case, there must be an edge from v to u in the graph. The function s_2(u,v)_i (0 <= s_2 <= 1) determines a exponential decay of the influence score with time and the parameter alpha allows the user to define the time span of analysis (short or long term influence). The influence of a user v over u is defined as follows:

where the symbol '.' means the scalar product, ||x|| is the Euclidean norm of x, and L is the number of non-zero positions in the vector s_2. Influence values are normalized according to the following expression:


Based on the influence scores given by the function defined above, the Trendsetter (TS) score of vertices for a specific topic k is given as:


This formulation is very similar to the PageRank equation, including the use of a damping factor d. The following Figure compares the time when the top TS user posted a message about the Iran election against the time of occurrence of the top PageRank (PR) user .


The idea is that top TS users are early adopters while top PR users engage on the topic much later.

The evaluation of the TS ranking technique is performed using a large (and old) Twitter dataset from 2009 containing about 1.6B tweets and the complete following network. The topics considered are based on the most frequent hashtags classified in a previous work. Each frequent hashtag (total of 370) defines a topic. Moreover, new hashtags were added to topics based on co-occurrence with a frequent hashtags.

The first part of the evaluation section is dedicated to evaluating how top TS users are early adopters compared to the top PR and ID (In-degree) users. Next, the authors group topics based on their popularity shapes using the KSC-algorithm and show how early the top TS users adopt the topics for different popularity curve shapes. It is also shown that the Influence ratio, which is the fraction of one's followers who have been influenced at least once by him, of the top TS users is higher than that of top PR and ID users. When the rankings given by TS, PR, ID, and EA (early adopters) is compared, it is possible to notice that TS gives rankings that are distinct from the ones given by the other strategies. The authors argue that TS gives a nice balance between them.  Finally, the authors show that top TS users may be identified using a small amount (e.g. 10%) of initial data, which is not true for PR and ID.

Reading this paper was interesting. I appreciated a lot some parts of it, specially the related work section, which included some references that are not familiar to me. However, the point that attracted my attention the most was how the authors partially supported their technique and findings using the definition of follower and innovative hubs. Because it changes the role played by influentials in the diffusion process. First, it basically assumes that influentials do not produce new content and I can find many counterexamples for it (e.g., CNN, Perez Hilton). Moreover, based on the results found, influentials arrive too late in the process (sometimes, after the peak of popularity). Well, I would hardly call someone like this an influential in the first place. But whenever these "influentials" are not engaged in the diffusion of a given information, I would expect them to arrive around the popularity peak as any other user. The fact that they used hashtags as data make things even more complex, because the space of hashtags is small and, so, some popular hashtags are expected to be introduced by a random user. But these users do not necessarily are key in information diffusion. I would like to see a similar study using other type of content in order to make things clearer.

Link: http://www.decom.ufop.br/fabricio/download/KDDrt614-Saez-Trumper.pdf

Nenhum comentário:

Postar um comentário