Me and my research: Finding high-quality content in social media

This work was is authored by researchers from Emory University and Yahoo! Research. The research question studied here is how can we find high-quality content in social media websites. In particular, the authors have considered how several attributes associated question/answering data can support the identification of high-quality questions and answers.

The main motivation for this work comes from the huge volume of user generated content in the Web. More specifically, community-driven question/answering portals were becoming popular at the time. The authors had access to a dataset from Yahoo! Answers, which is a very popular question/answering platform.

The general idea of the paper is combining a large set of attributes, associated to questions, answers, and users, in order to predict/classify high-quality questions and answers. These attributes can be classified into the following groups:

Content features: Extracted from the textual information (questions and answers). Include punctuation and typos, syntactic and semantic complexity, and gramaticality. I will not cover the specific strategy applied to extract these features here.
User relationships: The paper mentions two types of graph: a user-item and a user-user. The user-user graphs are very intuitive. Each relationship defined (e.g., answering a question, giving a star to a question, giving thumps up/down to an answer) defines a graph. Link analysis algorithms (PageRank and HITS) are applied in order to find authorities/hubs in such graphs. However, I did not get how the user-item graph is employed.
Usage statistics: Basically, are variations of the number of clicks received by the question. In particular, normalizing the number of clicks by category is very important.

Given these attributes, it is proposed a binary classification problem that consists of distinguishing high-quality content from the rest. The authors evaluated several classifiers and found that the best performance was achieved by a technique known as stochastic gradient boosted trees. A stochastic gradient boosted tree is, roughly speaking, an ensemble of decision trees.

The question/answering data can be seen as a set of interactions among users, questions, and answers, as shown in the following schema.

Moreover, answer and question features can be seen as paths in a tree, as follows:

Another set of features proposed by the authors is related to the properties of the text of a question and its answer. These set includes the Kullback-Leibler divergence between the language models of the question and the answer, their non-stopword overlap, and the ratio between their lengths.

The question/answering dataset applied in this work contains 6,665 questions and 8,366 answers. Questions and answers were labeled as high-quality content or not by human editors. Further, this dataset was extended to a larger set by adding users whose answers or questions were part of the original dataset together with all their questions and respective answers.

A characterization of the dataset shows that the user interaction graphs have highly skewed distributions. Furthermore, it is shown that users act as both askers and answeres (i.e., there are not defined fixed roles).

The following table shows the accuracy of the classifier in the question classification task using different sets of features and their combinations. The baseline applies only n-grams and intrinsic combines several textual (content) features. The evaluation metrics considered are precision (P), recall (R) and the Area Under the ROC Curve (AUC).

The most significant features in question classification identified using a chi-squared test are:

Number of stars received by the asker
Punctuation density in the question's subject
Question's category
Number of clicks received by the question (normalized by category)
Avg number of thumbs up received by questions written by the asker

Next, we show the effectiveness of the answer classification method.

Top features for answer classification are:

Answer length
Number of words in the answer with a frequency larger than c in the dataset
Number of thumbs up - thumbs down received by the answerer divided by the total number of thumbs received

My opinion on this paper is negative. I think the results are not reproducible due to the complexity of the techniques used. Some of the attributes considered are too specific and were applied in an ad-hoc fashion. Finally, I could not get many relevant lessons/conclusions from the results presented.

Link: http://www.mathcs.emory.edu/~eugene/papers/wsdm2008quality.pdf

Me and my research

sábado, 28 de julho de 2012

Finding high-quality content in social media

Nenhum comentário:

Postar um comentário