The main motivation for this work comes from the huge volume of user generated content in the Web. More specifically, community-driven question/answering portals were becoming popular at the time. The authors had access to a dataset from Yahoo! Answers, which is a very popular question/answering platform.
The general idea of the paper is combining a large set of attributes, associated to questions, answers, and users, in order to predict/classify high-quality questions and answers. These attributes can be classified into the following groups:
- Content features: Extracted from the textual information (questions and answers). Include punctuation and typos, syntactic and semantic complexity, and gramaticality. I will not cover the specific strategy applied to extract these features here.
- User relationships: The paper mentions two types of graph: a user-item and a user-user. The user-user graphs are very intuitive. Each relationship defined (e.g., answering a question, giving a star to a question, giving thumps up/down to an answer) defines a graph. Link analysis algorithms (PageRank and HITS) are applied in order to find authorities/hubs in such graphs. However, I did not get how the user-item graph is employed.
- Usage statistics: Basically, are variations of the number of clicks received by the question. In particular, normalizing the number of clicks by category is very important.
Given these attributes, it is proposed a binary classification problem that consists of distinguishing high-quality content from the rest. The authors evaluated several classifiers and found that the best performance was achieved by a technique known as stochastic gradient boosted trees. A stochastic gradient boosted tree is, roughly speaking, an ensemble of decision trees.
The question/answering data can be seen as a set of interactions among users, questions, and answers, as shown in the following schema.
Moreover, answer and question features can be seen as paths in a tree, as follows:The question/answering data can be seen as a set of interactions among users, questions, and answers, as shown in the following schema.
Another set of features proposed by the authors is related to the properties of the text of a question and its answer. These set includes the Kullback-Leibler divergence between the language models of the question and the answer, their non-stopword overlap, and the ratio between their lengths.
The question/answering dataset applied in this work contains 6,665 questions and 8,366 answers. Questions and answers were labeled as high-quality content or not by human editors. Further, this dataset was extended to a larger set by adding users whose answers or questions were part of the original dataset together with all their questions and respective answers.
A characterization of the dataset shows that the user interaction graphs have highly skewed distributions. Furthermore, it is shown that users act as both askers and answeres (i.e., there are not defined fixed roles).
The following table shows the accuracy of the classifier in the question classification task using different sets of features and their combinations. The baseline applies only n-grams and intrinsic combines several textual (content) features. The evaluation metrics considered are precision (P), recall (R) and the Area Under the ROC Curve (AUC).
The most significant features in question classification identified using a chi-squared test are:
- Number of stars received by the asker
- Punctuation density in the question's subject
- Question's category
- Number of clicks received by the question (normalized by category)
- Avg number of thumbs up received by questions written by the asker
Next, we show the effectiveness of the answer classification method.
Top features for answer classification are:
- Answer length
- Number of words in the answer with a frequency larger than c in the dataset
- Number of thumbs up - thumbs down received by the answerer divided by the total number of thumbs received
My opinion on this paper is negative. I think the results are not reproducible due to the complexity of the techniques used. Some of the attributes considered are too specific and were applied in an ad-hoc fashion. Finally, I could not get many relevant lessons/conclusions from the results presented.
Link: http://www.mathcs.emory.edu/~eugene/papers/wsdm2008quality.pdf