sexta-feira, 20 de janeiro de 2012

I Tube, You Tube, Everybody Tubes: Analyzing the World's Largest User Generated Content Video System

I've read this paper, written by some researchers from KAIST and Telefonica Research in 2007, three years ago. It is a very strong paper (best IMC paper of that year) that present a thorough analysis of user generated content systems (UGCs), specially Youtube. His contributions go from an in-depth statistical analysis of the video popularity to the estimates of benefits that may result if Youtube was implemented using the P2P protocol. While being a set of interesting questions answered using large-scale data and cool plots, the paper is focuses on the popularity of videos and its effects from several relevant perspectives.

The motivation for such a study comes from the increasing popularity of UGC Vod systems and the lack of similar analysis in the literature. The main dataset used is composed of around 2M videos, corresponding to 4M views, from Youtube (Entertainment and Science and Technology categories). Data from Daum, which is a UGC from Korea, Netflix (ratings), Lovefilm (movie data), and Yahoo! (theater gross income) are also applied.

UGC x non-UGC: The paper lists some important characteristics that distinguish UGC from non-UGC content. In general, UGC has a higher production rate but a similar production per user when compared to non-UGC content. As a consequence, while top UGC producers release much more content than top non-UGC ones, the average size of UGC videos is much smaller. They also found a low user participation (e.g., comments and ratings) in UGC systems and a high correlation between the number of views and ratings of videos. Regarding how users find videos, half of them have external links. Interestingly, these videos represent 90% of the views, but only about 3% of such views are from users coming through external sites. I think Facebook has changed it recently.

UGC Popularity: My favorite section of the paper. The popularity distribution of videos is expected to be power-law or log-normal. Power-law distributions are a consequence of a rich-get-richer process, while the log-normal distribution is a result of proportionate effect. The main visual characteristic of a power-law is that its plot is a straight line across several orders of magnitude in a log-log scale. An analysis of the video popularity shows that it presents the pareto principle (10% of the videos correspond to 90% of the views) and this has strong implications in many scenarios (e.g., cache performance). The fit of the popularity distributions against some known distributions (power-law, log-normal, exponential, and power-law with exponential cutoff) shows that the tail of the curve may follow a power-law with exponential cutoff (i.e., a power-law followed by an exponential). The authors argue that the truncated tail of the curve can be generated by an aging or a fetch-at-most-once process, both from the literature, and show that fetch-at-most-once is the more likely explanation through simulations. The effect is amplified by the increase of the number of requests/user and the decrease of the number of videos available. The focus is then changed to the long tail (less popular videos) and the authors show that the number of views can be described by a Zipf with an exponential cutoff but also seems to follow a log-normal.  This truncated tail may be due to the large number of uninteresting movies, sampling bias (towards interesting videos), and the information filtering (e.g., search engine rankings). Finally, the paper presents an analysis of the benefits of the removal of the truncated tail (e.g., popularity gains of 45% for the Entertainment category).

Popularity Evolution: In general, there is no strong correlation between video age and the number of views it attracts, except for very young videos that get more views than expected. Another interesting result is that most of the top popularity video of a day are recent, but, in the long range, the ranks of recent videos decrease. If a video does not get enough views during its early days, it is unlikely that it will get many requests in the future, in fact, there is a strong correlation (> 0.84) between the popularity of a video during its first two day and its future popularity. Moreover, different from old videos, young videos may suffer significant changes in their popularity rank.

Efficient UGC Design: The authors simulate the use of three cache schemes using the traces from Youtube. They show that a static finite cache with long-term popular content works very well (27% of the requests are missed) and a hybrid scheme with space to store the daily top popularity videos reduces the missing rate to 17%, which is a direct consequence of the skewness of the popularity distributions. Moreover, they simulate the performance of a P2P VoD system using Youtube data. Since they have daily popularity of videos, they estimate the inter-arrival time and concurrent requests for a given video. They found that 95% of the videos are expected to be requested only once in a 10 minute interval. However, assuming different scenarios where users share content while watching the video, during a session (28 min.) or one extra hour or day after watching a video, they show that P2P can be very effective. Basically, they simulated a system where 2 concurrent users share content if possible or access the server otherwise. By sharing content only during the video, the server workload may be reduced by 41%. Furthermore, if users share content during an extra day, the server workload is decreased by 98.7%.

Aliasing and Illegal Uploads: Aliases are copies of a video and may dilute content popularity. From a sample of aliases tagged by volunteers, the paper shows that the removal of some aliases may increase the popularity of some videos by two orders of magnitude. Most of the aliases are from one-timers and the maximum number of aliases/user found was 15. Moreover, a small fraction of the videos deletions were due to copyright violations.

This is certainly a good example of a well-written characterization paper. It presents many interesting conclusions motivated by relevant questions. The paper presentation is amazing and the authors show good expertise in statistics and system design.

Link: an.kaist.ac.kr/traces/papers/imc131-cha.pd

Nenhum comentário:

Postar um comentário