Me and my research: An Analysis of Live Streaming Workloads on the Internet

This is a characterization work developed by researchers from Carnegie Mellon University. The authors analyze a 3-month workload from Akamai Technologies, which is a big player in content distribution. The paper focuses not only on typical workload analysis (e.g. popularity, session arrival) but also studies other interesting aspects such as user diversity and client lifetime.

The dataset applied in this paper contains more than 70M requests for 5K URLs. It comprises many publishers and content. Examples include radio stations and short videos. The authors consider a URL as an event and call stream a 24-hour chunk of an event. About 70% of the streams are audio and only 7% are video. Most of the analysis are based on 3 distinct categories of content: All (everything), DaillyTop40 (top 20 audio and top 20 video for each encoding format for every day in the 3-month period) and Large (streams with peaks of more than 1K concurrent viewers). Moreover, streams are also classified as non-stop (24/7, 76%) or short (a few hours, 24%) durations and as recurring (97% of large streams) or one-time stream. Another interesting classification of streams is regarding the request arrival. Flash crowd streams are those that present significant sudden increase in the arrival rate. All short streams and half of the large streams present flash crowd behavior

Popularity: The popularity distribution of streams was found to follow a zipf-like distribution with 2 modes, one for streams with 10K-7M requests and other for streams with less than 10K requests.

Arrival Process: 70% of the streams have a median interarrival interval of 1 second or less. Interarrival time is shorter for large than for short streams, since large streams have more requests. An exponential distribution can be used do model request interarrivals at shorter (stationary) timescales.

Temporal Variation: The workload presents clear temporal variation (daily and weekly), with peaks of number of requests in the weekends. Moreover, it is possible to distinguish the impact of American and European users in the number of requests over time.

Flash Crowds: Short duration streams present less time-of-day and time-zone related behavior. The authors argue that this may be due to the fact that short content has its popularity driven by itself instead of by the user behavior.

Tail Analysis of Session Duration: The tail is the last 10% of the (CCDF) distribution. Non-stop streams present Pareto heavy-tailed behavior at some point in the plot. For short streams, there are clear cut-offs and the session duration distribution follows a truncated Pareto distribution.

Transport Protocol: While UDP is considered the most appropriate protocol for streaming, 40% of the sessions use TCP.

Host Location Analysis: Based on the IP of hosts, an analysis of the geographic location of hosts is presented. This analysis is done by different levels (AS, city, country, timezone). For most of the streams, most of the clients are in the same timezone. Large streams clients present diverse AS domains (200-3,500) and location at city (200-3,500), country and timezone level. Large streams present also large clustering index (minimum number of locations to cover 90% of the IP addresses). Smaller streams present lower location diversity, but it is still significant (e.g. half of the streams cover 100+ cities, 50% cover at least 10 countries). When simultaneous clients are considered (i.e. clients that accessed the video in a 1-hour interval) the diversity is reduced but is still significant.

Client Birthrate: Clients are identified by user ids. Birthrate varies from 10% to 100% for DailyTop40 streams. Large streams have lower birthrate, as well as non-stop streams when compared to short ones.

Client Lifetime: half of the users stay only for one day. For 90% of the events more than 50% of the users are one-timers. Clients that are not one-timers have an average lifetime equal to the event duration, being responsible for steady memberships.

In did not like reading this paper very much. Most of the characterization papers I've read seem to bureaucratic -popularity follows this distributions as everybody knew, session arrival follows this other, etc- to me. Moreover, it is difficult to organize so many findings and explanations in a single flow of ideas. Maybe, instead of characterizing one scenario using several analysis, it would be better to provide and in-depth analysis of one property in many scenarios.

Link: http://esm.cs.cmu.edu/technology/papers/imc04.pdf

Me and my research

quarta-feira, 8 de fevereiro de 2012

An Analysis of Live Streaming Workloads on the Internet

Nenhum comentário:

Postar um comentário