sábado, 14 de abril de 2012

Computer Science Classics and The Emperor’s Old Clothes

While I'm stuck in a middle of one of the hardest paper I've ever tried to read, I've decided to spend some time on reading Computer Science classics. The fact that I have a paper to submit, a dissertation to finish, dozens of other papers to read, and a long term project to work on seems to be the right motivation to procrastinate wisely.  While reading the last edition of the Communications of ACM, an article from Selma Tekir caught my attention to CS classics, which could be defined as those papers everyone wants to have read but nobody wants to read. Basically, she argues that classics, which are basically theoretical papers in CS written in the 50's and 60's are important to: 1) Providing fundamental theoretical knowledge on the field, 2) providing new perspectives and insights, 3) inspiring young professionals, 4) providing a common history to unite the community, and 5) facilitate the recognition of computer science as an independent science and profession.

The list of classics given by the author is the following:

  • “The Emperor’s Old Clothes,” C.A.R. Hoare
  • “An Axiomatic Basis for Computer Programming,” C.A.R. Hoare
  • “Gödel’s Undecidability Theorem,” S.F. Andrilli
  • “Computing Machinery and Intelligence,” A.M. Turing
  • “Reflections on Trusting Trust,” K. Thompson
  • “The Humble Programmer,” E.W. Dijkstra
  • “An Interview with Edsger W. Dijkstra,” P. Frana 
  • “Computer Programming as an Art,” D. Knuth
  • “The ‘Art’ of Being Donald Knuth,” E. Feigenbaum
  • “Donald Knuth: A Life’s Work Interrupted,” E. Feigenbaum

As an exercise, I've decided to read the “The Emperor’s Old Clothes,” by C.A.R. Hoare. Hoare is the inventor of Quicksort and is also a Turing Award recipient for his "fundamental contributions to the definition and design of programming languages", in 1980. This specific paper is basically a transcript of his speech at the Turing Award, when he described some of his experiences while working on several important project such as:

  • Quicksort
  • The first compiler for Algol 60
  • The Elliott 503 Mark II software system (a complete failure)
  • The formal definition of programming languages (Hoare logic and CSP)
  • The study of parallel programming and language constructs for operating systems (e.g., monitors)
  • Algol X (also a failure)
  • Algol W, which is based on Algol X
  • Algol 68 (Hoare considers it to be a failure, but some don't)
  • PL/I 
  • ADA, probably the most controversial topic of the text, about which he states: 
    • "Do not allow this language in its present state to be used in applications where reliability is critical, i.e., nuclear power stations, cruise missiles, early warning systems, anti-ballistic missile defense systems".

There are many interesting messages through the paper, but the strongest of them is about simplicity and elegance. Hoare argues that the lack of simplicity was the cause of the failure of the ambitious Elliot 503 system. In programming languages, these requirements are mainly due to the fact that the programmer should never be trusted, something already known in 1960. Another lesson is that a team of smart people working on a very challenging task and interesting problem can fail miserably due to the lack of focus, resources, documentation, communication, and control. Therefore, it was inevitable to remember some of my software engineering classes while reading the text.

After the Elliot 503 project, Hoare became an advocate of simplicity and elegance in any other project he participated. Although, he seemed to be right several times, I think he may have been too conservative. Some modern programming languages widely used today do not follow his standards. However, he makes an interesting point that nothing can stand against money. In other words, it is difficult to evaluate his points after the effort and money invested in some of the programming languages in use today.

Another conclusion from Hoare's speech is about how committees make so bad decisions in practice. This phenomenon is not new, but its reasons are still not clear to me. Hoare, together with the experts in programming languages at the time, has been part of several committees with the objective of defining a new programming language, and some of the resulting languages were a disaster.

An interesting point about Hoare's career is that he moved from industry to academia in order to work on the formal definition of programming languages, something that he was not able to do as business. This kind of transition still happens in Computer Science, but I think it is much rarer today. On the other hand, I've found several examples of people leaving academia to industry, specially to work for big companies such as Google, Facebook, and Microsoft.

Link: http://zoo.cs.yale.edu/classes/cs422/2011/bib/hoare81emperor.pdf

segunda-feira, 2 de abril de 2012

Sequential Influence Models in Social Networks

This paper is authored by some researchers from Cornell University, including Jon Kleinberg, and Yahoo! Research. It studies two influence models, one based on snapshots of a network and another based on detailed dynamics. In this models, vertices are activated w.r.t. communities and the metric of interest is how the number of active neighbors affects the probability of a node to be activated. The two models are compared and a technique for estimating the detailed temporal dynamics based on snapshot data is proposed and evaluated using a dataset from Wikipedia.

Previous work have studied social influence by measuring how the number of active neighbors of a node affects the likelihood of this node become active. In fact, this idea is the basis of a standard epidemic model called Linear Threshold Model. This paper consider two definitions of influence according to the network information available:

Snapshot definition: Based on information from snapshots of network at different points in time. Given the number of nodes with k infected neighbors at time t_1 (d_s(k)) and the number of nodes with k infected neighbors at time t_1 which are infected at time t_2 (n_s(k)), the metric of interest p_s(k) is defined as p_s(k) = n_s(k) / d_s(k). 

Ordinal-time definition: The complete information, which includes the link formation and the activation of each node along time is available. In this case, d_o(k) is the number of nodes exposed to k infected neighbors at some point in time and n_o(k) is the number of nodes that where k-exposed and became active before becoming k+1 exposed. The influence ration is defined as p_o(k) = n_o(k) / d_o(k).

These two definitions are evaluated using a dataset from Wikipedia. An undirected social network is defined based on links between pairs of editors for which at least one of them has written on the other's user-talk page. Users become active by editing an article. Therefore, each article has a community of editors that can spread over the network. A dataset that contains approximately 510K users of the English Wikipedia is used in the study. Moreover, for some analysis, results using the German and French versions of Wikipedia are also presented.

The first important result of the paper is that the influence definitions produce different results. In the following figure, the authors show the values of p_o, and p_s for different values of k and considering the ordinal-time and the snapshot definitions.

These results come direct from the values of n_o, d_o, n_s and n_d, which are shown in the following Figure. These functions seem to follow a power-law distribution.

Further, the authors study the relation between the two models based on the following sets:

  • B(t_1) = {(u,C,k_1) | u joined C before t_1 an u had k_1 neighbors in C at t_1}. These events do not affect p_s but do affect p_o. As a consequence, the values of  n_o and d_o are shifted upwards with respect to n_s and d_s.
  • J(t_1,t_2) = {(u,C,k_1,k_2)  | u had k_1 neighbors in C at t_1, u joined C between t_1 and t_2, u had k_2 neighbors in C at t_2}. These events contribute to both p_s and p_o. However, the contribution over p_o is stretched towards higher values of k.
  • N(t_2) = {(u,C,k_2) | u did not join C before t_2 and u had k_2 neighbors in C at t_2}. Such tuples contribute only to p_o.

The analysis of the slope and the y-intercepts of the fit of the functions n_o, d_o, n_s, and d_s to a power-law distribution shows the upward effect of B and N. The stretching effect of J is small.

The authors also present a very simple technique for simulating ordinal time data from snapshots. For the events in B, it is assumed that u joined C at a uniformly random time j between 0 and t_1. The same is done for J (u is assumed to have joined C at a uniformly random time j between t_1 and t_2). The events in N are not considered. It is shown that this technique works well in practice, producing values of p_o that are similar to those obtained from ordinal-time data. The accuracy of the technique improves with the number of snapshots and as the time between snapshots gets longer.

The problem studied in this paper is very specific, but it is certainly relevant. I missed more datasets in the evaluation. In several points, I wondered whether the results discussed were general enough. In particular, the arguments of the authors were too complex to me. I think these arguments should be better supported with more experimental results. Finally, although they argued their technique for simulating ordinal-time influence based on snapshots worked well, they do not present any quantitative measure to support it. Moreover, I did not see any reason why such a technique model is actually effective.

Link: http://www.cs.cornell.edu/home/kleinber/icwsm10-seq.pdf