segunda-feira, 17 de dezembro de 2012

On Compressing Weighted Time-evolving Graphs

Just finished reading this short paper published in the CIKM conference this year. It is authored by several researchers from University of Melbourne, Australia, and some of them are also co-authors of the ciForager paper I've summarized here one month ago.  Jian Pei, from Simon Fraser University, is also one of the authors of this paper.

The problem studied by the authors is compressing a network where edge weights are dynamic. One example they mention is the DBLP co-authorship network, where pairs authors may have different co-authorship degrees along time. This setting differs from the traditional graph compressing problem studied in the literature, which is focused on compressing the structure of a static graph. In general, we compress data to save storage space and communication cost when dealing with large databases, such as some real-life graphs that arise from social networks, the Web, etc.

The approach described in the paper is based on a 3D tensor representation of a dynamic graph. Tensors are basically multi-dimensional arrays and they have been applied to several graph mining problems in the recent years. Here, the graph is a cube (vertices x vertices x time) with weights in the cells. In practice, because we will be dealing with sparse graphs, we can store the sparse version of this tensor as a set of locations (indices) and non-zero values (weights). The idea is reducing the entropy of this tensor representation by merging subsets of weights across dimensions. For instance, edge weights 4, 5, and 6 can all be mapped to the value 6. The above figure shows an example of an illustrative dynamic graph and how edge weights can be merged as means to reduce the entropy of the edge values. The challenge is selecting the appropriate subsets of values to be merged while keeping some relevant properties of the dynamic graph (e.g. weighted shortest path distances for snapshots).  The authors propose the use of a hierarchical cluster tree to merge edge values that lead to an optimal compression given an error threshold.

Non-zero edge weights are always rounded to their nearest integer. The locations are represented as a sequence of integers where for each position we have the number of zeros before the next non-zero position. These integers are concatenated with the sequence of values into a single string LocVal, which is then encoded using an entropy encoding method (e.g., arithmetic coding, Huffman coding). Integers are represented as variable length quantities (VLQ's), which are representations of large integers as sequences of binary octets. In particular, the first bit of the octet tells whether there is a next octet that is part of the the number being represented. As a consequence, an integer x can be represented using 8.ceil(log2(x+1)/7) bits. The graph is represented as a sequence |V||T|Meta(LocVal)H(LocVal), where the first two values are the dimensions (number of vertices and snapshots) of the dynamic graph and Meta(LocVal) contains unique edge weights (that have been rounded to integers) and their frequencies. H(LocVal) is the entropy of LocVal, but I have no idea why it appears like that in the representation. The total length of the graph representation depends on the lengths of |V|, |T|, and Meta(LocVal), and also on the entropy of LocVal. Therefore, we can compress this representation by reducing the entropy (or heterogeneity) of LocVal. The maximal compression can be achieved by averaging all the edge values across snapshots, but this approach would certainly remove important properties of the original dynamic graph.

The solution presented by the authors (in the form of the MaxErrComp algorithm) is the use of a cluster tree to merge values. Two values are merged whenever they differ by less than a cutoff value. Merged values are set to their average. This cutoff is decreased gradually while the error of the compression, measured as the maximum error of the weighted shorted path for all the snapshots, is lower than a threshold epsilon.

They compare the proposed approach against two baselines. The first, called MaxErrRandom, merges two values randomly (i.e. the hierarchical clustering method is not applied). The second is a technique for compressing static graphs called StatComp. In order to apply StatComp to a dynamic setting, they compress each snapshot of the input graph.

The datasets applied in the evaluation are a portion of the DBLP co-authorship network (3K vertices, 56K edges, 12 years) and the Enron email network (2K vertices, 100K edges, 28 months). The following figure shows the encoding cost achieved by the 3 different techniques for varying error thresholds and the datasets considered. MaxErrComp outperforms the baseline approaches. In particular, StatComp achieves the worst results. MaxErrRandom and MaxErrComp are only on extreme cases, which are 0 error (no compression) or when the error is so high that the compressed graphs degenerate to the average graphs.


In another experiment discussed in the paper, the compression rate was fixed and then the error was analyzed (it is the opposite of what was done in the previous experiment). It is shown that MaxErrComp significantly outperforms the baseline techniques using a statistical test.

This a curious paper. First, I was surprised that it is not a full-paper, because both the problem studied and the solution proposed revealed potential for a greater contribution that what we often find in short papers. Then, when I started reading the paper I increased my expectations as I found some nice background topics (e.g, information theory, coding, tensors). But I got frustrated by the technical and experimental sections of the paper. The MaxErrComp sounds like a naive approach to me. The idea of merging values using a greedy approach and then evaluating the compressed graph using an overall error metric is not very challenging. Maybe, the problem would become more interesting if a relative error metric was considered instead. I also missed some discussion about how hard the original problem is. Finally, the evaluation was poor and the results achieved by MaxErrComp does not seem to be very impressive even compared against the random baseline.

Link: http://people.eng.unimelb.edu.au/akan/dgraphs/dyn-graphs-compr.pdf

terça-feira, 27 de novembro de 2012

Wavelet Synopses with Error Guarantees

I just finished reading this cool paper about providing approximate answers to queries with bounded expected error for individual queries using wavelet-based representations of the data. The problem solved is very clear and the techniques proposed are supported by a rich background theory, specially regarding wavelets, probability theory, optimization, and dynamic programming. The paper was published in the ACM SIGMOD Conference 2002, when the authors were working at Bell Labs. There is also a journal version of the paper, published in the ACM TODS two years later, that is basically a copy of this paper with some details (e.g., proofs, explanations, and results). I've lost track of how many hours I spent on this paper and I still think I should understand it deeper, but I decided to let it go for now.

In several scenarios, providing a fast approximate answer to a query is more effective than taking a long time to give the exact answer. Previous work has applied wavelets, which is a harmonic analysis technique for function decomposition, in order to represent a database as a compact set of coefficients, as known as synopses. Based on a good set of coefficients, it is possible to retrieve specific values in the database with small overall error. However, existing wavelet-based strategies for providing synopses did not give any guarantee on the the error of individual queries. In other words, given a specific query, there was no guarantees regarding the expected error of an answer for such query. This is the problem addressed in this paper.

Wavelets synopses are based on the Haar wavelet function. The idea is very simple, given a sequence of values, we keep pairwise averages and detail coefficients for different levels (or resolutions). Detail coefficients give the difference between pairs of values. We start from pairs of values in the database and stop when there is a single average (i.e., the average of the values in the database) and detail coefficient. These averages and detail coefficients enable the full reconstruction of the values in the database. Starting from the root of the hierarchy (single average and coefficient) we generate the values for level L by adding (left child) and subtracting (right child) the detail coefficient from the average value. The original values will be obtained in the last level of the hierarchy. The following table shows an example of how we compute the averages and coefficients for a set of values shown in the first line.


The error tree shows the coefficients (c_0, c_1, ..., c_k) and how they are associated to the data values.


Any value can be reconstructed given the coefficients from the root to it in the error tree. In the same fashion, range queries can be answered based on the paths to the values involved. In previous coefficients to compose the synopses were those with largest normalized value (c_i/sqrt(2^L), where L is the level in the error tree). This selection of coefficients is known to minimize the L^2 error of the reconstruction, but does not provide any guarantee on individual answers. In fact, the authors show an example where the optimal choice of coefficients leads to some very bad answers for some values. This kind of results is mainly due to an unbalanced selection of coefficients in the error tree.

In order to solve this problem, the authors apply the idea of randomized rounding. Each coefficient c_i is associated to a random variable C_i that can be gamma_i with probability c_i/gamma_i, or 0 with probability 1 - c_i/gamma_i, where 0 < c_i/gamma_i <= 1. This means that we are replacing the coefficients by these variables that can be rounded to values larger than c_i or set to 0 but they still keep the same expected value c_i of the original coefficients. Moreover, the variance of the coefficients is (gamma_i-c_i).c_i. The authors show that the estimations for values and range sums using coefficients generated with randomized rounding are unbiased, which means that their expected values are the correct ones for any value of gamma_i. However, gamma_i affects the variance of the estimations (positively) and the expected number of coefficients kept (negatively). The focus of the paper is how we can set these gamma_i values in a way that minimizes the expected error of the estimations for a fixed expected number of coefficients (or budget).

First, it is shown how the expected mean squared error (L^2) can be minimized. Due to specific properties of the minimization problem, the authors are able to provide a O(n log n) algorithm, where n is the number of values in the database, for the problem. The following formula gives the optimal values of gamma_i without constraining the values of c_i/gamma_i to the interval (0,1]:


where y_i = c_i/gamma_i and k_i = c_i^2/2^level(c_i). In order to get optimal values of gamma_i following the given constraint, they set any value of gamma_i that is greater than 1 to one.

But the main contribution of the paper is not giving a bound on the L^2 error but on the relative error of individual queries. The approach to achieve such bound is similar, an optimal selection of the gamma_i values. Because we are talking about relative errors, which can be highly influenced by small absolute values, the authors apply a sanity bound S in the computation of errors. Whenever an absolute value is lower than this sanity parameter S, the error to be minimized is relative to S and not to the specific value anymore. Similar to the mean error, the relative error is also related to the variance of the estimates, but now we are interesting in improving the worst estimate, which is the one with highest variance. The authors propose dynamic programming algorithm for setting the values of y_i, which is equivalent to setting the corresponding values of gamma_i (y_i = c_i / gamma_i). Based on the fact that in an optimal setting for an error sub-tree, the highest error will be associated to the minimum data value, the algorithm finds the optimal solution. However, some perturbation of coefficients may be required in order to make this property hold.

The idea of the dynamic programming algorithm is minimizing the relative error of the maximum error path by setting different values of y_i to the root of an error sub-tree and then dividing the remaining budget (coefficients) optimally between its left and right sub-trees.


The value of j is greater than N for the leaves of the error tree (i.e., data values). One issue of this strategy is that the value of y_i is continuous, and thus we cannot check all possible values in the range. The solution the authors offer for this problem is considering a set of q discrete values in the given interval. Therefore, the possible values of y_i are 1/q, 2/q, ..., and 1. The proposed algorithm has time complexity O(Nq^2B) and space complexity O(qB.logN).

It is also proposed an alternate scheme that does not apply randomized rounding, retaining or discarding coefficients minimizing the error instead. This approach may lead to bias (the expected value of C_i is not c_i anymore) but the choice of gamma_i values can minimize this bias as well.

The overall algorithm is the following:


Notice that they generate several trials and always select the synopses that minimize the mean relative error.

In the experimental evaluation, they compare MinL2 (probabilistic synopses minimizing L^2), MinRelVar (probabilistic synopses minimizing relative error), MinRelBias (probabilistic synopses without randomized rounding), and the deterministic approach. Synthetic data was generated based on Zipf distribution for pairs data-counts and then different permutation strategies were applied as means to set which values were set to each frequencies. A real database (Cover Type) with 581K tuples was also applied in the evaluation. In the case of this real dataset, they selected a given attribute with cardinality at least 256. Evaluation metrics applied are the mean relative error, maximum relative error, and 25% percentile of the relative error.  I won't show any plot here but the main conclusions of the evaluation are:

  • MinL2 achieves relative error worse than all the other strategies;
  • MinRelVar and MinRelBias outperform the deterministic approach and MinL2, specially for small synopses. Using only 10 coefficients (4% of the data) they reduce the relative error of the deterministic strategy from 1.5 to 0.5. 
  • The more skewed the data, the greater are the gains of MinRelBias and MinRelVar. 
  • Similar results were found for the real dataset.
My general view on this paper is very positive. I don't think the writing was very good but the proposed  strategies are really clever. One thing still missing for me is a good intuition why the idea of randomized rounding enables the guarantees on the relative error. I believe it works, but I don't have an intuitive reasoning for its use in this particular scenario. Another thing that is not clear to me is whether using randomized rounding for minimizing the L^2 error makes sense. The deterministic strategy already minimizes the L^2 error and is much simpler. Another missing point is whether solving the original optimization problem regarding the relative error using a general optimization approach would be get much better results and at which cost. Finally, the meaning of 'error guarantees' in the paper is not very clear to me. I think that one specific probabilistic synopses can still get a very poor answer for a given query, though it is unlikely. Therefore, maybe you are making poor answers unlikely but are still not providing guarantees regarding them.


Link: http://dl.acm.org/citation.cfm?id=564746


domingo, 18 de novembro de 2012

ciForager: Incrementally Discovering Regions of Correlated Change in Evolving Graphs

I'm back to reading! This paper is authored by some researchers from University of Melbourne (Australia) and the National Institute for Informatics (Japan). It was published in the ACM TKDD journal this year. The problem studied by the authors is the discovery of subgraphs (or regions) that change in a similar (or correlated) fashion in an evolving graph (or network). This graph is given as a sequence of snapshots and changes can represent edge activity. Patterns of interest correspond to sets of edge that are similar in both the temporal and spatial (topological) dimensions. The authors call these patterns regions of correlated change. One interesting application  for this patterns is the analysis of a brain network, where vertices are brain regions and edges connect regions are close to each other in a 3D view of the brain. As the brain is exposed to a sequence of stimuli, some of its regions (i.e., subgraphs) are expected to present correlated activity levels along time.


The above figure shows an illustrative example of the discovery of regions of correlated change, which are edges that are close in the graph topology and also are active in the same snapshots. It is easier to see these regions, and their specific changes, based on waveforms, as shown in the following figure.


More formally, the problem consists of, given a sequence of graph snapshots <G_1, ...G_T> with a set of changed edges E_C^{1,T}, which are edges that changes their status in the interval 1-T, partitioning E_C^{1,T} into regions R = {R_1^{ts1,te1}, ... R_L^{tsL,teL}}, such that the temporal and spatial distances of edges inside the same region are minimized. Temporal distance is measured using an extended version of the Euclidean distance and spatial distance is measured as the shortest path distance in the graph that is the union of all snapshots.

The problem of identifying these regions of correlated change was previously proposed by the same authors. In fact, the two previous papers on this subject were both authored by the same group (curious). The main contribution of this particular paper is the design of efficient algorithmic techniques that enable the discovery of these patterns in large graphs. these techniques are combined into a new algorithm called ciForager, which builds upon a previous algorithm proposed by the authors for this same problem named cSTAG.

The general idea of cSTAG is dividing the snapshots into a sequence of consecutive and overlapping windows, discovering the regions of correlated change for each window, and then merging them. Regions for each window are discovered using a bi-objective (time+space) clustering algorithm. There are several ways to combine the temporal and spatial optimization functions, they list some variations but I won't describe them here. Regions for different windows are merged whenever they have an average temporal and spatial distance below a user-defined threshold.


The above figure shows the process diagram of cSTAG (left) and ciForager (right). The main difference between these two algorithms, which constitute the main contributions of this paper, is the use of the following strategies for the efficient discovery of the regions of correlated change:

  • Equivalence classes and synchronized connected components: An equivalent class is defined by a unique waveform and a synchronized connected component (synCC) is a maximal set of connected edges that present the same waveform. The idea is instead of computing distances between isolated edges, computing distances between synCCs. SynCCs are updated efficiently through a set of 4 operations (extend, split, shrink, and merge). The worst case complexity of the update of the set of synCCs from a window W_k to a W_{k+1} is O(|E_C^k|) + O(|V_{S_{kAvg}}|.|S_k|) + O(|E_C^{k+1}|), where E_C^k is the set of changed edges in W_k, V_{S_{kAvg}} is the average size of a synCC in S_k, and E_C^{k+1} is the set of changed edges in W_{k+1}. The number of synCCs is expected to be much smaller than the number of changed edges, which leads to performance gains of ciForager in comparison to cSTAG.
  • Voronoi diagrams for shortest path computation: A Voronoi diagram divides the space into cells such that points in each cell are closer to the center of its cell than to the center of other cells in the diagram. In this paper, the authors apply a graph Voronoi diagram in order to compute approximate shortest paths efficiently. The center of a  cell is a synchronized connected component and a cell is composed by edges and vertices that are closer to the center of their respective center than to any other center in the graph. Distances between synCCs are computed based on a metagraph, where each vertex is a synCC, edge weights are minimum distances between vertices from two synCCs, and vertex weights are average distances between vertices in a given synCC. The distance between two synCCs is approximated as the sum of the edges and vertices weights in the path from one to the other. The authors show how this metagraph can be efficiently updated through merging, splitting, creating, and deleting operations. The idea is that distances involving a limited number of cells need to be updated by these operations whenever the synCCs change. For instance, if two synCCs are merged, only the vertices and edges in the cells whose centers represent these synCCs are updated. Each of these operations is performed in O(E + VLOGV) time, where E is the set of edges and V is the set of vertices in the metagraph. This complexity comes directly from the fact that these operations are based on Dijkstra's algorithm.

The overall complexity of ciForager for a transition from window W_k to W_{k+1} is:


The first three parts are due to the updating of equivalence classes and synCCs. The fourth part comes from the update of the graph Voronoi diagrams (U_k is the number of synCC updates). The fifth part is due to the pairwise temporal distance computations between pairs of equivalence classes. The sixth part is related to the computation of shortest path distances in the metagraph, and the final component is the cost of running a clustering algorithm.

In the evaluation section, the authors show that ciForager is much faster than cSTAG maintaining similar accuracy. These gains rely on localized changes and increase with the size of the graphs considered. The authors argue that ciForager enables the analysis of the global BGP (Border Gateway Protocol) network, something that was not feasible before. They also support their case using synthetic datasets and the 1998 World Cup Web site access data.

The evaluation metrics applied in the evaluation can be divided into two groups:

  1. External measures: Based on a set of regions generated (ground truth). Given the known regions and the regions discovered by the algorithms, the problem consists of matching these regions. The authors solve this matching problem as a mass transportation linear programming problem.
  2. Internal measures: Average intra-region and inter-region distances for both the spatial and temporal dimensions. 
Eight different clustering methods are evaluated (remember that ciForager does not depend on a specific  clustering method) and ciForager outperforms cSTAG (100-800 times faster) for all of them. However, cSTAG is more accurate than ciForager and the authors argue that this is due to the fact that some datasets have regions with small diameter, which leads to mistakenly grouped edges in the same regions. The performance gap between cSTAG and ciForager is shown to increase as the number of snapshots increase, without loss in accuracy. As expected, the larger the number of synCCs, the smaller are the performance gains of ciForager compared to cSTAG, the opposite happens when the number of edges changed increase because of the number of synCCs is expected to increase slower than the number of changed edges. It is interesting to notice that ciForager does not achieve similar gains on the BGP data, being only about 40 times faster than cSTAG. The authors justify such results based on the occurrence of randomly distributed changes in the BGP network. Two specific regions related to the damages caused by the Hurricane Katrina not only over the US network but also in Europe are discussed. 

Concluding, this paper studies an interesting problem and the performance gains shown are good. However, I don't usually like papers that try to describe the result of several combined but somehow uncorrelated strategies to improve algorithmic performance. The fact that the algorithm opens possibilities for analyzing new and larger data is amazing, but I think the authors could do a better job describing these results. But I understand that the focus of the paper were the performance gains of ciForager compared to cSTAG. Finally, I found some typos and other small errors that sometimes disrupted my reading, I don't expect finding this kind of issues in a TKDD paper. 


Link: http://dl.acm.org/citation.cfm?id=2362385

domingo, 7 de outubro de 2012

Naturally Obsessed: The Making of a Scientist

This weekend I watched this amazing documentary film called Naturally Obsessed, which shows the routine of the Shapiro's Lab, at Columbia University.



The main idea of this documentary is giving a realistic view on the research process to a broad audience. This view is centered on the experiences of three PhD students (Rob, Kilpatrick, and Gabrielle) who are advised by the Prof. Larry Shapiro. In the process of becoming scientists by exploring challenging research problems, these students must deal with constant failure and pressure. I believe that the documentary is very successful in demystifying the scientist as a genius and portraying how the evolution of science relies so much on hard-work and a kind of obsession. Prof. Shapiro begins the documentary with the following words:

              "One of the best things you can do as a scientist is to suffer from obsessive 
              compulsive disorder, so that you become obsessed with a problem and 
              can't stop working on it until you get your answer."



I'm pretty sure that many psychologists and psychiatrists disagree with Prof. Shapiro's opinion, but I also agree that obsession is to some extent important to drive the scientist along the research process, specially in failure situations.

The documentary was shot during 3 years and was also experimental in the sense that they did not know the results of the research projects the students were working on and, as a consequence, the future of the students who were taking part in the documentary.

All the students in the documentary were working on similar problems. Their main goal was understanding the functioning of a protein by looking into its structure. In order to get such structure, they had to engineer them as crystals and then analyze how they diffract X-rays by running experiments in a synchrotron. The main challenge was getting the crystals that would enable the diffraction.

Rob Townley, the main character of the documentary, is focused on a specific protein called AMPL, which controls the burning and storage of fat, being linked to diabetes and obesity. Rob is certainly not the successful young and smart kind of PhD student most of people are used to hear about. However, he is persistent and ends up providing many emotional moments through the documentary, specially in the end. But I won't be a spoiler here, so please watch it. This is favorite quote from Rob about the research process:

                "Either you got to be a fool or an optimist."

But this is one of those typical cases where you don't have the skills to perceive yourself as a fool in case you are one, right?

Links:
http://www.naturallyobsessed.com/
http://www.thirteen.org/naturally-obsessed/

domingo, 23 de setembro de 2012

38th International Conference on Very Large Data Bases

In the last month, I've attended the 2012 edition of the VLDB conference in Istanbul, Turkey. VLDB is   a top-tier conference in computer science with a focus on data management. I consider SIGMOD (ACM) and ICDE (IEEE) to be "sisters" of VLDB in the computer science conference family. My paper entitled Mining attribute-structure correlated patterns in large attributed graphs, written in collaboration with Wagner Meira Jr. and Mohammed Zaki,  was presented as part of the Graph Statistics and Summarization Session of VLDB. Moreover, I had the opportunity to watch many other talks, interact with the data management research community in general, and get to know the beautiful city of Istanbul.

About my talk, I think it was one of the best ones I've given so far. In fact, I was a little bit nervous about this talk, since it was my first one in a major CS conference. To make things even harder, my current advisor and my two past advisors were in the audience -- some people feel more comfortable with a familiar audience, but in my case it works exactly the other way around. I've put a good amount of work on the preparation of the slides, but I still have to improve my slide design skills a lot. My main goal was to combine textless slides with a clear speech. As expected, the speech part was the most challenging one. In the end, I finished my talk on time and also answered one question about it. Moreover, the overall feedback I've received after the talk was mostly positive.

Regarding the other presentations, I was able to watch some of them. However, choosing which session to attend among 7 parallel sessions was difficult. I hope someday they will develop some kind of session recommendation for these big conferences, so that you upload some of your papers and then get a list of sessions/talks aligned with your interests. I've attended all they keynote talks, but did not really pay attention to them. In my experience, keynote talks are usually boring and I've never been able to get something useful out of them. On the other hand, I've enjoyed the research sessions a lot. Here is the list of the talks I've attended:


  1. Size-I Object Summaries for Relational Keyword Search;
  2. Injecting Uncertainty in Graphs for Identify Obfuscation;
  3. Measuring Two-Event Structural Correlations on Graphs;
  4. Efficient Reachability Query Evaluation in Large Spatiotemporal Contact Datasets;
  5. Boosting Moving Object Indexing through Velocity Partitioning;
  6. Multiple Location Profiling for Users and Relationships from Social Network and Content;
  7. Fast and Exact Top-k Search for Random Walk with Restart;
  8. The Filter-Placement Problem and Its Applications to Minimizing Information Multiplicity;
  9. Labeling Workflow Views with Fine-Grained Dependencies;
  10. gSketch: On Query Estimation in Graph Streams;
  11. Answering Top-k Queries Over a Mixture of Attractive and Repulsive Dimensions;
  12. Bayesian Locality Sensitive Hashing for Fast Similarity Search
  13. Challenging the Long Tail Recommendation;
  14. Ranking Large Temporal Data;
  15. Indexing the Earth Mover's Distance Using Normal Distributions;
  16. Stochastic Databases Cracking: Towards Robust Adaptive Indexing in Main-Memory Column-Stores;
  17. A Moving Object Index for Efficient Query Processing with Peer-Wise Location Privacy;
  18. A Data-Based Approach to Social Influence Maximization;
  19. The Complexity of Social Coordination;
  20. Who Tags What? An Analysis Framework;
  21. Whom to Ask? Jury Selection for Decision Making Tasks on Micro-blog Services;

As soon as I set my reading pace to normal levels again, I may summarize some of these papers here.


Visiting Istanbul was an amazing experience. The city has a great history -- former capital of the Byzantine and the Ottoman Empires -- that can still be explored through museums, mosques, churches, castles, etc. Also, the city's topography and hydrography -- the city is divided by the Bosphorus Strait and the Golden Horn -- provides great views from several parts of the city. The picture included in this post was taken during a boat tour through the Bosphorus Strait. it shows the Yoros Castle and the Turkish flag. Because VLDB took place during the Victory Day, which is a very important national holiday to remember the independence of Turkey, we got used to see Turkish flags all over the city.






quinta-feira, 20 de setembro de 2012

How to succeed as a Ph.D student?

How can you maximize your chances of succeeding as a Ph.D student? This is the question I've been thinking about since I received the results of my applications for grad school earlier this year. Nothing more appropriate than applying the scientific method to identify a set of powerful guidelines for acing graduate school. After reading a good amount of material on graduate school, research methodology, and personal management, and also applying some of the proposed techniques, I think it is time to summarize some ideas here.

What is a Ph.D?

The most important question regarding a Ph.D is what is a Ph.D exactly? According to Wikipedia, a Ph.D (or Doctor of Philosophy) is a postgraduate academic degree given by universities. The term 'Philosophy' means 'love of wisdom' and is not necessarily related to Philosophy as a field of study. In practice, a Ph.D is a qualification to conduct individual research in industry, academia, government, etc.

A Ph.D degree usually requires some courses, qualifying exams, teaching experience, and research work presented in the form of a dissertation and oral defense. While working on research, the Ph.D candidate is supervised by a more experienced researcher, usually a professor. The dissertation and its oral defense are evaluated by a committee of experts in the topic of the research done by the Ph.D candidate. In case the committee recognizes the scientific merit of the dissertation, then the Ph.D degree is granted to the candidate.



 Though composed of a set of requirements, the main component in a course of a Ph.D candidate towards his or her degree is the development of original research.  Matt Might describes this research and contextualizes it in terms of the human knowledge using pictures in his Illustrated Guide to Ph.D.


Why?

The second most important question when you think about a Ph.D is: Why would you get a P.hD in the first place? Motivation plays a key role in the success of a Ph.D student. As a consequence, it must be very clear in the student's mind what a Ph.D can add to his life and career.



Whether getting a Ph.D is worthy or not is a controversial topic. In practical terms, the main goal of a graduate school is to train a new generation of university professors and researchers. I believe that in order to be objective about the reasons to pursue a Ph.D it is interesting to compare it against a job in industry, which is the most likely alternative to someone who wants to go to grad school. According to Matt Welsh, things that can be learned doing a Ph.D but usually not in industry include:
  • Reading and writing research papers;
  • Running experiments;
  • Giving talks;
  • Figuring out what problems need to be solved;
  • Becoming a leading expert in an area;
  • Exploring the boundary of human knowledge;
  • Dealing with lots of freedom and a lack of structure;
  • Working for yourself;
  • Working on the hardest problems.
Peter Bentley gives broader reasons to pursue a Ph.D:
  • To achieve something significant (except money);
  • To discover or learn something new;
  • To improve yourself and your life;
  • It fits you.
Jamie Lawrence offers a more pessimistic view on the motivations for going to graduate school -- after quitting his Ph.D. He mentions the following ones:
  • Qualification;
  • Dr.
  • Curious;
  • etc.
Regarding professional prospects, a Ph.D degree may increase someone's opportunities to find a satisfying job. However, as Thomas Benton (a pseudonym) describes fiercely, the market for Ph.Ds, specially in humanities, is disappointing. The truth is that there are too many Ph.Ds for too few tenure positions and graduate students end up being the cheap fuel that keeps the higher education engine running. David Goodstein, from Caltech, has a more apocalyptic view on  the oversupply of Ph.Ds as a result of the exponential growth of science in the past 300 years. He makes a very interesting analogy between the evolution of science and the universe based on the Big Bang-Big Crunch theory. Basically, an average university professor advises about 15 Ph.Ds during his career, but only one of them will be another university professor because the society does not support such a exponential growth in science anymore.

Moreover, the cost of a Ph.D degree is definitely high. The great majority of Ph.D students do not actually pay tuition but are supported during grad school instead. However, because this support is usually much lower than salaries in the job market, a Ph.D costs approximately the price of a house. To make things even worse, the earning premium for a Ph.D is, on average, 26%, while the one for a master's degree is 23%. What means that the tough years in graduate school do not actually lead to a big paycheck. In fact, a consequence of the current scenario is that the most brilliant American students have been avoiding graduate school in the past 40 years. However, these missing students have been replaced by foreign students, including myself, that are more willing to tolerate poor conditions and prospects.

Dan Pink, a motivational guru, argues that there are three factors that motivate people to work: purpose,  mastering, and autonomy. Ph.D students deal with these factors in a very interesting fashion. Regarding purpose, Ph.D students are expected to solve relevant problems as part of their dissertations. Moreover, a Ph.D degree is essentially about mastering a topic. Finally, Ph.D students have a high degree of autonomy, since they are developing the capacity of  conducting individual research. Back to the purpose aspect,  I found this amazing quote from Albert Szent-Gyorgi about it:


                "If a student comes to me and says he wants to be useful to mankind and go into 
                research to alleviate human suffering, I advise him to go into charity instead.
               Research wants real egotists who seek their own pleasure and satisfaction, 
                but find it in solving the puzzles of nature."



The reason I like this somehow selfish quote so much is that it removes a great part of the weight carried by science in general. It is hard to imagine that someone is paying a fair amount of money for people to "solve puzzles", but nobody said that this puzzle cannot be curing cancer or abolishing world hunger.



Courses/Qualification

Taking some courses is usually part of the qualifying requirements in graduate school. There is a common agreement that graduate school is more about depth than breadth, being different from undergraduate school. In fact, in graduate school you are expected to learn more from outside classes. Ronald Azuma gives a very good definition for learning in graduate school:

                "It's like teaching swimming by tossing students into the deep end of the pool
                and seeing who makes it to the other end alive and who drowns"

Grad students have more freedom and flexibility than undergrads to choose the courses they take. However, this freedom and flexibility must be handled wisely. It is recommended to ask your advisor and other students about which courses to take. Dianne O'Leary emphasizes that two priorities of a graduate student should be: (1) taking elementary courses to get prepared for the advanced graduate level courses, and (2) fulfilling the course requirements and exams defined by the department. Another good recommendation (from Frank Valid) is taking the easier classes first, since the beginning is a critical part in graduate school.

I like to see courses and qualification in graduate school as part of the broader goal of becoming a world expert in a given topic. As a consequence, graduate level courses require more initiative and preparation hours (from 3 to 4 hours for each hour in class). However, more than learning the topics, it is important to remember what you have learnt. Periodic reviewing is the only way to hold knowledge, but it takes even more discipline than learning itself. 

Another important aspect of graduate-level courses  is learning advanced topics.  As we get closer to the frontier of human knowledge, mastering complex techniques becomes more important. Calvin Newport once mentioned the case of  a highly successful young researcher whose secret has been learning new promising topics and then applying them to solve hard problems. One good strategy for mastering complex topics I have already experimented with is the so called Feynman Technique, which is basically learning as you were supposed to explain the topic to a new student using a top-down approach, examples, analogies, and simple language. In this post, I'm applying some ideas from the Feynman Technique to learn about graduate school.

Being a good student in graduate courses may be a good way to get close to faculty members that may be future advisors, research collaborators, and also write recommendation letters for you. Also, advanced projects developed in these courses may lead to publications and even be included as part of your dissertation, thus keep in mind that an extra effort may be rewarding.




A more controversial view on graduate courses is given by Matt Might, who said that focusing on grades or coursework is one of the easy ways to fail a Ph.D. Straight As do not give you a Ph.D and after you finish, nobody really cares about your grades. Stephen Stearns goes even beyond and says that lectures are an inefficient way to learn for Ph.D students, who should apply more active learning strategies, such as focused seminars. 

Advisor

According to the guide  How To Do Research In the MIT AI Lab, the choice of thesis advisor is the most important decision to be made in graduate school. Advisors are often the ones who apply for funding, meet weekly to talk about progress, review papers, recommend topics, introduce students to other professors and researchers,  promote the student's work, etc. Nevertheless, advising is a highly subjective matter, which may lead to all sorts of conflicts.



Students should consider prospective advisors as one of the criteria in the selection of graduate schools to apply for. Moreover, searching for an advisor should be a top priority for first-year Ph.D students.  In general, good advisors are those whose students finish their Ph.Ds and have successful careers. Therefore, it is important to talk to previous and current students of a given professor in order to gather information about this professor as a advisor. It is also relevant to know whether the advisor is tenured -- tenure means a guaranteed position as a professor -- or not. Non-tenured professors are generally more available and motivated. The downside is that they have less funding available and are also less experienced in research.

Besides these more general guidelines, choosing an advisor also depends a lot on the specific student under consideration. In particular, the student's and professor's research interests and preferences (e.g., level of guidance required/given, individual or group projects) should be as much compatible as possible. Advising styles range from a strong master/apprentice style to a passive hands-off style. Gordon Davis recommends avoiding these extremes. Moreover, he makes a very interesting point regarding the students' evolution along their Ph.Ds and how different advising styles fit better with each one of the stages in this evolution. Matt Might also makes reference to this evolution, stating that the advisor should be hands on in the early stages, but, towards the end, the student should know more than the advisor about her topic. Finding the exact time for this inversion is a challenge and a bad timing is one of the main reasons to fail a Ph.D. Another very good recommendation by Gordon Davis is discussing a 10-year plan for the student and then working on a Ph.D that best fits this plan.

In my limited experience (two advisors), communication is a key aspect of a successful advisor-advisee relationship. Each involved part should be aware of what is expected from each other (like a contract). Miscommunication -- rather than misbehavior -- was the source of most of the conflicts between students and their advisors I have seen so far. In particular, I believe that many students do not realize that in graduate school they are the ones who take the duties and pressure. In fact, the typical advisor is a source of duties and pressure instead of someone who was supposed to alleviate them.




A very funny view on advising is given in the Thesis Prevention, which is a set of guidelines for a fictitious adversarial advisor. I'm still looking for a student who does recognize his or her advisor at  some point while reading this guide. My favorite guidelines are: 

  1. Do not prepare for supervisions;
  2. Don't concentrate during meetings;
  3. Focus on vague (philosophical) ideas.

Time management

One of the most important lessons I've learnt about how to develop successful research is the importance of time management. Navigating from long-term goals (e.g., becoming a successful researcher, getting a Ph.D, publishing a paper) to very short specific tasks (e.g., reading a paper, plotting a result) is maybe the greatest challenge in the life of a Ph.D student. Therefore, they should become experts in avoiding procrastination and making constant progress as means to increase their productivity and, as a consequence, their success in graduate school.

A Ph.D is a long-term complex project that depends on many people (student, collaborators, advisor, thesis committee, journal and conference committees), costs a good amount of money (enough to buy a house according to Matt Welsh), and involves high risks. Nevertheless, most of the Ph.D students I know refuse to make use of very simple time management techniques in graduate school. The lack of proper time management is hard to perceive, since it has indirect implications over several other aspects in graduate school. But don't blame yourself, "falling to plain is planning to fail".


My major sources of material about time management are the Calvin Newport's blog Study Hacks and Scott Young' blog Get More from Life. Calvin is a computer science professor who got his Ph.D from MIT and is somehow obsessed about productivity and impact. One of my favorite posts from Calvin is about how he achieves outstanding levels of productivity working 8-5 by applying what he calls fixed schedule productivity. The idea is first to fix your schedule and then work backwards to make everything fit. However, in order to make this method work it is necessary to be ruthlessly results oriented, focusing on what really matters. Frank Vahid makes a similar point saying that the key to balancing multiple tasks in graduate school is refusing. Moreover, having a regular schedule, setting priorities, keeping to-do lists, avoiding tasks of negligible importance (e.g., checking email and facebook,  instant messaging, reading news), working in blocks of uninterrupted time, and getting enough sleep are effective time management strategies. Scott Young shares a similar opinion, advocating that you can actually achieve higher productivity by working less if you are smart.



Matt Might disagrees with Cal Newport about the regular schedule, mentioning it as one of the main reasons Ph.D students fail. According to him, graduate school requires contemplative labor days, nights and weekends. Jamie Lawrence also questions the effectiveness of time management in graduate school, but for different reasons. He believes that complex tasks such as inventing and comprehending are simply unmanageable in practice. 



In my opinion proper regular schedule and time management techniques in general are indeed useful in graduate school. Therefore, I decided to apply them systematically. Some of the guidelines I want to follow are:
  • Setting yearly, monthly, weekly, and daily goals with their respective priorities;
  • Breaking goals into sets of specific tasks;
  • Working on 3-hour chunks with 30-minute breaks;
  • Tracking the number of hours of focused working;
  • Setting a regular schedule (8-9 hours a day, 5-6 days a week);
  • Avoiding any source of procrastination;
  • Working on one project per day;
  • Working on up to 3 projects at the same time.
Randy Pausch proposed a list of questions that may help in priority setting:
  • Why am I doing this?
  • What is the goal?
  • Why will I succeed?
  • What happens if I choose not to do it?


Philip Guo, in his Phd Grind, describes some of his productivity rages -- one of them consisting of 72 consecutive days programming with 5 days of break and a 10-hour work routine --  in which he did most of the work that led him to graduate.  Though this kind of "mode" may be useful, I truly believe it should be avoided as much as possible. A good motivation to avoid these rages is that they are not compatible with the kind of creative work that leads to relevant research.

One of the most interesting and polemical articles about time management in research I know was  published in the Nature magazine. It describes the daily routine of a "24/7 lab" -- the Brain Tumor Stem Cell Laboratory -- at Johns Hopkins University School of Medicine, directed by the neuroscientist and neurosurgeon Alfredo Quinones-Hinojosa. In this lab, students are supposed to work over holidays (including Christmas), meetings go until after 10:00 pm, and members receive calls at 6:00 am. Quinones-Hinojosa and his lab are very successful but their working routine has been condemned by the research community. Nevertheless, it is hard to argue against a person who attributes his transition from an illegal immigrant from Mexico to becoming a Berkeley and Harvard alumni before joining one the the US leading research hospitals to hard work. This same article mentions a study that have found that the average scientist works 50 hours a week and that more working hours tends to lead to more publications. However, there is also evidence that highly creative scientists have broader interests and more hobbies than less creative ones. The idea of a 24/7 lab was later challenged by Julie Overbaugh, who argued that this intense routine penalizes creativity and the flow of ideas in research.

Money

Money does not seem to be a big problem in the life of a graduate student. I have already mentioned that a Ph.D is an expensive degree, but this cost actually comes from the low value of the stipends in graduate school compared to salaries in the job market. As Marie desJardins said "Although nobody ever got rich being a graduate student, you probably won't starve either".


The typical sources of money for Ph.D candidates are:
  • Fellowships: Based on merit and given by government (e.g., NSF, DOE) or private (Google, Microsoft, Yahoo, Samsumg) organizations;
  • Research Assistantships: Part of a professor's grants associated to a research project;
  • Teaching Assistantship: Given by the department to students that help professors in course-related activities (e.g., grading, tutoring).
TAships are usually given to first-year students that are looking for advisors and research topics. Around the second year, students move to RAships in order to start research. Fellowships require applications and have deadlines along the year. While a student with a fellowship is free to work with any professor -- and as consequence on any topic --, it is harder for them to interact with professors, in comparison to TAs and RAs. In the particular case of a RA, it is important to notice that once funded by a given project, the student's research is constrained to the topics associated to this project. An interesting analysis of the difference between a RA and a fellowship in the life of a graduate student is given in the Phd Grind, in which the fellowship did not worked in the student's favor due to the lack of motivations for him to work on a particular problem and collaborate with a specific professor.


Fellowships, TAs, and RAs usually cover tuition and also provide a stipend for the student. However, TAs and RAs are often available for nine months. During the summer, students can apply for internships, which pay about 5 times the graduate school stipend, as means to cover their expenses and also get some work experience. Moreover, students also need funding to attend conferences, which may come from travel fellowships or grants related to a project.

A general guideline for graduate students is to keep living expenses in check. Running into debt may move the focus of the student from graduate school to alternative ways to earn more money. This extra money will likely require some kind of effort somehow unrelated to finishing the Ph.D.

Teaching

Besides supporting graduate students, being a teaching assistant is also an effective way to get teaching experience, which is useful to students who aim to become professors. As a side-effect, teaching assistants alleviate the burden of professors, who then have more time to do research. Today, there is recurrent complaint that TAships, and even graduate school in general, have as one of their main purposes reducing the budget of universities.   The idea is that TAs are a much cheaper replacement for professors, playing a key role in the economic viability of mass education. Following the same line of thought, RAs are cheap manpower for conducting research projects.


TAs usually receive some kind of training from the university or department before they start working. Diane O'Leary gives a list of good recommendations for TAs:
  • Be well prepared (read the textbook, know what was covered in class, solve homework problems);
  • Be fair in grading;
  • Treat students with respect;
  • Give quick feedback;
  • Be alert to any attempt of cheating.
Teaching is a personally rewarding experience, but being a good teaching assistant does not give you a Ph.D degree. Therefore, I like to include teaching as one of those tasks that may conflict with graduation if not managed sensibly.

Internships

Internships are a great opportunity for students to get some work experience outside academia. In particular, for students who do not like the idea of staying in academia, internships may become job opportunities after graduation.  The typical period for internships in graduate school is summer. Companies and other organizations select students based on applications and each internship lasts for a 3-month period.



Internships can be focused on research or development depending on the interests of the organization that provides the internship position. Philip Guo describes his internship at Google as development oriented, since he spent his time working on his own open-source project. On the other hand, while he was at Microsoft Research, he collaborated with a famous researcher and wrote 3 papers. 

Internships and teaching assistantships are similar in the sense that they do not necessarily help the student to graduate. Internships focused on research may lead to publications and become part of a dissertation, as happened to Jure Leskovec, a former Carnegie Mellon Ph.D student now at Stanford. Jure was an intern at Yahoo! Research, HP Labs, and Microsoft Research, publishing at least one paper in collaboration with researchers from each one of these laboratories. However, his case is an exception and not the rule. Calvin Newport describes that he once asked to one of his professors about a summer internship he was considering and his professors told him that spending those three months working on paper submissions was a better idea.

Research / Thesis

A Ph.D degree testifies that someone is able do conduct original research, extending the human knowledge regarding a given subject. The research developed during a Ph.D has a thesis as main product.  The function of the advisor is to assist the student in the development of his or her thesis, which will be evaluated by a committee of experts who decide whether the Ph.D degree should be given to the student or not.

Finding a topic/problem to work on is one of the greatest challenges in the life of a Ph.D student. According to Richard Hamming, working on important problems is a prerequisite for doing great research, because if you don't work on important problems, it is unlikely that you will do important work. Moreover, there is an agreement that the student must enjoy the subject and have fun solving the problem. Working on a problem you don't like for several years is an academic suicide. Richard Hamming goes beyond, saying that emotional commitment is a necessary condition for the development of first-class work.

One of my favorite pieces of advice about finding a research problem, given by Edger Dijkstra, is to always try to work on the boundary of your abilities, so that you avoid routine problems as much as you can but, at the same time, are realistic about your capabilities.  It is good to have a list of interesting open problems and then discuss them with your advisor and colleagues. Tracking technology trends and changes may also enable the discovery of interesting research problems.




A Ph.D has limited time, which constraints the space of problems a Ph.D student can solve during her Ph.D. For instance, it took 358 years of intense work of many great mathematicians to prove the Last Fermat's Theorem, therefore proving this theorem was certainly not a good research problem for a Ph.D student. On the other hand, derivative dissertations, in the sense that they solve a slightly twisted version of an existing problem and/or improve an existing solution for a problem by a negligible margin, are usually boring. Given this trade-off between difficulty and feasibility, Marie desJardins recommends taking a central problem that is solvable and acceptable and then extending it to riskier versions in order to make the thesis more exciting. As a general advice, she says that a good problem should require both solid theoretical work and good empirical evaluation. According to Matt Might, aiming too high or too low are some of the top reasons to fail a Ph.D. Uri Alon has an interesting scheme based on difficulty x interest to describe the search space of research problems.



The idea is that ideal problems are very relevant but also relatively easy to solve. As the researcher gets experience, he or she can navigate towards the harder problems. Following the same line, Calvin Newport says that we should avoid complexity when seeking problems. A very important lesson he learned from one of his professors was to always reduce problems to their purest, most simple form. Once you are able to understand the underlying math  of the problem, then add more complexity to it. On the other hand, he recommends seeking complexity in your technical skills. Once you get an important and complex skill, you can apply it to solve many relevant problems. Still on this topic, Lasse Pedersen says that a great paper is often has a result that everyone can understand and apply, but not everyone could have derived.

Michael Nielsen defines two interesting types of researcher that can be extremely successful: the problem-seeker and the problem-solver. The problem-seeker is always exploring new problems and then giving simple solutions for them while the problem-solver takes a well-known relevant problem and then applies his arsenal of techniques to solve it. Based on these definitions, Maurice Herlihy can be considered a problem seeker based on what he said:

               "I am opportunistic. I have a general area of interest, namely, concurrency, but I worry
               about becoming too specialized, so I make an effort to balance my attention between 
               more applied and more theoretical areas, because, often, that's where you have 
              the opportunity to write the first "easy" paper and let the ones who write the difficult
              papers cite your paper."

In his guide about how to publish in the SIGKDD conference, Eanon Keogh states that a good research problem is one that, if you can solve it, you can make money, or save lives, or help children learn a new language. It is important to get real data, make incremental progress, and evaluate your success using a clear metric. Furthermore, your research statement should be falsifiable.


A good way to find new problems and key results is attending/watching seminars and talks in general. I've never been good at getting useful information from talks, thus I've decided to apply an off-the-shelf method to get information from talks, which was proposed by Ravi Vakil and is known as "Three Things" Exercise. The idea is that if you can get three things out of a talk, then it is a successful talk. These things can be: 
  • A definition;
  • A theorem;
  • A key example;
  • A motivating problem;
  • A question you want to ask;
  • Anything else similar.
Once you think you have a relevant and well-defined research problem, it is time to review the literature related to this problem. According to Diane O'Leary, the subject of the problem should be timely and active. But the research area should not be too crowded to reduce your chances of being scooped. Richard Hamming mentions that one important characteristic of great scientists is that they tolerate ambiguity very well

"They believe the theory enough to go ahead; they doubt it enough to notice the errors and faults so they can step forward and create the new replacement theory."

From a problem-solver point of view, after understanding the problem and the key results related to it, then it is time to do the creative work. In my experience, most of the researchers do not spend a sufficient amount of time with a pen and a piece of paper, working on original research. Proper time management and a results-oriented approach should lead you to spend a good amount of time developing original research, since researchers are judged in terms of their problem-solving capacities. However, it does not mean that the problem must be solved from scratch, the idea is filling the gaps between the existing knowledge and the particular problem you are tackling. According to Richard Hamming, you should have a reasonable attack to your problem. Stefan Savage emphasizes that having a "secret weapon" or "unfair advantage" is desired.

John Steele gives a general algorithm for solving research problems that works as follows:
  1. Take a problem of interest;
  2. Become familiar with the key results;
  3. Strip away as much mumbo (e.g., fix definitions) jumbo as you can;
  4. Identify the core phenomena of the area;
  5. Pose simple questions that articulate concrete features of the phenomena of interest;
  6. Put a little work on this questions (keep notes);
  7. Whenever you get a concrete theoretical or empirical result, Latex it out.
A more general piece of advice from John Steele that I think to be very relieving is:

                  "You need to put in an honest days work, say three genuine research hours 
                  (pencil in hand or fingers on the keys) and two genuine reading hours. I have
                  never in thirty years known or heard of a student who did this five days a week
                  for two years who did not get a thesis."

In the research community, there is a famous algorithm that is said to be based on Richard Feynman's approach to solve problems that is more concise, albeit less realistic, that John Steele's one. For sake of completeness, here is Feynman's algorithm:
  1. Write down the problem;
  2. Think real hard;
  3. Write down the solution.
Paul Ronney also offers a set of guidelines -- somehow associated to the fail fast principle -- that are useful in the development of effective research:
  1. Design something simple;
  2. Build quickly;
  3. Test crudely;
  4. Modify;
  5. Improve;
  6. Repeat.
There is a particular topic that I consider to be paramount for a Ph.D student but that is not covered by most of the guides I've found that is whether is the right time to give up. Courage, persistence, and patience are important attributes of a successful researcher, but at some point it is expected that the student should give up on a specific problem because her or she is somehow unable to solve it. I'm my opinion, if after a reasonable amount of time, the student is not able to make any progress regarding the problem, he or she should at least put it in a latent mode. In fact, keeping a dozen research problems in latent mode is a good advice given by Gian-Carlo Rota. He said that this strategy was applied by Richard Feynman, who would check any new method he learned against his set of favorite unsolved problems. Matt Might says that, as a Ph.D student, you have to be willing to fail "from the moment you wake up to the moment your head hits the pillow". The point is that a Ph.D student should always avoid getting stuck for a long time.



One question that I often face regarding a researcher's self-development is how much time should be spent on learning new skills instead of doing actual researchSteven Weinberg argues that you should start doing research and then learn what is needed along the way. But this advice kind of contradicts some points made by Calvin Newport and Michael Nielsen about learning a rich and varied set of tools to attack research problems. Another interesting tip from Steven Weinberg is to "go for the messes", which means the subjects that are in an intense process of evolution, instead of those that are more stable. The reason is that you can make greater contributions when the topic is "boiling".

Martin Schwartz devotes a whole article about the importance of stupidity in research. The idea is that scientists are able to handle stupidity in a productive fashion, actively seeking out new opportunities to feel stupid. This idea agrees a lot with something said by Richard Dawkins:

"Science, as opposed to technology, does violence to common sense."

Stuart Firestein made a similar point when he defined science as:

                   "Form of ignorance characterised by knowing what you don't
                   know, and being able to ask the right questions".

A corollary of the stupidity assumption that I took from Terry Tao is that we should be skeptical about our own work. Moreover, he summarizes what it takes to solve mathematical problems, which a think can be generalized to other research problems, as:

  • Knowledge;
  • Experience;
  • Patience;
  • Hard work

One of the most useful recommendations I found when I started studying effective research methodologies was keeping a notebook (or journal). Things that should be recorded in this notebook include:
  • General research speculations (e.g., a problem that sounds interesting);
  • Remarks about a particular paper you are reading or talk you are watching;
  • List of tasks for the day, week, and month;
  • Outlines of publications and talks;
  • Experiments to be run;
  • Preliminary results and discussion;
  • Formulations, theorems, and math notes in general;
  • Papers to be read later;
  • Etc.
Another aspect that is known to increase the quality of research is collaboration. Collaborators are an important source of feedback, new ideas, and motivation. Furthermore, you may learn new skills from experienced collaborators and also connect with the research community with their help. It is important to keep in mind that the student should be the main responsible for the work included in the dissertation. However, this requirement still allows collaboration opportunities in the development of a dissertation. I found interesting that in Mathematics, researchers usually apply the so called Hardy-Littlewood rule, that states that authors should appear in alphabetical order and all authors share the same credit for publications. Nevertheless, this practice is not popular in other research fields.

A last tip about doing research as a Ph.D is to never loose sight of the big picture. In other words, the student should be able to:
  • Study the proposed problem at several levels and from different points of view;
  • Understand the broader implications of your work to the society;
  • Have a clear picture of how your work complements existing research in different fields;
  • Comprehend how your work may be a step towards more ambitious goals.
Reading

Reading in an essential part of the research process. An average researcher is expected to spend most than half his or her time reading. Matt Might estimates that a typical Ph.D student needs to read about 50 to 150 papers in order to defend the novelty of the proposed thesis.


The two main purposes of reading for a Ph.D student are:
  • Finding unsolved research problems (e.g., future work);
  • Reviewing the literature on a given problem.
While reading a paper, you should try to answer the following questions (by Dianne O'Leary):
  • From where did the author seems to draw the ideas?
  • What exactly was accomplished by this piece of work?
  • How does it seem to to relate to other work on this field?
  • What would be the reasonable next step to build upon this work?
  • What ideas from related fields might be brought to bear upon this subject?
Reading a paper carefully is a task that takes several hours. Michael Mitzenmacher gives some important guidelines for reading research papers:
  • Read critically
  • Read creatively
  • Make  notes
  • Try to summarize the paper in a few sentences
  • Compare the paper to other works


I found this very interesting quote from Paul Halmos about the kind of active reading required for academic papers:

                "Don’t just read it; fight it! Ask your own questions, look for your 
                own examples, discover your own proofs. Is the hypothesis necessary? 
               Is the converse true? What happens in the classical special case? 
               What about the degenerate cases? Where does the proof use the 
               hypothesis?"


Write summaries on the papers you read, so that you do not need to read a paper again in case you forget its main ideas or need to write a related work section. Moreover, summaries help you to avoid cryptomnesia, which may lead you to claim credit for other people's ideas. I keep summaries of papers I've read in this blog.

It is important to notice that reading must be selective. There is not time to read many papers in full, so it  is important to read papers at different levels of detail. According to Michael Nielsen, there is a small number of papers that are worth reading in any field. Most papers should be skimmed and deep reading should be applied only to a few good papers.



Richard Hamming has a very peculiar, albeit interesting, view on the relationship between great research and reading:

             "Reading is important to know what is going on, but reading to get the 
              solutions is not the way to do great research."

Writing

Writing should be a routinely activity in graduate school. As I have already mentioned, a notebook is one of the most effective tools for researchers. However, different from this notebook, which is for your own consumption, at some point as a graduate student, you will write papers, book chapters and your dissertation, which are supposed to be read by someone else. Due to the "publish or perish" philosophy, writing is part of the job of any researcher, as emphasized by Richard Hamming:


           "But the fact is everyone is busy with their own work. You must present it
            so well that they will set aside what they are doing, look at what you've 
            done, read it, and come back and say, ``Yes, that was good.'' "





 A very common mistake of young researchers is to separate research time from writing time. The best strategy is to write along the way and the notebook helps a lot with that. Writing research ideas, experiments to be run, expected results, partial results, partial conclusions, paper outlines, sketches of figures and plots, etc. is a way to keep a detailed log of the research progress. This log may later become the starting point for drafts of papers and the dissertation. Moreover, it is also very useful to support the discussions while meeting with your advisor and other collaborators.

Writing well is difficult and the best way to improve your writing skills is practicing. Therefore, I decided to make writing a habit by creating this blog. At first, creating a blog seemed as a dumb idea, but, in fact, this blog turned out to be useful to me from several perspectives, such as:
  • Practice writing;
  • Keep summaries of papers I read;
  • Share ideas with colleagues;
  • Pressure myself to put good habits in practice.
One of the ways your writing skills will be put in check is by submitting peer-reviewed papers. Peer-review is, roughly speaking, an evaluation process where a committee of experts assess the scientific merit of your paper. It is a self-regulated method, in the sense that researchers evaluate the work of their colleagues and many researchers are likely to both submit and review papers. Peer-review is a established evaluation process in academia -- not only for papers but also for project proposals, awards, etc. -- but it has been victim of criticism in the last years. It is a common saying that committees are a lowest common denominator. The main shortcomings of peer-review are:
  • It is slow;
  • It favors elitism and the suppression of dissent ideas;
  • It is subject to conflicts of interest;
  • Results are not checked with rigor;
  • There are no incentives for good reviewing.
Due to these shortcomings and also the tough competition among researchers to have their papers published by prestigious venues, good writing has become even more decisive for paper acceptance. The idea is to assume the reviewer as a busy (or lazy) person and then reduce the effort needed for the reviewer to assess the merit of the paper to minimal levels. For instance, Eanon Keogh states the following:

             "If you can save the reviewer one minute of their time, by spending one 
              extra hour of your time, then you have an obligation to do so".

There is an agreement about the fact that the typical reviewer makes his impression based on the first pages of the paper and most of them will not "waste" their time reading your paper in full. As Daniel Lemire once wrote in his blog:

           "In science, everybody is busy promoting their own idea and gathering up
            evidence to convince others."

Another important thing I've learned about writing is that it requires a lot of feedback and several iterations. In fact, many people say that writing is re-writing, in the sense that will you need to improve your paper several times until it reaches a high-quality level. In graduate school, the burden of reviewing is given mainly to the advisor, but in practice the advisor may be to busy to contribute along the whole writing process. The solution I've applied quite successfully so far is getting feedback from colleagues first (especially those from your collaboration network) and then invoking the wisdom of the advisor only in the final stages. Actually, the wisdom will come in the final stages whether you are expecting it or not.



In the Researcher's Bible, the authors make several interesting recommendations about writing academic papers:
  • Learn to write well;
  • Your paper should have a message;
  • You must present it so that it cannot be misunderstood;
  • Do not try to say too much;
  • Basic framework: Claim/hipotheses and then evidence for it;
  • Keep a particular reader in mind;
  • Use examples;
  • State what is new and better about what you have done.
Besides the direct benefit of publishing your work, the guide  How To Do Research In the MIT AI Lab makes an interesting point about the benefits of writing for debugging ideas. Duane Bailey recommends stealing stylistic ideas from other papers, something that I also found very useful. In particular, it is a good strategy to take papers from the venue you are aiming for in order to inspire you in writing a paper that is compatible with this venue.



It is interesting to notice that publication patterns vary a lot according to the research area. While some fields (e.g., biology and physics) are focused on journals others (e.g. computer science) give more value  to conferences. Moreover, publication volumes vary not only among fields but also researchers. Because the number of publications is sometimes considered a measure of a researcher's success, many researchers publish a lot in order to get recognition. On the other hand, there are several well-succeeded  researchers that have a dozen papers or even less, as described by Daniel Lemire. I do agree with him that some people publish too much without caring about the impact of what they publish. It seems like the attributes that a paper must have in order to be published are not exactly the same ones it must have in order to achieve great impact.

                "Today, many assistant professors have published more than star researchers 
                like Claude Shannon, John  Nash or Richard Feynman have in their entire 
                life. Maybe we spend too much time on the publication process, and too 
                little time on the actual science?"


One of the most important pieces of advice regarding writing is the famous saying: "To catch a thief, you must think like a thief.". What is funny about it is that in peer-review everybody is a thief and everybody is trying to catch a thief as well. Therefore, it should be easy for researchers to review and improve their papers from a reviewer's perspective.

I believe that most of the recommendations for effective writing given here also apply to a dissertation. However, the dissertation has some specific properties that should be treated accordingly. First of all, the typical structure of a dissertation is:

  • Introduction;
  • Technical Contribution 1;
  • Technical Contribution 2;
  • ...
  • Technical Contribution N;
  • Conclusions.




Though it is not necessarily true, I like to see each chapter of a dissertation as the work developed during one year of the Ph.D. Moreover, it is a good idea to submit each chapter of the dissertation as a journal paper, so that the contributions are testified by the research community. Each of these journal papers should be extensions of one or two conference papers. I've seem these integrated view on dissertation writing and publishing in practice and it works very well. By spending at least 3 years focused on research, the student can publish about 3 journal papers and 6 conference papers. 


Giving talks

Part of selling your work, and yourself, as a student and researcher is giving talks. In fact, talks are a great opportunity to interact with the research community. So, leave a good impressionRonald Azuma points out that if you can speak well, you will earn recognition and distinguish yourself from the other students. I have to confess that I do not like most of the research talks and lectures I've seen so far, but I have found some inspiring ones. Sean Carrol summarizes well the importance of giving talks in graduate school:

                 "I’ve actually heard some students say that they love science, but
                     don’t like writing papers or giving talks. That’s like saying you love
           being a butcher, just aren’t very fond of cutting up animals."

The paper How to give a good research talk (Simon Jones et al.) gives useful directives for giving effective talks. Some of these directives that should be emphasized are:

  • A talk is a taster, not an in-depth treatment on a subject;
  • Use motivating examples, in particular, start your presentation with one;
  • Treat some aspects with more detail than others;
  • Say enough without saying too much;
  • Don't hide the problems;
  • Don't read the slides;
  • For long talks, have slides to orient the audience;
  • Make eye contact with the audience;
  • Don't overrun, it is selfish and rude;
  • Save 2/3 minutes for each slide;
  • Know your audience;
  • Have in mind that people will remember only one or two things from your talk, choose what should be remembered.

My favorite guide on giving research talks is entitled Small guide for giving presentations and was written by Markus Puschel. In this guide, the author applies several useful strategies in his own slides, which makes his points much clearer. These are the pieces of advice given by him that I think are the most relevant for a graduate student:

  • Preparation is key;
  • Speak clearly, not too fast;
  • Don't put your hands in your pockets or cross your arms;
  • Avoid text. Your brain cannot process reading and listening at the same time, but listening and images are fine;
  • Avoid equations, they are understood only by experts;
  • Use images:
    • Diagrams;
    • Plots;
    • etc.
  • Use humor if you can;
  • Start your presentation by thanking your co-authors;
  • Try not to loose the audience, use slides to catch people back;
  • Strive for quality, people remember good presentations;
  • Start with an interesting slide;
  • Don't forget to explain every plot:
    • axis;
    • scales;
    • higher/lower is better;
    • example point.
  • Use contrasts (colors, fonts, sizes) wisely and be consistent;
  • Watch critically other presentations.

In the following figure, Markus Puschel shows how a presentation is judged considering both its visual quality and technical content.



Most of the guides about giving research talks focus on formal talks, but informal talks are also important. Another general recommendation is that you should be able to summarize your research into one-sentence, one-paragraph, and five-minute answers. Discussing your research with your colleagues is a very effective way to get early feedback on your ideas. But you should also give feedback to other people's research, so that you create a network for research collaboration. This network also works for exchanging interesting references, reviewing drafts of your papers, and sharing research experience in general. This idea is based on the Secret Paper Passing Network, described in the guide How To Do Research In the MIT AI Lab.


A very important topic related to giving research talks is how to deal with criticism and questions. As a researcher, you should always be open to criticism and questions. But be sure that you understand them well. For instance, it is a common mistake to answer a question that is different from the one asked by the audience. Furthermore, it is a good strategy to answer the question in a way that is comfortable for both you and the person who asked the question. Try to make everybody happy.

It is expected that you will be nervous before and during your talk. Therefore, you should learn how to deal with your nerves. First of all, practice a lot, so that you become familiar with your talk. Matt Might recommends listening to music and doing physical exercises (e.g. push-ups) as means to relax. I've watched some of the Matt Might's talks available online and they are impressive. Another great recommendation he gives in his blog is about anticipating the questions an having backup slides that address them. In fact, he says to be able to predict the person who will ask the question sometimes.

Final advice:

Begin with the end in mind!


References

What is a Ph.D?

Ronald Azuma. So long and thanks for the P.hD.
David Goodstein. The Big Crunch.
The Economist. The disposable academic: Why doing a PhD is often a waste of time. 
Dan Pink. The Surprising Science of Motivation.

Courses/Qualification
Duane Bailey. A letter to research students.
Total Drek. Unhelpful hints.
Jamie Lawrence. Things I learnt during and about my PhD.
Philip Guo. The PhD grind.
Ronald Azuma. So long and thanks for the P.hD.

Time Management
Matt Might. 10 easy ways to fail a P.hD.
Alan Bundy et al. The researcher's bible. 
Cal Newport. Time management: How an MIT postdoc writes 3 books, a PhD defense, and 6+ peer-reviewed papers - and finishes by 5:30pm.
Calvin Newport. Some Thoughts on Graduate School.
Julie Overbaugh. 24/7 isn't the only way: A healty work-life balance can enhance research.
Scott Young. Work less to get more done.
Jamie Lawrence. Things I learnt during and about my PhD.
Philip Guo. The PhD grind.
Richard Hamming. Your and your research.
Randy Pauch. Time management tips.
Heidi Ledford. The 24/7 lab.

Money
Philip Guo. The PhD grind.
Marie desJardins. How to be a good graduate student.
Cal Newport. Time management: How an MIT postdoc writes 3 books, a PhD defense, and 6+ peer-reviewed papers - and finishes by 5:30pm.
Dianne O'Leary. Graduate study in the computer and mathematical sciences: a surviving manual.

Teaching
Dianne O'Leary. Graduate study in the computer and mathematical sciences: a surviving manual.
Cal Newport. Time management: How an MIT postdoc writes 3 books, a PhD defense, and 6+ peer-reviewed papers - and finishes by 5:30pm.
Randy Pauch. Time management tips.
Richard Hamming. Your and your research.
David Goodstein. The Big Crunch.

Internships
Calvin Newport. Some More Thoughts on Graduate School.
Philip Guo. The PhD grind.

The Thesis
Eanon Keogh. How to do great research, get it published in SIGKDD and get it cited! 
Daniel Lemire. How to write a good research paper.
Uri Alon. How to choose a good scientific problem?
Edger Dijkstra. The three golden rules for successful scientific research.
Richard Hamming. Your and your research.
Cal Newport. Decoding the impact instinct.
Marie desJardins. How to be a good graduate student.
Randy Pauch. Time management tips.
Richard Hamming. Your and your research.
Timothy Novikoff et al. Education of a model student.
Total Drek. Unhelpful hints.
Jamie Lawrence. Things I learnt during and about my PhD.
Edward Wilson. Advice to young scientists.
Philip Guo. The PhD grind.
Stephen Stearns. Some modest advice for graduate students.
Scott Young. Work less to get more done.
Scott Young. Learn Faster with the Feynman Technique.
Ravi Vakil. The "Three Things" Exercise for Getting Things out of Talks.

Reading
Michael Mitzenmacher. How to read a research paper?
David Chapman. How to do research at the MIT AI lab.
Matt Might. 10 easy ways to fail a P.hD.
Michael Nielsen. Principles of Effective Research.
Lance Fortnow. Finding Problems to Work on.

Writing
Apoorva Mandavilli. Peer review: Trial by Twitter.
David Chapman. How to do research at the MIT AI lab.
Philip Guo. The PhD grind.
Daniel Lemire. How to write a good research paper.
Eanon Keogh. How to do great research, get it published in SIGKDD and get it cited! 
Alan Bundy et al. The researcher's bible. 



Giving Talks
Simon Jones et al. How to give a good research talk.
Markus Puschel. Small guide to giving presentations.
David Chapman. How to do research at the MIT AI lab.