segunda-feira, 20 de outubro de 2014

Hierarchical In-Network Attribute Compression via Importance Sampling

Our most recent work entitled "Hierarchical In-Network Attribute Compression via Importance Sampling" just got accepted at ICDE'15. This research was done in collaboration with my advisor Ambuj Singh and Petko Bogdanov, who is now at SUNY (Albany). 

This paper focused on the problem of compressing real values over a network structure. Such values can represent average speeds in a road network (see Figure), opinions in a social network, gene expression in a PPI network, to name a few.

Speeds at different locations in the highway network from CA, US. Large/red means low speed, medium/yellow means average speed and small/green means high speed.
When managing such network data, specially in scenarios involving large dynamic networks, processing, storing or communicating a given state (or snapshot) of the network might not be feasible. For instance, consider the individual opinions of all Twitter users regarding a polemical topic (e.g. politics, religion) over time. While managing the individual opinions might be costly, we know that some knowledge about the network structure might explain a lot of these opinions due to patterns such as homophily and influence. Therefore, one way to approximate the user opinions given the structure would be partitioning the network into groups that have similar opinions. As long as these partitions can be described in a compact form, this lossy compression is a powerful way to compress values embedded in a network. We have proposed a compression model called Slice Tree, which is a hierarchical partition of a network that minimizes the compression error. 

Computing slice trees is a computationally hard problem, so we spend a lot of time trying to scale its construction to large networks. One of the main contributions of the paper was showing that slice trees can be computed using importance sampling. In particular, we have shown that by biasing the sampling process towards values the deviate from the mean of the network, we can quickly spot the best partition/slice. Therefore, an efficient way to build slice trees is by recursively sampling values as means to extend the tree in a way that the partitions generated are as homogeneous as possible.

We applied this idea to compress several real networks to see how it performs compared to some existing compression approaches, such as the graph Fourier (GF). The results show that slice tree can achieve up to 2-fold gains over the best baseline. We have also presented several results supporting the scalability and compression power of our approach for million-node networks.

Compression quality of Slice Tree (ST) against three baseline approaches.
We are now investigating the temporal aspect of the problem (when values are updated), which brings some interesting challenges and was one of the main issues the reviewers of the current paper raised. More on this subject soon!

segunda-feira, 29 de setembro de 2014

Back to blogging!

After more than a year of silence, I've decided to resuscitate this blog. Life as a PhD student has been busy and I've tried to use my free time doing activities other than blogging about research. But I'm glad that a few people read some of my posts, specially the one about grad school.

This is a summary on my PhD so far:

1) I've worked for more than a year on the problem of compressing values over a graph. After trying lots of different approaches we came up with an algorithm called slice-tree, which decomposes the network into center-radius hierarchical partitions and compresses values inside partitions to their average.



I should write a post about this project here soon. We have submitted a paper on this called Hierarchical In-Network Attribute Compression via Importance Sampling and are now waiting for the reviews. A previous version of this paper was rejected by VLDB some months ago.

2) I've also taken some courses towards fulfilling the requirements for my degree at UC, Santa Barbara. The courses are: Computational Geometry, Combinatorial Algorithms, Advanced Topics on
Databases, Searching over Big Data and Advanced Topics in Data Mining.

3) During the last summer, I was a summer intern at IBM Thomas J. Watson Research Center working on spatio-temporal data management. I'm still working on this project and I should also summarize it here at some point in the future. Working at IBM was really cool and I hope I can get similar internship opportunities in the next years.

More posts soon.

segunda-feira, 17 de dezembro de 2012

On Compressing Weighted Time-evolving Graphs

Just finished reading this short paper published in the CIKM conference this year. It is authored by several researchers from University of Melbourne, Australia, and some of them are also co-authors of the ciForager paper I've summarized here one month ago.  Jian Pei, from Simon Fraser University, is also one of the authors of this paper.

The problem studied by the authors is compressing a network where edge weights are dynamic. One example they mention is the DBLP co-authorship network, where pairs authors may have different co-authorship degrees along time. This setting differs from the traditional graph compressing problem studied in the literature, which is focused on compressing the structure of a static graph. In general, we compress data to save storage space and communication cost when dealing with large databases, such as some real-life graphs that arise from social networks, the Web, etc.

The approach described in the paper is based on a 3D tensor representation of a dynamic graph. Tensors are basically multi-dimensional arrays and they have been applied to several graph mining problems in the recent years. Here, the graph is a cube (vertices x vertices x time) with weights in the cells. In practice, because we will be dealing with sparse graphs, we can store the sparse version of this tensor as a set of locations (indices) and non-zero values (weights). The idea is reducing the entropy of this tensor representation by merging subsets of weights across dimensions. For instance, edge weights 4, 5, and 6 can all be mapped to the value 6. The above figure shows an example of an illustrative dynamic graph and how edge weights can be merged as means to reduce the entropy of the edge values. The challenge is selecting the appropriate subsets of values to be merged while keeping some relevant properties of the dynamic graph (e.g. weighted shortest path distances for snapshots).  The authors propose the use of a hierarchical cluster tree to merge edge values that lead to an optimal compression given an error threshold.

Non-zero edge weights are always rounded to their nearest integer. The locations are represented as a sequence of integers where for each position we have the number of zeros before the next non-zero position. These integers are concatenated with the sequence of values into a single string LocVal, which is then encoded using an entropy encoding method (e.g., arithmetic coding, Huffman coding). Integers are represented as variable length quantities (VLQ's), which are representations of large integers as sequences of binary octets. In particular, the first bit of the octet tells whether there is a next octet that is part of the the number being represented. As a consequence, an integer x can be represented using 8.ceil(log2(x+1)/7) bits. The graph is represented as a sequence |V||T|Meta(LocVal)H(LocVal), where the first two values are the dimensions (number of vertices and snapshots) of the dynamic graph and Meta(LocVal) contains unique edge weights (that have been rounded to integers) and their frequencies. H(LocVal) is the entropy of LocVal, but I have no idea why it appears like that in the representation. The total length of the graph representation depends on the lengths of |V|, |T|, and Meta(LocVal), and also on the entropy of LocVal. Therefore, we can compress this representation by reducing the entropy (or heterogeneity) of LocVal. The maximal compression can be achieved by averaging all the edge values across snapshots, but this approach would certainly remove important properties of the original dynamic graph.

The solution presented by the authors (in the form of the MaxErrComp algorithm) is the use of a cluster tree to merge values. Two values are merged whenever they differ by less than a cutoff value. Merged values are set to their average. This cutoff is decreased gradually while the error of the compression, measured as the maximum error of the weighted shorted path for all the snapshots, is lower than a threshold epsilon.

They compare the proposed approach against two baselines. The first, called MaxErrRandom, merges two values randomly (i.e. the hierarchical clustering method is not applied). The second is a technique for compressing static graphs called StatComp. In order to apply StatComp to a dynamic setting, they compress each snapshot of the input graph.

The datasets applied in the evaluation are a portion of the DBLP co-authorship network (3K vertices, 56K edges, 12 years) and the Enron email network (2K vertices, 100K edges, 28 months). The following figure shows the encoding cost achieved by the 3 different techniques for varying error thresholds and the datasets considered. MaxErrComp outperforms the baseline approaches. In particular, StatComp achieves the worst results. MaxErrRandom and MaxErrComp are only on extreme cases, which are 0 error (no compression) or when the error is so high that the compressed graphs degenerate to the average graphs.


In another experiment discussed in the paper, the compression rate was fixed and then the error was analyzed (it is the opposite of what was done in the previous experiment). It is shown that MaxErrComp significantly outperforms the baseline techniques using a statistical test.

This a curious paper. First, I was surprised that it is not a full-paper, because both the problem studied and the solution proposed revealed potential for a greater contribution that what we often find in short papers. Then, when I started reading the paper I increased my expectations as I found some nice background topics (e.g, information theory, coding, tensors). But I got frustrated by the technical and experimental sections of the paper. The MaxErrComp sounds like a naive approach to me. The idea of merging values using a greedy approach and then evaluating the compressed graph using an overall error metric is not very challenging. Maybe, the problem would become more interesting if a relative error metric was considered instead. I also missed some discussion about how hard the original problem is. Finally, the evaluation was poor and the results achieved by MaxErrComp does not seem to be very impressive even compared against the random baseline.

Link: http://people.eng.unimelb.edu.au/akan/dgraphs/dyn-graphs-compr.pdf

terça-feira, 27 de novembro de 2012

Wavelet Synopses with Error Guarantees

I just finished reading this cool paper about providing approximate answers to queries with bounded expected error for individual queries using wavelet-based representations of the data. The problem solved is very clear and the techniques proposed are supported by a rich background theory, specially regarding wavelets, probability theory, optimization, and dynamic programming. The paper was published in the ACM SIGMOD Conference 2002, when the authors were working at Bell Labs. There is also a journal version of the paper, published in the ACM TODS two years later, that is basically a copy of this paper with some details (e.g., proofs, explanations, and results). I've lost track of how many hours I spent on this paper and I still think I should understand it deeper, but I decided to let it go for now.

In several scenarios, providing a fast approximate answer to a query is more effective than taking a long time to give the exact answer. Previous work has applied wavelets, which is a harmonic analysis technique for function decomposition, in order to represent a database as a compact set of coefficients, as known as synopses. Based on a good set of coefficients, it is possible to retrieve specific values in the database with small overall error. However, existing wavelet-based strategies for providing synopses did not give any guarantee on the the error of individual queries. In other words, given a specific query, there was no guarantees regarding the expected error of an answer for such query. This is the problem addressed in this paper.

Wavelets synopses are based on the Haar wavelet function. The idea is very simple, given a sequence of values, we keep pairwise averages and detail coefficients for different levels (or resolutions). Detail coefficients give the difference between pairs of values. We start from pairs of values in the database and stop when there is a single average (i.e., the average of the values in the database) and detail coefficient. These averages and detail coefficients enable the full reconstruction of the values in the database. Starting from the root of the hierarchy (single average and coefficient) we generate the values for level L by adding (left child) and subtracting (right child) the detail coefficient from the average value. The original values will be obtained in the last level of the hierarchy. The following table shows an example of how we compute the averages and coefficients for a set of values shown in the first line.


The error tree shows the coefficients (c_0, c_1, ..., c_k) and how they are associated to the data values.


Any value can be reconstructed given the coefficients from the root to it in the error tree. In the same fashion, range queries can be answered based on the paths to the values involved. In previous coefficients to compose the synopses were those with largest normalized value (c_i/sqrt(2^L), where L is the level in the error tree). This selection of coefficients is known to minimize the L^2 error of the reconstruction, but does not provide any guarantee on individual answers. In fact, the authors show an example where the optimal choice of coefficients leads to some very bad answers for some values. This kind of results is mainly due to an unbalanced selection of coefficients in the error tree.

In order to solve this problem, the authors apply the idea of randomized rounding. Each coefficient c_i is associated to a random variable C_i that can be gamma_i with probability c_i/gamma_i, or 0 with probability 1 - c_i/gamma_i, where 0 < c_i/gamma_i <= 1. This means that we are replacing the coefficients by these variables that can be rounded to values larger than c_i or set to 0 but they still keep the same expected value c_i of the original coefficients. Moreover, the variance of the coefficients is (gamma_i-c_i).c_i. The authors show that the estimations for values and range sums using coefficients generated with randomized rounding are unbiased, which means that their expected values are the correct ones for any value of gamma_i. However, gamma_i affects the variance of the estimations (positively) and the expected number of coefficients kept (negatively). The focus of the paper is how we can set these gamma_i values in a way that minimizes the expected error of the estimations for a fixed expected number of coefficients (or budget).

First, it is shown how the expected mean squared error (L^2) can be minimized. Due to specific properties of the minimization problem, the authors are able to provide a O(n log n) algorithm, where n is the number of values in the database, for the problem. The following formula gives the optimal values of gamma_i without constraining the values of c_i/gamma_i to the interval (0,1]:


where y_i = c_i/gamma_i and k_i = c_i^2/2^level(c_i). In order to get optimal values of gamma_i following the given constraint, they set any value of gamma_i that is greater than 1 to one.

But the main contribution of the paper is not giving a bound on the L^2 error but on the relative error of individual queries. The approach to achieve such bound is similar, an optimal selection of the gamma_i values. Because we are talking about relative errors, which can be highly influenced by small absolute values, the authors apply a sanity bound S in the computation of errors. Whenever an absolute value is lower than this sanity parameter S, the error to be minimized is relative to S and not to the specific value anymore. Similar to the mean error, the relative error is also related to the variance of the estimates, but now we are interesting in improving the worst estimate, which is the one with highest variance. The authors propose dynamic programming algorithm for setting the values of y_i, which is equivalent to setting the corresponding values of gamma_i (y_i = c_i / gamma_i). Based on the fact that in an optimal setting for an error sub-tree, the highest error will be associated to the minimum data value, the algorithm finds the optimal solution. However, some perturbation of coefficients may be required in order to make this property hold.

The idea of the dynamic programming algorithm is minimizing the relative error of the maximum error path by setting different values of y_i to the root of an error sub-tree and then dividing the remaining budget (coefficients) optimally between its left and right sub-trees.


The value of j is greater than N for the leaves of the error tree (i.e., data values). One issue of this strategy is that the value of y_i is continuous, and thus we cannot check all possible values in the range. The solution the authors offer for this problem is considering a set of q discrete values in the given interval. Therefore, the possible values of y_i are 1/q, 2/q, ..., and 1. The proposed algorithm has time complexity O(Nq^2B) and space complexity O(qB.logN).

It is also proposed an alternate scheme that does not apply randomized rounding, retaining or discarding coefficients minimizing the error instead. This approach may lead to bias (the expected value of C_i is not c_i anymore) but the choice of gamma_i values can minimize this bias as well.

The overall algorithm is the following:


Notice that they generate several trials and always select the synopses that minimize the mean relative error.

In the experimental evaluation, they compare MinL2 (probabilistic synopses minimizing L^2), MinRelVar (probabilistic synopses minimizing relative error), MinRelBias (probabilistic synopses without randomized rounding), and the deterministic approach. Synthetic data was generated based on Zipf distribution for pairs data-counts and then different permutation strategies were applied as means to set which values were set to each frequencies. A real database (Cover Type) with 581K tuples was also applied in the evaluation. In the case of this real dataset, they selected a given attribute with cardinality at least 256. Evaluation metrics applied are the mean relative error, maximum relative error, and 25% percentile of the relative error.  I won't show any plot here but the main conclusions of the evaluation are:

  • MinL2 achieves relative error worse than all the other strategies;
  • MinRelVar and MinRelBias outperform the deterministic approach and MinL2, specially for small synopses. Using only 10 coefficients (4% of the data) they reduce the relative error of the deterministic strategy from 1.5 to 0.5. 
  • The more skewed the data, the greater are the gains of MinRelBias and MinRelVar. 
  • Similar results were found for the real dataset.
My general view on this paper is very positive. I don't think the writing was very good but the proposed  strategies are really clever. One thing still missing for me is a good intuition why the idea of randomized rounding enables the guarantees on the relative error. I believe it works, but I don't have an intuitive reasoning for its use in this particular scenario. Another thing that is not clear to me is whether using randomized rounding for minimizing the L^2 error makes sense. The deterministic strategy already minimizes the L^2 error and is much simpler. Another missing point is whether solving the original optimization problem regarding the relative error using a general optimization approach would be get much better results and at which cost. Finally, the meaning of 'error guarantees' in the paper is not very clear to me. I think that one specific probabilistic synopses can still get a very poor answer for a given query, though it is unlikely. Therefore, maybe you are making poor answers unlikely but are still not providing guarantees regarding them.


Link: http://dl.acm.org/citation.cfm?id=564746


domingo, 18 de novembro de 2012

ciForager: Incrementally Discovering Regions of Correlated Change in Evolving Graphs

I'm back to reading! This paper is authored by some researchers from University of Melbourne (Australia) and the National Institute for Informatics (Japan). It was published in the ACM TKDD journal this year. The problem studied by the authors is the discovery of subgraphs (or regions) that change in a similar (or correlated) fashion in an evolving graph (or network). This graph is given as a sequence of snapshots and changes can represent edge activity. Patterns of interest correspond to sets of edge that are similar in both the temporal and spatial (topological) dimensions. The authors call these patterns regions of correlated change. One interesting application  for this patterns is the analysis of a brain network, where vertices are brain regions and edges connect regions are close to each other in a 3D view of the brain. As the brain is exposed to a sequence of stimuli, some of its regions (i.e., subgraphs) are expected to present correlated activity levels along time.


The above figure shows an illustrative example of the discovery of regions of correlated change, which are edges that are close in the graph topology and also are active in the same snapshots. It is easier to see these regions, and their specific changes, based on waveforms, as shown in the following figure.


More formally, the problem consists of, given a sequence of graph snapshots <G_1, ...G_T> with a set of changed edges E_C^{1,T}, which are edges that changes their status in the interval 1-T, partitioning E_C^{1,T} into regions R = {R_1^{ts1,te1}, ... R_L^{tsL,teL}}, such that the temporal and spatial distances of edges inside the same region are minimized. Temporal distance is measured using an extended version of the Euclidean distance and spatial distance is measured as the shortest path distance in the graph that is the union of all snapshots.

The problem of identifying these regions of correlated change was previously proposed by the same authors. In fact, the two previous papers on this subject were both authored by the same group (curious). The main contribution of this particular paper is the design of efficient algorithmic techniques that enable the discovery of these patterns in large graphs. these techniques are combined into a new algorithm called ciForager, which builds upon a previous algorithm proposed by the authors for this same problem named cSTAG.

The general idea of cSTAG is dividing the snapshots into a sequence of consecutive and overlapping windows, discovering the regions of correlated change for each window, and then merging them. Regions for each window are discovered using a bi-objective (time+space) clustering algorithm. There are several ways to combine the temporal and spatial optimization functions, they list some variations but I won't describe them here. Regions for different windows are merged whenever they have an average temporal and spatial distance below a user-defined threshold.


The above figure shows the process diagram of cSTAG (left) and ciForager (right). The main difference between these two algorithms, which constitute the main contributions of this paper, is the use of the following strategies for the efficient discovery of the regions of correlated change:

  • Equivalence classes and synchronized connected components: An equivalent class is defined by a unique waveform and a synchronized connected component (synCC) is a maximal set of connected edges that present the same waveform. The idea is instead of computing distances between isolated edges, computing distances between synCCs. SynCCs are updated efficiently through a set of 4 operations (extend, split, shrink, and merge). The worst case complexity of the update of the set of synCCs from a window W_k to a W_{k+1} is O(|E_C^k|) + O(|V_{S_{kAvg}}|.|S_k|) + O(|E_C^{k+1}|), where E_C^k is the set of changed edges in W_k, V_{S_{kAvg}} is the average size of a synCC in S_k, and E_C^{k+1} is the set of changed edges in W_{k+1}. The number of synCCs is expected to be much smaller than the number of changed edges, which leads to performance gains of ciForager in comparison to cSTAG.
  • Voronoi diagrams for shortest path computation: A Voronoi diagram divides the space into cells such that points in each cell are closer to the center of its cell than to the center of other cells in the diagram. In this paper, the authors apply a graph Voronoi diagram in order to compute approximate shortest paths efficiently. The center of a  cell is a synchronized connected component and a cell is composed by edges and vertices that are closer to the center of their respective center than to any other center in the graph. Distances between synCCs are computed based on a metagraph, where each vertex is a synCC, edge weights are minimum distances between vertices from two synCCs, and vertex weights are average distances between vertices in a given synCC. The distance between two synCCs is approximated as the sum of the edges and vertices weights in the path from one to the other. The authors show how this metagraph can be efficiently updated through merging, splitting, creating, and deleting operations. The idea is that distances involving a limited number of cells need to be updated by these operations whenever the synCCs change. For instance, if two synCCs are merged, only the vertices and edges in the cells whose centers represent these synCCs are updated. Each of these operations is performed in O(E + VLOGV) time, where E is the set of edges and V is the set of vertices in the metagraph. This complexity comes directly from the fact that these operations are based on Dijkstra's algorithm.

The overall complexity of ciForager for a transition from window W_k to W_{k+1} is:


The first three parts are due to the updating of equivalence classes and synCCs. The fourth part comes from the update of the graph Voronoi diagrams (U_k is the number of synCC updates). The fifth part is due to the pairwise temporal distance computations between pairs of equivalence classes. The sixth part is related to the computation of shortest path distances in the metagraph, and the final component is the cost of running a clustering algorithm.

In the evaluation section, the authors show that ciForager is much faster than cSTAG maintaining similar accuracy. These gains rely on localized changes and increase with the size of the graphs considered. The authors argue that ciForager enables the analysis of the global BGP (Border Gateway Protocol) network, something that was not feasible before. They also support their case using synthetic datasets and the 1998 World Cup Web site access data.

The evaluation metrics applied in the evaluation can be divided into two groups:

  1. External measures: Based on a set of regions generated (ground truth). Given the known regions and the regions discovered by the algorithms, the problem consists of matching these regions. The authors solve this matching problem as a mass transportation linear programming problem.
  2. Internal measures: Average intra-region and inter-region distances for both the spatial and temporal dimensions. 
Eight different clustering methods are evaluated (remember that ciForager does not depend on a specific  clustering method) and ciForager outperforms cSTAG (100-800 times faster) for all of them. However, cSTAG is more accurate than ciForager and the authors argue that this is due to the fact that some datasets have regions with small diameter, which leads to mistakenly grouped edges in the same regions. The performance gap between cSTAG and ciForager is shown to increase as the number of snapshots increase, without loss in accuracy. As expected, the larger the number of synCCs, the smaller are the performance gains of ciForager compared to cSTAG, the opposite happens when the number of edges changed increase because of the number of synCCs is expected to increase slower than the number of changed edges. It is interesting to notice that ciForager does not achieve similar gains on the BGP data, being only about 40 times faster than cSTAG. The authors justify such results based on the occurrence of randomly distributed changes in the BGP network. Two specific regions related to the damages caused by the Hurricane Katrina not only over the US network but also in Europe are discussed. 

Concluding, this paper studies an interesting problem and the performance gains shown are good. However, I don't usually like papers that try to describe the result of several combined but somehow uncorrelated strategies to improve algorithmic performance. The fact that the algorithm opens possibilities for analyzing new and larger data is amazing, but I think the authors could do a better job describing these results. But I understand that the focus of the paper were the performance gains of ciForager compared to cSTAG. Finally, I found some typos and other small errors that sometimes disrupted my reading, I don't expect finding this kind of issues in a TKDD paper. 


Link: http://dl.acm.org/citation.cfm?id=2362385

domingo, 7 de outubro de 2012

Naturally Obsessed: The Making of a Scientist

This weekend I watched this amazing documentary film called Naturally Obsessed, which shows the routine of the Shapiro's Lab, at Columbia University.



The main idea of this documentary is giving a realistic view on the research process to a broad audience. This view is centered on the experiences of three PhD students (Rob, Kilpatrick, and Gabrielle) who are advised by the Prof. Larry Shapiro. In the process of becoming scientists by exploring challenging research problems, these students must deal with constant failure and pressure. I believe that the documentary is very successful in demystifying the scientist as a genius and portraying how the evolution of science relies so much on hard-work and a kind of obsession. Prof. Shapiro begins the documentary with the following words:

              "One of the best things you can do as a scientist is to suffer from obsessive 
              compulsive disorder, so that you become obsessed with a problem and 
              can't stop working on it until you get your answer."



I'm pretty sure that many psychologists and psychiatrists disagree with Prof. Shapiro's opinion, but I also agree that obsession is to some extent important to drive the scientist along the research process, specially in failure situations.

The documentary was shot during 3 years and was also experimental in the sense that they did not know the results of the research projects the students were working on and, as a consequence, the future of the students who were taking part in the documentary.

All the students in the documentary were working on similar problems. Their main goal was understanding the functioning of a protein by looking into its structure. In order to get such structure, they had to engineer them as crystals and then analyze how they diffract X-rays by running experiments in a synchrotron. The main challenge was getting the crystals that would enable the diffraction.

Rob Townley, the main character of the documentary, is focused on a specific protein called AMPL, which controls the burning and storage of fat, being linked to diabetes and obesity. Rob is certainly not the successful young and smart kind of PhD student most of people are used to hear about. However, he is persistent and ends up providing many emotional moments through the documentary, specially in the end. But I won't be a spoiler here, so please watch it. This is favorite quote from Rob about the research process:

                "Either you got to be a fool or an optimist."

But this is one of those typical cases where you don't have the skills to perceive yourself as a fool in case you are one, right?

Links:
http://www.naturallyobsessed.com/
http://www.thirteen.org/naturally-obsessed/

domingo, 23 de setembro de 2012

38th International Conference on Very Large Data Bases

In the last month, I've attended the 2012 edition of the VLDB conference in Istanbul, Turkey. VLDB is   a top-tier conference in computer science with a focus on data management. I consider SIGMOD (ACM) and ICDE (IEEE) to be "sisters" of VLDB in the computer science conference family. My paper entitled Mining attribute-structure correlated patterns in large attributed graphs, written in collaboration with Wagner Meira Jr. and Mohammed Zaki,  was presented as part of the Graph Statistics and Summarization Session of VLDB. Moreover, I had the opportunity to watch many other talks, interact with the data management research community in general, and get to know the beautiful city of Istanbul.

About my talk, I think it was one of the best ones I've given so far. In fact, I was a little bit nervous about this talk, since it was my first one in a major CS conference. To make things even harder, my current advisor and my two past advisors were in the audience -- some people feel more comfortable with a familiar audience, but in my case it works exactly the other way around. I've put a good amount of work on the preparation of the slides, but I still have to improve my slide design skills a lot. My main goal was to combine textless slides with a clear speech. As expected, the speech part was the most challenging one. In the end, I finished my talk on time and also answered one question about it. Moreover, the overall feedback I've received after the talk was mostly positive.

Regarding the other presentations, I was able to watch some of them. However, choosing which session to attend among 7 parallel sessions was difficult. I hope someday they will develop some kind of session recommendation for these big conferences, so that you upload some of your papers and then get a list of sessions/talks aligned with your interests. I've attended all they keynote talks, but did not really pay attention to them. In my experience, keynote talks are usually boring and I've never been able to get something useful out of them. On the other hand, I've enjoyed the research sessions a lot. Here is the list of the talks I've attended:


  1. Size-I Object Summaries for Relational Keyword Search;
  2. Injecting Uncertainty in Graphs for Identify Obfuscation;
  3. Measuring Two-Event Structural Correlations on Graphs;
  4. Efficient Reachability Query Evaluation in Large Spatiotemporal Contact Datasets;
  5. Boosting Moving Object Indexing through Velocity Partitioning;
  6. Multiple Location Profiling for Users and Relationships from Social Network and Content;
  7. Fast and Exact Top-k Search for Random Walk with Restart;
  8. The Filter-Placement Problem and Its Applications to Minimizing Information Multiplicity;
  9. Labeling Workflow Views with Fine-Grained Dependencies;
  10. gSketch: On Query Estimation in Graph Streams;
  11. Answering Top-k Queries Over a Mixture of Attractive and Repulsive Dimensions;
  12. Bayesian Locality Sensitive Hashing for Fast Similarity Search
  13. Challenging the Long Tail Recommendation;
  14. Ranking Large Temporal Data;
  15. Indexing the Earth Mover's Distance Using Normal Distributions;
  16. Stochastic Databases Cracking: Towards Robust Adaptive Indexing in Main-Memory Column-Stores;
  17. A Moving Object Index for Efficient Query Processing with Peer-Wise Location Privacy;
  18. A Data-Based Approach to Social Influence Maximization;
  19. The Complexity of Social Coordination;
  20. Who Tags What? An Analysis Framework;
  21. Whom to Ask? Jury Selection for Decision Making Tasks on Micro-blog Services;

As soon as I set my reading pace to normal levels again, I may summarize some of these papers here.


Visiting Istanbul was an amazing experience. The city has a great history -- former capital of the Byzantine and the Ottoman Empires -- that can still be explored through museums, mosques, churches, castles, etc. Also, the city's topography and hydrography -- the city is divided by the Bosphorus Strait and the Golden Horn -- provides great views from several parts of the city. The picture included in this post was taken during a boat tour through the Bosphorus Strait. it shows the Yoros Castle and the Turkish flag. Because VLDB took place during the Victory Day, which is a very important national holiday to remember the independence of Turkey, we got used to see Turkish flags all over the city.