terça-feira, 3 de janeiro de 2012

Detecting Influenza Epidemics using Search Engine Query Data

This paper was in my TO-READ stack for a long time. It is a work developed by some researchers from Google and the Centers for Disease Control and Prevention and was published by the Nature magazine in 2009. The paper shows how query data is correlated with influenza illness epidemics in a given region, being useful for monitoring purposes. More specifically, a linear regression using the relative frequency of some queries submitted to the google search engine estimates accurately the percentage of physician visits in which a patient presents influenza-like symptoms. Official reports were applied in the identification of the queries and also in the validation of the models.

An interesting property of the strategy proposed is the fact that it is completely automatic. Instead of applying machine learning techniques to classify a set of 50 million queries, they applied the official data to rank and select the queries that produced the most accurate estimates. The weekly counts of each query were normalized for each state during a five-year period. Although the study considered data from each region of the United States, the selected model (i.e., set of queries + linear regression) was the one that obtained the best global results. The final set of queries are related to influenza complications, cold/flu remedy, symptoms, etc.

The mean correlation of the model with the official reports was 0.90. The model achieved a mean correlation of 0.97 when applied to untested data (i.e., not used during the fitting phase) and a further evaluation using data from the state of Utah obtained a mean correlation of 0.90. Moreover, the authors found that the proposed model is able to provide estimates 1-2 weeks ahead of the publication of the reports.

An interesting aspect of the paper is that it combines a very simple model (linear regression) with a huge dataset. In academy, we accustomed to the opposite approach, which is the application of a complex model to fit a small dataset. I like to call the strategy used in the paper the google approach. According to the paper, hundreds of machines running map-reduce were used in the processing of the models. Pretty cool, isn't it? Searching for related papers, I found one which argues that the query data is not that useful for this kind of monitoring. Nevertheless, using web data to monitor epidemics has become a hot topic.

Link: http://research.google.com/archive/papers/detecting-influenza-epidemics.pdf

Nenhum comentário:

Postar um comentário