Prompt detection is a cornerstone in the control and prevention of infectious diseases. The Integrated Disease Surveillance Project of India identifies outbreaks, but it does not exactly predict outbreaks. This study was conducted to assess temporal correlation between Google Trends and Integrated Disease Surveillance Programme (IDSP) data and to determine the feasibility of using Google Trends for the prediction of outbreaks or epidemics.
The Google search queries related to malaria, dengue fever, chikungunya, and enteric fever for Chandigarh union territory and Haryana state of India in 2016 were extracted and compared with presumptive form data of the IDSP. Spearman correlation and scatter plots were used to depict the statistical relationship between the two datasets. Time trend plots were constructed to assess the correlation between Google search trends and disease notification under the IDSP
Temporal correlation was observed between the IDSP reporting and Google search trends. Time series analysis of the Google Trends showed strong correlation with the IDSP data with a lag of −2 to −3 weeks for chikungunya and dengue fever in Chandigarh (
Similar results were obtained when applying the results of previous studies to specific diseases, and it is considered that many other diseases should be studied at the national and sub-national levels.
Prompt detection is a cornerstone for the control and prevention of infectious diseases. The Integrated Disease Surveillance Programme (IDSP) of India (
Data generated from queries fed into search engines is recorded and can be used for surveillance purposes as it is used for marketing purpose. Targeted sources include Internet-search metrics, online news stories, social network data, and blog/microblog data [
A cross-sectional study design was used.
Under the IDSP, three types of forms are to be submitted, namely ‘S’, ‘P’, and ‘L’ forms. The ‘S’ form includes suspected cases based on syndromic surveillance done by health workers at a health subcentre and its community, which covers a population of 3,000 to 5,000. The ‘P’ or the presumptive form is filled by medical officers of various health facilities (from primary health centres to tertiary care hospitals), including private medical practitioners, based on clinical examination. The ‘P’ form reports around 22 diseases. The ‘L’ or laboratory form is filled at laboratories (both public and private) and reports 12 types of laboratory confirmed cases. The cases identified from Monday to Sunday are reported using different forms on successive Mondays. The reporting units submit their reports to the next level every Monday. After verification and compilation, the data reaches the District Surveillance Units by Wednesday. It is further transmitted to the State Surveillance Units (SSU) at all State/UT headquarters, and finally, it is sent to the Central Surveillance Unit (CSU) in New Delhi.
Haryana is one of the northern states of India. It is amongst the wealthiest states in India and has the third highest per capita income in the country. The wireless teledensity (number of telephone connections for every hundred individuals living within an area) in the state is around 117.53. Chandigarh, a union territory, is the common capital of Haryana and Punjab, and the teledensity is around 107.88 [
Google is one of the most commonly used search engines, where a very high volume of queries is carried out every day. The current market share of Google among the existing search engines is around 97% [
Google Trends data is a randomly collected sample of real time (of the last 7 days) and non-real time (data from 2004 to 36 hours prior to search) Google search queries. After removal of personal information, each piece of data is categorized and tagged with a topic. Each data point is divided by the total searches in a specific geographical area over a period of time to compare relative popularity. Google Trends depicts search frequency output as a normalized data series, and the resulting numbers are scaled on a range of 0 to 100 based on a topic's proportion to all searches on all topics (
A query is searched using different forms, due to differences in education, primary language, ethnicity, pronunciations, etc. The Google search engine takes these differences into account and gives results from every possible related query. Identification of different queries or terms meant for searching a single disease has been obtained using Google Correlate. Google Correlate is another domain of Google (
In our study, the similar search terms for each of the four diseases (dengue, chikungunya, malaria, and typhoid) used in study areas were obtained using Google Correlate. The top 5 search queries having maximum correlation with the main disease under the IDSP were downloaded from Google Correlate for each notifiable disease and were further used for retrieving the trend data through Google Trends. For example, the top 5 terms for dengue included dengue (
Data reported in ‘P’ form of the IDSP on four diseases, namely, dengue, chikungunya, malaria, and enteric fever from January to December 2016 for Haryana and Chandigarh was used. For the above study period, the Google Trends data reported for Haryana and Chandigarh was used.
The week-wise compiled number of cases at SSU of Haryana and Chandigarh pertaining to all four diseases was entered in Microsoft Excel 2016. Google Trends weekly search metrics for each disease were downloaded in the .CSV format. Data from both sources was then exported to RStudio (
Scatter plots, Spearman rank correlation, and time series analysis were applied to assess the association between the two datasets. Cross-correlation results are obtained as product-moment correlations between the two time series. The advantage of using cross-correlations is that it accounts for time dependence between two time-series variables. The time dependence between two variables is termed as lag. Lag values indicate the degree and direction of associations. A lag of −1 for assessing correlation suggests that Google Trends data has been shifted backward by one-week from the IDSP data and the opposite is true for +1. Positive correlations for lag vales of ≥1 week were considered significant. Considering the objectives of this study, a positive association between the two time-series (Google Trends data preceding the presumptive disease notification under the IDSP) verifies its suitability for use as an early warning tool. A
The results of the correlational analysis between the IDSP data and Google Trends are presented in
Time trend graphs for surveillance data and Google Trends data for the respective diseases for the study areas are presented in
The linear association between disease surveillance and Google Trends pattern was assessed using time-series cross-correlations as shown in
The Google Trends-based prediction system has the capability to identify disease outbreaks well in advance for the studied diseases with modest reliability [
The investigation and application of internet-based surveillance is widely recognised [
The present study demonstrates that an Internet-search-based surveillance system has the potential to effectively contribute to the control of various diseases. However, correlations alone should not be viewed as definitive evidence of impending outbreaks or epidemics as the analyses performed were univariate and exploratory in nature. The results of this study should be interpreted with caution keeping in mind the biological plausibility and natural history of the disease concerned.
The Internet-based surveillance system collects data and provides necessary information, instantly circumventing traditional administrative structures that impede information flow [
The lag period used in this study was −4 to +4 weeks. This range was nearly two times the incubation period of any febrile illness studied. The negative lag period will help to understand the approximate time of primary case occurrence and further analysis to look for biologically plausible associations. The observed maximum correlation 2 to 3 weeks before the actual outbreak provides sufficient time to deploy RRTs for timely action. Similarly, the positive lag period may support the surveillance team to ensure that the outbreak is over.
The spike of Internet searches, for example, for ‘chikungunya’ may be attributed to various factors. It may be due to increased number of cases in the community and increased attention given by the social media. Media can be a source of bias, as it may seriously affect the trending of searches for a particular disease [
The studied febrile illnesses are common in India. Therefore, whenever a patient with fever visits any health facility, a battery of lab investigations are conducted depending on the previous experience from the community. This list also serves as a driver for the searches related to the diseases. However, these two processes, i.e., Internet searches as per the Google Trends and the actual number of cases in community and their notification may not be mutually exclusive.
The study had following limitations. The study used only the ‘P’ form data of malaria, enteric fever, chikungunya, and dengue. This study did not use the ‘S’ form data because the form did not differentiate the fever cases reported. Similarly, ‘L’ form data also was not included in the analysis because case reporting is usually delayed for laboratory confirmation. There is also a need to test and establish the correlation of Internet search data with other diseases and other forms of IDSP data. Similarly, there is a need to demonstrate the applicability of this internet search data to be used by all states. Second, in a country like India with varied culture, we have a variety of languages that are used as primary languages by the mobile and Internet users. However, only English was used as the main language to retrieve the search results, which may have caused underreporting of cases and thus errors in the correlation. Third, the established correlation may not help to identify the exact place of an outbreak or epidemic at intrastate and intra-district level because the Google Trends does not provide data at these levels. Fourth, this study assessed the performance of only one term that had the maximum correlation with the febrile illnesses included in the study. Other search terms may also add to the burden of the searches related to the particular disease. Despite this, we observed a positive correlation with all the febrile illnesses, though the strength varied. Finally, seasonal differencing could not be applied to cross correlations to remove cyclic seasonal trends as IDSP data was available for only 1 year.
We recommend the use of an Internet-based surveillance system to supplement the existing IDSP system. Such a system can be tested at the field level for taking timely action, especially for epidemic prone diseases. Future studies should focus on forecasting epidemics and outbreaks for various other diseases by using mathematical modelling that adjusts for other parameters. The search trends from social media platforms can also be assessed further along with Google or other portal site trends for disease surveillance.
In conclusion, similar results were obtained when applying the results of previous studies to specific diseases, and it is considered that many other diseases should be studied at national and sub-national levels. Internet-based surveillance systems have broader applicability for the surveillance of infectious diseases than is currently recognised, especially in resource-constrained areas. Despite the huge potential of this approach, this method cannot be used as an alternative to traditional surveillance systems and can only be used to supplement the existing system. However, the results of this study suggest that internet-based surveillance systems have potential role in forecasting of emerging infectious disease events.
No potential conflict of interest relevant to this article was reported.
Supplementary materials can be found via
Top 5 terms extracted from Google Correlate for each of the four febrile illnesses
IDSP: Integrated Disease Surveillance Programme.