Abstract
According to the World Health Organization (WHO), vector-borne diseases such as malaria and dengue account for 17% of all infectious disease cases and lead to more than 700,000 deaths per year. Tracking and predicting the spread of vector-borne diseases is a vital task that could save hundreds of thousands of lives annually. Oftentimes, the first reports of vector-borne disease outbreaks occur through emails and online reporting systems long before they are officially documented. Tracking and predicting the emergence and spread of vector-borne disease outbreaks requires extracting data from these unstructured sources in combination with historical weather and climate data to understand the underlying background triggers and disease dynamics. In this work, we develop a data extraction pipeline for the online outbreak reporting website ProMED-mail that utilizes a web scraper, transformer neural network summarizer, and named entity recognizer to obtain a dataset of malaria, dengue, zika, and chikungunya outbreaks over the last 30 years. This scraped dataset was further analyzed in association with global rainfall anomalies derived from NASA's Integrated Multi-satellitE Retrievals for GPM [Global Precipitation Mission] (IMERG) dataset. This preliminary analysis was to understand the effect of global rainfall patterns on the spread of vector-borne diseases. Analysis of the ProMED-mail and GPM data shows that vector-borne disease outbreaks are clustered towards the tropics and outbreaks are often amplified during the rainy seasons. Our scraped dataset can be a valuable tool in creating comprehensive georeferenced disease records for modeling and predicting future outbreaks.
Original language | English |
---|---|
Title of host publication | Proceedings - 2021 IEEE International Conference on Big Data, Big Data 2021 |
Editors | Yixin Chen, Heiko Ludwig, Yicheng Tu, Usama Fayyad, Xingquan Zhu, Xiaohua Tony Hu, Suren Byna, Xiong Liu, Jianping Zhang, Shirui Pan, Vagelis Papalexakis, Jianwu Wang, Alfredo Cuzzocrea, Carlos Ordonez |
Publisher | Institute of Electrical and Electronics Engineers Inc. |
Pages | 4156-4164 |
Number of pages | 9 |
ISBN (Electronic) | 9781665439022 |
DOIs | |
State | Published - 2021 |
Externally published | Yes |
Event | 2021 IEEE International Conference on Big Data, Big Data 2021 - Virtual, Online, United States Duration: Dec 15 2021 → Dec 18 2021 |
Publication series
Name | Proceedings - 2021 IEEE International Conference on Big Data, Big Data 2021 |
---|
Conference
Conference | 2021 IEEE International Conference on Big Data, Big Data 2021 |
---|---|
Country/Territory | United States |
City | Virtual, Online |
Period | 12/15/21 → 12/18/21 |
Bibliographical note
Publisher Copyright:© 2021 IEEE.
Keywords
- NLP
- ProMED
- Web scraping
- data mining
- epidemiology
- transformers
- vector-borne disease