Mathias Unberath, assistant research professor in the Malone Center, and team have published an open-source, machine readable dataset related to socioeconomic factors that may affect the spread and/or consequences of epidemiological outbreaks, particularly the novel coronavirus (COVID-19).

Access the dataset here >>

From the project website:

Overview: Despite overoptimistic promises of an “American Resurrection” by Easter Sunday, many scientists and citizens fear that current mitigation strategies are likely insufficient to avert the collapse of the US healthcare system. Confirmed COVID-19 cases, hospitalizations, and – unfortunately – deaths are rapidly increasing; implementing an aggressive suppression strategy – “The Hammer” – seems to be the only viable option to buy time. How can we make best use of the time these measures buy?

The machine learning community should actively engage in these discussions and contribute possible solutions to actionable problems. One interesting direction could be to identify the effect that different mitigation and suppression strategies have in terms of benefits and costs. “Benefits” in this case would correspond to reductions in the effective reproduction number R, potential lives saved and long-term socio-economic benefits, while “costs” could reflect the resulting burden on the healthcare system, short-term economic consequences and possible long-term economic restructuring.

Many of the recent epidemiological predictions and analyses are performed for the US as a whole. However, identifying relationships between “benefits” and “costs” will likely require a much higher granularity of analysis. This is because highly localized contextual factors, such as population density, demographics or primary means of transportation, will affect critical parameters for computational epidemiological modeling, including the effective reproduction number R.

To facilitate research on such questions, we present a machine readable dataset that aggregates relevant data from around 10 governmental and academic sources on the county-level. In addition to county-level time-series data from the JHU CSSE COVID-19 Dashboard, our dataset contains more than 300 variables that summarize population estimates, demographics, ethnicity, housing, education, employment and income, climate, transit scores, and healthcare system-related metrics. A detailed description of all variables can be found here.