brought to you by the Reich Lab Zoltar Development Team:
Matt Cornell, Khoa Le, Abdul Hannan Kanji, Katie House,
Yuxin Huang, Evan Ray, Nick Reich
http://zoltardata.com
October 6, 2020
Zoltar is a research data repository that stores time-series forecasts made by external models and provides tools for programmatic data access and scoring.
In development as a research tool since 2018. Focused COVID-oriented development in last 6 months has made it a viable, if early-stage “production” system.
We have a preprint describing the vision and general forecast data model: https://arxiv.org/abs/2006.03922.
GitHub is not a sustainable, long-term architecture for large-scale forecast storage.
A structured database can provide systematic access to just the pieces of the data that you need.
All aspects of the project are open-source (contributions and feature requests welcome!).
The Zoltar API allows you to access forecast data programmatically (without having to read the whole repository) for evaluation, visualization or ensemble building.
The first step is always to establish a connection with the Zoltar server.
library(zoltr)
zoltar_connection <- new_connection()
zoltar_authenticate(zoltar_connection, Sys.getenv("Z_USERNAME"), Sys.getenv("Z_PASSWORD"))
The Zoltar API follows RESTful design principles. As a result, all resources are associated with and accessed via a unique URL. For example, the url for the COVID-19 Forecast Hub project is:
fcasts <- do_zoltar_query(zoltar_connection,
project_url = covidhub_project_url,
is_forecast_query = TRUE,
models = c("MOBS-GLEAM_COVID", "IHME-CurveFit", "COVIDhub-ensemble", "COVIDhub-baseline"),
targets = paste(1:20, "wk ahead inc death"),
units = "48", ## FIPS code for Texas
types = c("point"), ## only retrieving point forecasts
timezeros = "2020-06-22")
## # A tibble: 32 x 6
## model timezero unit target class value
## <chr> <date> <chr> <chr> <chr> <dbl>
## 1 IHME-CurveFit 2020-06-22 48 1 wk ahead inc death point 242.
## 2 IHME-CurveFit 2020-06-22 48 10 wk ahead inc death point 896.
## 3 IHME-CurveFit 2020-06-22 48 11 wk ahead inc death point 1033.
## 4 IHME-CurveFit 2020-06-22 48 12 wk ahead inc death point 1263.
## 5 IHME-CurveFit 2020-06-22 48 13 wk ahead inc death point 1510.
## 6 IHME-CurveFit 2020-06-22 48 14 wk ahead inc death point 1779.
## 7 IHME-CurveFit 2020-06-22 48 15 wk ahead inc death point 2062.
## 8 IHME-CurveFit 2020-06-22 48 16 wk ahead inc death point 2352.
## 9 IHME-CurveFit 2020-06-22 48 17 wk ahead inc death point 2642.
## 10 IHME-CurveFit 2020-06-22 48 18 wk ahead inc death point 2923.
## # … with 22 more rows
library(tidyverse)
library(covidcast)
library(MMWRweek)
source("https://raw.githubusercontent.com/reichlab/covid19-forecast-hub/master/code/processing-fxns/get_next_saturday.R")
## adding dates to week-ahead targets for easier plotting
fcasts <- fcasts %>%
mutate(week_ahead = as.numeric(substr(target, 0,2)),
target_end_date = get_next_saturday(timezero + 7*(week_ahead-1)))
## downloading truth data from covidcast
jhu_dat <- covidcast_signal(data_source = "jhu-csse",
signal ="deaths_incidence_num",
start_day = "2020-04-01", end_day = "2020-10-03",
geo_type = "state", geo_values = "tx") %>%
mutate(epiweek=MMWRweek(time_value)$MMWRweek) %>%
group_by(epiweek) %>%
summarize(value = sum(value)) %>%
mutate(target_end_date = MMWRweek2Date(rep(2020, n()), epiweek, rep(7, n())),
model="observed data (JHU)")
## plot the data!
ggplot(fcasts, aes(x=target_end_date, y=value, color=model)) +
geom_point() +
geom_line() +
geom_point(data=jhu_dat) +
geom_line(data=jhu_dat) +
scale_color_brewer(type = "qual") +
theme_bw() + xlab(NULL)+
ggtitle("Incident deaths in Texas, observed and forecasted")
fcasts <- do_zoltar_query(zoltar_connection,
project_url = covidhub_project_url,
is_forecast_query = TRUE,
models = c("COVIDhub-ensemble"),
targets = paste(1:4, "wk ahead inc death"),
units = "48", ## FIPS code for Texas
types = c("quantile"),
timezeros = seq.Date(as.Date("2020-06-01"), as.Date("2020-10-05"), by="28 days"))
## # A tibble: 368 x 6
## model timezero unit target quantile value
## <chr> <date> <chr> <chr> <dbl> <dbl>
## 1 COVIDhub-ensemble 2020-06-29 48 1 wk ahead inc death 0.01 153.
## 2 COVIDhub-ensemble 2020-06-29 48 1 wk ahead inc death 0.025 163.
## 3 COVIDhub-ensemble 2020-06-29 48 1 wk ahead inc death 0.05 173.
## 4 COVIDhub-ensemble 2020-06-29 48 1 wk ahead inc death 0.1 187.
## 5 COVIDhub-ensemble 2020-06-29 48 1 wk ahead inc death 0.15 197.
## 6 COVIDhub-ensemble 2020-06-29 48 1 wk ahead inc death 0.2 205.
## 7 COVIDhub-ensemble 2020-06-29 48 1 wk ahead inc death 0.25 213.
## 8 COVIDhub-ensemble 2020-06-29 48 1 wk ahead inc death 0.3 220.
## 9 COVIDhub-ensemble 2020-06-29 48 1 wk ahead inc death 0.35 227.
## 10 COVIDhub-ensemble 2020-06-29 48 1 wk ahead inc death 0.4 234.
## # … with 358 more rows
## add dates and pivot to wide-form data
fcasts_wide <- fcasts %>%
filter(quantile %in% c(0.025, 0.1, 0.25, 0.5, 0.75, 0.9, 0.975)) %>%
mutate(week_ahead = as.numeric(substr(target, 0,2)),
target_end_date = get_next_saturday(timezero + 7*(week_ahead-1))) %>%
pivot_wider(names_from = quantile, names_prefix="q")
## plot the data!
ggplot(fcasts_wide, aes(x=target_end_date)) +
geom_line(aes(y=q0.5, color=timezero, group=timezero)) +
geom_ribbon(aes(ymin=q0.1, ymax=q0.9, fill=timezero, group=timezero), alpha=.3) +
geom_ribbon(aes(ymin=q0.025, ymax=q0.975, fill=timezero, group=timezero), alpha=.3) +
geom_ribbon(aes(ymin=q0.25, ymax=q0.75, fill=timezero, group=timezero), alpha=.3) +
geom_point(data=jhu_dat, aes(y=value)) +
geom_line(data=jhu_dat, aes(y=value)) +
theme_bw() + xlab(NULL) +
theme(legend.position = "none") + ylab("incident deaths") +
ggtitle("Incident deaths in Texas, observed and forecasted")
For python users, zoltpy enables API access.
Examples similar to the above in this notebook.