Zoltar: a forecast repository

brought to you by the Reich Lab Zoltar Development Team:
Matt Cornell, Khoa Le, Abdul Hannan Kanji, Katie House,
Yuxin Huang, Evan Ray, Nick Reich
http://zoltardata.com




October 6, 2020

Zoltar: big picture goals

Zoltar is a research data repository that stores time-series forecasts made by external models and provides tools for programmatic data access and scoring.

In development as a research tool since 2018. Focused COVID-oriented development in last 6 months has made it a viable, if early-stage “production” system.

We have a preprint describing the vision and general forecast data model: https://arxiv.org/abs/2006.03922.

Zoltar vs. GitHub

GitHub is not a sustainable, long-term architecture for large-scale forecast storage.

A structured database can provide systematic access to just the pieces of the data that you need.

Zoltar “ecosystem”

All aspects of the project are open-source (contributions and feature requests welcome!).

Quick web tour

3 steps to getting set up with Zoltar

  1. Request an account
  2. Install zoltr and/or zoltpy
devtools::install_github("reichlab/zoltr")
pip install git+https://github.com/reichlab/zoltpy/
  1. Set up authentication using system variables for R specifically or system-wide for either R or python.

Using data from the Zoltar API (demo)

The Zoltar API allows you to access forecast data programmatically (without having to read the whole repository) for evaluation, visualization or ensemble building.

The first step is always to establish a connection with the Zoltar server.

library(zoltr)
zoltar_connection <- new_connection()
zoltar_authenticate(zoltar_connection, Sys.getenv("Z_USERNAME"), Sys.getenv("Z_PASSWORD"))

The Zoltar API follows RESTful design principles. As a result, all resources are associated with and accessed via a unique URL. For example, the url for the COVID-19 Forecast Hub project is:

covidhub_project_url <- "https://www.zoltardata.com/api/project/44/"

Example 1: comparing multiple forecasts

  1. Submit a query to the API
fcasts <- do_zoltar_query(zoltar_connection, 
    project_url =  covidhub_project_url,
    is_forecast_query = TRUE,
    models = c("MOBS-GLEAM_COVID", "IHME-CurveFit", "COVIDhub-ensemble", "COVIDhub-baseline"), 
    targets = paste(1:20, "wk ahead inc death"),
    units = "48", ## FIPS code for Texas
    types = c("point"), ## only retrieving point forecasts
    timezeros = "2020-06-22")
dplyr::select(fcasts, model, timezero, unit, target, class, value)
## # A tibble: 32 x 6
##    model         timezero   unit  target                class value
##    <chr>         <date>     <chr> <chr>                 <chr> <dbl>
##  1 IHME-CurveFit 2020-06-22 48    1 wk ahead inc death  point  242.
##  2 IHME-CurveFit 2020-06-22 48    10 wk ahead inc death point  896.
##  3 IHME-CurveFit 2020-06-22 48    11 wk ahead inc death point 1033.
##  4 IHME-CurveFit 2020-06-22 48    12 wk ahead inc death point 1263.
##  5 IHME-CurveFit 2020-06-22 48    13 wk ahead inc death point 1510.
##  6 IHME-CurveFit 2020-06-22 48    14 wk ahead inc death point 1779.
##  7 IHME-CurveFit 2020-06-22 48    15 wk ahead inc death point 2062.
##  8 IHME-CurveFit 2020-06-22 48    16 wk ahead inc death point 2352.
##  9 IHME-CurveFit 2020-06-22 48    17 wk ahead inc death point 2642.
## 10 IHME-CurveFit 2020-06-22 48    18 wk ahead inc death point 2923.
## # … with 22 more rows

Example 1: comparing multiple forecasts (con’t)

  1. Wrangle and plot the data. (We are working on additional functions to make this step easier!)
library(tidyverse)
library(covidcast)
library(MMWRweek)
source("https://raw.githubusercontent.com/reichlab/covid19-forecast-hub/master/code/processing-fxns/get_next_saturday.R")

## adding dates to week-ahead targets for easier plotting
fcasts <- fcasts %>%
    mutate(week_ahead = as.numeric(substr(target, 0,2)),
        target_end_date = get_next_saturday(timezero + 7*(week_ahead-1)))

## downloading truth data from covidcast
jhu_dat <- covidcast_signal(data_source = "jhu-csse", 
    signal ="deaths_incidence_num",
    start_day = "2020-04-01", end_day = "2020-10-03",
    geo_type = "state", geo_values = "tx") %>% 
    mutate(epiweek=MMWRweek(time_value)$MMWRweek) %>%
    group_by(epiweek) %>%
    summarize(value = sum(value)) %>%
    mutate(target_end_date = MMWRweek2Date(rep(2020, n()), epiweek, rep(7, n())),
        model="observed data (JHU)")

## plot the data!
ggplot(fcasts, aes(x=target_end_date, y=value, color=model)) +
    geom_point() + 
    geom_line() + 
    geom_point(data=jhu_dat) + 
    geom_line(data=jhu_dat) +
    scale_color_brewer(type = "qual") + 
    theme_bw() + xlab(NULL)+
        ggtitle("Incident deaths in Texas, observed and forecasted")

Example 2: forecasts from one model over time

  1. Submit a query to the API
fcasts <- do_zoltar_query(zoltar_connection, 
        project_url =  covidhub_project_url,
        is_forecast_query = TRUE,
        models = c("COVIDhub-ensemble"), 
        targets = paste(1:4, "wk ahead inc death"),
        units = "48", ## FIPS code for Texas
        types = c("quantile"),
        timezeros = seq.Date(as.Date("2020-06-01"), as.Date("2020-10-05"), by="28 days")) 
dplyr::select(fcasts, model, timezero, unit, target, quantile, value)
## # A tibble: 368 x 6
##    model             timezero   unit  target               quantile value
##    <chr>             <date>     <chr> <chr>                   <dbl> <dbl>
##  1 COVIDhub-ensemble 2020-06-29 48    1 wk ahead inc death    0.01   153.
##  2 COVIDhub-ensemble 2020-06-29 48    1 wk ahead inc death    0.025  163.
##  3 COVIDhub-ensemble 2020-06-29 48    1 wk ahead inc death    0.05   173.
##  4 COVIDhub-ensemble 2020-06-29 48    1 wk ahead inc death    0.1    187.
##  5 COVIDhub-ensemble 2020-06-29 48    1 wk ahead inc death    0.15   197.
##  6 COVIDhub-ensemble 2020-06-29 48    1 wk ahead inc death    0.2    205.
##  7 COVIDhub-ensemble 2020-06-29 48    1 wk ahead inc death    0.25   213.
##  8 COVIDhub-ensemble 2020-06-29 48    1 wk ahead inc death    0.3    220.
##  9 COVIDhub-ensemble 2020-06-29 48    1 wk ahead inc death    0.35   227.
## 10 COVIDhub-ensemble 2020-06-29 48    1 wk ahead inc death    0.4    234.
## # … with 358 more rows

Example 2: forecasts from one model over time (con’t)

  1. Wrangle and plot the data
## add dates and pivot to wide-form data
fcasts_wide <- fcasts %>%
    filter(quantile %in% c(0.025, 0.1, 0.25, 0.5, 0.75, 0.9, 0.975)) %>%
    mutate(week_ahead = as.numeric(substr(target, 0,2)),
        target_end_date = get_next_saturday(timezero + 7*(week_ahead-1))) %>%
    pivot_wider(names_from = quantile, names_prefix="q")

## plot the data!
ggplot(fcasts_wide, aes(x=target_end_date)) +
    geom_line(aes(y=q0.5, color=timezero, group=timezero)) + 
    geom_ribbon(aes(ymin=q0.1, ymax=q0.9, fill=timezero, group=timezero), alpha=.3) +
    geom_ribbon(aes(ymin=q0.025, ymax=q0.975, fill=timezero, group=timezero), alpha=.3) +
    geom_ribbon(aes(ymin=q0.25, ymax=q0.75, fill=timezero, group=timezero), alpha=.3) +
    geom_point(data=jhu_dat, aes(y=value)) + 
    geom_line(data=jhu_dat, aes(y=value)) +
    theme_bw() + xlab(NULL) +
    theme(legend.position = "none") + ylab("incident deaths") +
    ggtitle("Incident deaths in Texas, observed and forecasted")

Using zoltpy

For python users, zoltpy enables API access.

Examples similar to the above in this notebook.

Some underused features in Zoltar (as of now)

Next steps