Zoltar: big picture goals

Zoltar is a research data repository that stores time-series forecasts made by external models and provides tools for programmatic data access and scoring.

In development as a research tool since 2018. Focused COVID-oriented development in last 6 months has made it a viable, if early-stage “production” system.

We have a preprint describing the vision and general forecast data model: https://arxiv.org/abs/2006.03922.

Zoltar vs. GitHub

GitHub is not a sustainable, long-term architecture for large-scale forecast storage.

No internal structure to the data/file storage.
Space restrictions may become a limiting factor. (Current repo size: ~15GB)

A structured database can provide systematic access to just the pieces of the data that you need.

Zoltar “ecosystem”

Zoltar website: http://zoltardata.com
zoltr R package: http://reichlab.io/zoltr/
zoltpy python library: https://github.com/reichlab/zoltpy

All aspects of the project are open-source (contributions and feature requests welcome!).

Quick web tour

3 steps to getting set up with Zoltar

Request an account
Install zoltr and/or zoltpy

devtools::install_github("reichlab/zoltr")

pip install git+https://github.com/reichlab/zoltpy/

Set up authentication using system variables for R specifically or system-wide for either R or python.

Using data from the Zoltar API (demo)

The Zoltar API allows you to access forecast data programmatically (without having to read the whole repository) for evaluation, visualization or ensemble building.

The first step is always to establish a connection with the Zoltar server.

library(zoltr)
zoltar_connection <- new_connection()
zoltar_authenticate(zoltar_connection, Sys.getenv("Z_USERNAME"), Sys.getenv("Z_PASSWORD"))

The Zoltar API follows RESTful design principles. As a result, all resources are associated with and accessed via a unique URL. For example, the url for the COVID-19 Forecast Hub project is:

covidhub_project_url <- "https://www.zoltardata.com/api/project/44/"

Example 1: comparing multiple forecasts

Submit a query to the API

fcasts <- do_zoltar_query(zoltar_connection, 
    project_url =  covidhub_project_url,
    is_forecast_query = TRUE,
    models = c("MOBS-GLEAM_COVID", "IHME-CurveFit", "COVIDhub-ensemble", "COVIDhub-baseline"), 
    targets = paste(1:20, "wk ahead inc death"),
    units = "48", ## FIPS code for Texas
    types = c("point"), ## only retrieving point forecasts
    timezeros = "2020-06-22")

dplyr::select(fcasts, model, timezero, unit, target, class, value)

## # A tibble: 32 x 6
##    model         timezero   unit  target                class value
##    <chr>         <date>     <chr> <chr>                 <chr> <dbl>
##  1 IHME-CurveFit 2020-06-22 48    1 wk ahead inc death  point  242.
##  2 IHME-CurveFit 2020-06-22 48    10 wk ahead inc death point  896.
##  3 IHME-CurveFit 2020-06-22 48    11 wk ahead inc death point 1033.
##  4 IHME-CurveFit 2020-06-22 48    12 wk ahead inc death point 1263.
##  5 IHME-CurveFit 2020-06-22 48    13 wk ahead inc death point 1510.
##  6 IHME-CurveFit 2020-06-22 48    14 wk ahead inc death point 1779.
##  7 IHME-CurveFit 2020-06-22 48    15 wk ahead inc death point 2062.
##  8 IHME-CurveFit 2020-06-22 48    16 wk ahead inc death point 2352.
##  9 IHME-CurveFit 2020-06-22 48    17 wk ahead inc death point 2642.
## 10 IHME-CurveFit 2020-06-22 48    18 wk ahead inc death point 2923.
## # … with 22 more rows

Example 1: comparing multiple forecasts (con’t)

Wrangle and plot the data. (We are working on additional functions to make this step easier!)

library(tidyverse)
library(covidcast)
library(MMWRweek)
source("https://raw.githubusercontent.com/reichlab/covid19-forecast-hub/master/code/processing-fxns/get_next_saturday.R")

## adding dates to week-ahead targets for easier plotting
fcasts <- fcasts %>%
    mutate(week_ahead = as.numeric(substr(target, 0,2)),
        target_end_date = get_next_saturday(timezero + 7*(week_ahead-1)))

## downloading truth data from covidcast
jhu_dat <- covidcast_signal(data_source = "jhu-csse", 
    signal ="deaths_incidence_num",
    start_day = "2020-04-01", end_day = "2020-10-03",
    geo_type = "state", geo_values = "tx") %>% 
    mutate(epiweek=MMWRweek(time_value)$MMWRweek) %>%
    group_by(epiweek) %>%
    summarize(value = sum(value)) %>%
    mutate(target_end_date = MMWRweek2Date(rep(2020, n()), epiweek, rep(7, n())),
        model="observed data (JHU)")

## plot the data!
ggplot(fcasts, aes(x=target_end_date, y=value, color=model)) +
    geom_point() + 
    geom_line() + 
    geom_point(data=jhu_dat) + 
    geom_line(data=jhu_dat) +
    scale_color_brewer(type = "qual") + 
    theme_bw() + xlab(NULL)+
        ggtitle("Incident deaths in Texas, observed and forecasted")

Example 2: forecasts from one model over time

Submit a query to the API

fcasts <- do_zoltar_query(zoltar_connection, 
        project_url =  covidhub_project_url,
        is_forecast_query = TRUE,
        models = c("COVIDhub-ensemble"), 
        targets = paste(1:4, "wk ahead inc death"),
        units = "48", ## FIPS code for Texas
        types = c("quantile"),
        timezeros = seq.Date(as.Date("2020-06-01"), as.Date("2020-10-05"), by="28 days"))

dplyr::select(fcasts, model, timezero, unit, target, quantile, value)

## # A tibble: 368 x 6
##    model             timezero   unit  target               quantile value
##    <chr>             <date>     <chr> <chr>                   <dbl> <dbl>
##  1 COVIDhub-ensemble 2020-06-29 48    1 wk ahead inc death    0.01   153.
##  2 COVIDhub-ensemble 2020-06-29 48    1 wk ahead inc death    0.025  163.
##  3 COVIDhub-ensemble 2020-06-29 48    1 wk ahead inc death    0.05   173.
##  4 COVIDhub-ensemble 2020-06-29 48    1 wk ahead inc death    0.1    187.
##  5 COVIDhub-ensemble 2020-06-29 48    1 wk ahead inc death    0.15   197.
##  6 COVIDhub-ensemble 2020-06-29 48    1 wk ahead inc death    0.2    205.
##  7 COVIDhub-ensemble 2020-06-29 48    1 wk ahead inc death    0.25   213.
##  8 COVIDhub-ensemble 2020-06-29 48    1 wk ahead inc death    0.3    220.
##  9 COVIDhub-ensemble 2020-06-29 48    1 wk ahead inc death    0.35   227.
## 10 COVIDhub-ensemble 2020-06-29 48    1 wk ahead inc death    0.4    234.
## # … with 358 more rows

Example 2: forecasts from one model over time (con’t)

Wrangle and plot the data

## add dates and pivot to wide-form data
fcasts_wide <- fcasts %>%
    filter(quantile %in% c(0.025, 0.1, 0.25, 0.5, 0.75, 0.9, 0.975)) %>%
    mutate(week_ahead = as.numeric(substr(target, 0,2)),
        target_end_date = get_next_saturday(timezero + 7*(week_ahead-1))) %>%
    pivot_wider(names_from = quantile, names_prefix="q")

## plot the data!
ggplot(fcasts_wide, aes(x=target_end_date)) +
    geom_line(aes(y=q0.5, color=timezero, group=timezero)) + 
    geom_ribbon(aes(ymin=q0.1, ymax=q0.9, fill=timezero, group=timezero), alpha=.3) +
    geom_ribbon(aes(ymin=q0.025, ymax=q0.975, fill=timezero, group=timezero), alpha=.3) +
    geom_ribbon(aes(ymin=q0.25, ymax=q0.75, fill=timezero, group=timezero), alpha=.3) +
    geom_point(data=jhu_dat, aes(y=value)) + 
    geom_line(data=jhu_dat, aes(y=value)) +
    theme_bw() + xlab(NULL) +
    theme(legend.position = "none") + ylab("incident deaths") +
    ggtitle("Incident deaths in Texas, observed and forecasted")

Using zoltpy

For python users, zoltpy enables API access.

Examples similar to the above in this notebook.

Some underused features in Zoltar (as of now)

different forecast representations, e.g., distributions can be represented by samples or parametric densities
programmatic pushing of forecasts into Zoltar by teams (can be part of the model workflow, right now, forecasts are pushed automatically from GitHub every 6 hours)
accessing scores directly from Zoltar (also available via the API)

Zoltar: a forecast repository