---
config:
theme: neutral
themeVariables:
fontFamily: "Arial, sans-serif"
lineColor: "#555555"
edgeLabelBackground: "#FFFFFF"
primaryTextColor: "#333333"
flowchart:
useMaxWidth: true
wrappingWidth: 420
nodeSpacing: 40
rankSpacing: 50
---
flowchart TB
A(["Data Producers"])
B1["<b>Layer 1 · EDI Archive</b><br/>Field and monitoring data published as versioned, DOI-bearing packages with EML metadata"]
B2["<b>Layer 1 · Alternative Archive (TBD)</b><br/>Complex data objects (e.g., EO imagery) requiring a DOI-minting repository appropriate to data type"]
C["<b>Layer 2 · Ingestion & Standardization</b><br/>Automated pipelines harvest, validate, and standardize to the HRL data model"]
D["<b>Layer 3 · Storage & Serving</b><br/>Analysis-ready products in open formats in a central, queryable cloud store"]
E["<b>Layer 4 · Access & Applications</b><br/>Data reaches synthesis teams and the public via R packages, Python libraries, discovery-friendly data catalogs with download options, and dashboards"]
A --> B1
A --> B2
B1 -->|"harvest & validate"| C
B2 -->|"harvest & validate"| C
C -->|"standardize & store"| D
D -->|"query & download"| E
classDef producer fill:#E3EFF5,stroke:#2E7DA1,stroke-width:2px,color:#0C425C
classDef l1 fill:#EDF5F0,stroke:#2E6E3D,stroke-width:2px,color:#1A4A2A
classDef l2 fill:#FEF9ED,stroke:#B07820,stroke-width:2px,color:#4A3000
classDef l3 fill:#F1EDF8,stroke:#6A4A9A,stroke-width:2px,color:#3A1E6A
classDef l4 fill:#FDEEEC,stroke:#A03428,stroke-width:2px,color:#5C1A10
class A producer
class B1,B2 l1
class C l2
class D l3
class E l4
HRL Data Infrastructure Proposal
Healthy Rivers and Landscapes Science Program
This document is currently published in draft form and should not be cited as authoritative HRL policy.
Executive Summary
The HRL Science Program is committed to open, reproducible science in an eight-year, multi-agency program that will generate hundreds of datasets, analyses, and decision-support products. That commitment requires technical infrastructure to support open data and synthesis science: a system that can receive data from a federated network of producers, standardize and integrate it reliably, store it in a durable and accessible way, and make it discoverable and available to synthesis teams, external stakeholders, and the general public.
This document proposes a phased data architecture to deliver on those commitments. The architecture is designed around three core realities of the HRL context:
- The team is small now and must grow. The HRL Science Program’s data science and engineering capacity is currently extremely limited. The architecture must be buildable by a small team—perhaps even a single person with a broad skillset—while providing a credible path for development and maturation as capacity grows.
- The program is inherently interagency. Partners will both contribute and consume data. The infrastructure must be cloud-neutral, built on open standards, and governed by clear data agreements rather than technical lock-in and ad hoc decisions about data publication, access, and reuse.
- The data landscape is heterogeneous and may include large spatial datasets. The architecture must handle tabular monitoring data, geospatial products, and potentially large raster datasets (LiDAR, remote sensing) without requiring separate, siloed systems for each.
The proposed approach is a four-layer model—source and publication, ingestion and standardization, storage and serving, and access and applications—implemented in three phases over the program lifetime. The proposed Year 1 deliverable is a set of functional, reproducible ingestion pipelines for the first wave of HRL datasets and initial storage solutions, paired with a metadata catalog that makes those datasets discoverable and queryable by partners.
Achieving this requires investment in two things the program currently lacks: dedicated technical capacity and basic cloud infrastructure. The recommended Year 1 staffing ask is one data engineer and one scientific data scientist. The staffing and infrastructure costs are low relative to program scale.
This document also surfaces several consequential decisions that require HRL program leadership engagement or endorsement. The most important of these are not technical choices—they are governance and resource decisions:
- Governance and technical leadership. There is currently no established structure for funding, building, hosting, or maintaining the program’s shared data infrastructure and no defined process for making technical decisions— from strategic choices (what technology stack to adopt) to operational ones (how a specific pipeline or API is implemented). Both gaps need to be resolved before substantive technical work can begin. This includes clarity on which agency or body holds decision-making authority, where the infrastructure lives institutionally, and how technical leadership is accountable to program leadership.
- The publication contract. Before any dataset enters the shared system, it must be published to a public archive with standardized metadata and tracked provenance. Common data standards may also need to be established to facilitate later dataset integration. Establishing and enforcing this requirement across all contributing agencies is the single most important thing the Science Committee can do to make the architecture function.
- Dedicated staffing. The architecture cannot be built as a side responsibility of existing staff. Hiring data engineers and data scientists is a prerequisite, not an option.
- Cloud storage and large-data archiving. The program needs a shared, public data store and a clear answer for where large or complex data objects (LiDAR, remote sensing imagery, model outputs) will be archived. These decisions are not blocking for early work, but must be made before the first spatial or large-data pipelines are built.
- Public access. The program has committed to open science. This proposal operationalizes that commitment through a public data catalog with download links and an R package that synthesis teams can use to discover and download data without knowing anything about the underlying infrastructure. What access looks like for non-R users and the general public beyond this is an open question for later phases.
Program Context & Requirements
The core challenge
HRL faces a structural tension that is common to large interagency monitoring programs but rarely addressed explicitly: data production is federated, but data use and synthesis science requires centralization. Field teams, tributaries/system governance entities, and partner agencies will collect and publish data independently, using their own methods, systems, and timelines, with methodological comparability tracked by the HRL Science Committee. Synthesis teams and adaptive managers, however, need data that are consistent and integrated across sites, time periods, and data types.
Bridging that gap—turning federated outputs into integrated, analysis-ready products discoverable across the program—is the core job of the data infrastructure described here. Without deliberate investment in this bridge, the program risks ending up with a pile of well-documented individual datasets that nobody can practically use for cross-site synthesis and that fail to meet the true spirit of open science that HRL has adopted.
The HRL data landscape
The program will generate and use a wide variety of data types, each with distinct storage, processing, and access requirements:
- Tabular monitoring data: time series of flow, water quality, fish counts, invertebrate surveys, and similar field measurements. High volume, regular cadence, generally well-understood formats.
- Geospatial vector data: habitat mapping, restoration site boundaries, species occurrence, network features. Require consistent coordinate reference systems and topology standards.
- Large raster and point cloud data: LiDAR-derived terrain models, remote sensing imagery. Potentially very large; require cloud-native formats and specialized processing pipelines. This is the most technically demanding data type and warrants specific architectural attention.
- Model outputs and derived products: habitat suitability models, hydrologic model outputs, synthesis data products. Variable structure; must be versioned and linked to the code and inputs that produced them.
- Metadata and documentation: EML packages, data dictionaries, schemas, provenance records. Must travel with data through every stage of the lifecycle.
Design principles
The infrastructure design is shaped by HRL’s commitments to FAIR and CARE principles, reproducible science, and open data. Translated into infrastructure design, those commitments produce four guiding principles:
- Open by default. Open-source tools, open standards, and open formats at every layer where feasible. Vendor-specific dependencies are introduced only where they provide clear, justified value.
- Interoperable, not integrated. The architecture should not require partners to adopt HRL-specific systems. Data enters through standard interfaces (EDI publication, documented schemas) and exits through standard interfaces (APIs, R package, open formats).
- Phased and maintainable. Each phase should be independently useful and maintainable by the team that exists at that time, not dependent on the team that is hoped for.
- Reproducibility is non-negotiable. Every pipeline, transformation, and derived product must be scripted, open-source, versioned, and executable without manual intervention.
Infrastructure Overview
The proposed infrastructure is organized into four layers, each corresponding to a stage in the HRL data lifecycle. Data flows from producers through each layer, culminating in analysis-ready products accessible to synthesis teams and external stakeholders.
Layer 1 — Source and Static Publication
Data enters the HRL system from a distributed network of producers. The program cannot and should not control what systems producers use internally. What it can and must establish is a clear publication contract: before any dataset is eligible for ingestion into the HRL platform, it must exist as a versioned, DOI-bearing package in a stable public archive with compliant metadata. For field and monitoring data, that archive is EDI with EML metadata. For data types that do not fit EDI’s model, the appropriate archive is an open question (see below).
This is a governance decision as much as a technical one, and it’s the single most important thing the Science Committee can do to simplify the architecture. It shifts the burden of data formatting upstream (to the people who best understand their own data) while providing the Central Data Team with a consistent interface to harvest from.
EDI (Environmental Data Initiative) is well-established in the freshwater ecology community; supports EML metadata (the most detailed and standardized ecological metadata standard); provides free, stable DOIs; and has R tooling (EDIutils, EMLassemblyline) that makes publication and harvesting scriptable. These considerations make it the right choice for HRL field and monitoring data types and community context.
- Where should large or complex data objects be archived? EDI may not be appropriate for Earth observation products, LiDAR point clouds, or large model outputs. The CNRA Open Data Portal does not mint DOIs and therefore cannot be the primary archive location. Candidates include Zenodo, Dryad, and domain-specific repositories. Is a single alternative archive sufficient, or should the choice be data-type-specific? What location(s) should be used?
- How will real-time or near-real-time sensor data be handled? Continuous monitoring feeds (e.g., CDEC, in-situ sensor networks) do not fit EDI’s data-package publication model. Does the publication contract require a batched EDI snapshot at a defined frequency, or is a different path needed for streaming or high-frequency data?
- What is the DOI and archiving requirement for derived products and synthesis outputs? Should program-produced synthesis data products also go through EDI, or is a different archive appropriate? This determines whether the publication contract applies only to raw monitoring data or to all program outputs.
Layer 2 — Ingestion and Standardization
This is the most technically demanding layer to build and the Year 1 focus. Ingestion pipelines are responsible for:
- Harvesting newly published or updated datasets from EDI and other archive repositories
- Parsing metadata and extracting provenance information
- Validating incoming data against HRL program schemas (e.g., column names, units, coordinate reference systems, and allowed values)
- Reshaping and standardizing data to conform to HRL’s program-wide data model
- Writing outputs to cloud storage in open, analysis-ready formats (e.g., Parquet for tabular data, GeoPackage or Cloud-Optimized GeoTIFF for spatial data)
- Producing provenance records linking every output to its source EDI package, input version, and pipeline commit hash
Pipeline orchestration frameworks have yet to be determined. Routine ingestion jobs run automatically on GitHub Actions, triggered by new EDI publications or on a schedule — GitHub Actions free tier is sufficient for most tabular and vector workloads. Compute-intensive jobs (LiDAR processing, large remote sensing stacks, or multi-year raster time series) will likely exceed what GitHub Actions can handle cost-effectively and will require dedicated cloud compute resources (e.g., Azure Batch, AWS Batch, or on-demand VMs). Containerization ensures reproducibility across environments and makes pipelines portable across compute substrates: the same container that runs a pipeline in CI can run it on a cloud VM or HPC node without modification. The orchestration framework, container tooling, and cloud compute platform have not yet been selected; leading candidates are described in the decisions section below.
Schema design, i.e., defining the program-wide data model that ingestion pipelines conform data to, is the most intellectually demanding work in this layer. It requires deep domain knowledge, cross-agency buy-in or delegated strong authority, and careful handling of the tension between standardization and the legitimate variability across HRL study designs.
- What pipeline orchestration framework should be used? This is a foundational decision that shapes the entire Layer 2 architecture. How likely is the pipeline language mix to expand beyond R? Will any pipelines need to run on HPC or cloud compute beyond GitHub Actions? Is there interagency tooling precedent that should inform the choice? Leading candidates:
targets— R-native, excellent dependency graph and caching, reproducibility-focused, well-matched to an R-first team. Lower learning curve for R practitioners; less suited to polyglot or HPC workloads.- Nextflow — polyglot DSL (runs R, Python, shell), widely adopted in computational biology and genomics, strong cloud and HPC support, built-in provenance tracking. Steeper learning curve; potentially overkill if pipelines remain entirely R-based.
- Snakemake — Python-based but language-agnostic; popular in bioinformatics; similar dependency-graph model to
targets. Less natural for an R-first team thantargets, but more portable than either if the language mix broadens. Not as user-friendly as Nextflow.
- What analysis-ready open format should be used for vector spatial data? GeoPackage is commonly use, but GeoParquet (Apache Parquet with the spatial extension) is emerging as a more cloud-native alternative with first-class interoperability with the Arrow/DuckDB ecosystem. The choice has implications for how synthesis teams query spatial data and which tooling is required. This should be decided before the first spatial pipeline is built.
- What format should be used for raster analysis-ready products? Cloud-Optimized GeoTIFF (COG) is commonly used. Zarr is gaining traction for multidimensional raster data (e.g., time-series remote sensing stacks). What is the expected raster data volume and dominant use pattern? Will synthesis teams primarily download full scenes, or do sub-region/time-slice queries need to be fast?
- What coordinate reference system(s) will HRL spatial products be standardized to? All output spatial data should share a consistent CRS. What is the right choice for the Sacramento Valley and Bay-Delta context? (Candidates include EPSG:4326 for maximum interoperability and EPSG:3310 California Albers for locally appropriate metric units.)
- What schema validation framework will be used? Options include R packages (e.g.,
pointblank,validate), frictionless data specifications, and JSON Schema. The choice determines what the validation error reports look like and how easily producers can debug failing submissions. - Will pipelines be dataset-specific or generalized by data type? A single highly parameterized pipeline template per data type is more maintainable than one bespoke pipeline per EDI package, but requires more upfront design. This architectural decision significantly affects long-term maintenance burden.
- What cloud compute platform will be used for compute-intensive jobs? GitHub Actions is sufficient for routine ingestion of tabular and vector data, but LiDAR processing and large raster pipelines will need dedicated compute. Azure Batch is the natural candidate alongside Azure Blob Storage; AWS Batch and cloud VMs are alternatives. The choice should be confirmed with IT early, as enterprise agreements, egress costs between storage and compute, and policy constraints on account types all affect the decision.
- What container tooling will be used? Docker is the most widely adopted option and has strong GitHub Actions integration, but others may be worth considering if DWR system policies restrict Docker daemon access on shared compute. The choice is largely an implementation detail, but should be confirmed against DWR IT constraints early.
- How will the pipeline handle missing or inconsistent temporal metadata? Datasets will vary in temporal precision (year, date, datetime, sub-daily). Is there a program-wide standard for temporal resolution and time zone handling, and how do pipelines respond when incoming data does not meet it?
Layer 3 — Storage and Serving
The central data store holds standardized, analysis-ready products and makes them accessible through multiple interfaces. The proposed approach is a tiered storage model:
- Cloud object storage as the primary substrate. Files are stored in open formats (see the notes in Layer 2 for pending format decisions), organized in a logical partition structure (e.g., by data type, watershed, year). Azure Blob Storage is the leading candidate given DWR’s existing enterprise agreement, but the choice has not been made (see decisions below).
- PostgreSQL + PostGIS for relational and spatial query capability. A lightweight managed instance supports catalog metadata, cross-dataset joins, and spatial queries.
- STAC catalog for spatial and raster data discovery. The SpatioTemporal Asset Catalog standard is the modern, widely-adopted approach to organizing geospatial datasets. It is used by entities such as NASA, USGS, and Microsoft Planetary Computer. Building HRL’s spatial catalog on STAC means interoperability with the broader geospatial ecosystem and native support for LiDAR and remote sensing products.
Essentially all HRL data will be publicly accessible, with the primary exception being Tribal data as dictated by data sharing agreements. File-level access control and relational row-level security are both well-supported by the leading storage candidates; the specific mechanisms will follow from the storage platform decision.
- What cloud object storage platform should be used? Azure Blob Storage is the leading candidate given DWR’s existing enterprise agreement and the low procurement overhead that implies. AWS S3 and Google Cloud Storage are functionally equivalent alternatives. Because all data will be stored in open formats (Parquet, GeoPackage, COG), the storage layer is swappable — but the decision has implications for access control tooling, cost structure, hosting agency, and interagency friction.
- What partition structure should be used in cloud object storage? The folder hierarchy determines how efficiently clients can list and filter available data without downloading catalog metadata. Options include partitioning by data type, watershed, year, or hierarchical combinations (e.g.,
/{data_type}/{watershed}/{year}/). The right choice depends on expected query patterns from synthesis teams. - Is PostgreSQL + PostGIS necessary in initial implementation, or can it be deferred? A relational database adds operational overhead. If the catalog is initially a static file (e.g., a Parquet or CSV manifest), PostGIS could be deferred to a later phase when spatial query needs are clearer.
- What storage access tier strategy is appropriate for different data classes? Most providers offer tiered pricing by access frequency (hot, cool, cold, archive). Analysis-ready products warrant a hot or cool tier; raw ingested snapshots retained for provenance may qualify for cold or archive storage. This decision affects both cost and the latency seen by data users.
- What is the data retention and versioning policy for raw ingested snapshots? When a producer publishes a new EDI version, should the pipeline retain the previously ingested raw snapshot indefinitely, or is only the current version kept? How long are superseded analysis-ready products retained?
- How will sensitivity classification and access control for Tribal data be implemented at the storage layer? This requires clear policy decisions in consultation with Tribal partners before storage is provisioned.
Layer 4 — Access and Applications
The access layer is how data reaches synthesis teams, partner agencies, and the public. Multiple access modes serve different users:
- An R package SDK (e.g.,
hrldata) as the primary interface for R-based synthesis science. Wraps catalog discovery, data download, and format handling so synthesis teams don’t need to know anything about the underlying storage architecture. - A data catalog portal — a lightweight, browsable interface to catalog metadata. Initially a rendered Quarto document or static site; evolves into a STAC browser in a later phase.
- Direct download links for data products, generated automatically as part of ingestion pipeline outputs.
- Application dashboards and Quarto reports for science communication and decision support; applications may be built with frameworks such as Shiny, Streamlit, or Node.js + React.
- A REST API for programmatic access by non-R users. Later phase.
- Application hosting infrastructure — interactive applications and APIs require persistent server compute, distinct from the pipeline compute described in Layer 2. Options include Posit Connect (tight Shiny/Quarto integration, requires a license), Azure App Service or Azure Container Apps (cloud-native, integrates naturally with the Azure storage stack), and shinyapps.io (simple managed hosting, less control). The right choice depends on application complexity, authentication requirements, and whether DWR can provide or procure the needed platform.
Synthesis teams access HRL data through the hrldata package and direct download from their own compute environments. The core infrastructure does not need to provision compute for synthesis science; however, if the program later wants to provide shared interactive environments (e.g., a hosted RStudio or JupyterHub instance), that would be a separate platform decision beyond the scope of this proposal.
- Will a Python SDK be developed alongside the R package? The HRL science community is R-first, but some partners and the broader open data audience may expect a Python interface. Is a Python SDK in scope for any phase of the program, or is the REST API sufficient for non-R users?
- What platform will host the data catalog portal? Options include GitHub Pages (free, simple, version-controlled), Posit Team (if the program obtains a license), and Azure Static Web Apps. The right choice depends on whether the catalog needs to be dynamic (server-side search, filtering) or can be fully static-rendered.
- Should the catalog be built on an existing open-source catalog solution or developed as a custom Quarto site? Existing tools such as CKAN or Datasette provide richer functionality out of the box but add operational complexity. A custom Quarto-rendered catalog is simpler to build and maintain but has limited interactivity until later-phase STAC integration.
- How will the
hrldatapackage handle authentication for access-controlled datasets (e.g., Tribal data), or should access-controlled datasets be handled differently (not by the SDK at all)? The package needs a credential management strategy that is secure and does not require users to handle raw Azure credentials directly. - Where will interactive applications and APIs be hosted? Shiny apps, dashboards, and any REST API require persistent server infrastructure that must be provisioned, maintained, and paid for. Leading options: Posit Connect (best Shiny/Quarto integration; requires a license that must be procured); Azure App Service / Azure Container Apps (aligns with Azure Blob Storage; supports any containerized app, not just R).
- Should the program offer shared compute for synthesis science? The proposed architecture assumes synthesis teams supply their own compute and use the
hrldatapackage from their own environments. If partner agencies lack adequate compute, the program could consider hosting a shared interactive environment (e.g., Posit Workbench, a JupyterHub instance).
Technology Stack
| Layer | Tools and technologies |
|---|---|
| Data archiving & metadata | EDI / EML, other storage locations (TBD), EMLassemblyline, EDIutils, hrlpub |
| Pipeline & orchestration | R / Python, targets / Nextflow / Snakemake (TBD), Docker / other containerization option (TBD), GitHub Actions |
| Pipeline compute | GitHub Actions (routine jobs), Azure Batch / AWS Batch / cloud VMs (TBD, for compute-intensive workloads) |
| App & API hosting | Posit Connect / shinyapps.io / Azure App Service / Azure Container Apps (TBD) |
| Data processing | arrow, dplyr, sf, terra, stars, vroom, etc. |
| Storage & database | Azure Blob Storage / AWS S3 / GCS (TBD), PostgreSQL + PostGIS, Apache Parquet, GeoPackage / GeoParquet (TBD), COG |
| Catalog | STAC, rstac, stac-fastapi, etc. |
| Access & applications | hrldata R package, Shiny / Posit Connect, Quarto, Plumber / other REST API constructor |
| Version control & CI/CD | GitHub, GitHub Actions, lintr, styler, testthat, etc. |
The cloud object storage platform has not been finalized. Azure Blob Storage is the leading candidate given DWR’s existing enterprise agreement, but the architecture is deliberately designed so that the storage layer is an implementation detail, not a dependency. All data is stored in open formats (Parquet, GeoPackage, COG) that are readable from any cloud or local environment. Switching between Azure Blob Storage, AWS S3, or Google Cloud Storage would be a configuration change, not a re-architecture.
Phased Roadmap
The architecture is designed to be implemented in phases, each of which is independently useful and maintainable. Phases are defined by what the team can realistically deliver given available staffing.
Phase 0 — Foundations (Now — Month 3)
TBD.
Staffing & Resource Requirements
The architecture described here is realistic only if the program makes genuine investment in dedicated technical capacity. Scientific software and data engineering are skilled professions; they cannot be accomplished reliably as a side responsibility of scientists whose primary obligation is research or program administration and who do not have specific training in these domains.
| Role | FTE | Phase | Primary responsibilities |
|---|---|---|---|
| Data Engineer | 1.0 | TBD | Ingestion pipeline development, schema implementation, cloud storage, CI/CD, R package development, catalog build-out |
| Data Scientist | 2.0 | TBD | Domain schema coordination, scientific QA of ingested products, synthesis support, documentation and producer training, decision-support tools, dashboards, reporting |
| Cloud / DevOps Engineer | 1.0 | TBD | Cloud infrastructure, STAC server deployment, API deployment, monitoring, security |
| Program Data Lead | 1.0 | All phases | Architecture governance, external partnerships, publication contract, Science Committee liaison |
The recommended Year 1 team is one data engineer, one data scientist, and one program data lead. The data engineer is responsible for pipeline implementation, storage integration, automation, and catalog infrastructure. The data scientist is responsible for schema coordination, scientific QA/QC of ingested products, data producer support, documentation, and early synthesis support. Attempting to implement this architecture without both roles is not recommended: the pipeline development work is full-time, and the scientific validation and producer coordination work requires domain fluency that should not be treated as a side responsibility of the data engineer.
Infrastructure costs
Cloud infrastructure costs in early phases are modest. Estimated costs for storage and pipeline compute are in the range of $500–$2,000 per month, depending on data volume and ingestion cadence. Application hosting adds to this: a Posit Connect license runs roughly $15,000–$20,000/year (coarse estimate); Azure App Service or Container Apps hosting for a small number of applications typically $100–$500/month. A more detailed cost estimate will be developed during early implementation, once data volumes, ingestion cadence, and application hosting requirements are better characterized.
Risks & Open Questions
| Risk | Severity | Likelihood | Mitigation |
|---|---|---|---|
| TBD | TBD | TBD | TBD |
Decisions Required
This proposal asks the Science Committee to make or endorse the following decisions to enable early work to proceed:
TBD.
Infrastructure proposals age quickly. This document is intended to be version-controlled, revisited at each phase transition, and updated as the program learns. Suggestions and revisions are welcome via GitHub issues on the hrl-data-infrastructure repository.