Data Pipeline Architecture

Author

Lucy Andrews

Published

May 2026

WarningWorking draft

This is an internal planning document, not a policy or design specification.

This document lays out a stack for the ingestion, validation, standardization, storage, and publication of HRL spatial data. It focuses on spatial data given immediate mapping needs but represents the first instance of a broader data infrastructure that will expand to serve other HRL data types as the program matures.

Restoration spatial datasets (i.e., datasets describing where restoration is happening and some basic attributes describing the projects) will be emailed to Lucy for upload into this data system. These datasets are not suited for external repositories due to their content and update/versioning patterns. For other data types, HRL data producers publish datasets to external repositories such as the Environmental Data Initiative (EDI). EDI and analogous external repositories are the right place for static, citable archival data but are not a sufficient end state for a program that needs to integrate data across agencies, serve live applications, and present a coherent picture of restoration and environmental flow activity across the watershed. Data arriving from different producers will vary in structure, field names, field types, controlled vocabularies, coordinate reference systems, geometry types, and metadata completeness. Without a standardization layer, every downstream use — a map, a query, a join — would require repetitive, redundant manual reconciliation.

The infrastructure described here fills that gap. It receives submitted spatial files (and eventually other data type files harvested from repositories such as EDI), validates them against a shared schema, transforms them to a common standard, loads them into a managed spatial database, and exposes them through an API and mapping application. It also provides the hosting environment that external repositories do not: a place to run the pipeline itself, serve an API, and deploy interactive applications for staff, partner agencies, and the public.

The primary architectural decision running through this design is platform-as-a-service (PaaS) over infrastructure-as-a-service (IaaS). With IaaS, the team would provision and manage virtual machines directly — handling OS updates, security patching, and uptime. PaaS offloads that to Azure, so the team is responsible only for the application and data layers. For a program of this scale and team size, PaaS is the right default: it substantially reduces operational overhead without meaningfully constraining what the infrastructure can do.

Note

This page covers the data engineering and hosting layer: ingestion, validation, standardization, and storage. This page also covers infrastructure needed to host applications (e.g., maps) and APIs until Posit Connect is procured and configured. This page does not cover the tooling that supports data scientists in developing and publishing analytical content without needed to interact with Azure complexity described in this document. That is described separately on the Posit Data Science Platform page.

flowchart TD
    A[Manual upload of spatial data files<br/>GeoPackage, zipped shapefile] --> B[<b>Azure Storage / ADLS Gen2</b><br/>raw-submissions/]

    C[Machine-readable schema<br/><b>schemas/</b>] --> D[<b>Azure Container Apps Job</b><br/>Validation + transformation script]
    B --> D

    D --> E{Validation passes?}

    E -- No --> F[Write validation report<br/><b>validation-reports/</b>]
    F --> G[Return issues to data submitter<br/>Schema errors, geometry errors, CRS issues]

    E -- Yes --> H[Transform to common standard fields,<br/>CRS, geometry type, metadata]
    H --> I[<b>Azure Database for PostgreSQL</b><br/>Flexible Server + <b>PostGIS</b>]

    H --> J[Standardized export files<br/>GeoPackage, GeoJSON, CSV, GeoParquet<br/><b>standardized-exports/</b>]

    I --> K[API layer<br/><b>Azure App Service</b> or <b>Azure Container Apps</b>]
    K --> L[<b>Azure API Management</b><br/>External access, auth, throttling, versioning]

    L --> M[External clients<br/>GIS users, analysts, partner agencies, dashboards]

    I --> N[Interactive map application<br/><b>Azure Static Web Apps</b>, <b>App Service</b>,<br/>or <b>Container Apps</b>]
    L --> N

    J --> M

    I --> O[Metadata and catalog layer<br/><b>Microsoft Purview</b> and/or<br/>metadata landing page]
    J --> O
    F --> O

    O --> M

    classDef azure fill:#D2EAF4,stroke:#2E7DA1,stroke-width:2px,color:#0C425C
    class B,D,I,K,L,N,O azure

Component reference

Need Azure component Why
Raw uploaded spatial files Azure Blob Storage or Azure Data Lake Storage Gen2 Store submitted GeoPackages, zipped shapefiles, validation reports, transformed outputs, and archived versions. Azure Storage can also host static websites if needed.
Schema / controlled vocabularies GitHub/Azure DevOps repo + Blob Storage copy Keep the machine-readable schema version-controlled; optionally publish a frozen copy with each pipeline run.
Validation and transformation script Azure Container Apps Jobs, Azure Functions, or Azure Data Factory orchestration Azure Functions is event-driven and suited for lightweight, short-running scripts — for example, triggering validation when a file lands in storage. Container Apps Jobs run a full container, making them a better fit for pipelines that need geospatial dependencies like GDAL or Python/R packages, or that require longer runtimes. Data Factory can orchestrate either as part of a larger scheduled pipeline.
Spatial database Azure Database for PostgreSQL Flexible Server + PostGIS This is the cleanest managed Postgres/PostGIS option. Azure’s own architecture guidance identifies PostgreSQL/PostGIS as a fit for geospatial apps, and Azure Database for PostgreSQL is managed by Azure rather than by the team.
Standardized public/download dataset PostGIS + GeoPackage/GeoJSON/Parquet outputs in Blob Storage PostGIS is the authoritative operational store; files are useful for external clients, archival, and reproducible snapshots.
Metadata/catalog Microsoft Purview if DWR uses it; otherwise static metadata files + landing page Purview is Azure’s governed data catalog/data estate tool and supports metadata, lineage, access policies, data quality concepts, and data discovery.
API access Azure App Service / Container Apps API + Azure API Management API Management is designed to publish and govern APIs for external, partner, and internal users.
Interactive map app Azure App Service, Azure Static Web Apps, or Azure Container Apps App Service is managed PaaS web hosting; Container Apps is better for containerized apps; static hosting works if the app is a static front end consuming APIs.
Map rendering Azure Maps Web SDK, kepler.gl, Shiny, MapLibre, Leaflet, or OpenLayers Azure Maps has a Web SDK for interactive maps and Azure-native geospatial services. kepler.gl is a React/WebGL visualization tool (built on deck.gl) well suited to large geospatial datasets; it can load GeoJSON, CSV, or Arrow/Parquet directly from Blob Storage or the API layer and deploys as a static web app. Shiny and Streamlit are also possible frameworks.

Implementation repositories

The spatial data pipeline is expected to be implemented across several repositories rather than as a single monorepo. This separation reflects the different responsibilities of the data model, validation code, Azure deployment configuration, database migrations, APIs, maps, and downstream applications. See Repository structure and separation of concerns for the recommended repository layout.

Proposed storage structure

The following container layout organizes all pipeline inputs and outputs in a single Azure Storage account. The structure separates concerns by pipeline stage, makes automated writes and reads straightforward to script, and keeps a clear audit trail.

spatial-data-pipeline/
  raw-submissions/
    agency-name/
      project-name/                  # this level may change depending on submission structure
        submission_date/
          submitted_file.gpkg
          submission_metadata.json   # optional; spatial files expected to be self-documenting
  schemas/
    current/
      spatial_submission_schema.json
      controlled_vocabularies.json
    archive/
      v1.0.0/
      v1.1.0/
  validation-reports/
    agency-name/
      project-name/                  # this level may change depending on submission structure
        submission-date/
          validation_report.json
          validation_report.html
  standardized-exports/
    current/
      restoration_projects.gpkg
      restoration_projects.geojson
      restoration_projects.csv
      restoration_projects.parquet
    snapshots/
      YYYY-MM-DD/
  metadata/
    data_dictionary.md               # file type tbd; this is a placeholder
    field_definitions.csv            # file type tbd; this is a placeholder
    change_log.md
    lineage.json
  archive/
    raw-submissions/
    standardized-exports/