flowchart TD
A[Manual upload of spatial data files<br/>GeoPackage, zipped shapefile] --> B[<b>Azure Storage / ADLS Gen2</b><br/>raw-submissions/]
C[Machine-readable schema<br/><b>schemas/</b>] --> D[<b>Azure Container Apps Job</b><br/>Validation + transformation script]
B --> D
D --> E{Validation passes?}
E -- No --> F[Write validation report<br/><b>validation-reports/</b>]
F --> G[Return issues to data submitter<br/>Schema errors, geometry errors, CRS issues]
E -- Yes --> H[Transform to common standard fields,<br/>CRS, geometry type, metadata]
H --> I[<b>Azure Database for PostgreSQL</b><br/>Flexible Server + <b>PostGIS</b>]
H --> J[Standardized export files<br/>GeoPackage, GeoJSON, CSV, GeoParquet<br/><b>standardized-exports/</b>]
I --> K[API layer<br/><b>Azure App Service</b> or <b>Azure Container Apps</b>]
K --> L[<b>Azure API Management</b><br/>External access, auth, throttling, versioning]
L --> M[External clients<br/>GIS users, analysts, partner agencies, dashboards]
I --> N[Interactive map application<br/><b>Azure Static Web Apps</b>, <b>App Service</b>,<br/>or <b>Container Apps</b>]
L --> N
J --> M
I --> O[Metadata and catalog layer<br/><b>Microsoft Purview</b> and/or<br/>metadata landing page]
J --> O
F --> O
O --> M
classDef azure fill:#D2EAF4,stroke:#2E7DA1,stroke-width:2px,color:#0C425C
class B,D,I,K,L,N,O azure
Data Pipeline Architecture
This is an internal planning document, not a policy or design specification.
This document lays out a stack for the ingestion, validation, standardization, storage, and publication of HRL spatial data. It focuses on spatial data given immediate mapping needs but represents the first instance of a broader data infrastructure that will expand to serve other HRL data types as the program matures.
Restoration spatial datasets (i.e., datasets describing where restoration is happening and some basic attributes describing the projects) will be emailed to Lucy for upload into this data system. These datasets are not suited for external repositories due to their content and update/versioning patterns. For other data types, HRL data producers publish datasets to external repositories such as the Environmental Data Initiative (EDI). EDI and analogous external repositories are the right place for static, citable archival data but are not a sufficient end state for a program that needs to integrate data across agencies, serve live applications, and present a coherent picture of restoration and environmental flow activity across the watershed. Data arriving from different producers will vary in structure, field names, field types, controlled vocabularies, coordinate reference systems, geometry types, and metadata completeness. Without a standardization layer, every downstream use — a map, a query, a join — would require repetitive, redundant manual reconciliation.
The infrastructure described here fills that gap. It receives submitted spatial files (and eventually other data type files harvested from repositories such as EDI), validates them against a shared schema, transforms them to a common standard, loads them into a managed spatial database, and exposes them through an API and mapping application. It also provides the hosting environment that external repositories do not: a place to run the pipeline itself, serve an API, and deploy interactive applications for staff, partner agencies, and the public.
The primary architectural decision running through this design is platform-as-a-service (PaaS) over infrastructure-as-a-service (IaaS). With IaaS, the team would provision and manage virtual machines directly — handling OS updates, security patching, and uptime. PaaS offloads that to Azure, so the team is responsible only for the application and data layers. For a program of this scale and team size, PaaS is the right default: it substantially reduces operational overhead without meaningfully constraining what the infrastructure can do.
This page covers the data engineering and hosting layer: ingestion, validation, standardization, and storage. This page also covers infrastructure needed to host applications (e.g., maps) and APIs until Posit Connect is procured and configured. This page does not cover the tooling that supports data scientists in developing and publishing analytical content without needed to interact with Azure complexity described in this document. That is described separately on the Posit Data Science Platform page.
Component reference
| Need | Azure component | Why |
|---|---|---|
| Raw uploaded spatial files | Azure Blob Storage or Azure Data Lake Storage Gen2 | Store submitted GeoPackages, zipped shapefiles, validation reports, transformed outputs, and archived versions. Azure Storage can also host static websites if needed. |
| Schema / controlled vocabularies | GitHub/Azure DevOps repo + Blob Storage copy | Keep the machine-readable schema version-controlled; optionally publish a frozen copy with each pipeline run. |
| Validation and transformation script | Azure Container Apps Jobs, Azure Functions, or Azure Data Factory orchestration | Azure Functions is event-driven and suited for lightweight, short-running scripts — for example, triggering validation when a file lands in storage. Container Apps Jobs run a full container, making them a better fit for pipelines that need geospatial dependencies like GDAL or Python/R packages, or that require longer runtimes. Data Factory can orchestrate either as part of a larger scheduled pipeline. |
| Spatial database | Azure Database for PostgreSQL Flexible Server + PostGIS | This is the cleanest managed Postgres/PostGIS option. Azure’s own architecture guidance identifies PostgreSQL/PostGIS as a fit for geospatial apps, and Azure Database for PostgreSQL is managed by Azure rather than by the team. |
| Standardized public/download dataset | PostGIS + GeoPackage/GeoJSON/Parquet outputs in Blob Storage | PostGIS is the authoritative operational store; files are useful for external clients, archival, and reproducible snapshots. |
| Metadata/catalog | Microsoft Purview if DWR uses it; otherwise static metadata files + landing page | Purview is Azure’s governed data catalog/data estate tool and supports metadata, lineage, access policies, data quality concepts, and data discovery. |
| API access | Azure App Service / Container Apps API + Azure API Management | API Management is designed to publish and govern APIs for external, partner, and internal users. |
| Interactive map app | Azure App Service, Azure Static Web Apps, or Azure Container Apps | App Service is managed PaaS web hosting; Container Apps is better for containerized apps; static hosting works if the app is a static front end consuming APIs. |
| Map rendering | Azure Maps Web SDK, kepler.gl, Shiny, MapLibre, Leaflet, or OpenLayers | Azure Maps has a Web SDK for interactive maps and Azure-native geospatial services. kepler.gl is a React/WebGL visualization tool (built on deck.gl) well suited to large geospatial datasets; it can load GeoJSON, CSV, or Arrow/Parquet directly from Blob Storage or the API layer and deploys as a static web app. Shiny and Streamlit are also possible frameworks. |
Recommended implementation pattern
The preferred architecture is PaaS-first: use Azure-managed services wherever possible and reserve containerized compute for steps that require custom geospatial dependencies. Avoid general-purpose VMs unless DWR IT policy requires them. “PaaS” stands for platform-as-a-service. The other option is infrastructure-as-a-service (IaaS), but PaaS is better suited for our needs.
Storage. Azure Storage / ADLS Gen2 is the backbone of the pipeline. It holds raw submissions as received, the versioned schema and controlled vocabularies used for validation, validation reports (both passes and failures), standardized export files, and archived snapshots. Keeping all of these in a single storage account with a consistent folder hierarchy makes the pipeline auditable and reproducible.
Validation and transformation. Azure Container Apps Jobs are the preferred compute layer for this step. Geospatial validation and transformation workflows commonly require GDAL, Python geospatial packages (e.g., geopandas, pyproj, shapely), or R spatial packages, and often take longer than the execution limits of serverless functions. A containerized job can package the full dependency stack, run to completion, and exit — with no persistent infrastructure between runs.
Spatial database. Azure Database for PostgreSQL Flexible Server with the PostGIS extension is the authoritative store for standardized spatial data. All validated and transformed submissions are written here. Downstream services — the API layer, the map application, the metadata catalog — read from PostgreSQL/PostGIS as the primary source of truth.
Export files. Standardized GeoPackage, GeoJSON, CSV, and GeoParquet exports written to Blob Storage serve a different purpose than the database: they support external clients who cannot query an API, enable reproducible analytical snapshots, and provide archival copies that are independent of database availability.
API and application layers. The API and map application should run on managed Azure hosting (App Service, Container Apps, or Static Web Apps) rather than a self-managed VM. This keeps operational overhead low for a small team.
Implementation repositories
The spatial data pipeline is expected to be implemented across several repositories rather than as a single monorepo. This separation reflects the different responsibilities of the data model, validation code, Azure deployment configuration, database migrations, APIs, maps, and downstream applications. See Repository structure and separation of concerns for the recommended repository layout.
Proposed storage structure
The following container layout organizes all pipeline inputs and outputs in a single Azure Storage account. The structure separates concerns by pipeline stage, makes automated writes and reads straightforward to script, and keeps a clear audit trail.
spatial-data-pipeline/
raw-submissions/
agency-name/
project-name/ # this level may change depending on submission structure
submission_date/
submitted_file.gpkg
submission_metadata.json # optional; spatial files expected to be self-documenting
schemas/
current/
spatial_submission_schema.json
controlled_vocabularies.json
archive/
v1.0.0/
v1.1.0/
validation-reports/
agency-name/
project-name/ # this level may change depending on submission structure
submission-date/
validation_report.json
validation_report.html
standardized-exports/
current/
restoration_projects.gpkg
restoration_projects.geojson
restoration_projects.csv
restoration_projects.parquet
snapshots/
YYYY-MM-DD/
metadata/
data_dictionary.md # file type tbd; this is a placeholder
field_definitions.csv # file type tbd; this is a placeholder
change_log.md
lineage.json
archive/
raw-submissions/
standardized-exports/