YAML-Driven Terraform: Building a Self-Service Infrastructure Catalog
In this article
- The Problem with “Just Use Modules”
- Architecture: How It Works
- The Catalog Layer
- The Provisioning Layer
- The CI/CD Pipeline
- From YAML Catalog to API-Backed Catalog
- Why This Works Better Than Terraform Modules Alone
- Why Most Platform Catalogs Fail
- Why Not Terragrunt?
- Patterns That Matter
- One stack per resource type
- Use try() liberally for optional fields
- Flatten nested structures into keyed maps
- Merge tags from multiple sources
- Separate state per stack, per environment
- Getting Started
- Putting It All Together

Every platform team hits the same wall. Application teams want cloud resources. They don’t want to learn Terraform. They want to fill in a form (or better, commit a YAML file) and get a working environment. Meanwhile, the platform team wants consistency, governance, and the ability to sleep at night.
The usual answer is Terraform modules. You write reusable modules, document them, and ask teams to use them. This works for a while. Then you end up with 30 slightly different main.tf files across 30 repositories, each one a creative interpretation of your module documentation.
A pattern that works well in larger cloud environments is a catalog-driven provisioning model. Infrastructure intent is captured in YAML. Terraform reads the YAML, transforms it into resource maps, and applies the infrastructure. The platform team controls the provisioning engine. Application teams interact with a catalog contract, whether that’s through YAML files, a portal form, or an API.
The Problem with “Just Use Modules”
Terraform modules are the right building block, but they are the wrong interface for application teams.
Application developers think in YAML, JSON, or environment variables, not in resource blocks and data sources. When teams copy a Terraform root module and modify it, you get drift: different backend configurations, inconsistent naming, forgotten tags. If the platform team reviews every Terraform PR, you become the bottleneck. If you do not, you get surprises in production. And updating a module version across 30 consumers means 30 PRs, 30 plan reviews, and 30 deployment windows.
The YAML-driven approach solves all of these by separating what teams want from how it gets provisioned.
Related reading: Terraform module best practices · Azure CAF Terraform modules
Architecture: How It Works
At a high level, the pattern separates catalog data from provisioning logic, with a thin orchestration layer connecting the two.
The Catalog Layer
Infrastructure intent is defined here. A service registry maps resource types to their configuration, and each type has its own directory with YAML files per environment:
# services.yml - the service registry
networking:
config_path: resources/networking
secrets:
config_path: resources/secrets
storage:
config_path: resources/storage
compute:
config_path: resources/compute
catalog/
├── services.yml # Service registry
└── resources/
├── networking/
│ ├── dev/
│ │ └── platform.yml # Dev network config
│ └── prd/
│ └── platform.yml # Prod network config
├── secrets/
│ ├── dev/
│ │ └── platform.yml
│ └── prd/
│ └── platform.yml
└── compute/
├── dev/
│ └── workstations.yml
└── prd/
└── app-servers.yml
The YAML files capture infrastructure intent at a level that is readable by platform consumers, while the provisioning layer maps that intent to provider-specific resources:
# resources/networking/dev/platform.yml
vnets:
- name: vnet-app-dev-01
resource_group: rg-network-dev-01
location: westeurope
address_spaces:
- 10.20.0.0/16
subnets:
- name: snet-app-dev-01
address_prefix: 10.20.1.0/24
nsg: nsg-app-dev-01
route_table: rt-app-dev-01
service_endpoints:
- Microsoft.KeyVault
- Microsoft.Storage
- name: snet-data-dev-01
address_prefix: 10.20.2.0/24
nsg: nsg-data-dev-01
delegation:
name: databricks
service: Microsoft.Databricks/workspaces
nsgs:
- name: nsg-app-dev-01
resource_group: rg-network-dev-01
location: westeurope
rules:
- name: AllowHTTPS
priority: 100
direction: Inbound
access: Allow
protocol: Tcp
source_prefix: VirtualNetwork
destination_ports: ["443"]
- name: DenyAllInbound
priority: 4096
direction: Inbound
access: Deny
protocol: "*"
source_prefix: "*"
destination_ports: ["*"]
route_tables:
- name: rt-app-dev-01
resource_group: rg-network-dev-01
location: westeurope
enable_bgp_route_propagation: false
routes:
- name: to-firewall
address_prefix: "0.0.0.0/0"
next_hop_type: VirtualAppliance
next_hop_in_ip_address: "192.0.2.10"
Application teams submit this in a pull request. It is easy to read, easy to review, and the schema is well-understood. No HCL knowledge needed.
The Provisioning Layer
The provisioning layer contains one Terraform stack per resource type. Each stack follows the same structure: a locals.tf that loads and transforms the YAML, and a main.tf that creates resources with for_each.
The critical piece is the YAML-to-map transformation:
# Load the service registry and resolve the config file path
locals {
services = yamldecode(file("../../catalog/services.yml"))
config_path = "../../catalog/${local.services[var.stack].config_path}/${var.env}/${var.file}.yml"
config = yamldecode(file(local.config_path))
}
# Transform YAML lists into keyed maps for for_each
locals {
vnets = { for v in try(local.config.vnets, []) : v.name => v }
nsgs = { for n in try(local.config.nsgs, []) : n.name => n }
}
# Flatten nested structures (subnets live inside vnets in YAML)
locals {
subnets = {
for s in flatten([
for v in values(local.vnets) : [
for sn in try(v.subnets, []) : merge(sn, {
vnet_name = v.name
resource_group = v.resource_group
})
]
]) : "${s.vnet_name}/${s.name}" => s
}
}
Then main.tf uses for_each on these transformed maps:
resource "azurerm_virtual_network" "this" {
for_each = local.vnets
name = each.value.name
location = each.value.location
resource_group_name = each.value.resource_group
address_space = each.value.address_spaces
tags = merge(
data.azurerm_resource_group.rg[each.value.resource_group].tags,
try(each.value.tags, {}),
local.tags
)
}
resource "azurerm_subnet" "this" {
for_each = local.subnets
name = each.value.name
resource_group_name = each.value.resource_group
virtual_network_name = azurerm_virtual_network.this[each.value.vnet_name].name
address_prefixes = [each.value.address_prefix]
service_endpoints = try(each.value.service_endpoints, [])
}
No child modules, no abstraction layers. The stack directly creates resources using for_each on the YAML-derived maps. The try() function handles optional fields. Deliberately simple.
Azure docs: Cloud Adoption Framework - Platform automation · Terraform on Azure best practices
The CI/CD Pipeline
The pipeline orchestrates stacks with explicit dependency ordering. A thin wrapper handles the plumbing: loading backend configuration, constructing the config path from the service registry, and calling Terraform with the right variables:
# azure-pipelines.yml (simplified)
parameters:
- name: run_mode
type: string
default: plan
values: [plan, apply, destroy]
- name: env
type: string
- name: file
type: string
stages:
- template: templates/run_stack.yml
parameters:
stack: resource_group
env: ${{ parameters.env }}
file: ${{ parameters.file }}
run_mode: ${{ parameters.run_mode }}
dependencies: []
- template: templates/run_stack.yml
parameters:
stack: networking
env: ${{ parameters.env }}
file: ${{ parameters.file }}
run_mode: ${{ parameters.run_mode }}
dependencies: [resource_group]
- template: templates/run_stack.yml
parameters:
stack: secrets
env: ${{ parameters.env }}
file: ${{ parameters.file }}
run_mode: ${{ parameters.run_mode }}
dependencies: [networking]
- template: templates/run_stack.yml
parameters:
stack: compute
env: ${{ parameters.env }}
file: ${{ parameters.file }}
run_mode: ${{ parameters.run_mode }}
dependencies: [secrets]
Each stack stage runs terraform init with the right backend key, then plan or apply depending on the run mode. The state key is derived from the stack, environment, and file name, giving you isolated state per deployment unit.
Azure docs: Azure DevOps Pipelines with Terraform · Environment approvals and gates
From YAML Catalog to API-Backed Catalog
In smaller teams, a pull request with a YAML file is a perfectly good self-service interface. As the platform matures, many teams add an API layer in front of the catalog. Instead of committing configuration directly, users submit a request to a catalog API. The API validates the payload, applies policy checks, stores the approved request in the catalog, and triggers the provisioning workflow. The underlying Terraform stacks remain unchanged; only the intake experience evolves.
A conceptual request might look like:
POST /catalog/requests
{
"serviceType": "networking",
"environment": "dev",
"spec": {
"name": "app-network",
"addressSpace": ["10.20.0.0/16"],
"subnets": [
{ "name": "app", "prefix": "10.20.1.0/24" }
]
}
}
The API validates the request against a schema, the policy layer enforces allowed regions, sizes, and tags, and approved requests are stored as catalog data. The provisioning pipeline reconciles desired state through Terraform, the same way it would with a YAML commit.
This hybrid model keeps YAML as the canonical declarative format, uses an API as the intake and validation layer, and retains Git as the audit trail and execution trigger. Whether your real setup is GitOps-first, API-first, or mixed, the architecture supports all three.
Tools like Backstage fit well as a developer portal front-end for this kind of catalog. Azure API Management can serve as the API gateway with schema validation built in.
Why This Works Better Than Terraform Modules Alone
The key differences with the YAML-driven approach:
The catalog layer is the interface. The provisioning layer is the engine. Different teams own different parts, with different review cycles and release cadences.
Teams that define infrastructure never see a .tf file. They describe what they want, not how to build it. Each stack uses for_each directly on resources, so there’s no module inception and no abstraction layers to debug through.
The YAML or API schema acts as the policy. You can’t order a VM size that isn’t in the catalog. You can’t skip tags. The shape of the request enforces the rules. And every infrastructure change is a diff in git: who changed what, when, and why.
Why Most Platform Catalogs Fail
Before building a catalog, it helps to understand why so many of them end up abandoned.
The most common failure: the catalog becomes a glorified ticketing form. Teams submit YAML or fill in a portal, but someone on the platform team still manually reviews every request, tweaks parameters, and runs terraform apply by hand. The self-service part is an illusion. Provisioning time stays the same, just with extra steps.
Module inconsistency kills trust early. Three teams wrote three VNet modules with different naming conventions, different tag structures, and different opinions on subnet sizing. Application teams pick whichever module they find first. The catalog offers choices that produce inconsistent results.
Policy enforcement happens in Slack instead of in the pipeline. Someone reviews the PR and says “you can’t use that VM size in production” or “you forgot the cost center tag.” If the pipeline does not reject non-compliant requests automatically, the catalog is just a suggestion box.
Too many deployment paths undermine adoption. Some teams use the catalog. Others have their own Terraform repositories. A few still provision through the Azure portal. When three paths coexist, the catalog never reaches critical mass.
Finally, nobody measures whether the catalog actually works. No tracking of adoption rates, provisioning times, or how often teams bypass the catalog entirely. Without a feedback loop, the platform team has no signal on what to improve.
Why Not Terragrunt?
Terragrunt is the obvious question. It solves real problems: DRY backend configuration, dependency management between modules, and generating provider blocks. It works well when those are the main pain points.
But Terragrunt doesn’t solve the interface problem. Application teams still write HCL (or at least terragrunt.hcl). They still need to understand module inputs, variable types, and how include blocks work. Terragrunt makes Terraform easier for Terraform users. It doesn’t make infrastructure provisioning accessible to teams that don’t want to learn Terraform at all.
The same applies to Spacelift, Atlantis, and env0. They’re pipeline orchestrators. They solve “how do I run Terraform safely at scale” but not “how do I let application teams define infrastructure without writing HCL.” Those are different problems. Pipeline orchestrators work well alongside the catalog pattern, but they don’t replace it.
The YAML boundary is the differentiator. Application teams never touch HCL. They describe a VNet, a Key Vault, or a set of VMs. The platform team’s Terraform code reads that description and provisions the resources. If the provisioning engine switched to Pulumi or OpenTofu tomorrow, the catalog interface wouldn’t change.
Patterns That Matter
One stack per resource type
Don’t bundle unrelated resources. A networking stack handles VNets, subnets, NSGs, and route tables, things that are naturally related. But compute and networking go in separate stacks with separate state files and separate pipeline stages.
Use try() liberally for optional fields
The YAML should only contain what needs to vary. Everything else gets a default via Terraform’s try() function:
dns_servers = try(each.value.dns_servers, [])
enable_bgp = try(each.value.enable_bgp_route_propagation, true)
private_ip_address = try(each.value.private_ip_address, null)
This keeps the YAML configs clean. A simple subnet doesn’t need 20 fields.
Flatten nested structures into keyed maps
YAML is hierarchical. Terraform’s for_each wants flat maps. The transformation layer bridges this gap. This pattern works for any parent-child relationship: NSG rules inside NSGs, routes inside route tables, containers inside storage accounts.

Merge tags from multiple sources
Tags should come from three layers: the resource group (inherited), the YAML config (custom per resource), and the platform (automatic metadata). Merge them in that order so platform tags always win.
Separate state per stack, per environment
Each combination of stack + environment + file gets its own state file. This isolates blast radius. A failed compute deployment doesn’t lock or corrupt networking state.
Getting Started
You don’t need to build the full catalog on day one. Start with:
- Pick one resource type. Networking is usually the best first candidate because every environment needs it and it has natural sub-resources (VNets, subnets, NSGs).
- Define the schema. Write a sample YAML that describes your current environment. If the YAML is readable by a non-Terraform person, the schema is right.
- Write the stack. One transformation layer that converts YAML to maps. One resource file that creates resources with
for_each. Keep it flat. - Add a pipeline. Start with plan-only. Add apply after you trust the pattern.
- Expand. Add resource types as demand grows. Each new stack follows the same pattern.
- Add an API when Git PRs become a bottleneck. Not before.
The teams that succeed with this approach start small, prove the pattern works, and expand organically. The teams that fail try to build a platform for every possible use case before anyone has submitted a single request.
Azure docs: Azure landing zone Terraform modules · Terraform state management
Putting It All Together
A YAML-driven infrastructure catalog isn’t a framework or a product you install. It is a pattern: separating configuration from implementation, with a clear contract between the teams that define infrastructure and the teams that provision it.
The platform team owns the provisioning engine. Application teams own the catalog data. The YAML or API boundary between them is the contract. Whether that boundary is a Git pull request, an API call, or a portal form depends on team maturity and scale.
If your platform team spends more time reviewing copy-pasted Terraform modules than improving the platform, a catalog is the way out.
Related: Bicep vs Terraform: Why We Default to Terraform (and When Bicep Wins) explains our thinking on when to use Terraform and when Bicep is the better choice.
Ready to automate your infrastructure?
From Infrastructure as Code to CI/CD pipelines, we help teams ship faster with confidence and less manual overhead.