Harbinger Explorer

Back to Knowledge Hub
cloud-architecture
Published:

Infrastructure as Code for Data Platforms

14 min read·Tags: IaC, Terraform, data-platform, GitOps, DevOps, Databricks, dbt

Infrastructure as Code for Data Platforms

The discipline of Infrastructure as Code transformed how we manage compute and networking. Data platforms have been slower to adopt these practices — data pipelines lived in Jupyter notebooks, schema changes were applied manually, and "environment promotion" meant copying SQL files between folders. That era is over.

This guide covers how to apply rigorous IaC principles to modern data platforms: from the underlying cloud resources to the schemas, pipelines, and governance policies that run on top of them.


The Data Platform IaC Stack

A complete IaC approach for data platforms operates at four layers:

Loading diagram...

Most teams have L1 covered with Terraform. L2 is where things get interesting. L3 and L4 are where they usually fall down.


Layer 1: Cloud Resources with Terraform

Module Structure for Data Platform Infrastructure

terraform/
├── modules/
│   ├── data-lake/
│   │   ├── main.tf
│   │   ├── variables.tf
│   │   ├── outputs.tf
│   │   └── README.md
│   ├── streaming-platform/
│   ├── data-warehouse/
│   └── governance/
├── environments/
│   ├── dev/
│   │   ├── main.tf
│   │   └── terraform.tfvars
│   ├── staging/
│   └── prod/
└── shared/
    ├── backend.tf
    └── providers.tf

Data Lake Module Example

# modules/data-lake/main.tf

locals {
  bucket_name = "${var.project_prefix}-${var.environment}-lakehouse"
  common_tags = merge(var.tags, {
    Module      = "data-lake"
    Environment = var.environment
    ManagedBy   = "terraform"
  })
}

resource "aws_s3_bucket" "lakehouse" {
  bucket = local.bucket_name
  tags   = local.common_tags
}

resource "aws_s3_bucket_versioning" "lakehouse" {
  bucket = aws_s3_bucket.lakehouse.id
  versioning_configuration {
    status = "Enabled"
  }
}

resource "aws_s3_bucket_lifecycle_configuration" "lakehouse" {
  bucket = aws_s3_bucket.lakehouse.id

  rule {
    id     = "bronze-to-glacier"
    status = "Enabled"
    filter { prefix = "bronze/" }
    transition {
      days          = 90
      storage_class = "GLACIER_IR"
    }
  }

  rule {
    id     = "silver-intelligent-tiering"
    status = "Enabled"
    filter { prefix = "silver/" }
    transition {
      days          = 30
      storage_class = "INTELLIGENT_TIERING"
    }
  }
}

# Lake Formation permissions
resource "aws_lakeformation_resource" "lakehouse" {
  arn      = aws_s3_bucket.lakehouse.arn
  role_arn = aws_iam_role.lakeformation_service.arn
}

resource "aws_lakeformation_permissions" "analyst_access" {
  for_each = toset(var.analyst_role_arns)

  principal   = each.value
  permissions = ["SELECT", "DESCRIBE"]

  table {
    database_name = aws_glue_catalog_database.silver.name
    wildcard      = true
  }
}

Glue Catalog as Code

# modules/data-lake/catalog.tf

resource "aws_glue_catalog_database" "bronze" {
  name        = "${var.project_prefix}_bronze"
  description = "Raw ingested data — immutable, append-only"
  
  create_table_default_permission {
    permissions = ["ALL"]
    principal {
      data_lake_principal_identifier = "IAM_ALLOWED_PRINCIPALS"
    }
  }
}

resource "aws_glue_catalog_database" "silver" {
  name        = "${var.project_prefix}_silver"
  description = "Cleaned, validated, conformed data"
}

resource "aws_glue_catalog_database" "gold" {
  name        = "${var.project_prefix}_gold"
  description = "Business-ready aggregates and data products"
}

Layer 2: Data Warehouse Infrastructure

Databricks Workspace with Terraform

# modules/databricks-workspace/main.tf

terraform {
  required_providers {
    databricks = {
      source  = "databricks/databricks"
      version = "~> 1.38"
    }
  }
}

resource "databricks_cluster_policy" "data_engineering" {
  name = "data-engineering-${var.environment}"
  definition = jsonencode({
    "spark_version" : {
      "type" : "allowlist",
      "values" : ["13.3.x-scala2.12", "14.3.x-scala2.12"],
      "defaultValue" : "14.3.x-scala2.12"
    },
    "node_type_id" : {
      "type" : "allowlist",
      "values" : ["m5d.xlarge", "m5d.2xlarge", "m5d.4xlarge"]
    },
    "autotermination_minutes" : {
      "type" : "fixed",
      "value" : 60,
      "hidden" : false
    },
    "custom_tags.team" : {
      "type" : "fixed",
      "value" : var.team_tag
    }
  })
}

resource "databricks_sql_warehouse" "main" {
  name             = "${var.project_prefix}-${var.environment}"
  cluster_size     = var.environment == "prod" ? "Medium" : "X-Small"
  auto_stop_mins   = 5
  min_num_clusters = 1
  max_num_clusters = var.environment == "prod" ? 3 : 1

  tags {
    custom_tags {
      key   = "environment"
      value = var.environment
    }
  }
}

resource "databricks_catalog" "main" {
  name    = var.catalog_name
  comment = "Main Unity Catalog for ${var.environment}"
  
  properties = {
    purpose = "data-platform"
  }
}

Layer 3: Schema-as-Code with dbt

Schema management deserves the same rigor as infrastructure. dbt provides this for transformation logic; combine it with schema migrations for your operational databases.

dbt Project Structure for Platform Teams

dbt_project/
├── dbt_project.yml
├── profiles.yml          # managed via Vault / environment vars
├── models/
│   ├── staging/          # raw → typed, renamed
│   ├── intermediate/     # business logic
│   └── marts/            # data products for consumers
├── tests/
│   ├── generic/          # reusable test macros
│   └── singular/         # one-off assertions
├── macros/
├── seeds/                # small reference tables, version-controlled
└── analyses/             # ad-hoc, not materialised

dbt Model with Data Contract

# models/marts/finance/schema.yml
version: 2

models:
  - name: fct_revenue
    description: "Daily revenue fact table — SLA: 99.9% freshness within 2h of close"
    config:
      contract:
        enforced: true           # dbt will fail if types don't match
    columns:
      - name: revenue_id
        data_type: varchar
        constraints:
          - type: not_null
          - type: unique
      - name: amount_usd
        data_type: numeric(18,4)
        constraints:
          - type: not_null
      - name: transaction_date
        data_type: date
        constraints:
          - type: not_null
    tests:
      - dbt_expectations.expect_column_values_to_be_between:
          column_name: amount_usd
          min_value: 0
          max_value: 10000000

Layer 4: GitOps Workflows for Data Pipelines

CI/CD Pipeline Architecture

Loading diagram...

GitHub Actions Workflow

# .github/workflows/data-platform-ci.yml
name: Data Platform CI

on:
  pull_request:
    paths:
      - 'terraform/**'
      - 'dbt_project/**'
      - 'airflow/dags/**'

jobs:
  terraform-plan:
    runs-on: ubuntu-latest
    defaults:
      run:
        working-directory: terraform/environments/staging
    steps:
      - uses: actions/checkout@v4
      
      - name: Setup Terraform
        uses: hashicorp/setup-terraform@v3
        with:
          terraform_version: "1.7.x"
      
      - name: Configure AWS Credentials
        uses: aws-actions/configure-aws-credentials@v4
        with:
          role-to-assume: ${{ secrets.AWS_CICD_ROLE_ARN }}
          aws-region: eu-west-1
      
      - name: Terraform Init
        run: terraform init -backend-config="key=staging/terraform.tfstate"
      
      - name: Terraform Plan
        run: terraform plan -out=tfplan.binary
      
      - name: Convert Plan to JSON
        run: terraform show -json tfplan.binary > tfplan.json
      
      - name: Validate Plan (no deletes in prod tables)
        run: |
          python scripts/validate_plan.py tfplan.json             --no-destroy-pattern "aws_glue_catalog_table"             --no-destroy-pattern "aws_s3_bucket.lakehouse"

  dbt-ci:
    runs-on: ubuntu-latest
    steps:
      - uses: actions/checkout@v4
      - uses: actions/setup-python@v5
        with:
          python-version: '3.11'
      
      - name: Install dbt
        run: pip install dbt-spark dbt-expectations
      
      - name: dbt deps
        run: dbt deps
        working-directory: dbt_project
      
      - name: dbt compile (syntax check)
        run: dbt compile --profiles-dir . --target ci
        working-directory: dbt_project
      
      - name: dbt test (dev schema)
        run: dbt test --profiles-dir . --target ci --store-failures
        working-directory: dbt_project

Environment Promotion Strategy

Loading diagram...

Variable Sets Per Environment

# environments/prod/terraform.tfvars
environment        = "prod"
databricks_cluster_size = "Large"
min_workers        = 2
max_workers        = 20
enable_spot        = true
spot_bid_price_pct = 80
backup_retention   = 30
alert_email        = "data-platform-oncall@company.com"

State Management and Secrets

Remote State with Locking

# shared/backend.tf
terraform {
  backend "s3" {
    bucket         = "company-terraform-state"
    key            = "${var.environment}/data-platform/terraform.tfstate"
    region         = "eu-west-1"
    dynamodb_table = "terraform-state-locks"
    encrypt        = true
    kms_key_id     = "arn:aws:kms:eu-west-1:123456789:key/mrk-xxx"
  }
}

Secrets via Vault + Terraform

data "vault_generic_secret" "databricks_token" {
  path = "secret/data-platform/${var.environment}/databricks"
}

resource "databricks_token" "pipeline_sa" {
  comment          = "CI/CD service account — managed by Terraform"
  lifetime_seconds = 7776000  # 90 days, rotated by Vault
}

resource "vault_generic_secret" "databricks_token_output" {
  path = "secret/data-platform/${var.environment}/databricks"
  data_json = jsonencode({
    token = databricks_token.pipeline_sa.token_value
  })
}

Common Pitfalls and Mitigations

PitfallSymptomMitigation
State driftTerraform plan shows unexpected changesEnforce terraform apply only via CI/CD; protect state bucket
Snowflake resourcesManual schema changes → driftRequire all changes via PR; drift detection in CI
Secret sprawlCredentials in git, S3, Parameter Store, VaultSingle secrets backend; Terraform reads, never stores
Module versioningUpdates break all consumersPin module versions; use a private registry
Long plan times45-min plans → developer frustrationBreak state into smaller files; use target for hotfixes

Summary

IaC for data platforms isn't just about Terraform. It's a discipline that spans cloud resources (L1), data infrastructure (L2), schemas and pipelines (L3), and data products (L4). Each layer needs version control, CI/CD gates, and environment promotion.

The teams that do this well have a single pull request flow for everything: a developer changes a dbt model, Terraform module, and Airflow DAG in the same commit, CI validates all three layers, and promotion to production is a one-click release.


Try Harbinger Explorer free for 7 days — use it to validate your API contracts and data platform endpoints as part of your CI/CD pipeline. Catch schema drift and breaking changes before they reach production. harbingerexplorer.com


Continue Reading

Try Harbinger Explorer for free

Connect any API, upload files, and explore with AI — all in your browser. No credit card required.

Start Free Trial

Command Palette

Search for a command to run...