Harbinger Explorer

Back to Knowledge Hub
databricks
Published:

Databricks Cluster Policies for Cost Control: A Practical Guide

10 min read·Tags: databricks, cost-optimization, cluster-policies, finops, governance

Databricks Cluster Policies for Cost Control: A Practical Guide

Databricks is powerful. It's also remarkably easy to accidentally spin up a 32-node Standard_E64s_v3 cluster and forget to terminate it. Anyone who's managed a Databricks workspace for more than a month has a story like this.

Cluster policies are Databricks' mechanism for preventing these expensive mistakes while still giving your team the flexibility they need. Done right, they're invisible guardrails — your engineers can work freely within a safe boundary.


What Are Cluster Policies?

A cluster policy is a JSON document that constrains the values users can set when creating or editing clusters. Policies can:

  • Fix a value (e.g., always enable auto-termination)
  • Limit a value to a range or list (e.g., max 8 workers)
  • Set defaults (e.g., default to spot instances)
  • Hide fields from the UI (simplify the creation experience)
  • Require specific values (e.g., must tag clusters with a cost center)

Policies are assigned to users, groups, or service principals. A user can only create clusters using policies they have access to (unless they're a workspace admin).


Why Cluster Policies Matter (With Numbers)

Consider a team of 10 data engineers. Without policies:

ScenarioConfigHourly Cost
Overpowered dev cluster8x Standard_E16s_v3 (on-demand)~$18/hr
Forgotten overnight cluster4x Standard_DS3_v2 (on-demand)~$30 total
Production job over-provisioned16x Standard_E32s_v3~$60/hr

A single forgotten cluster running for a weekend = ~$150 in waste. Multiply by 10 engineers over a year, and you're looking at thousands in avoidable spend.

With cluster policies enforcing auto-termination and spot instances:

  • Auto-terminates after 30 min = ~$1.50 wasted instead of $30
  • Uses spot instances = ~60% cheaper baseline

Policy Definition Language

Policies are JSON documents. Each attribute maps to a constraint type:

{
  "attribute_name": {
    "type": "fixed | range | allowlist | blocklist | regex | unlimited | forbidden",
    "value": "...",
    "minValue": 0,
    "maxValue": 10,
    "values": [],
    "defaultValue": "...",
    "hidden": true
  }
}

Policy Examples

1. Basic Cost Control Policy (for all data engineers)

{
  "autotermination_minutes": {
    "type": "range",
    "minValue": 10,
    "maxValue": 120,
    "defaultValue": 30
  },
  "num_workers": {
    "type": "range",
    "minValue": 1,
    "maxValue": 8
  },
  "node_type_id": {
    "type": "allowlist",
    "values": [
      "Standard_DS3_v2",
      "Standard_DS4_v2",
      "Standard_DS5_v2",
      "Standard_E4s_v3",
      "Standard_E8s_v3"
    ],
    "defaultValue": "Standard_DS3_v2"
  },
  "azure_attributes.availability": {
    "type": "fixed",
    "value": "SPOT_WITH_FALLBACK_AZURE",
    "hidden": true
  },
  "spark_version": {
    "type": "regex",
    "pattern": "^(14|15)\\.[0-9]+\\.x-scala2\\.12$",
    "defaultValue": "15.4.x-scala2.12"
  },
  "custom_tags.team": {
    "type": "fixed",
    "value": "data-engineering"
  },
  "custom_tags.cost_center": {
    "type": "allowlist",
    "values": ["harbinger", "research", "infra"]
  }
}

This policy:

  • Forces auto-termination between 10-120 minutes (default 30)
  • Limits cluster size to 8 workers max
  • Restricts to cost-effective instance types
  • Forces spot instances (transparent to user)
  • Ensures LTS Spark versions
  • Requires cost center tagging for chargeback

2. Single-Node Policy (for interactive development)

For lightweight exploration and testing, a single-node policy keeps costs minimal:

{
  "num_workers": {
    "type": "fixed",
    "value": 0,
    "hidden": true
  },
  "spark_conf.spark.databricks.cluster.profile": {
    "type": "fixed",
    "value": "singleNode",
    "hidden": true
  },
  "autotermination_minutes": {
    "type": "fixed",
    "value": 60,
    "hidden": true
  },
  "node_type_id": {
    "type": "allowlist",
    "values": ["Standard_DS3_v2", "Standard_DS4_v2"],
    "defaultValue": "Standard_DS3_v2"
  }
}

3. Production Job Policy (for automated workflows)

Production jobs need reliability over cost savings. This policy allows larger clusters but on spot with fallback:

{
  "num_workers": {
    "type": "range",
    "minValue": 2,
    "maxValue": 32
  },
  "autotermination_minutes": {
    "type": "fixed",
    "value": 0,
    "hidden": true
  },
  "azure_attributes.availability": {
    "type": "fixed",
    "value": "SPOT_WITH_FALLBACK_AZURE",
    "hidden": true
  },
  "azure_attributes.spot_bid_max_price": {
    "type": "fixed",
    "value": -1,
    "hidden": true
  },
  "custom_tags.environment": {
    "type": "fixed",
    "value": "production"
  }
}

4. Unrestricted Policy (for workspace admins only)

Give your platform team full flexibility while still tracking usage:

{
  "custom_tags.team": {
    "type": "fixed",
    "value": "platform"
  }
}

Creating Policies via CLI

# Create a policy from a JSON file
databricks cluster-policies create \
  --name "Data Engineers - Cost Controlled" \
  --definition @policies/engineer_policy.json

# Update an existing policy
databricks cluster-policies edit \
  --policy-id ABCD1234 \
  --name "Data Engineers - Cost Controlled" \
  --definition @policies/engineer_policy.json

# List all policies
databricks cluster-policies list

Creating Policies via Terraform

If you manage Databricks infrastructure with Terraform:

resource "databricks_cluster_policy" "engineer_policy" {
  name = "Data Engineers - Cost Controlled"
  definition = jsonencode({
    "autotermination_minutes" = {
      type         = "range"
      minValue     = 10
      maxValue     = 120
      defaultValue = 30
    }
    "num_workers" = {
      type     = "range"
      minValue = 1
      maxValue = 8
    }
    "azure_attributes.availability" = {
      type   = "fixed"
      value  = "SPOT_WITH_FALLBACK_AZURE"
      hidden = true
    }
    "custom_tags.cost_center" = {
      type   = "allowlist"
      values = ["harbinger", "research", "infra"]
    }
  })
}

resource "databricks_permissions" "engineer_policy_access" {
  cluster_policy_id = databricks_cluster_policy.engineer_policy.id

  access_control {
    group_name       = "data-engineers"
    permission_level = "CAN_USE"
  }
}

Enforcing Policies at Scale

Remove Default Policy Access

By default, workspace users have access to the "Unrestricted" policy. To enforce your custom policies, restrict it to admins only:

databricks permissions set \
  --resource-type cluster-policies \
  --resource-id 0 \
  --access-controls '[{"group_name": "admins", "permission_level": "CAN_USE"}]'

Monitor Policy Compliance

Use Databricks system tables to track cluster creation and policy adherence:

-- Find clusters created without a policy (ungoverned spend)
SELECT
  cluster_id,
  cluster_name,
  created_by,
  create_time,
  node_type_id,
  num_workers,
  autotermination_minutes
FROM system.compute.clusters
WHERE policy_id IS NULL
  AND create_time >= CURRENT_TIMESTAMP - INTERVAL 30 DAYS
ORDER BY create_time DESC;

-- Total DBU consumption by policy
SELECT
  c.policy_id,
  cp.name AS policy_name,
  SUM(u.dbu_consumption) AS total_dbus,
  SUM(u.cloud_cost) AS total_cloud_cost_usd
FROM system.billing.usage u
JOIN system.compute.clusters c ON u.cluster_id = c.cluster_id
LEFT JOIN system.compute.cluster_policies cp ON c.policy_id = cp.policy_id
WHERE u.usage_date >= CURRENT_DATE - INTERVAL 30 DAYS
GROUP BY 1, 2
ORDER BY total_dbus DESC;

Implementing a FinOps Review Process

Cluster policies are a technical control. They work best paired with a lightweight process:

  1. Weekly cost review — query system.billing.usage and share with team leads
  2. Tag enforcement — require cost_center and owner tags; use these for chargeback reports
  3. Policy review cadence — review policy limits quarterly; adjust as team needs change
  4. Alert on spend anomalies — set Databricks SQL alerts on system.billing.usage for unexpected spikes
-- Alert query: daily spend over $50 (adjust threshold for your team)
SELECT
  usage_date,
  SUM(list_price * usage_quantity) AS daily_cost_usd
FROM system.billing.usage
WHERE usage_date = CURRENT_DATE - INTERVAL 1 DAY
HAVING daily_cost_usd > 50;

Policy Hierarchy: Job vs Interactive Clusters

One common confusion: cluster policies apply differently to interactive clusters (created manually) vs job clusters (created by Databricks Workflows).

  • Interactive clusters — fully governed by policies; users must select a policy
  • Job clusters — the job definition includes a cluster spec; policy is optional but recommended
  • Shared job clusters — reuse a running cluster; no policy enforcement at spin-up

For job clusters, enforce policies in your Job definitions and CI/CD pipeline:

{
  "job_clusters": [
    {
      "job_cluster_key": "default",
      "new_cluster": {
        "policy_id": "ABCD1234",
        "spark_version": "15.4.x-scala2.12",
        "num_workers": 4
      }
    }
  ]
}

Measuring the Impact

After implementing policies at a mid-sized team (15 engineers), here's what typically changes:

MetricBefore PoliciesAfter Policies
Avg cluster termination4.2 hours after last use28 minutes
Clusters using spot instances22%94%
Monthly compute spendBaseline35-50% reduction
Ungoverned clusters per monthUncountedLess than 5 (admins only)

Wrapping Up

Cluster policies are one of the highest-ROI governance investments you can make in a Databricks workspace. A few hours of setup translates to continuous cost savings, better standardization, and fewer 2am alerts about unexpected cloud bills.

Start with the basics: mandatory auto-termination, spot instances, and size limits. Then layer in tagging requirements and monitoring once the foundation is solid.

At Harbinger Explorer, our Databricks workspace runs fully policy-governed. Every cluster our team spins up is within guardrails — keeping infrastructure costs lean so we can focus on building intelligence, not managing bills.


Try Harbinger Explorer free for 7 days — we practice what we preach. Start your free trial at harbingerexplorer.com.


Continue Reading

Try Harbinger Explorer for free

Connect any API, upload files, and explore with AI — all in your browser. No credit card required.

Start Free Trial

Command Palette

Search for a command to run...