Harbinger Explorer

Back to Knowledge Hub
cloud-architecture
Published:

Zero Trust Architecture for Data Platforms

15 min read·Tags: zero-trust, security, data-governance, IAM, encryption, compliance, RBAC

Zero Trust Architecture for Data Platforms

"Never trust, always verify" — the zero trust principle — was coined for network security, but it's increasingly the right mental model for data platform access control. The perimeter-based model assumes that anything inside your VPC is safe. Modern data platforms span cloud accounts, regions, third-party services, and a workforce that accesses data from coffee shops. The perimeter is gone.

This guide covers how to implement zero trust principles specifically for data platforms: identity-first access, attribute-based controls, encryption at every layer, and continuous verification.


Why Data Platforms Are High-Value Targets

Data platforms aggregate the most sensitive information an organisation has:

  • PII at scale (millions of customer records in one query)
  • Financial data in analytical models
  • Intellectual property in ML training sets
  • Operational data that reveals business strategy

A compromised data warehouse isn't just a GDPR violation — it's potentially every trade secret the organisation has, queryable via SQL.

The traditional answer (VPC isolation + IP allowlisting) fails because:

  1. Most data is now in managed cloud services that don't live "inside" your VPC
  2. Analytical access requires broad read permissions that are difficult to scope
  3. Service accounts accumulate excessive permissions over time

Zero Trust Principles Applied to Data

Loading diagram...

Layer 1: Identity-First Data Access

Eliminate Service Account Key Files

Long-lived key files are the most common vector for data platform compromises. Replace them with short-lived credential exchange:

Loading diagram...

Terraform — OIDC trust policy for GitHub Actions:

data "aws_iam_policy_document" "github_actions_trust" {
  statement {
    effect  = "Allow"
    actions = ["sts:AssumeRoleWithWebIdentity"]
    
    principals {
      type        = "Federated"
      identifiers = [aws_iam_openid_connect_provider.github.arn]
    }
    
    condition {
      test     = "StringLike"
      variable = "token.actions.githubusercontent.com:sub"
      values   = ["repo:myorg/data-platform:ref:refs/heads/main"]
    }
    
    condition {
      test     = "StringEquals"
      variable = "token.actions.githubusercontent.com:aud"
      values   = ["sts.amazonaws.com"]
    }
  }
}

resource "aws_iam_role" "pipeline_execution" {
  name               = "data-pipeline-cicd"
  assume_role_policy = data.aws_iam_policy_document.github_actions_trust.json
  max_session_duration = 3600  # 1 hour max
}

Databricks Unity Catalog — Identity Federation

# Databricks Unity Catalog with SCIM provisioning
resource "databricks_user" "data_engineer" {
  for_each     = var.data_engineer_emails
  user_name    = each.value
  display_name = each.key
  # SCIM handles provisioning/deprovisioning from IdP
  # No local password — SSO only
  force_delete_repos = true
  force_delete_home_dir = true
}

resource "databricks_group_member" "de_team" {
  for_each  = var.data_engineer_emails
  group_id  = databricks_group.data_engineers.id
  member_id = databricks_user.data_engineer[each.key].id
}

# Grant table access to group, not individuals
resource "databricks_grants" "silver_layer" {
  table = "main.silver.customer_events"

  grant {
    principal  = "data-engineers"
    privileges = ["SELECT", "MODIFY"]
  }

  grant {
    principal  = "analysts"
    privileges = ["SELECT"]
  }
}

Layer 2: Attribute-Based Access Control (ABAC)

Role-based access control (RBAC) doesn't scale for data platforms. When you have 500 tables, 50 teams, and 3 environments, the RBAC matrix explodes. ABAC uses data attributes (classification, domain, sensitivity) and user attributes (team, clearance, location) to compute access dynamically.

Data Classification Tags

# Tag every data asset at creation
resource "aws_glue_catalog_table" "customer_pii" {
  name          = "customer_profiles"
  database_name = aws_glue_catalog_database.silver.name
  
  parameters = {
    "data_classification" = "PII"
    "data_domain"         = "customer"
    "sensitivity"         = "high"
    "gdpr_relevant"       = "true"
    "retention_days"      = "730"
    "owner_team"          = "customer-platform"
  }
  
  # ... schema definition
}

Lake Formation — ABAC Tag Policy

# Grant access based on data classification tags, not specific tables
resource "aws_lakeformation_tag" "classification" {
  key    = "data_classification"
  values = ["public", "internal", "confidential", "PII", "restricted"]
}

# Data engineers can access internal and confidential, not PII
resource "aws_lakeformation_tag_association" "engineer_access" {
  principal {
    iam_arn = "arn:aws:iam::123456789:role/data-engineers"
  }
  
  lf_tag_policy {
    resource_type = "TABLE"
    expression {
      key    = "data_classification"
      values = ["public", "internal", "confidential"]
    }
  }
  
  permissions = ["SELECT", "DESCRIBE"]
}

# PII access requires explicit DPO approval (separate role)
resource "aws_lakeformation_tag_association" "pii_approved_access" {
  principal {
    iam_arn = "arn:aws:iam::123456789:role/pii-approved-analysts"
  }
  
  lf_tag_policy {
    resource_type = "TABLE"
    expression {
      key    = "data_classification"
      values = ["PII"]
    }
  }
  
  permissions = ["SELECT"]
  permissions_with_grant_option = []
}

Layer 3: Column-Level Security and Data Masking

Even users with table access shouldn't always see all columns. Column-level security with dynamic masking implements this without duplicating data.

BigQuery Column-Level Security

-- Create a policy tag taxonomy
-- (done via Data Catalog API or Terraform)

-- Assign policy tag to sensitive column
CREATE OR REPLACE TABLE analytics.customer_orders (
  order_id        STRING,
  customer_id     STRING,
  email           STRING OPTIONS (
    description='PII — protected by policy tag',
    policy_tags='"projects/my-project/locations/us/taxonomies/12345/policyTags/67890"'
  ),
  amount_usd      NUMERIC,
  created_at      TIMESTAMP
);

-- Analysts without "PII Viewer" role see:
-- SELECT * → email column returns NULL or REDACTED
-- No error, no indication that data is being masked

Snowflake Dynamic Data Masking

-- Create masking policy
CREATE OR REPLACE MASKING POLICY pii_email_mask AS (val STRING)
RETURNS STRING ->
  CASE
    WHEN CURRENT_ROLE() IN ('PII_APPROVED_ANALYST', 'DPO_TEAM') THEN val
    WHEN CURRENT_ROLE() = 'ANALYST' THEN 
      REGEXP_REPLACE(val, '(.{2}).*(@.*)', '\1***\2')  -- partial mask
    ELSE '***REDACTED***'
  END;

-- Apply to column
ALTER TABLE customer_orders 
  MODIFY COLUMN email 
  SET MASKING POLICY pii_email_mask;

-- Test as analyst role:
USE ROLE ANALYST;
SELECT email FROM customer_orders LIMIT 5;
-- Returns: jo***@example.com, ma***@company.org, ...

Layer 4: Network Micro-Segmentation

Private Endpoints for All Data Services

Loading diagram...
# S3 Gateway Endpoint (free)
resource "aws_vpc_endpoint" "s3" {
  vpc_id          = aws_vpc.data_platform.id
  service_name    = "com.amazonaws.${var.region}.s3"
  route_table_ids = aws_route_table.private[*].id
  
  policy = jsonencode({
    Version = "2012-10-17"
    Statement = [{
      Effect    = "Allow"
      Principal = "*"
      Action    = ["s3:GetObject", "s3:PutObject", "s3:ListBucket"]
      Resource  = [
        aws_s3_bucket.lakehouse.arn,
        "${aws_s3_bucket.lakehouse.arn}/*"
      ]
    }]
  })
}

# Restrict S3 bucket to VPC endpoint only
resource "aws_s3_bucket_policy" "lakehouse_vpc_only" {
  bucket = aws_s3_bucket.lakehouse.id
  
  policy = jsonencode({
    Version = "2012-10-17"
    Statement = [{
      Effect    = "Deny"
      Principal = "*"
      Action    = "s3:*"
      Resource  = [
        aws_s3_bucket.lakehouse.arn,
        "${aws_s3_bucket.lakehouse.arn}/*"
      ]
      Condition = {
        StringNotEquals = {
          "aws:sourceVpce" = aws_vpc_endpoint.s3.id
        }
      }
    }]
  })
}

Layer 5: Encryption at Every Layer

Encryption Architecture

Loading diagram...
# Separate KMS keys per data classification
resource "aws_kms_key" "pii_data" {
  description             = "PII data encryption — data platform"
  deletion_window_in_days = 30
  enable_key_rotation     = true
  
  policy = jsonencode({
    Version = "2012-10-17"
    Statement = [
      {
        Sid    = "Enable DPO team management"
        Effect = "Allow"
        Principal = { AWS = var.dpo_team_role_arn }
        Action = ["kms:*"]
        Resource = "*"
      },
      {
        Sid    = "Allow approved roles to use key"
        Effect = "Allow"
        Principal = { AWS = [
          var.pii_pipeline_role_arn,
          var.pii_analyst_role_arn
        ]}
        Action = ["kms:GenerateDataKey", "kms:Decrypt"]
        Resource = "*"
      },
      {
        Sid    = "Deny all others"
        Effect = "Deny"
        Principal = { AWS = "*" }
        Action = ["kms:GenerateDataKey", "kms:Decrypt"]
        Resource = "*"
        Condition = {
          StringNotLike = {
            "aws:PrincipalArn" = [
              var.dpo_team_role_arn,
              var.pii_pipeline_role_arn,
              var.pii_analyst_role_arn
            ]
          }
        }
      }
    ]
  })
}

Layer 6: Continuous Verification and Anomaly Detection

Zero trust isn't "verify once and trust." It's continuous.

Query Anomaly Detection

# Pseudocode for query audit log analysis
# Run as a scheduled Spark job on CloudTrail / audit logs

from pyspark.sql import functions as F

audit_logs = spark.table("security.data_access_audit")

# Detect unusual data volume access
anomalies = (
    audit_logs
    .where(F.col("event_date") == F.current_date())
    .groupBy("principal_id", "table_name")
    .agg(
        F.sum("bytes_scanned").alias("bytes_today"),
        F.count("*").alias("query_count")
    )
    .join(
        # Compare against 30-day baseline
        audit_logs
        .where(F.col("event_date") >= F.date_sub(F.current_date(), 30))
        .groupBy("principal_id", "table_name")
        .agg((F.sum("bytes_scanned") / 30).alias("avg_daily_bytes")),
        on=["principal_id", "table_name"],
        how="left"
    )
    .where(F.col("bytes_today") > F.col("avg_daily_bytes") * 10)  # 10x spike
)

# Alert via PagerDuty / Slack
anomalies.foreach(lambda row: alert_security_team(row))

Harbinger Explorer for API Access Auditing

When your data platform exposes APIs (and they all do — from Athena federation endpoints to custom REST APIs), you need continuous visibility into which endpoints are being called, with what parameters, and whether responses match expected schemas. Harbinger Explorer provides this testing and monitoring layer, letting you catch unexpected access patterns or schema deviations before they become security incidents.


Zero Trust Maturity Model

LevelDescriptionKey Controls
0 — Implicit trustVPC = trusted; anyone inside can query anythingNone
1 — Identity-awareAuthentication required; coarse RBACSSO, basic roles
2 — Data-awareABAC on data classification; column maskingPolicy tags, masking policies
3 — Context-awareAccess varies by time, location, device postureConditional access, MFA step-up
4 — ContinuousEvery query re-evaluated; anomaly detection; immutable audit logsSIEM integration, ML anomaly detection

Most mature data platforms operate at Level 2-3. Level 4 is appropriate for organisations handling financial services data, healthcare records, or government information.


Summary

Zero trust for data platforms is a layered discipline: identity-first authentication eliminates the key file problem; ABAC scales access control beyond what RBAC can manage; column-level masking protects sensitive fields without data duplication; network micro-segmentation removes lateral movement; and continuous verification catches anomalies before they become breaches.

Start with Layer 1 (eliminate key files, enforce SSO) and Layer 2 (classify your data, apply ABAC). The impact-to-effort ratio is highest there, and it builds the foundation for the deeper controls.


Try Harbinger Explorer free for 7 days — validate your data API security posture, test that your access controls return correct responses, and monitor for unexpected access patterns across your data platform endpoints. harbingerexplorer.com


Continue Reading

Try Harbinger Explorer for free

Connect any API, upload files, and explore with AI — all in your browser. No credit card required.

Start Free Trial

Command Palette

Search for a command to run...