Harbinger Explorer

Back to Knowledge Hub
cloud-architecture
Published:

Security Patterns for Cloud Data Lakehouses: A Comprehensive Guide

13 min read·Tags: data-lakehouse, security, delta-lake, iceberg, data-governance, compliance

Security Patterns for Cloud Data Lakehouses: A Comprehensive Guide

The data lakehouse has emerged as the dominant architectural pattern for analytical platforms — combining the scalability and cost-efficiency of object storage with the transactional guarantees and query performance of a traditional data warehouse. But with consolidated data comes consolidated risk. A misconfigured lakehouse can expose PII, financial records, and sensitive operational data to anyone with a storage account access key.

This guide covers the full security stack for cloud data lakehouses built on Delta Lake, Apache Iceberg, or Apache Hudi.


The Lakehouse Security Surface

Before designing controls, map your attack surface:

Loading diagram...

Security controls operate at four layers:

  1. Identity — who is allowed to authenticate
  2. Access control — what authenticated identities can read/write
  3. Storage — how data is protected at rest and in transit
  4. Audit — what was accessed, by whom, and when

Layer 1: Identity and Authentication

Federate Everything

Never create local database users for human identities. Federate all authentication through your corporate Identity Provider (IdP):

PlatformFederation Mechanism
DatabricksSCIM + SAML 2.0 / OIDC via AAD or Okta
AWS Lake FormationIAM Identity Center (SSO)
GCP BigLakeGoogle Workspace / Cloud Identity
SnowflakeSAML 2.0 / SCIM

Service accounts for pipelines should use workload identity federation rather than long-lived keys:

# AWS: Use IAM roles for EC2/EKS instead of access keys
# Attach role to EKS service account via IRSA
eksctl create iamserviceaccount   --name spark-pipeline   --namespace data-platform   --cluster harbinger-prod   --attach-policy-arn arn:aws:iam::123456789:policy/LakehouseReadWrite   --approve

Secret Rotation Policy

Secret TypeMax LifetimeRotation Method
Human passwords90 daysIdP-enforced
Service account keys30 daysAutomated via Secrets Manager
API tokens7 daysShort-lived tokens preferred
Storage access keysNever (use roles)Replace with IAM roles

Layer 2: Access Control Patterns

Unity Catalog (Databricks)

Unity Catalog provides a three-level namespace (catalog.schema.table) with fine-grained access controls at every level. This is currently the most mature governance layer for Delta Lake workloads.

-- Create a catalog for production data
CREATE CATALOG harbinger_prod;

-- Grant schema-level access to data engineers
GRANT USE CATALOG ON CATALOG harbinger_prod TO `data-engineers`;
GRANT CREATE SCHEMA ON CATALOG harbinger_prod TO `data-engineers`;

-- Grant read-only access to analysts
GRANT USE CATALOG ON CATALOG harbinger_prod TO `analysts`;
GRANT USE SCHEMA ON SCHEMA harbinger_prod.geopolitical TO `analysts`;
GRANT SELECT ON TABLE harbinger_prod.geopolitical.events TO `analysts`;

-- Revoke direct storage access
REVOKE ALL PRIVILEGES ON EXTERNAL LOCATION raw_s3 FROM `analysts`;

Column-Level Security

Protect sensitive columns (PII, classified fields) without restructuring your tables:

-- Mask email column for non-privileged users
CREATE OR REPLACE FUNCTION harbinger_prod.security.mask_email(email STRING)
RETURNS STRING
RETURN CASE
  WHEN IS_MEMBER('pii-readers') THEN email
  ELSE CONCAT(LEFT(email, 2), '****@****.com')
END;

ALTER TABLE harbinger_prod.users.profiles
ALTER COLUMN email SET MASK harbinger_prod.security.mask_email;

Row-Level Security

Restrict which rows a user can see based on their group membership or attributes:

-- Row filter: analysts only see events for their assigned regions
CREATE OR REPLACE FUNCTION harbinger_prod.security.region_filter(region STRING)
RETURNS BOOLEAN
RETURN IS_MEMBER('global-analysts')
  OR EXISTS (
    SELECT 1 FROM harbinger_prod.security.analyst_regions ar
    WHERE ar.user_email = CURRENT_USER()
    AND ar.region = region
  );

ALTER TABLE harbinger_prod.geopolitical.events
ADD ROW FILTER harbinger_prod.security.region_filter ON (region);

AWS Lake Formation: Tag-Based Access Control (TBAC)

For AWS-native lakehouses on Glue / Athena / EMR:

# Create LF tags
aws lakeformation create-lf-tag   --tag-key "Sensitivity"   --tag-values "Public,Internal,Confidential,Restricted"

aws lakeformation create-lf-tag   --tag-key "Domain"   --tag-values "geopolitical,financial,operational"

# Assign tags to resources
aws lakeformation add-lf-tags-to-resource   --resource '{"Table":{"DatabaseName":"harbinger_prod","Name":"classified_events"}}'   --lf-tags '[{"TagKey":"Sensitivity","TagValues":["Restricted"]}]'

# Grant access via tags
aws lakeformation grant-permissions   --principal '{"DataLakePrincipalIdentifier":"arn:aws:iam::123456789:role/AnalystRole"}'   --resource '{"LFTagPolicy":{"ResourceType":"TABLE","Expression":[{"TagKey":"Sensitivity","TagValues":["Public","Internal"]}]}}'   --permissions SELECT

Layer 3: Encryption

Encryption at Rest

All major cloud object stores encrypt data at rest by default with platform-managed keys. For sensitive workloads, use Customer-Managed Keys (CMK):

# Terraform: S3 bucket with CMK encryption
resource "aws_s3_bucket_server_side_encryption_configuration" "lakehouse" {
  bucket = aws_s3_bucket.lakehouse.id

  rule {
    apply_server_side_encryption_by_default {
      sse_algorithm     = "aws:kms"
      kms_master_key_id = aws_kms_key.lakehouse.arn
    }
    bucket_key_enabled = true  # reduces KMS API calls by ~99%
  }
}

resource "aws_kms_key" "lakehouse" {
  description             = "Harbinger Lakehouse CMK"
  deletion_window_in_days = 30
  enable_key_rotation     = true

  policy = jsonencode({
    Version = "2012-10-17"
    Statement = [
      {
        Sid    = "Enable IAM policies"
        Effect = "Allow"
        Principal = { AWS = "arn:aws:iam::${var.account_id}:root" }
        Action   = "kms:*"
        Resource = "*"
      },
      {
        Sid    = "Deny key deletion by non-admins"
        Effect = "Deny"
        Principal = { AWS = "*" }
        Action   = ["kms:ScheduleKeyDeletion", "kms:DeleteAlias"]
        Resource = "*"
        Condition = {
          StringNotLike = {
            "aws:PrincipalArn" = "arn:aws:iam::${var.account_id}:role/KMSAdmin"
          }
        }
      }
    ]
  })
}

Column-Level Encryption for Ultra-Sensitive Data

For data that must be encrypted even from privileged storage administrators, apply application-level encryption before writing to the lakehouse:

from cryptography.fernet import Fernet
import base64, os

# Key stored in AWS Secrets Manager, not in code
def encrypt_column(value: str, key: bytes) -> str:
    f = Fernet(base64.urlsafe_b64encode(key))
    return f.encrypt(value.encode()).decode()

# In PySpark
from pyspark.sql.functions import udf
from pyspark.sql.types import StringType

encrypt_udf = udf(lambda v: encrypt_column(v, kms_key_bytes), StringType())

df_encrypted = df.withColumn("ssn_encrypted", encrypt_udf("ssn"))                  .drop("ssn")

Layer 4: Network Security

Private Endpoints

Never expose your lakehouse over the public internet:

# Azure: Private endpoint for ADLS Gen2
resource "azurerm_private_endpoint" "adls" {
  name                = "harbinger-adls-pe"
  location            = var.location
  resource_group_name = var.resource_group
  subnet_id           = var.private_subnet_id

  private_service_connection {
    name                           = "adls-connection"
    private_connection_resource_id = azurerm_storage_account.lakehouse.id
    subresource_names              = ["dfs"]
    is_manual_connection           = false
  }
}

# Disable public access
resource "azurerm_storage_account_network_rules" "lakehouse" {
  storage_account_id = azurerm_storage_account.lakehouse.id
  default_action     = "Deny"
  bypass             = ["AzureServices"]
  ip_rules           = []
  virtual_network_subnet_ids = [var.private_subnet_id]
}

Layer 5: Audit Logging

Audit logging is non-negotiable for compliance frameworks (GDPR, HIPAA, SOC 2). You need a complete record of: what data was accessed, by which identity, from which IP, at what time.

Databricks Audit Logs to S3

# Enable audit log delivery via Databricks account API
curl -X POST https://accounts.azuredatabricks.net/api/2.0/accounts/${ACCOUNT_ID}/log-delivery   -H "Authorization: Bearer ${TOKEN}"   -d '{
    "log_delivery_configuration": {
      "log_type": "AUDIT_LOGS",
      "output_format": "JSON",
      "delivery_path_prefix": "audit-logs/databricks",
      "storage_configuration_id": "'${STORAGE_CONFIG_ID}'"
    }
  }'

Querying Audit Logs

Once ingested into your lakehouse, audit logs become queryable:

-- Find all SELECT operations on PII tables in the last 7 days
SELECT
    timestamp,
    userIdentity.email,
    requestParams.commandText,
    sourceIPAddress
FROM harbinger_audit.databricks.audit_events
WHERE timestamp > CURRENT_TIMESTAMP - INTERVAL 7 DAYS
    AND actionName = 'runCommand'
    AND requestParams.commandText ILIKE '%users.profiles%'
ORDER BY timestamp DESC;

-- Detect anomalous access: users querying at unusual hours
SELECT
    userIdentity.email,
    HOUR(timestamp) as hour_of_day,
    COUNT(*) as query_count
FROM harbinger_audit.databricks.audit_events
WHERE timestamp > CURRENT_TIMESTAMP - INTERVAL 30 DAYS
    AND actionName = 'runCommand'
GROUP BY 1, 2
HAVING hour_of_day NOT BETWEEN 7 AND 19
ORDER BY query_count DESC;

Compliance Frameworks

GDPR

RequirementImplementation
Right to erasureDelta Lake DELETE + vacuum; or use a pseudonymisation key table
Data minimisationColumn-level masking for non-essential access
Purpose limitationRow-level filters by user role/purpose
Audit trailDatabricks audit logs + Delta change data feed
Data residencyRegion-locked storage accounts + no cross-region replication

HIPAA

For healthcare data on cloud lakehouses:

  • Encryption at rest with CMK: required
  • Encryption in transit (TLS 1.2+): required
  • Access controls with MFA: required
  • Audit logs retained for 6 years: required
  • Business Associate Agreement with cloud provider: required

Security Checklist

Use this as a pre-production gate:

  • All human access via federated IdP (no local users)
  • Service accounts use IAM roles / workload identity (no static keys)
  • Encryption at rest with CMK enabled
  • Private endpoints configured; public access blocked
  • Unity Catalog / Lake Formation governance layer active
  • Column-level security on PII fields
  • Row-level filters on multi-tenant tables
  • Audit logs flowing to immutable storage
  • Network egress controlled (no unrestricted outbound)
  • Vulnerability scanning on compute images
  • Secrets rotation policy enforced

Conclusion

Securing a cloud data lakehouse is a multi-layered challenge that spans identity, access control, encryption, network architecture, and audit. The good news is that modern platforms like Databricks Unity Catalog and AWS Lake Formation provide the primitives to implement fine-grained, policy-driven security without compromising analytical performance.

Platforms processing sensitive geopolitical or intelligence data — like Harbinger Explorer — apply these patterns across every layer of their data architecture to ensure that sensitive signals are accessible only to authorised consumers, with a complete audit trail of every access.


Try Harbinger Explorer free for 7 days — built on a secure, compliant cloud data lakehouse from day one.


Continue Reading

Try Harbinger Explorer for free

Connect any API, upload files, and explore with AI — all in your browser. No credit card required.

Start Free Trial

Command Palette

Search for a command to run...