Harbinger Explorer

Back to Knowledge Hub
cloud-architecture
Published:

GDPR Compliance for Cloud Data Platforms: A Technical Deep Dive

14 min read·Tags: gdpr, compliance, cloud, data-engineering, terraform, kubernetes, privacy, aws, gcp, azure, security

GDPR Compliance for Cloud Data Platforms: A Technical Deep Dive

Building cloud data platforms that are both powerful and GDPR-compliant is one of the most nuanced engineering challenges of our era. The regulation isn't just a legal checkbox — it fundamentally shapes how you architect data pipelines, choose cloud services, and manage the lifecycle of personal data. This guide walks through the technical realities of achieving GDPR compliance in modern cloud data stacks, complete with infrastructure-as-code examples and reference architectures.


Why GDPR Is an Engineering Problem

Most teams treat GDPR as a legal problem and hand it off to compliance teams. That's a mistake. At its core, GDPR is about data architecture:

  • Article 5 — Data minimisation, purpose limitation, storage limitation
  • Article 17 — Right to erasure ("right to be forgotten")
  • Article 20 — Data portability
  • Article 25 — Data protection by design and by default
  • Article 32 — Security of processing (encryption, pseudonymisation)
  • Article 35 — Data Protection Impact Assessments (DPIA)

Each of these has direct implications for how you design your ingestion layers, storage, access control, and APIs. Engineers own this.


Reference Architecture: GDPR-Compliant Cloud Data Platform

The following diagram illustrates a reference architecture for a GDPR-compliant data platform on AWS or GCP:

Loading diagram...

Key Architectural Tenets

  1. PII never enters the raw zone unclassified
  2. Pseudonymisation tokens are the only reference to PII in analytics
  3. The PII Vault is the single source of truth for personal data
  4. All access is logged immutably
  5. Erasure is automated and verifiable

Terraform: Building the Compliance Infrastructure

Let's look at concrete Terraform for a GCP-based compliant data platform.

1. Encrypted Storage Buckets with Data Retention Policies

resource "google_storage_bucket" "raw_zone" {
  name          = "${var.project_id}-raw-zone"
  location      = "EU"
  storage_class = "STANDARD"

  # Enforce encryption at rest with CMEK
  encryption {
    default_kms_key_name = google_kms_crypto_key.data_key.id
  }

  # Enforce retention — storage limitation (Art. 5)
  retention_policy {
    is_locked        = true
    retention_period = 7776000  # 90 days in seconds
  }

  # Prevent public access
  uniform_bucket_level_access = true

  # Versioning for audit trail
  versioning {
    enabled = true
  }

  lifecycle_rule {
    condition {
      age = 90
    }
    action {
      type = "Delete"
    }
  }
}

resource "google_kms_key_ring" "gdpr_ring" {
  name     = "gdpr-keyring"
  location = "europe-west3"
}

resource "google_kms_crypto_key" "data_key" {
  name            = "gdpr-data-key"
  key_ring        = google_kms_key_ring.gdpr_ring.id
  rotation_period = "7776000s"  # 90-day rotation

  lifecycle {
    prevent_destroy = true
  }
}

2. IAM: Least-Privilege Access (Art. 25 — Privacy by Design)

# Data Engineer role — can read pseudonymised data only
resource "google_project_iam_custom_role" "data_engineer" {
  role_id     = "dataEngineerGDPR"
  title       = "Data Engineer (GDPR Compliant)"
  description = "Access to pseudonymised zones only — no PII Vault"
  permissions = [
    "bigquery.tables.getData",
    "bigquery.tables.list",
    "bigquery.jobs.create",
    "storage.objects.get",
    "storage.objects.list",
  ]
}

# PII Vault access — restricted to compliance service account only
resource "google_storage_bucket_iam_binding" "pii_vault_access" {
  bucket = google_storage_bucket.pii_vault.name
  role   = "roles/storage.objectViewer"

  members = [
    "serviceAccount:${google_service_account.erasure_service.email}",
    "serviceAccount:${google_service_account.portability_service.email}",
  ]
}

# Deny all other access to PII Vault
resource "google_storage_bucket_iam_deny" "pii_vault_deny_all" {
  bucket = google_storage_bucket.pii_vault.name

  deny_policy {
    deny_conditions {
      title      = "Deny non-service-accounts"
      expression = "!resource.name.startsWith('projects/_/serviceAccounts/')"
    }
    denied_permissions = ["storage.objects.get"]
  }
}

3. VPC Service Controls — Data Exfiltration Prevention

resource "google_access_context_manager_service_perimeter" "gdpr_perimeter" {
  parent = "accessPolicies/${var.access_policy_id}"
  name   = "accessPolicies/${var.access_policy_id}/servicePerimeters/gdpr_perimeter"
  title  = "GDPR Data Perimeter"

  status {
    resources = [
      "projects/${var.project_number}",
    ]

    restricted_services = [
      "bigquery.googleapis.com",
      "storage.googleapis.com",
      "dataflow.googleapis.com",
    ]

    ingress_policies {
      ingress_from {
        identity_type = "SERVICE_ACCOUNT"
        identities    = ["serviceAccount:${var.pipeline_sa}"]
      }
      ingress_to {
        resources = ["*"]
        operations {
          service_name = "bigquery.googleapis.com"
          method_selectors {
            method = "BigQueryStorage.ReadRows"
          }
        }
      }
    }
  }
}

Kubernetes: Deploying the Pseudonymisation Service

The pseudonymisation service is the heart of your GDPR architecture. Here's the Kubernetes manifest:

apiVersion: apps/v1
kind: Deployment
metadata:
  name: pseudonymisation-service
  namespace: gdpr-compliance
  labels:
    app: pseudonymisation-service
    gdpr-component: "true"
spec:
  replicas: 3
  selector:
    matchLabels:
      app: pseudonymisation-service
  template:
    metadata:
      labels:
        app: pseudonymisation-service
      annotations:
        # Force pod restart on key rotation
        secret-hash: "${SHA256_OF_KEY}"
    spec:
      serviceAccountName: pseudonymisation-sa
      securityContext:
        runAsNonRoot: true
        runAsUser: 10001
        fsGroup: 10001
        seccompProfile:
          type: RuntimeDefault
      containers:
        - name: pseudonymisation
          image: gcr.io/${PROJECT_ID}/pseudonymisation-service:1.4.2
          ports:
            - containerPort: 8080
          env:
            - name: KMS_KEY_NAME
              valueFrom:
                secretKeyRef:
                  name: gdpr-secrets
                  key: kms-key-name
            - name: VAULT_BUCKET
              value: ${PII_VAULT_BUCKET}
            - name: AUDIT_LOG_TOPIC
              value: projects/${PROJECT_ID}/topics/gdpr-audit
          resources:
            requests:
              memory: "256Mi"
              cpu: "250m"
            limits:
              memory: "512Mi"
              cpu: "500m"
          readinessProbe:
            httpGet:
              path: /health
              port: 8080
            initialDelaySeconds: 5
            periodSeconds: 10
          livenessProbe:
            httpGet:
              path: /health
              port: 8080
            initialDelaySeconds: 15
            periodSeconds: 20

---
apiVersion: networking.k8s.io/v1
kind: NetworkPolicy
metadata:
  name: pseudonymisation-netpol
  namespace: gdpr-compliance
spec:
  podSelector:
    matchLabels:
      app: pseudonymisation-service
  policyTypes:
    - Ingress
    - Egress
  ingress:
    - from:
        - namespaceSelector:
            matchLabels:
              name: data-pipeline
      ports:
        - protocol: TCP
          port: 8080
  egress:
    - to: []  # Only KMS and GCS via VPC SC
      ports:
        - protocol: TCP
          port: 443

Data Catalog: Classifying PII at Ingestion

A crucial part of GDPR compliance is knowing what data you have. Use a YAML-based data catalog that feeds your classification engine:

# data-catalog/schemas/user_events.yaml
schema:
  name: user_events
  version: "2.1"
  gdpr_classification: "personal_data"
  dpia_required: true
  retention_days: 90
  legal_basis: "legitimate_interest"
  
  fields:
    - name: event_id
      type: UUID
      pii: false
      
    - name: user_id
      type: string
      pii: true
      pii_category: "indirect_identifier"
      pseudonymisation: "token_replace"
      vault_key: "user_tokens"
      
    - name: email
      type: string
      pii: true
      pii_category: "contact_data"
      pseudonymisation: "hash_hmac_sha256"
      erasable: true
      
    - name: ip_address
      type: string
      pii: true
      pii_category: "online_identifier"
      pseudonymisation: "ip_masking"
      masking_strategy: "last_octet"
      
    - name: event_type
      type: string
      pii: false
      
    - name: timestamp
      type: timestamp
      pii: false
      retention_trigger: true

Comparison: Cloud Provider GDPR Tooling

FeatureAWSGCPAzure
Data ResidencyRegion-specific S3, RDSRegional GCS, BigQueryGeo-restricted Azure Storage
CMEK SupportAWS KMS + SSE-KMSCloud KMS + CMEKAzure Key Vault + CMK
Data ClassificationAmazon MacieCloud DLP APIAzure Purview
Audit LoggingCloudTrailCloud Audit LogsAzure Monitor + Activity Log
Data ErasureManual + LambdaCloud DLP deidentifyAzure Data Subject Requests
VPC IsolationVPC + PrivateLinkVPC SC + Private Service ConnectVNet + Private Endpoints
PII DetectionMacie (S3 only)DLP (text, images, structured)Purview (broad but slower)
Compliance ReportsAWS ArtifactCompliance Reports ManagerMicrosoft Service Trust Portal
SCCs / Org PoliciesService Control PoliciesOrganization PoliciesAzure Policy
EU Data Boundary✅ AWS EU Boundary✅ GCP EU Boundary✅ Azure EU Boundary

Verdict: GCP's Cloud DLP API has the most mature automated PII detection. AWS Macie is S3-only but deeply integrated. Azure Purview is catching up but remains complex to configure.


Implementing the Right to Erasure (Art. 17)

The right to be forgotten is technically the hardest GDPR requirement in data platforms. Here's a practical approach:

Erasure Workflow

Loading diagram...

Key Erasure Strategies

Strategy 1 — Token Invalidation (recommended for analytics) Don't delete records in BigQuery. Instead, invalidate the pseudonymisation token. All analytics referencing that user_id now resolve to NULL. No table scans needed.

Strategy 2 — Crypto Shredding Encrypt data with a user-specific key stored separately. Deleting the key makes all data unreadable. Works well for object storage.

Strategy 3 — Tombstoning Mark records as deleted in a deletion log table. Filter every query through this log. Simple but adds query overhead.


Data Processing Agreements and Cross-Border Transfers

Standard Contractual Clauses (SCCs) for Cloud Processors

When data leaves the EU — even to a US-based cloud service — you need SCCs. Map your data flows:

Data FlowTransfer MechanismRisk Level
EU → AWS EU (Ireland/Frankfurt)Within EEA — no SCC needed🟢 Low
EU → AWS USSCC Module 2 (Controller → Processor)🟡 Medium
EU → Subprocessors (e.g., Datadog)SCC in DPA + Article 28 clauses🟡 Medium
EU → China/RussiaAdequacy decision absent — generally prohibited🔴 High
EU → CanadaAdequacy decision in place🟢 Low

Monitoring and Alerting for GDPR Incidents

Under GDPR Article 33, you have 72 hours to notify the supervisory authority of a personal data breach. Your monitoring must be fast:

# alerting/gdpr-breach-detection.yaml
alerts:
  - name: Unauthorized PII Access
    condition: >
      SELECT COUNT(*) FROM audit_logs
      WHERE resource = 'pii_vault'
      AND principal NOT IN (SELECT sa FROM allowed_principals)
      AND timestamp > TIMESTAMP_SUB(CURRENT_TIMESTAMP(), INTERVAL 5 MINUTE)
    threshold: 1
    severity: CRITICAL
    notification:
      - channel: pagerduty
        policy: gdpr-incident
      - channel: email
        recipients:
          - dpo@company.com
          - cto@company.com
    sla_hours: 72  # GDPR breach notification window

  - name: Bulk Data Export Anomaly
    condition: >
      bytes_exported > 10GB AND timeframe = '1h'
      AND NOT in_approved_jobs
    severity: HIGH
    
  - name: Retention Policy Violation
    condition: >
      data_age_days > retention_policy_days
      AND data_classification IN ('personal_data', 'sensitive_data')
    severity: MEDIUM
    auto_remediate: true
    remediation: trigger_erasure_workflow

GDPR Compliance Checklist for Cloud Data Engineers

ControlImplementationStatus Check
Data InventoryAutomated catalog with PII taggingScan new tables on ingestion
Consent ManagementConsent flags in user profilesBlock processing if no consent
Data MinimisationSchema-level field necessity reviewQuarterly schema audits
PseudonymisationToken vault with HMAC-SHA256Pen test token reversibility
Encryption at RestCMEK with 90-day rotationKey rotation alerts
Encryption in TransitTLS 1.3 enforced at load balancerTLS scan weekly
Access ControlRBAC with principle of least privilegeQuarterly access reviews
Audit LoggingImmutable logs, 3-year retentionLog integrity checks daily
Right to ErasureAutomated within 30 daysMonthly erasure SLA report
Data PortabilityMachine-readable export APIQuarterly API testing
DPIA DocumentationFor high-risk processing activitiesBefore new data types
Breach Detection< 24h detection, < 72h notificationIncident drill biannual
DPA with ProcessorsSigned SCCs with all vendorsAnnual DPA audit

Conclusion

GDPR compliance in cloud data platforms is achievable — but only if you treat it as an engineering discipline, not a legal afterthought. The key principles are:

  1. Build PII isolation into your architecture from day one — retrofitting is 10x harder
  2. Automate everything — manual compliance processes fail under scale
  3. Pseudonymise, don't anonymise — true anonymisation is nearly impossible; pseudonymisation is tractable
  4. Make erasure cheap — crypto shredding and token invalidation are your friends
  5. Log everything immutably — when the regulator asks, you need receipts

The architectures and code in this guide are battle-tested patterns for teams building on AWS, GCP, or Azure. Start with the data catalog and PII classifier — everything else follows from knowing what data you have.


Ready to build a GDPR-compliant geopolitical intelligence platform?

Harbinger Explorer processes global event data from hundreds of sources with privacy-by-design architecture built in.

Try Harbinger Explorer free for 7 days →


Continue Reading

Try Harbinger Explorer for free

Connect any API, upload files, and explore with AI — all in your browser. No credit card required.

Start Free Trial

Command Palette

Search for a command to run...