Containerized Data Pipelines: Docker and Kubernetes for Platform Engineers
Containerized Data Pipelines: Docker and Kubernetes for Platform Engineers
Container-native data pipelines have moved from experimental to standard practice. The combination of Docker for reproducible environments and Kubernetes for elastic scheduling has transformed how platform teams build, deploy, and operate data workloads. This guide goes deep on patterns that actually work in production.
Why Containers for Data Pipelines?
The case for containerisation has never been stronger:
| Problem (Pre-Container) | Solution (Container-Native) |
|---|---|
| "Works on my machine" dependency hell | Immutable images with pinned dependencies |
| Shared-cluster resource contention | Namespace isolation with resource quotas |
| Scaling Spark requires cluster resize | Spark Operator auto-provisions workers per job |
| Deploying new Python versions breaks everything | Each pipeline carries its own runtime |
| Audit trail for pipeline environments | Image digest = reproducible environment |
| Multi-team monolithic Airflow deployment | Isolated DAG environments per team |
The Container-Native Data Platform Stack
Loading diagram...
Dockerfile Best Practices for Data Pipelines
Multi-Stage Build for PySpark
# Stage 1: Build dependencies
FROM python:3.11-slim AS builder
WORKDIR /build
# Install build tools
RUN apt-get update && apt-get install -y --no-install-recommends gcc libpq-dev && rm -rf /var/lib/apt/lists/*
COPY requirements.txt .
RUN pip install --no-cache-dir --user -r requirements.txt
# Stage 2: Runtime image
FROM openjdk:17-slim AS runtime
# Install Python
RUN apt-get update && apt-get install -y --no-install-recommends python3.11 python3-pip && rm -rf /var/lib/apt/lists/*
# Copy installed packages from builder
COPY --from=builder /root/.local /root/.local
# Install Spark
ENV SPARK_VERSION=3.5.1
ENV HADOOP_VERSION=3
RUN pip install pyspark==${SPARK_VERSION}
# Copy pipeline code
WORKDIR /app
COPY src/ ./src/
COPY entrypoint.sh .
RUN chmod +x entrypoint.sh
# Non-root user for security
RUN useradd -m -u 1000 spark
USER spark
ENTRYPOINT ["./entrypoint.sh"]
Key principles:
- Multi-stage builds: keep runtime images small
- Non-root user: reduces attack surface
- Pin all versions:
requirements.txtwith exact hashes - No secrets in image: use K8s Secrets or external secret stores
Image Tagging Strategy
Never use :latest in production. Use immutable tags:
# Tag with git SHA
IMAGE_TAG=$(git rev-parse --short HEAD)
docker build -t ecr.amazonaws.com/harbinger/pipeline:${IMAGE_TAG} .
docker push ecr.amazonaws.com/harbinger/pipeline:${IMAGE_TAG}
# Also tag with semantic version for release images
docker tag ecr.amazonaws.com/harbinger/pipeline:${IMAGE_TAG} ecr.amazonaws.com/harbinger/pipeline:v1.4.2
Airflow on Kubernetes: KubernetesExecutor
The KubernetesExecutor spawns a fresh pod for every task. This is the production-grade approach for teams that want:
- Task isolation: one failing task cannot affect another
- Resource customisation: CPU/memory per task type
- Horizontal scalability: no fixed worker pool
Helm Deployment
helm repo add apache-airflow https://airflow.apache.org
helm repo update
helm install airflow apache-airflow/airflow --namespace airflow --create-namespace --set executor=KubernetesExecutor --set images.airflow.repository=ecr.amazonaws.com/harbinger/airflow --set images.airflow.tag=2.9.1 --set config.core.dags_folder=/opt/airflow/dags --values airflow-values.yaml
airflow-values.yaml:
executor: KubernetesExecutor
workers:
resources:
requests:
memory: "1Gi"
cpu: "500m"
limits:
memory: "4Gi"
cpu: "2000m"
scheduler:
resources:
requests:
memory: "512Mi"
cpu: "250m"
webserver:
replicas: 2
logs:
persistence:
enabled: false # Use remote logging
remote:
remoteLogging: true
remoteBaseLogFolder: s3://harbinger-airflow-logs/
config:
kubernetes:
namespace: airflow
worker_container_repository: ecr.amazonaws.com/harbinger/airflow
worker_container_tag: 2.9.1
delete_worker_pods: true
delete_worker_pods_on_failure: false # Keep for debugging
extraEnvFrom:
- secretRef:
name: airflow-connections
Task-Level Resource Overrides
from airflow import DAG
from airflow.providers.cncf.kubernetes.operators.kubernetes_pod import KubernetesPodOperator
from kubernetes.client import models as k8s
from datetime import datetime
with DAG(
"harbinger_geopolitical_etl",
start_date=datetime(2024, 1, 1),
schedule="@hourly",
catchup=False,
) as dag:
# Heavy Spark job — request more resources
spark_transform = KubernetesPodOperator(
task_id="spark_transform",
name="spark-transform",
namespace="airflow",
image="ecr.amazonaws.com/harbinger/spark-pipeline:v1.4.2",
image_pull_policy="Always",
env_vars={
"INPUT_PATH": "s3://harbinger-raw/events/",
"OUTPUT_PATH": "s3://harbinger-processed/events/",
},
container_resources=k8s.V1ResourceRequirements(
requests={"memory": "4Gi", "cpu": "2"},
limits={"memory": "8Gi", "cpu": "4"},
),
node_selector={"node-type": "compute-optimised"},
tolerations=[
k8s.V1Toleration(key="dedicated", value="spark", effect="NoSchedule")
],
is_delete_operator_pod=True,
get_logs=True,
)
# Light API call — minimal resources
api_ingest = KubernetesPodOperator(
task_id="api_ingest",
name="api-ingest",
namespace="airflow",
image="ecr.amazonaws.com/harbinger/ingest:v1.4.2",
container_resources=k8s.V1ResourceRequirements(
requests={"memory": "256Mi", "cpu": "100m"},
limits={"memory": "512Mi", "cpu": "500m"},
),
is_delete_operator_pod=True,
)
api_ingest >> spark_transform
Spark Operator
The Kubernetes Operator for Apache Spark manages the full lifecycle of Spark applications as Kubernetes-native custom resources.
Install via Helm
helm repo add spark-operator https://kubeflow.github.io/spark-operator
helm install spark-operator spark-operator/spark-operator --namespace spark-operator --create-namespace --set webhook.enable=true --set metrics.enable=true --set metrics.port=10254
SparkApplication CRD
apiVersion: sparkoperator.k8s.io/v1beta2
kind: SparkApplication
metadata:
name: harbinger-event-enrichment
namespace: spark-jobs
spec:
type: Python
pythonVersion: "3"
mode: cluster
image: ecr.amazonaws.com/harbinger/spark-pipeline:v1.4.2
imagePullPolicy: Always
mainApplicationFile: local:///app/src/enrich_events.py
sparkVersion: "3.5.1"
restartPolicy:
type: OnFailure
onFailureRetries: 3
onFailureRetryInterval: 30
driver:
cores: 2
memory: "2g"
serviceAccount: spark
labels:
app: harbinger-spark
annotations:
iam.amazonaws.com/role: arn:aws:iam::123456789:role/SparkPipelineRole
executor:
cores: 4
instances: 10
memory: "8g"
memoryOverhead: "2g"
labels:
app: harbinger-spark
tolerations:
- key: dedicated
value: spark
effect: NoSchedule
sparkConf:
"spark.sql.extensions": "io.delta.sql.DeltaSparkSessionExtension"
"spark.sql.catalog.spark_catalog": "org.apache.spark.sql.delta.catalog.DeltaCatalog"
"spark.hadoop.fs.s3a.aws.credentials.provider": "com.amazonaws.auth.WebIdentityTokenCredentialsProvider"
"spark.dynamicAllocation.enabled": "true"
"spark.dynamicAllocation.minExecutors": "2"
"spark.dynamicAllocation.maxExecutors": "50"
"spark.dynamicAllocation.shuffleTracking.enabled": "true"
volumes:
- name: spark-local-dir
emptyDir: {}
driver:
volumeMounts:
- name: spark-local-dir
mountPath: /tmp/spark-local
Resource Management and Autoscaling
Namespace Resource Quotas
Prevent any single team from monopolising cluster resources:
apiVersion: v1
kind: ResourceQuota
metadata:
name: spark-jobs-quota
namespace: spark-jobs
spec:
hard:
requests.cpu: "100"
requests.memory: 400Gi
limits.cpu: "200"
limits.memory: 800Gi
count/pods: "500"
count/sparkoperator.k8s.io/v1beta2/sparkapplications: "20"
Cluster Autoscaler
For cloud-managed clusters (EKS, GKE, AKS), configure node auto-provisioning:
# EKS managed node group for Spark workers
resource "aws_eks_node_group" "spark" {
cluster_name = aws_eks_cluster.main.name
node_group_name = "spark-workers"
node_role_arn = aws_iam_role.node.arn
subnet_ids = var.private_subnet_ids
scaling_config {
desired_size = 2
min_size = 0
max_size = 50
}
instance_types = ["r5.4xlarge"] # Memory-optimised for Spark
taint {
key = "dedicated"
value = "spark"
effect = "NO_SCHEDULE"
}
labels = {
"node-type" = "compute-optimised"
}
}
CI/CD Pipeline for Data Pipelines
# .github/workflows/pipeline-deploy.yml
name: Build and Deploy Pipeline
on:
push:
branches: [main]
paths:
- 'pipelines/**'
- 'Dockerfile'
jobs:
build:
runs-on: ubuntu-latest
outputs:
image-tag: ${{ steps.meta.outputs.tags }}
steps:
- uses: actions/checkout@v4
- name: Configure AWS credentials
uses: aws-actions/configure-aws-credentials@v4
with:
role-to-assume: arn:aws:iam::123456789:role/GitHubActionsRole
aws-region: eu-west-1
- name: Login to ECR
uses: aws-actions/amazon-ecr-login@v2
- name: Build and push
uses: docker/build-push-action@v5
with:
push: true
tags: |
${{ env.ECR_REGISTRY }}/harbinger/pipeline:${{ github.sha }}
${{ env.ECR_REGISTRY }}/harbinger/pipeline:latest
- name: Run security scan
uses: aquasecurity/trivy-action@master
with:
image-ref: ${{ env.ECR_REGISTRY }}/harbinger/pipeline:${{ github.sha }}
severity: CRITICAL,HIGH
exit-code: 1
deploy:
needs: build
runs-on: ubuntu-latest
steps:
- name: Update Spark application
run: |
kubectl set image sparkapplication/harbinger-event-enrichment spark-kubernetes-driver=${{ needs.build.outputs.image-tag }} -n spark-jobs
Observability
Key Metrics for Data Pipeline Pods
# ServiceMonitor for Prometheus
apiVersion: monitoring.coreos.com/v1
kind: ServiceMonitor
metadata:
name: spark-metrics
namespace: monitoring
spec:
selector:
matchLabels:
app: harbinger-spark
endpoints:
- port: metrics
interval: 15s
path: /metrics
namespaceSelector:
matchNames:
- spark-jobs
Grafana dashboard panels to build:
- Pipeline throughput: records processed per minute
- Pod restart count: detect crash loops
- CPU throttling ratio: signals under-provisioned limits
- Pending pod duration: K8s scheduling latency
- Spark executor utilisation: detect over/under-scaling
Conclusion
Container-native data pipelines with Kubernetes eliminate entire classes of operational problems — dependency conflicts, unpredictable resource usage, and environment drift. The initial investment in Dockerfiles, Helm charts, and Spark Operator configuration pays back quickly through faster deployments, better resource utilisation, and dramatically simpler debugging.
Platforms like Harbinger Explorer run their entire data ingestion and enrichment pipeline on Kubernetes, enabling them to scale from zero to hundreds of parallel Spark executors in response to geopolitical event surges — and scale back to near zero overnight.
Try Harbinger Explorer free for 7 days — powered by a Kubernetes-native data platform that scales with the world's events.
Continue Reading
GDPR Compliance for Cloud Data Platforms: A Technical Deep Dive
A comprehensive technical guide to building GDPR-compliant cloud data platforms — covering pseudonymisation architecture, Terraform infrastructure, Kubernetes deployments, right-to-erasure workflows, and cloud provider comparison tables.
Cloud Cost Allocation Strategies for Data Teams
A practitioner's guide to cloud cost allocation for data teams—covering tagging strategies, chargeback models, Spot instance patterns, query cost optimization, and FinOps tooling with real Terraform and CLI examples.
API Gateway Architecture Patterns for Data Platforms
A deep-dive into API gateway architecture patterns for data platforms — covering data serving APIs, rate limiting, authentication, schema versioning, and the gateway-as-data-mesh pattern.
Try Harbinger Explorer for free
Connect any API, upload files, and explore with AI — all in your browser. No credit card required.
Start Free Trial