Running Data Workloads on Kubernetes: Patterns and Pitfalls
Running Data Workloads on Kubernetes: Patterns and Pitfalls
Kubernetes was designed for stateless, containerized applications. Data workloads are often stateful, resource-hungry, and sensitive to scheduling jitter. Yet teams are increasingly consolidating data infrastructure onto K8s — driven by the promise of unified scheduling, better resource utilization, and CI/CD-native workflows.
This guide covers what actually works in production: running Spark on Kubernetes, operating Kafka via operators, scheduling data pipelines with Argo Workflows, and avoiding the most common failure patterns.
Why Run Data Workloads on K8s?
The case for K8s-native data infrastructure isn't about hype — it's about operational leverage:
| Traditional Approach | K8s-Native Approach |
|---|---|
| Per-cluster Spark YARN | Spark-on-K8s with namespace-level isolation |
| Kafka VMs with manual broker scaling | Strimzi operator with GitOps-managed config |
| Airflow on dedicated VMs | Airflow on K8s Executor or Argo Workflows |
| Separate infra per team | Multi-tenant namespaces with ResourceQuotas |
| Manual node provisioning | Karpenter / Cluster Autoscaler with node pools |
The tradeoff: stateful workloads need persistent storage, controlled eviction, and careful network configuration that stateless apps don't require. Get these right and K8s becomes a genuine force multiplier.
Spark on Kubernetes
Architecture
Loading diagram...
Submitting Jobs
# spark-submit against a K8s cluster
spark-submit --master k8s://https://k8s-api.internal:6443 --deploy-mode cluster --name etl-orders-daily --conf spark.kubernetes.namespace=data-platform --conf spark.kubernetes.container.image=company-registry/spark:3.5.1-python3.11 --conf spark.kubernetes.serviceAccountName=spark-executor --conf spark.executor.instances=10 --conf spark.executor.cores=4 --conf spark.executor.memory=8g --conf spark.driver.memory=4g --conf spark.kubernetes.driver.request.cores=2 --conf spark.kubernetes.executor.request.cores=3 --conf spark.eventLog.enabled=true --conf spark.eventLog.dir=s3a://data-platform-logs/spark-events/ --conf spark.kubernetes.executor.annotation.cluster-autoscaler.kubernetes.io/safe-to-evict=false local:///opt/spark/work-dir/jobs/orders_daily.py
The safe-to-evict=false annotation is critical — without it, Cluster Autoscaler will evict executor pods mid-job when scaling down, causing cascading task failures.
Node Pools for Data Workloads
Don't mix Spark executors with API servers on the same node pool. Memory-intensive Spark jobs cause noisy-neighbor problems that create latency spikes in unrelated services.
# Terraform: dedicated node pool for Spark executors (GKE example)
resource "google_container_node_pool" "spark_executor_pool" {
name = "spark-executor-pool"
cluster = google_container_cluster.main.name
location = var.region
autoscaling {
min_node_count = 0
max_node_count = 50
location_policy = "BALANCED"
}
node_config {
machine_type = "n2-highmem-16" # 16 vCPU, 128 GB RAM
disk_size_gb = 200
disk_type = "pd-ssd"
taint {
key = "workload"
value = "spark-executor"
effect = "NO_SCHEDULE"
}
labels = {
workload = "spark-executor"
team = "data-platform"
}
oauth_scopes = [
"https://www.googleapis.com/auth/cloud-platform",
]
}
}
Match with a toleration in your Spark config:
--conf spark.kubernetes.executor.node.selector.workload=spark-executor
--conf spark.kubernetes.executor.tolerations=[{"key":"workload","operator":"Equal","value":"spark-executor","effect":"NoSchedule"}]
Kafka on Kubernetes with Strimzi
The Strimzi operator is the production-grade way to run Kafka on K8s. It handles broker lifecycle, rolling upgrades, TLS certificate rotation, and Cruise Control integration for partition rebalancing.
Cluster Definition
apiVersion: kafka.strimzi.io/v1beta2
kind: Kafka
metadata:
name: data-platform-kafka
namespace: kafka
spec:
kafka:
version: 3.7.0
replicas: 3
listeners:
- name: plain
port: 9092
type: internal
tls: false
- name: tls
port: 9093
type: internal
tls: true
authentication:
type: tls
- name: external
port: 9094
type: loadbalancer
tls: true
config:
offsets.topic.replication.factor: 3
transaction.state.log.replication.factor: 3
transaction.state.log.min.isr: 2
default.replication.factor: 3
min.insync.replicas: 2
inter.broker.protocol.version: "3.7"
log.retention.hours: 168
log.segment.bytes: 1073741824
log.retention.check.interval.ms: 300000
storage:
type: jbod
volumes:
- id: 0
type: persistent-claim
size: 500Gi
class: premium-ssd
deleteClaim: false
resources:
requests:
memory: 16Gi
cpu: "4"
limits:
memory: 16Gi
cpu: "8"
rack:
topologyKey: topology.kubernetes.io/zone
template:
pod:
affinity:
podAntiAffinity:
requiredDuringSchedulingIgnoredDuringExecution:
- labelSelector:
matchExpressions:
- key: strimzi.io/name
operator: In
values:
- data-platform-kafka-kafka
topologyKey: kubernetes.io/hostname
zookeeper:
replicas: 3
storage:
type: persistent-claim
size: 50Gi
class: premium-ssd
deleteClaim: false
entityOperator:
topicOperator: {}
userOperator: {}
cruiseControl: {}
The rack configuration maps to AZ topology — Strimzi will distribute replicas across zones automatically, which is essential for HA.
Topic Management via GitOps
# KafkaTopic CR — manage via Git, not kafka-topics.sh
apiVersion: kafka.strimzi.io/v1beta2
kind: KafkaTopic
metadata:
name: orders-events-v2
namespace: kafka
labels:
strimzi.io/cluster: data-platform-kafka
spec:
partitions: 24
replicas: 3
config:
retention.ms: "604800000" # 7 days
cleanup.policy: delete
compression.type: lz4
min.insync.replicas: "2"
message.timestamp.type: LogAppendTime
Pipeline Orchestration with Argo Workflows
For data pipelines that need DAG semantics without a separate Airflow cluster, Argo Workflows is a compelling K8s-native option.
apiVersion: argoproj.io/v1alpha1
kind: WorkflowTemplate
metadata:
name: daily-etl-pipeline
namespace: data-platform
spec:
entrypoint: etl-dag
serviceAccountName: argo-workflow-sa
parallelism: 5
templates:
- name: etl-dag
dag:
tasks:
- name: extract-orders
template: spark-job
arguments:
parameters:
- name: job-class
value: "com.company.etl.ExtractOrders"
- name: date
value: "{{workflow.parameters.date}}"
- name: extract-customers
template: spark-job
arguments:
parameters:
- name: job-class
value: "com.company.etl.ExtractCustomers"
- name: date
value: "{{workflow.parameters.date}}"
- name: transform-and-load
dependencies: [extract-orders, extract-customers]
template: spark-job
arguments:
parameters:
- name: job-class
value: "com.company.etl.TransformAndLoad"
- name: date
value: "{{workflow.parameters.date}}"
- name: spark-job
inputs:
parameters:
- name: job-class
- name: date
resource:
action: create
successCondition: status.applicationState.state == COMPLETED
failureCondition: status.applicationState.state == FAILED
manifest: |
apiVersion: sparkoperator.k8s.io/v1beta2
kind: SparkApplication
metadata:
generateName: etl-job-
namespace: data-platform
spec:
type: Scala
mode: cluster
image: company-registry/spark-etl:latest
mainClass: "{{inputs.parameters.job-class}}"
arguments:
- "--date={{inputs.parameters.date}}"
driver:
cores: 2
memory: "4g"
serviceAccount: spark-executor
executor:
cores: 4
instances: 8
memory: "8g"
Common Failure Patterns and Fixes
| Failure Mode | Root Cause | Fix |
|---|---|---|
| Executor OOM kills | Memory limits too low or unbounded broadcast joins | Set spark.sql.autoBroadcastJoinThreshold=-1, tune executor memory |
| Driver pod evicted | Driver on node with resource pressure | Add PriorityClass: high-priority to driver pods |
| Shuffle data lost | Executor evicted mid-job | Use remote shuffle service (e.g., Magnet, Uniffle) |
| Kafka lag accumulates | Consumer pod CPU throttled | Move consumers to node pool with no CPU limits |
| PVC provisioning delay | StorageClass binding mode | Use WaitForFirstConsumer binding mode |
| Slow pod startup | Large container images | Optimize to <2GB; use image pull pre-warming |
Observability for K8s Data Workloads
Instrument at three layers:
- Infrastructure layer: node CPU/memory/disk saturation, pod scheduling latency
- Workload layer: Spark job duration, Kafka consumer lag, executor failure rate
- Data layer: table freshness, row count deltas, schema drift
For platform teams managing dozens of data workloads across K8s namespaces, a unified data observability tool like Harbinger Explorer bridges the gap between K8s metrics (which tell you the pod OOM'd) and data metrics (which tell you which tables are stale as a result).
Summary
Running data workloads on Kubernetes is achievable in production — but it requires treating your data pods as first-class citizens in your scheduling, storage, and observability strategy.
Key patterns that work:
- Dedicated node pools with taints for Spark executors
- Strimzi operator with GitOps-managed KafkaTopic CRs
safe-to-evict=falseon all executor pods- Argo Workflows for K8s-native DAG orchestration
- Multi-layer observability: infra + workload + data
Try Harbinger Explorer free for 7 days — unified observability for your K8s data platform, from executor health to table freshness, all in one place.
Continue Reading
GDPR Compliance for Cloud Data Platforms: A Technical Deep Dive
A comprehensive technical guide to building GDPR-compliant cloud data platforms — covering pseudonymisation architecture, Terraform infrastructure, Kubernetes deployments, right-to-erasure workflows, and cloud provider comparison tables.
Cloud Cost Allocation Strategies for Data Teams
A practitioner's guide to cloud cost allocation for data teams—covering tagging strategies, chargeback models, Spot instance patterns, query cost optimization, and FinOps tooling with real Terraform and CLI examples.
API Gateway Architecture Patterns for Data Platforms
A deep-dive into API gateway architecture patterns for data platforms — covering data serving APIs, rate limiting, authentication, schema versioning, and the gateway-as-data-mesh pattern.
Try Harbinger Explorer for free
Connect any API, upload files, and explore with AI — all in your browser. No credit card required.
Start Free Trial