Harbinger Explorer

Back to Knowledge Hub
cloud-architecture
Published:

Serverless Data Processing: When It Works and When It Doesn't

13 min read·Tags: serverless, Lambda, Cloud-Run, data-processing, event-driven, cloud-architecture

Serverless Data Processing: When It Works and When It Doesn't

Serverless is one of the most over-applied concepts in data engineering. The promise — infinite scale, zero ops, pay-per-invocation — attracts teams to use it for workloads it was never designed for. The result is systems that are expensive, hard to debug, and slower than what they replaced.

At the same time, serverless genuinely excels for specific data processing patterns. The problem is most teams don't have a clear decision framework for when to use it.

This guide gives you that framework, with honest benchmarks.


What "Serverless" Actually Means for Data Processing

The term covers several distinct execution models, which have different tradeoffs:

ModelExamplesUnit of billingCold start
Function-as-a-Service (FaaS)AWS Lambda, GCP Cloud Functions, Azure FunctionsPer invocation + duration100ms–3s
Container-on-demandCloud Run, Lambda Container, Azure Container AppsPer request + CPU-seconds1–10s
Serverless SQLAthena, BigQuery, Synapse ServerlessPer TB scannedN/A (query)
Serverless SparkDatabricks Serverless, EMR Serverless, Dataproc ServerlessDBU or vCPU-hours30s–2min
Serverless streamingKinesis, Pub/Sub, EventBridge PipesPer message/unitN/A

These are fundamentally different products. "Should I use serverless?" is the wrong question — "which serverless model fits this workload?" is the right one.


Where Serverless Works Well

1. Event-Driven Micro-Ingestion

Small, frequent, unpredictable events are the canonical serverless use case. An IoT sensor sends readings when it has something to report — not on a schedule. A webhook fires when a payment completes.

Loading diagram...

This works because:

  • Events are small (< 1 MB each)
  • Processing is stateless (each event is independent)
  • Volume is unpredictable (Lambda handles 0→10,000 events/min without pre-provisioning)
  • Cold starts are acceptable (background processing, not user-facing)

Lambda for webhook ingestion:

# lambda_function.py
import json
import boto3
import os
from datetime import datetime

s3 = boto3.client('s3')
BUCKET = os.environ['BRONZE_BUCKET']

def lambda_handler(event, context):
    # Parse webhook payload
    payload = json.loads(event['body'])
    
    # Enrich with metadata
    record = {
        **payload,
        "_ingested_at": datetime.utcnow().isoformat(),
        "_source": event['headers'].get('X-Webhook-Source', 'unknown'),
        "_partition_date": datetime.utcnow().strftime('%Y/%m/%d')
    }
    
    # Write to S3 with partitioning
    key = f"webhooks/{record['_partition_date']}/{context.aws_request_id}.json"
    s3.put_object(
        Bucket=BUCKET,
        Key=key,
        Body=json.dumps(record),
        ContentType='application/json'
    )
    
    return {'statusCode': 200, 'body': 'OK'}
# Terraform — Lambda with SQS trigger and DLQ
resource "aws_lambda_function" "webhook_ingestion" {
  filename      = "webhook_ingestion.zip"
  function_name = "webhook-ingestion"
  role          = aws_iam_role.lambda_ingestion.arn
  handler       = "lambda_function.lambda_handler"
  runtime       = "python3.12"
  timeout       = 30
  memory_size   = 256

  environment {
    variables = {
      BRONZE_BUCKET = aws_s3_bucket.bronze.id
    }
  }

  dead_letter_config {
    target_arn = aws_sqs_queue.ingestion_dlq.arn
  }
}

resource "aws_lambda_event_source_mapping" "sqs_trigger" {
  event_source_arn = aws_sqs_queue.webhook_queue.arn
  function_name    = aws_lambda_function.webhook_ingestion.arn
  batch_size       = 100
  
  filter_criteria {
    filter {
      pattern = jsonencode({
        body = {
          event_type = ["payment.completed", "payment.failed"]
        }
      })
    }
  }
}

2. Serverless SQL for Ad-Hoc Analytics

Athena and BigQuery are the clearest serverless wins in the data space. Zero infrastructure, SQL interface, pay per TB scanned.

When it's the right call:

  • Queries run 0-20× per day (on-demand is cheaper than reserved capacity)
  • Data is already in S3/GCS (no movement cost)
  • Queries are exploratory, not production SLA-bound
-- Athena query with partition pruning (fast + cheap)
SELECT 
    event_type,
    COUNT(*) as event_count,
    SUM(revenue_usd) as total_revenue
FROM events
WHERE 
    year = '2024'
    AND month = '03'
    AND day BETWEEN '01' AND '31'
    AND event_type IN ('purchase', 'subscription')
GROUP BY 1
ORDER BY 3 DESC;
-- Scans ~2 GB (partitioned) vs 800 GB (unpartitioned) — 400x cost difference

3. Orchestration Glue and Data Quality Checks

Lightweight, infrequent jobs that check data quality, trigger downstream pipelines, or fan out work are ideal for serverless.

# AWS Step Functions — serverless orchestration
StateMachine:
  Type: AWS::StepFunctions::StateMachine
  Properties:
    Definition:
      StartAt: ValidateSchema
      States:
        ValidateSchema:
          Type: Task
          Resource: !GetAtt SchemaValidationLambda.Arn
          Next: BranchByResult
        BranchByResult:
          Type: Choice
          Choices:
            - Variable: $.validation_passed
              BooleanEquals: true
              Next: TriggerTransformation
          Default: AlertAndFail
        TriggerTransformation:
          Type: Task
          Resource: arn:aws:states:::glue:startJobRun.sync
          Parameters:
            JobName: silver-transformation
          End: true
        AlertAndFail:
          Type: Task
          Resource: !GetAtt AlertLambda.Arn
          Next: Fail
        Fail:
          Type: Fail

Where Serverless Fails

1. Long-Running, Memory-Intensive Batch Jobs

Lambda has a 15-minute timeout and 10 GB memory limit. Cloud Run has 60-minute timeout and 32 GB. Neither is appropriate for a 4-hour Spark job processing 10 TB of data.

The failure pattern:

Team tries to replace Spark cluster with Lambda for nightly ETL.
- Job runs 2h → Lambda times out at 15min
- Team splits job into 1000 smaller Lambdas
- Cold starts add 30 min overhead
- Coordination logic becomes more complex than the original job
- Cost: $340/night vs $12/night with Spot EMR

The irony: the operational simplicity of serverless disappears when you're orchestrating thousands of functions to simulate what Spark does natively.

2. High-Throughput Streaming with Stateful Processing

Lambda + Kinesis can handle ~1 MB/s per shard. For a 10 MB/s stream with stateful windowing (session analysis, fraud detection), you hit limits fast.

Benchmarks — 100 events/sec sustained for 8h:

ApproachCostLatency P99Max throughput
Lambda (Kinesis trigger)$18/day800ms~5k events/s
Flink on EKS$22/day45ms500k+ events/s
Flink on EMR Serverless$28/day55ms200k events/s

Lambda loses on latency and throughput ceiling. Flink wins on both, and the cost delta is small at scale.

3. ML Inference at High Volume

A model inference Lambda handling 1,000 requests/second with a 100ms p50 latency looks cheap. Until you calculate:

1,000 req/s × 100ms × 1 GB memory = 100 GB-seconds/s
100 GB-seconds/s × 86,400 s/day = 8,640,000 GB-seconds/day
Cost: 8,640,000 × $0.0000166667 = $144/day = $4,320/month

Same workload on 3× ml.c5.2xlarge (8 vCPU, 16 GB):
$0.464/hr × 3 × 720hr = $1,002/month

Serverless is 4× more expensive for this workload, and you get worse tail latency due to cold starts.

4. The Cold Start Tax for User-Facing APIs

If your data API needs < 200ms p99 latency, serverless functions are usually the wrong choice without aggressive provisioned concurrency (which dramatically reduces the cost benefit).

Lambda cold start breakdown (Python 3.12, 512MB):
- Container initialization: 80-200ms
- Runtime initialization: 50-150ms
- Handler initialization (imports, connections): 100-500ms
Total: 230ms - 850ms added to first request

Provisioned concurrency eliminates cold starts but costs $0.0000646/function-second — roughly equivalent to keeping EC2 instances running.


The Decision Framework

Loading diagram...

Serverless Spark: The Middle Ground

AWS EMR Serverless and Databricks Serverless Compute solve the main pain points of traditional serverless for data workloads: no cold start lock-in, no timeout limits, genuine Spark-scale processing.

# EMR Serverless — submit Spark job
aws emr-serverless start-job-run   --application-id app-1234567890abcdef   --execution-role-arn arn:aws:iam::123456789:role/emr-serverless-execution   --job-driver '{
    "sparkSubmit": {
      "entryPoint": "s3://my-bucket/scripts/transform.py",
      "sparkSubmitParameters": "--conf spark.executor.cores=4 --conf spark.executor.memory=8g"
    }
  }'   --configuration-overrides '{
    "monitoringConfiguration": {
      "s3MonitoringConfiguration": {
        "logUri": "s3://my-bucket/logs/"
      }
    }
  }'

EMR Serverless vs EMR on EC2 (1 TB job, 2× monthly):

EMR ServerlessEMR on EC2 (Spot)
Setup time0 min8 min
Cost per run~$4.20~$2.80
Monthly (2 runs)~$8.40~$5.60 + cluster fixed cost
Idle cost$0$0 (if terminated)
Operational effortVery lowLow-medium

EMR Serverless wins clearly for infrequent jobs. For daily+ jobs, managed clusters with spot instances win on cost.


Observability for Serverless Data Pipelines

The hardest part of serverless debugging: distributed execution, ephemeral logs, and no SSH.

# Structured logging for Lambda — mandatory for production
import json
import logging
import time
from functools import wraps

logger = logging.getLogger()
logger.setLevel(logging.INFO)

def log_invocation(func):
    @wraps(func)
    def wrapper(event, context):
        start = time.time()
        request_id = context.aws_request_id
        
        logger.info(json.dumps({
            "event": "lambda_start",
            "request_id": request_id,
            "function": context.function_name,
            "memory_mb": context.memory_limit_in_mb,
            "records_count": len(event.get('Records', [event]))
        }))
        
        try:
            result = func(event, context)
            duration_ms = (time.time() - start) * 1000
            
            logger.info(json.dumps({
                "event": "lambda_success",
                "request_id": request_id,
                "duration_ms": round(duration_ms, 2),
                "remaining_ms": context.get_remaining_time_in_millis()
            }))
            return result
            
        except Exception as e:
            logger.error(json.dumps({
                "event": "lambda_error",
                "request_id": request_id,
                "error": str(e),
                "error_type": type(e).__name__
            }))
            raise
    return wrapper

@log_invocation
def lambda_handler(event, context):
    # Your actual logic here
    pass

Summary: The Honest Assessment

Serverless data processing is genuinely excellent for event-driven ingestion, ad-hoc SQL analytics, and infrequent batch jobs. It's a poor fit for long-running jobs, high-throughput streaming, ML inference at scale, and latency-sensitive user-facing APIs.

The industry is moving toward serverless Spark (EMR Serverless, Databricks Serverless) as a compelling middle ground — you get managed infrastructure and automatic scaling without the hard limits of FaaS.

Use the decision framework: duration, memory, frequency, and cost comparison. The right answer is workload-specific, not a blanket "serverless is modern, therefore correct."


Try Harbinger Explorer free for 7 days — test your serverless data API endpoints, validate response schemas under load, and identify cold start latency issues before your users do. harbingerexplorer.com


Continue Reading

Try Harbinger Explorer for free

Connect any API, upload files, and explore with AI — all in your browser. No credit card required.

Start Free Trial

Command Palette

Search for a command to run...