Harbinger Explorer

Back to Knowledge Hub
Engineering

Data Contracts for Teams

9 min read·Tags: data-contracts, data-quality, schema-registry, dbt, data-engineering, kafka

Data Contracts for Teams

Every data engineer has hit this wall: the upstream team changed a column type, dropped a field, or renamed a table — without telling anyone. Your pipeline failed silently at 3 AM, the dashboard showed zeros, and the business blamed data engineering. Data contracts exist to stop this from happening.

A data contract is a formal, versioned agreement between a data producer (the team that owns a table or event stream) and its consumers (the pipelines and applications that depend on it). Think of it as an API contract, but for data assets. The concept isn't new — service teams have used OpenAPI contracts for years — but its application to data pipelines is recent and still maturing.

Why Teams Resist Contracts (and Why That's Wrong)

The usual objection is overhead. Teams worry about process, documentation, and slowing down development velocity. This reasoning is backwards. Undocumented schema changes cost more — in incident response time, debugging cycles, and eroded trust between teams. A contract forces the conversation before the breaking change lands, not after.

The real overhead isn't writing contracts. It's the lack of them.

What a Data Contract Contains

A complete data contract specifies:

ComponentDescriptionExample
SchemaField names, types, nullabilityuser_id: INT NOT NULL
SemanticsWhat each field actually means"event_time is UTC wall-clock time, not server-local time"
SLAFreshness and availability guarantees"Updated within 15 min of event, 99.5% uptime"
OwnershipWho is responsible for the datasetTeam: Checkout Platform, Slack: #checkout-data
VersioningHow changes are communicatedSemver: breaking = major bump, additive = minor
Quality rulesExpectations consumers rely on"amount > 0 always, currency is always ISO 4217"

Contract Formats in Practice

There is no single standard yet. Three approaches dominate real-world teams.

1. YAML-Based Contracts (Open Data Contract Standard)

The Open Data Contract Standard (ODCS) defines a YAML schema for data contracts. It's gaining traction among teams that want a lightweight, version-controlled approach without buying a platform. [VERIFY: check ODCS current version and adoption status]

# ODCS-compatible data contract (simplified)
apiVersion: v2.3.0
kind: DataContract
uuid: "a8b2c3d4-1234-5678-abcd-ef0123456789"
datasetName: orders
version: "1.4.0"
status: active

description:
  purpose: "Order lifecycle events for analytics and downstream ML"
  usage: "Read-only. Do not rely on fields marked internal."

team: checkout-platform
owner: platform-data@company.com

schema:
  - name: orders
    physicalName: checkout.orders
    columns:
      - name: order_id
        logicalType: string
        physicalType: VARCHAR(36)
        required: true
        description: "UUID v4. Immutable after creation."
      - name: user_id
        logicalType: integer
        physicalType: BIGINT
        required: true
      - name: status
        logicalType: string
        physicalType: VARCHAR(20)
        required: true
        description: "Enum: pending, confirmed, shipped, delivered, cancelled"
      - name: amount_usd
        logicalType: number
        physicalType: DECIMAL(10,2)
        required: true
        quality:
          - rule: "amount_usd >= 0"
            action: fail

sla:
  - property: freshness
    value: "15 minutes"
  - property: completeness
    value: "99.9%"

consumers:
  - team: analytics-platform
    contact: analytics@company.com
  - team: ml-platform
    contact: mlops@company.com

This file lives in the producer team's repository, versioned and reviewed like application code. Changes to it trigger notifications to registered consumers.

2. Schema Registry (Confluent / Karapace)

For event-driven architectures on Kafka, Schema Registry enforces contracts at the protocol level. Producers register an Avro, Protobuf, or JSON Schema. Consumers decode messages using the registered schema. Compatibility rules are enforced by the registry — a producer literally cannot publish a breaking schema change without updating the contract version and passing the compatibility check first.

# Register a schema version in Confluent Schema Registry
curl -X POST http://schema-registry:8081/subjects/orders-value/versions \
  -H "Content-Type: application/vnd.schemaregistry.v1+json" \
  -d '{"schema": "{\"type\":\"record\",\"name\":\"Order\",\"fields\":[{\"name\":\"order_id\",\"type\":\"string\"},{\"name\":\"amount_usd\",\"type\":\"double\"},{\"name\":\"status\",\"type\":\"string\"}]}"}'

# Check compatibility before publishing a new schema version
curl -X POST http://schema-registry:8081/compatibility/subjects/orders-value/versions/latest \
  -H "Content-Type: application/vnd.schemaregistry.v1+json" \
  -d '{"schema": "<new schema JSON here>"}'
# Returns: {"is_compatible": true}

Schema Registry is the strongest form of contract enforcement available. The trade-off is that it's tightly coupled to the Kafka ecosystem and adds infrastructure complexity.

3. dbt Contracts (dbt Core 1.5+)

For teams using dbt, the contract: enforced: true setting turns the schema YAML file into a runtime-enforced contract. If the model output doesn't match the declared columns and types, the dbt run fails loudly before the data reaches consumers.

# dbt schema.yml — model with enforced contract
models:
  - name: orders
    config:
      contract:
        enforced: true
    columns:
      - name: order_id
        data_type: varchar
        constraints:
          - type: not_null
          - type: unique
        description: "UUID v4. Immutable."
      - name: amount_usd
        data_type: numeric
        constraints:
          - type: not_null
      - name: status
        data_type: varchar
        constraints:
          - type: not_null

The dbt approach is ideal for the transformation layer — it prevents a model refactor from silently breaking downstream consumers without any additional tooling.

What "Breaking" Actually Means

Teams frequently under-specify this, which leads to disputes. A breaking change is any change that can cause a previously working consumer to fail or produce incorrect results:

ChangeBreaking?Notes
Remove a column✅ YesAlways breaking
Rename a column✅ YesAlways breaking
Change type (e.g., INT → VARCHAR)✅ YesAlways breaking
Tighten nullability (nullable → NOT NULL)✅ YesRejects previously valid rows
Add a new NOT NULL column without default✅ YesBreaks INSERT statements
Change enum values✅ YesBreaks CASE/IF logic downstream
Add a new nullable column❌ NoSafe for most consumers
Loosen nullability (NOT NULL → nullable)❌ NoSafe
Add index or constraint with no type change❌ NoTransparent to consumers

The Producer/Consumer Protocol

A contract without a process is just documentation. Here's a minimal workflow that holds up in practice:

For producers (schema change protocol):

  1. Any change that could break consumers requires a contract version bump before deployment
  2. Breaking changes require advance notice — define a lead time (e.g., two sprints) in the contract
  3. Additive changes (new nullable field) require a minor version bump and a consumer notification
  4. The contract lives in the producer's repo; PRs against it trigger notifications to all registered consumers

For consumers:

  1. Register as a consumer in the contract file — this is how producers know who to notify
  2. Pin your pipeline configuration to a specific contract version
  3. Opt into version upgrade notifications via your preferred channel (Slack, Jira, email)
  4. Never treat undocumented fields as stable

Common Mistakes

Contracts as documentation only. A contract that isn't checked by any automated process is just a comment. It will drift from reality. Wire the contract into CI/CD, schema registry compatibility checks, or dbt tests.

Skipping semantic definitions. Field types are the easy part. The hard part is agreeing what event_time means — generated timestamp, Kafka ingestion time, or warehouse landing time? Semantic misalignment causes silent wrong results that are far harder to debug than schema errors.

No consumer registry. If you don't know who depends on a dataset, you can't notify them of changes. A consumer list in the contract file is the minimum viable answer.

No deprecation policy. How long do you maintain v1.x after v2.0 ships? Define this before it becomes a negotiation under pressure.

Treating contracts as a platform team problem. Contracts work when every team — including application developers — sees schema stability as their responsibility. If only the data team cares, you're writing contracts into a void.

Tooling Landscape

ToolApproachBest For
Soda CoreYAML contracts + quality assertionsdbt/warehouse teams
Confluent Schema RegistryProtocol-level enforcementKafka/streaming teams
dbt contracts (1.5+)Model-level enforcementdbt transformation layer
OpenMetadata / DataHubCatalog + contract metadataPlatform teams
Atlan / CollibraEnterprise data governanceLarger organizations

There's no universal winner. Pick based on where your data actually flows and what tooling your team already operates.

Contracts and Exploration

When a contract is well-defined, ad-hoc exploration becomes much safer. You know the schema, the semantics, and the quality guarantees — so queries are predictable. Harbinger Explorer's natural language interface works well in this context: when you can describe what a dataset's fields actually mean, the AI generates SQL that reflects those semantics rather than guessing from column names alone.

Conclusion

Data contracts shift the conversation from "who broke the pipeline" to "how do we evolve data safely." They require upfront discipline from producers, but they pay off in fewer incidents, faster debugging, and durable trust between teams. Start small: pick one critical dataset, write a YAML contract for it, wire it into CI, and register your consumers. The rest follows.

For runtime quality validation that checks contracts are being honored, see the Data Quality Testing guide.

Continue Reading


Continue Reading

Try Harbinger Explorer for free

Connect any API, upload files, and explore with AI — all in your browser. No credit card required.

Start Free Trial

Command Palette

Search for a command to run...