Data Governance Framework: A Practical Guide for Data Teams
Data Governance Framework: A Practical Guide for Data Teams
You've been there: a stakeholder asks where a number comes from, and three people give three different answers. Or someone discovers that the "customer" table in the warehouse has four different definitions across teams. Or a GDPR request comes in and nobody knows which systems hold PII.
A data governance framework solves these problems — not with bureaucracy, but with clear ownership, shared definitions, and tooling that enforces the rules. This guide shows you how to build one that actually works.
What Data Governance Really Means
Data governance is the system of decision rights, policies, and processes that determines how data is collected, stored, used, and retired across an organization. Strip away the consultant-speak and it boils down to three questions:
- Who owns this data? (Accountability)
- What are the rules? (Policies)
- How do we enforce them? (Processes + Tooling)
If you can answer those three questions for every dataset in your organization, you have governance. Everything else is implementation detail.
What Governance Is NOT
| Misconception | Reality |
|---|---|
| A one-time project | An ongoing operating model |
| An IT-only responsibility | A cross-functional discipline |
| A tool you can buy | A system of people, processes, and tools |
| Locking data down | Enabling safe, fast access to data |
| Writing 200-page policy docs nobody reads | Lightweight, enforceable rules embedded in workflows |
The Four Pillars of a Data Governance Framework
Every effective governance framework rests on four pillars. Skip one and the whole thing wobbles.
1. Data Ownership & Stewardship
Someone must be accountable for every dataset. Not "the data team" — a specific person.
Roles you actually need:
| Role | Responsibility | Who Fills It |
|---|---|---|
| Data Owner | Business accountability — defines what the data means, who can access it, retention rules | Domain/business lead (e.g., Head of Finance owns financial data) |
| Data Steward | Day-to-day governance — maintains metadata, resolves quality issues, enforces policies | Senior analyst or engineer within the domain |
| Data Engineer | Technical implementation — pipelines, access controls, quality checks | Engineering team |
| Data Governance Lead | Cross-cutting coordination — resolves conflicts, maintains standards | Dedicated role or part of a data platform team |
The key insight: ownership lives with the business, not with IT. The finance team owns financial data. The marketing team owns campaign data. Engineers build the infrastructure; they don't define what "active customer" means.
2. Policies & Standards
Policies are the rules. Standards are how you implement them. Keep both short and enforceable.
Core policies every team needs:
- Data Classification Policy — What sensitivity levels exist (public, internal, confidential, restricted) and how each is handled
- Access Policy — Who can access what, how access is requested and revoked
- Retention Policy — How long data is kept and when it's deleted
- Quality Policy — What quality thresholds exist and what happens when they're breached
- Lineage Policy — How upstream/downstream dependencies are tracked
Here's a practical example of a data classification standard implemented as a SQL comment convention:
-- PostgreSQL: Column-level classification using COMMENT
COMMENT ON COLUMN customers.email IS 'classification:confidential;pii:true;retention:3y';
COMMENT ON COLUMN customers.country IS 'classification:internal;pii:false;retention:indefinite';
COMMENT ON COLUMN orders.total_amount IS 'classification:internal;pii:false;retention:7y';
This is lightweight but machine-readable. A downstream scanner can parse these comments and enforce access rules automatically.
3. Data Quality
Governance without quality enforcement is just documentation. You need automated checks that run in your pipelines.
The five dimensions of data quality:
| Dimension | Question It Answers | Example Check |
|---|---|---|
| Completeness | Is all expected data present? | NOT NULL rate > 99% on required fields |
| Accuracy | Does the data reflect reality? | Revenue totals match source system within 0.1% |
| Consistency | Do related datasets agree? | Customer count in CRM = customer count in warehouse |
| Timeliness | Is the data fresh enough? | Pipeline completes within 2 hours of source update |
| Uniqueness | Are there duplicates? | Primary key uniqueness = 100% |
Here's how you'd implement basic quality checks in a dbt project:
# dbt schema.yml — Data quality tests
version: 2
models:
- name: dim_customers
description: "Customer dimension — owned by Sales team"
meta:
owner: "sales-team"
classification: "confidential"
contains_pii: true
columns:
- name: customer_id
description: "Unique customer identifier"
tests:
- unique
- not_null
- name: email
description: "Customer email — PII, confidential"
tests:
- not_null
- unique
- name: created_at
tests:
- not_null
- dbt_utils.expression_is_true:
expression: "created_at <= current_timestamp"
- name: country_code
tests:
- not_null
- accepted_values:
values: ['DE', 'US', 'GB', 'FR', 'NL', 'AT', 'CH']
config:
severity: warn
This embeds governance directly into your transformation layer. When a test fails, the pipeline stops — no bad data reaches dashboards.
4. Metadata & Data Catalog
A data catalog is the single pane of glass where people find, understand, and trust data. Without one, governance lives in wikis nobody reads.
What your catalog must capture:
- Technical metadata — table names, column types, row counts, freshness
- Business metadata — plain-English descriptions, ownership, classification
- Lineage metadata — where data comes from, what transformations it went through, what depends on it
- Usage metadata — who queries it, how often, for what purpose
Popular catalog tools:
| Tool | Best For | Pricing Model |
|---|---|---|
| DataHub (LinkedIn OSS) | Teams comfortable with self-hosting | Free (open source) |
| OpenMetadata | Modern, API-first approach | Free (open source) |
| Atlan | Enterprise teams wanting managed solution | Per-seat, starts ~$30k/yr [PRICING-CHECK] |
| Alation | Large enterprises with complex governance needs | Enterprise pricing [PRICING-CHECK] |
| dbt Docs + dbt Explorer | Teams already using dbt | Free (OSS) / included in dbt Cloud |
Implementation Roadmap: From Zero to Governed
Don't try to govern everything at once. Start small, prove value, expand.
Phase 1: Foundation (Weeks 1–4)
Goal: Establish ownership for your most critical datasets.
- Identify your top 10 most-used datasets (check query logs)
- Assign an owner and steward for each
- Write one-paragraph descriptions for each dataset
- Document known quality issues — don't fix them yet, just acknowledge them
Deliverable: A simple spreadsheet or catalog entries for 10 datasets with owners.
Phase 2: Policies & Quality (Weeks 5–8)
Goal: Define the rules and start enforcing them.
- Write your data classification policy (keep it to one page)
- Classify the top 10 datasets
- Add automated quality tests to critical pipelines (start with dbt tests or Great Expectations)
- Set up a weekly 30-minute "data quality standup" with stewards
Deliverable: Classification policy, quality tests running in CI/CD, first quality metrics dashboard.
Phase 3: Scale & Automate (Weeks 9–16)
Goal: Extend governance to all production datasets.
- Roll out ownership to all production datasets
- Deploy a data catalog (or enhance your existing one)
- Implement automated lineage tracking
- Set up access request workflows
- Create onboarding docs for new team members
Deliverable: Full catalog coverage, automated lineage, self-service access requests.
Phase 4: Continuous Improvement (Ongoing)
Goal: Make governance a habit, not a project.
- Monthly governance review — are policies being followed?
- Quarterly ownership audit — have responsibilities shifted?
- Track governance KPIs (catalog coverage, quality score trends, time-to-access)
- Iterate on policies based on what's actually causing friction
Governance KPIs: Measuring What Matters
You can't manage what you don't measure. Track these metrics monthly:
| KPI | Target | How to Measure |
|---|---|---|
| Catalog coverage | >90% of production tables documented | Automated scan of warehouse vs catalog |
| Ownership assignment | 100% of production tables have an owner | Catalog metadata check |
| Quality test coverage | >80% of critical tables have automated tests | dbt/GE test count vs table count |
| Data freshness SLA | >95% of pipelines meet freshness SLA | Pipeline monitoring tool |
| Access request turnaround | <24 hours for standard requests | Ticketing system metrics |
| PII classification | 100% of PII columns tagged | Automated PII scanner + catalog |
When Governance Fails: Common Anti-Patterns
The Committee Trap: Creating a "Data Governance Council" that meets monthly, produces slide decks, but never ships anything. Governance must be embedded in daily workflows, not delegated to a committee.
The Tool-First Trap: Buying an expensive catalog tool before defining ownership or policies. The tool will sit empty. People first, processes second, tools third.
The Boil-the-Ocean Trap: Trying to govern every dataset from day one. You'll burn out and give up. Start with the 10 tables that matter most and expand from there.
The Compliance-Only Trap: Treating governance purely as a GDPR/SOX checkbox exercise. Compliance is a byproduct of good governance, not its purpose. The purpose is making data trustworthy and accessible.
Governance for Small Teams
You don't need a dedicated governance team to start. In my experience, teams of 3–10 data professionals can implement effective governance with these adjustments:
- Combine roles: The data engineer who builds the pipeline is also the steward. The analytics lead is also the owner.
- Use dbt as your catalog:
schema.ymlwith descriptions, tests, and meta tags covers 80% of catalog needs for free. - Automate aggressively: Every manual governance step is a step that won't happen consistently. CI/CD for quality tests, automated freshness monitoring, git-based policy versioning.
- Skip the heavyweight tools: A well-maintained dbt project + a Notion page with ownership mappings beats an empty Atlan instance every time.
If you're running a small team exploring data from multiple sources — APIs, CSVs, databases — tools like Harbinger Explorer let you query and catalog those sources directly in the browser with DuckDB WASM, which can serve as a lightweight data exploration layer while you build out governance around your core warehouse.
Getting Started Tomorrow
Here's what you can do right now, before any formal initiative:
- Pick your three most important tables. The ones that show up in every dashboard and every stakeholder question.
- Write a one-sentence description for each. Post it wherever your team communicates.
- Add one quality test per table. A
NOT NULLcheck on the primary key counts. Ship it to production. - Name an owner for each. Send them a message: "You own this table. If something breaks, you're the first call."
That's governance. Everything else is scaling it up.
Continue Reading
- What Is dbt? A Complete Guide for Data Teams
- Data Lakehouse Architecture Explained
- Data Catalog Best Practices for Modern Data Teams
[PRICING-CHECK] Atlan and Alation pricing figures are estimates based on public information — verify with vendors for current rates.
Continue Reading
What Is a Data Catalog? Tools, Trade-offs and When You Need One
A clear definition of data catalogs, an honest comparison of DataHub, Atlan, Alation, and OpenMetadata, and a build-vs-buy framework for data teams.
Self-Service Analytics: Why Most Teams Get It Wrong
Self-service analytics fails more often than it succeeds — and usually for the same reasons. A practical guide to the prerequisites, failure modes, and a 4-phase build sequence that actually works.
Try Harbinger Explorer for free
Connect any API, upload files, and explore with AI — all in your browser. No credit card required.
Start Free Trial