Harbinger Explorer

Back to Knowledge Hub
databricks

Unity Catalog Data Governance: Security, Lineage & Audit

10 min read·Tags: unity catalog, data governance, row-level security, column masking, data lineage, audit logs, databricks

Most teams adopt Unity Catalog for the unified namespace. They stay for the governance. Row-level security, column masking, system-level audit logs, and automatic data lineage are all built in — but none of them are zero-configuration. Here's what actually needs to be set up and how.

The Unity Catalog Security Model

Unity Catalog (UC) uses a three-level namespace: catalog.schema.table. Permissions are hierarchical — granting USE CATALOG on a catalog implicitly allows schema and table navigation, but not data access. You always need to grant table-level SELECT separately.

-- Spark SQL (Unity Catalog dialect)
-- Grant read access to a table
GRANT USE CATALOG ON CATALOG analytics TO `data-analysts`;
GRANT USE SCHEMA ON SCHEMA analytics.silver TO `data-analysts`;
GRANT SELECT ON TABLE analytics.silver.customer_events TO `data-analysts`;

-- Or at schema level (covers all current + future tables)
GRANT SELECT ON SCHEMA analytics.silver TO `data-analysts`;

The second form (schema-level grant) is preferred for most teams — it avoids constant re-granting as new tables are added.

Row-Level Security with Row Filters

Row filters let you restrict which rows a user or group can see, without duplicating tables. They're implemented as SQL functions attached to a table.

-- Step 1: Create the filter function
CREATE OR REPLACE FUNCTION analytics.governance.region_filter(region_col STRING)
RETURNS BOOLEAN
RETURN IS_ACCOUNT_GROUP_MEMBER('emea-analysts') AND region_col = 'EMEA'
    OR NOT IS_ACCOUNT_GROUP_MEMBER('emea-analysts');

-- Step 2: Apply the filter to a table
ALTER TABLE analytics.silver.sales
SET ROW FILTER analytics.governance.region_filter ON (sales_region);

Now EMEA analysts only see EMEA rows. Everyone else sees all rows. The filter is enforced at the engine level — it can't be bypassed by direct Delta reads from DBFS (which is one reason to stop using DBFS directly once you're on Unity Catalog).

Supported in: Databricks Runtime 12.2 LTS+ with Unity Catalog enabled. Row filters are evaluated per-query — be aware of performance implications on large tables without partition pruning.

Column Masking

Column masking hides or transforms sensitive values for users without the appropriate role. Common use cases: PII fields, payment card numbers, internal cost centers.

-- Create a masking function
CREATE OR REPLACE FUNCTION analytics.governance.mask_email(email STRING)
RETURNS STRING
RETURN CASE
    WHEN IS_ACCOUNT_GROUP_MEMBER('pii-cleared') THEN email
    ELSE CONCAT(LEFT(email, 2), '***@***.***')
END;

-- Apply the mask to a column
ALTER TABLE analytics.silver.customers
ALTER COLUMN email SET MASK analytics.governance.mask_email;

Users without pii-cleared group membership see jo***@***.***. Users in pii-cleared see the full email. The masking is transparent — the column still exists, queries don't fail.

-- What users without PII clearance see:
SELECT customer_id, email FROM analytics.silver.customers LIMIT 5;
-- customer_id | email
-- 1001        | jo***@***.***
-- 1002        | ma***@***.***

Tags: Classify and Discover Data

Unity Catalog tags let you label catalogs, schemas, tables, and individual columns. Tags are free-text key-value pairs attached to objects in the metastore.

-- Tag a table
ALTER TABLE analytics.silver.customers
SET TAGS ('pii' = 'true', 'data_domain' = 'customer', 'owner' = 'data-platform');

-- Tag a column
ALTER TABLE analytics.silver.customers
ALTER COLUMN email SET TAGS ('sensitivity' = 'high', 'pii_type' = 'email');

-- Query tags via system catalog
SELECT table_name, column_name, tag_name, tag_value
FROM system.information_schema.column_tags
WHERE tag_name = 'sensitivity' AND tag_value = 'high';

Tags integrate with the Databricks catalog UI, making discovery much faster. They also form the basis for policy-driven governance — you can build row filter and masking functions that check tags rather than hardcoding table names.

Data Lineage

Unity Catalog automatically captures column-level lineage across notebooks, jobs, DLT pipelines, and Databricks SQL. No instrumentation required — it's enabled by default when UC is active.

You can query lineage programmatically:

-- System table: table lineage (upstream → downstream)
SELECT
    source_table_full_name,
    target_table_full_name,
    created_by,
    event_time
FROM system.access.table_lineage
WHERE target_table_full_name = 'analytics.gold.customer_360'
ORDER BY event_time DESC
LIMIT 20;

-- Column-level lineage
SELECT
    source_table_full_name,
    source_column_name,
    target_table_full_name,
    target_column_name
FROM system.access.column_lineage
WHERE target_table_full_name = 'analytics.gold.customer_360'
    AND target_column_name = 'lifetime_value';

Lineage is surfaced in the Catalog UI with a visual graph. For regulated industries, this is often the first thing compliance teams want to see.

Lineage limitations to know:

  • Lineage is captured at query execution time — historical lineage before UC migration is not available
  • Lineage from external tools (dbt, Airbyte, custom JDBC) is not captured automatically
  • Delta Lake direct writes outside of Databricks compute won't appear

Audit Logs via System Tables

Unity Catalog writes all access events to system.access.audit. This is the single source of truth for who accessed what, when, and from where.

-- Who accessed the customers table in the last 7 days?
SELECT
    user_identity.email AS user_email,
    action_name,
    request_params.full_name_arg AS table_accessed,
    event_time,
    source_ip_address
FROM system.access.audit
WHERE action_name IN ('commandSubmit', 'runCommand')
    AND request_params.full_name_arg LIKE '%customers%'
    AND event_time >= CURRENT_TIMESTAMP - INTERVAL 7 DAYS
ORDER BY event_time DESC;

-- Failed access attempts (potential security events)
SELECT
    user_identity.email,
    action_name,
    status_code,
    error_message,
    event_time
FROM system.access.audit
WHERE status_code != 200
    AND event_time >= CURRENT_TIMESTAMP - INTERVAL 24 HOURS
ORDER BY event_time DESC;

System tables are queryable with standard SQL — you can build dashboards directly on top of them using Databricks SQL, or export them to your SIEM.

System table availability: Audit logs are retained for 365 days by default. The system.access catalog must be enabled by a metastore admin before use.

Governance Best Practices

PracticeWhy it matters
Use groups, not individual usersGroups survive user churn; individual grants accumulate and become unmaintainable
Apply tags before row filtersTag-based policies scale; table-name-based policies don't
Set schema-level grants for readTable-level grants require re-granting on every new table
Monitor system.access.audit with alertsProactive detection beats reactive investigation
Use service principals for jobsAudit logs show the SP name, not "databricks-job-runner"; much easier to trace
Separate catalog per environmentdev, staging, prod — prevents accidental cross-environment access

Common Pitfalls

1. Granting MODIFY at catalog level This gives write access to every table. Almost never the right call. Grant at schema level minimum, table level preferred.

2. Row filters with UDFs that are too expensive A row filter runs on every query. If your filter function does a subquery against another table, every SELECT becomes a join. Keep filters simple and fast.

3. Assuming column masking covers all access paths Column masking applies to Databricks SQL and notebook compute on Unity Catalog. Direct Delta file access (e.g., reading Parquet files from ADLS bypassing UC) will not enforce masking. Lock down storage-level access too.

4. Not enabling system tables System tables (lineage, audit) must be explicitly enabled per metastore. This is a one-time admin action — do it on day one.

For teams migrating from the legacy Hive metastore, see our Unity Catalog migration guide for the practical migration path, and Unity Catalog best practices for namespace design decisions.

Key Takeaways

Unity Catalog's governance capabilities are production-grade — row filters, column masking, tags, lineage, and audit logs cover most enterprise compliance requirements out of the box. The catch: none of it is automatic. You need to model your access patterns upfront, apply tags consistently, and monitor the audit log actively. The SQL surface area is well-designed and queryable — treat your governance layer as code, version it, and test it like any other data product.


Continue Reading


Continue Reading

Try Harbinger Explorer for free

Connect any API, upload files, and explore with AI — all in your browser. No credit card required.

Start Free Trial

Command Palette

Search for a command to run...