Skip to main content

data-quality-frameworks

Implement data quality validation with Great Expectations, dbt tests, and data contracts. Use when building data quality pipelines, implementing validation rules, or establishing data contracts.

Stars
36,148
Source
wshobson/agents
Updated
2026-05-29
Slug
wshobson--agents--data-quality-frameworks
View on GitHubRaw SKILL.md

// install — copy + paste into any project

mkdir -p .claude/skills && curl -fsSL https://raw.githubusercontent.com/wshobson/agents/HEAD/plugins/data-engineering/skills/data-quality-frameworks/SKILL.md -o .claude/skills/data-quality-frameworks.md

Drops the SKILL.md into .claude/skills/data-quality-frameworks.md. Works with Claude Code, Cursor, and any agent that loads SKILL.md files from .claude/skills/.

Data Quality Frameworks

Production patterns for implementing data quality with Great Expectations, dbt tests, and data contracts to ensure reliable data pipelines.

When to Use This Skill

  • Implementing data quality checks in pipelines
  • Setting up Great Expectations validation
  • Building comprehensive dbt test suites
  • Establishing data contracts between teams
  • Monitoring data quality metrics
  • Automating data validation in CI/CD

Core Concepts

1. Data Quality Dimensions

Dimension Description Example Check
Completeness No missing values expect_column_values_to_not_be_null
Uniqueness No duplicates expect_column_values_to_be_unique
Validity Values in expected range expect_column_values_to_be_in_set
Accuracy Data matches reality Cross-reference validation
Consistency No contradictions expect_column_pair_values_A_to_be_greater_than_B
Timeliness Data is recent expect_column_max_to_be_between

2. Testing Pyramid for Data

          /\
         /  \     Integration Tests (cross-table)
        /────\
       /      \   Unit Tests (single column)
      /────────\
     /          \ Schema Tests (structure)
    /────────────\

Quick Start

Great Expectations Setup

# Install
pip install great_expectations

# Initialize project
great_expectations init

# Create datasource
great_expectations datasource new
# great_expectations/checkpoints/daily_validation.yml
import great_expectations as gx

# Create context
context = gx.get_context()

# Create expectation suite
suite = context.add_expectation_suite("orders_suite")

# Add expectations
suite.add_expectation(
    gx.expectations.ExpectColumnValuesToNotBeNull(column="order_id")
)
suite.add_expectation(
    gx.expectations.ExpectColumnValuesToBeUnique(column="order_id")
)

# Validate
results = context.run_checkpoint(checkpoint_name="daily_orders")

Detailed patterns and worked examples

Detailed pattern documentation lives in references/details.md. Read that file when the navigation tier above is insufficient.

Summary: {total_passed}/{total_tables} tables passed")

    report.append("")

    for table, result in results.items():
        status = "✅" if result.passed else "❌"
        report.append(f"### {status} {table}")
        report.append(f"- Expectations: {result.total_expectations}")
        report.append(f"- Failed: {result.failed_expectations}")

        if not result.passed:
            report.append("- Failed checks:")
            for detail in result.details:
                if not detail["success"]:
                    report.append(f"  - {detail['expectation']}: {detail['observed_value']}")
        report.append("")

    return "\n".join(report)

Usage

context = gx.get_context() pipeline = DataQualityPipeline(context)

tables_to_validate = { "orders": "orders_suite", "customers": "customers_suite", "products": "products_suite", }

results = pipeline.run_all(tables_to_validate) report = pipeline.generate_report(results)

Fail pipeline if any table failed

if not all(r.passed for r in results.values()): print(report) raise ValueError("Data quality checks failed!")


## Best Practices

### Do's

- **Test early** - Validate source data before transformations
- **Test incrementally** - Add tests as you find issues
- **Document expectations** - Clear descriptions for each test
- **Alert on failures** - Integrate with monitoring
- **Version contracts** - Track schema changes

### Don'ts

- **Don't test everything** - Focus on critical columns
- **Don't ignore warnings** - They often precede failures
- **Don't skip freshness** - Stale data is bad data
- **Don't hardcode thresholds** - Use dynamic baselines
- **Don't test in isolation** - Test relationships too