Entropy Data integration for dbt projects
Make sure a dbt project is well-integrated with Entropy Data.
What "well-integrated" means
A dbt project is well-integrated with Entropy Data when it has all of:
| # | Artifact | Path | Purpose |
|---|---|---|---|
| 1 | Open Data Product Specification | <data-product-id>.odps.yaml at repo root |
Declares the data product, team, output ports |
| 2 | Output-port data contracts | models/output_ports/v<N>/<contract-id>.odcs.yaml (one per output port — what this data product commits to produce) |
Schema + server config the contract test runs against; colocated with the SQL that implements it |
| 3 | Input-port data contracts | models/input_ports/<provider-output-port-id>.odcs.yaml (one per active access agreement — what this data product trusts upstream to produce) |
Cached snapshot of the upstream provider's ODCS; refreshed via entropy-data datacontracts get, never hand-edited |
| 4 | OpenLineage transport | openlineage.yml at repo root |
Makes dbt-ol run send lineage to api.entropy-data.com |
| 5 | Model layout | models/{input_ports,staging,intermediate,output_ports/v1}/ |
Convention that mirrors the data product's lifecycle |
| 6 | Publish workflow | .github/workflows/data-product.yml |
CI: dbt run/test → publish ODPS + output ODCS → run contract test |
| 7 | Git connections | One per ODPS + one per output-port ODCS, registered via entropy-data dataproducts gitconnection put and entropy-data datacontracts gitconnection put |
Lets Entropy Data link the published spec back to the YAML in the repo, and enables pull / push / push-pr from the CLI. Input-port ODCS files are not registered — they belong to the upstream data product, which owns its own git connection |
How to run this skill
Work in this exact order. Do not skip the audit.
${PLUGIN_ROOT}below refers to the root of this plugin — the directory that containsskills/. On Claude Code it is set automatically as${CLAUDE_PLUGIN_ROOT}— use that. On any other agent (Codex, Copilot CLI, etc.) it is unset; resolve it as../..relative to thisSKILL.mdfile's directory (i.e. the grandparent ofskills/<this-skill>/).
Plan announcement (before Step 0)
Before running Step 0, print the following plan to the user verbatim so they know what's about to happen:
Running entropy-data-sync. I'll:
- Verify the
entropy-dataCLI is installed and connected.- Confirm this is a dbt project and pick up its name.
- Audit existing Entropy Data artifacts (ODPS, ODCS, OpenLineage, model layout, publish workflow, git connections).
- Gather any missing parameters from you (one batched question).
- Apply fixes — create missing files, patch incomplete ones, register git connections.
- Summarize what changed and what's deferred.
Then proceed.
Step 0 — Verify the Entropy Data CLI connection
Confirm uv run --quiet entropy-data --version succeeds from the project root. If it fails, run uv sync (the bootstrap template seeds entropy-data as a dev dep) and retry. If still missing, stop and tell the user to verify entropy-data is in pyproject.toml's [dependency-groups].dev. Use uv run entropy-data … for every CLI invocation in this skill.
Run entropy-data connection test. If it fails (no connection, expired key, etc.), stop and tell the user to run entropy-data connection add <name> --host <host> --api-key <key> first. Do not prompt for the key yourself.
Step 1 — Confirm this is a dbt project
Check that dbt_project.yml exists at the working directory root. If not, stop and tell the user this skill only works inside a dbt project.
Read dbt_project.yml and remember the name: value — call it DBT_PROJECT_NAME. By convention it is also the dbt profile and the data product id.
Step 2 — Audit
For each row in the table above, check whether the artifact is present. For row 7 (git connections), call:
entropy-data dataproducts gitconnection get <DATA_PRODUCT_ID> -o jsonentropy-data datacontracts gitconnection get <CONTRACT_ID> -o jsonfor each output-port contract undermodels/output_ports/**/
For rows 2 and 3, glob the file system:
- Output contracts:
models/output_ports/**/*.odcs.yaml - Input contracts:
models/input_ports/*.odcs.yaml
Legacy projects may still have contracts at the old datacontracts/ path. If datacontracts/ exists and contains *.odcs.yaml files, flag those rows as migration needed and surface the move target (output contracts → models/output_ports/v1/; input contracts, if any, → models/input_ports/). Do not move them silently — Step 4 asks the user.
If a get returns a 404 (or "not found"), mark that connection as missing. If it returns a connection whose repository-url / repository-path / repository-branch does not match the local repo, mark it as drifted and call it out separately — do not silently overwrite. If the underlying data product or contract doesn't exist on the platform yet (the workflow hasn't run for the first time), or if the working directory is not a git repository (git rev-parse --is-inside-work-tree errors or returns false), mark git connections as deferred with a one-line explanation.
For row 1 (ODPS file), also check that the top-level customProperties list contains an entry with property: "dataProductBuilder" and value: "https://github.com/entropy-data/dataproduct-builder-dbt". If the file exists but the property is missing, mark the ODPS as incomplete with a one-line note ("missing dataProductBuilder customProperty"); Step 4 will add it without touching other fields. Forks of this plugin should substitute their own builder URL in the template before publishing.
Produce a short audit report like:
Entropy Data integration audit for <DBT_PROJECT_NAME>:
[✓] ODPS file
[✗] Output-port contracts (no *.odcs.yaml under models/output_ports/)
[⏸] Input-port contracts (none — populated by dataproduct-implement from access agreements)
[✓] openlineage.yml
[✗] Model layout (no models/output_ports)
[✗] GitHub Actions publish workflow
[⏸] Git connections (deferred: data product not yet published — run the workflow first)
Show the report. Then list what you intend to create. Wait for the user to confirm before writing any files.
Step 3 — Gather parameters (only ask for what you cannot infer)
Before generating files, fill in these placeholders. Infer from the project where you can; ask the user for the rest in one batched question.
| Placeholder | Default / inference | Notes |
|---|---|---|
DATA_PRODUCT_ID |
DBT_PROJECT_NAME |
Used as id in ODPS and as the dbt profile name |
DATA_PRODUCT_NAME |
Title-cased DBT_PROJECT_NAME |
Human-friendly name |
OUTPUT_PORT_NAME |
DBT_PROJECT_NAME |
One output port per ODCS file |
CONTRACT_ID |
<DATA_PRODUCT_ID>-v1 |
Stable id used by entropy-data datacontracts put |
CONTRACT_FILE |
<contract_id>.odcs.yaml |
File under models/output_ports/v1/ |
CONTRACT_PATH |
models/output_ports/v1/<CONTRACT_FILE> |
Full repo-relative path; used by --repository-path, the CI workflow, and datacontract test |
TABLE |
last segment of DBT_PROJECT_NAME |
Output table name |
PURPOSE |
— | Ask the user (one sentence) |
TEAM_NAME |
— | If <DATA_PRODUCT_ID>.odps.yaml already exists with a team.name, use that. Otherwise, prefer a team id registered in Entropy Data — invoke the entropy-data-teams skill (in this same plugin) so the user can pick from the existing teams, and use the returned id. Fall back to a free-text answer only if entropy-data-teams cannot run (CLI unavailable / not authenticated) |
TAG |
— | Ask the user (e.g. a usecases/... slug) |
PLATFORM |
— | Ask the user: databricks, snowflake, bigquery, s3, postgres |
CATALOG / SCHEMA |
— | Ask the user (Databricks: catalog + schema; Snowflake: database + schema; BigQuery: project + dataset) |
DBT_PROFILE |
DBT_PROJECT_NAME |
Used in the workflow's profiles.yml block |
ODPS_FILE |
<DATA_PRODUCT_ID>.odps.yaml |
Path passed to entropy-data dataproducts put |
API_HOST |
entropy-data connection get -o json → host |
Resolve in Step 4, only when writing openlineage.yml or the workflow. Uses the same host the CLI is authenticated against, so lineage and CI publish hit the same deployment |
GIT_REPOSITORY_URL |
git remote get-url origin |
Used by gitconnection put. If no origin, ask the user; if the remote is git@… SSH form, convert to the equivalent HTTPS URL the platform expects |
GIT_REPOSITORY_BRANCH |
git rev-parse --abbrev-ref HEAD, falling back to main |
Used by gitconnection put; if HEAD is detached, ask the user |
GIT_CONNECTION_TYPE |
inferred from GIT_REPOSITORY_URL: github.com → github, gitlab.com → gitlab, bitbucket.org → bitbucket, dev.azure.com / *.visualstudio.com → azuredevops |
Ask the user only if the host doesn't match any of these |
GIT_HOST |
the URL host, only when self-hosted (i.e. not one of the SaaS hosts above); otherwise omit | Passed as --host to gitconnection put |
GIT_CREDENTIAL_EXTERNAL_ID |
— | Optional. Ask the user; if they don't have one yet, leave the connection unauthenticated (it can still be used for read-only metadata in the UI) |
Step 4 — Apply the fixes
For each missing artifact, copy the corresponding template from ${PLUGIN_ROOT}/skills/entropy-data-sync/templates/ into the user's project, substituting placeholders. Do not overwrite existing files; if a file is present but incomplete, surface the diff and ask before changing.
When (and only when) you're about to write openlineage.yml or .github/workflows/data-product.yml, resolve API_HOST from the active CLI connection:
entropy-data connection get -o json
Use the host field to substitute {{API_HOST}} in those templates. Self-hosted deployments are handled via entropy-data connection add --host <host>, not a plugin-level setting.
If the ODPS file exists but was flagged as incomplete — missing dataProductBuilder customProperty in Step 2, append the entry to the top-level customProperties list (do not reorder or touch other entries):
customProperties:
- property: "dataProductBuilder"
value: "https://github.com/entropy-data/dataproduct-builder-dbt"
Surface the diff and ask before saving.
The templates live at:
templates/data-product.odps.yaml→ write to<DATA_PRODUCT_ID>.odps.yamltemplates/models/output_ports/v1/contract.odcs.yaml→ write to<CONTRACT_PATH>(i.e.models/output_ports/v1/<CONTRACT_FILE>)templates/openlineage.yml→ write toopenlineage.ymltemplates/.github/workflows/data-product.yml→ write to.github/workflows/data-product.yml
If the audit reported a legacy datacontracts/ directory, ask the user before moving its contents. The default migration is:
datacontracts/<contract>.odcs.yaml→models/output_ports/v1/<contract>.odcs.yaml(if it matches an output port). If multiple output port versions exist, ask which one.- Update the git connection's
--repository-pathfor each moved contract (Step 4b). - Delete the now-empty
datacontracts/directory only after the user confirms the move.
This skill does not create input-port ODCS files. They appear only when dataproduct-implement resolves an access agreement and writes the cached upstream contract to models/input_ports/. If the audit found stale input-port ODCS files (no matching .source.yaml), surface them as orphans and let the user decide whether to delete them.
For the model layout, create the directories models/input_ports/, models/staging/, models/intermediate/, models/output_ports/v1/ if absent, plus _models.yml placeholders so dbt does not warn about empty directories. Do not move existing models — only add the empty subfolders the user is missing, and note it in the report.
Also update dbt_project.yml's models: block so the materializations match the reference (output port = table, staging/intermediate = view):
models:
<DBT_PROJECT_NAME>:
+materialized: table
staging:
+materialized: view
intermediate:
+materialized: view
If the models: block already exists, only add missing keys; do not clobber the user's customizations.
Step 4b — Configure git connections
Only run this sub-step if the audit (Step 2) flagged at least one git connection as missing or the user confirmed re-creating a drifted one. Skip entirely if every connection is already correct, if the audit marked them as deferred, or if the working directory is not a git repository (check with git rev-parse --is-inside-work-tree — if it errors or returns false, there's no remote to register; tell the user to run git init and add a remote first, then re-run this skill).
For the data product:
entropy-data dataproducts gitconnection put <DATA_PRODUCT_ID> \
--repository-url <GIT_REPOSITORY_URL> \
--repository-path <ODPS_FILE> \
--repository-branch <GIT_REPOSITORY_BRANCH> \
--git-connection-type <GIT_CONNECTION_TYPE> \
[--host <GIT_HOST>] \
[--git-credential-external-id <GIT_CREDENTIAL_EXTERNAL_ID>]
For each output-port ODCS file (models/output_ports/**/*.odcs.yaml):
entropy-data datacontracts gitconnection put <CONTRACT_ID> \
--repository-url <GIT_REPOSITORY_URL> \
--repository-path <CONTRACT_PATH> \
--repository-branch <GIT_REPOSITORY_BRANCH> \
--git-connection-type <GIT_CONNECTION_TYPE> \
[--host <GIT_HOST>] \
[--git-credential-external-id <GIT_CREDENTIAL_EXTERNAL_ID>]
Do not register git connections for input-port ODCS files. They are cached copies of upstream contracts; the upstream data product owns the canonical record.
Notes:
--repository-pathis relative to the repo root, not the working directory. The ODPS path is just<DATA_PRODUCT_ID>.odps.yaml; output-port contract paths look likemodels/output_ports/v<N>/<CONTRACT_FILE>.- Omit
--hostfor SaaS providers (github.com, gitlab.com, bitbucket.org, dev.azure.com); set it only for self-hosted instances. - These commands fail if the underlying data product / contract does not exist on the platform yet. If you skipped earlier because of "deferred," surface the manual command in Step 5 so the user can run it after the first workflow run. Do not retry-loop.
- If the audit reported drift (existing connection with different URL/branch/path), confirm with the user before overwriting —
putis upsert.
Step 5 — Final report
Always end with this exact two-part format so the user gets a consistent recap.
Part 1 — outcome table. One row per artifact from the audit. Use the Status enum below; Details is a short, plain-text note (file path, or "—" if nothing to add).
| Artifact | Status | Details |
|---|---|---|
| ODPS file | … | … |
| Output-port contracts | … | <N> file(s) at models/output_ports/v<N>/<CONTRACT_FILE> |
| Input-port contracts | … | <N> file(s) at models/input_ports/<provider-output-port-id>.odcs.yaml (or "—" if no access agreements yet) |
| OpenLineage transport | … | … |
| Model layout | … | … |
| Publish workflow | … | … |
| Git connections | … | … |
Status enum (use exactly these words):
created— the skill wrote a new file or registered a new connection.updated— the skill patched an existing file or fixed a drifted connection.already present— no change needed.deferred— skipped intentionally (data product/contract not yet published, or no git repo). The deferred command(s) appear in Part 2.skipped— the user declined when asked to confirm.
Part 2 — next steps. Bullet list, only include the items that apply:
- For each
deferredgit connection, the exactentropy-data dataproducts gitconnection put …orentropy-data datacontracts gitconnection put …command to run after the first CI publish. - "Set GitHub repository secrets:
ENTROPY_DATA_API_KEY, plus platform creds (DBT_DATABRICKS_HOST,DBT_DATABRICKS_HTTP_PATH,DBT_DATABRICKS_TOKENfor Databricks; equivalents for other platforms)." - "Fill in the data contract schema in
<CONTRACT_PATH>— the template only seedsidandupdated_at." - "Run
dbt-ol runlocally once to verify lineage flows to Entropy Data (requiresOPENLINEAGE__TRANSPORT__AUTH__APIKEY)."
If there is nothing in Part 2, write a single line: No further action required.
Conventions and constraints
- Platform-aware workflow: the workflow template assumes Databricks. If the user picks Snowflake/BigQuery/Postgres, swap the
dbt-databricksinstall line, theCreate profiles.ymlblock, and theDATACONTRACT_*env vars to the matching dialect. Do not generate a Databricks workflow for a non-Databricks project. - No invented schema: when generating the ODCS file, do not invent columns. Seed it with
id+updated_atand tell the user to fill in the rest, or — if dbt models already exist for the output port — derive columns from_models.ymlif available. - Idempotent: running the skill a second time should be a no-op when everything is already present. For git connections that means: if
gitconnection getreturns a record matching the local repo URL / branch / path, do not callput. - Don't overwrite drifted git connections silently. If the platform reports a different URL/branch/path than the local repo, surface the diff and ask. The user may have a fork, a renamed default branch, or a deliberate path remap.
- Don't push secrets: never write API keys, tokens, or hostnames into committed files. They must come from GitHub secrets in the workflow.
- Don't create a git repo or commit: leave VCS state to the user.