LangChain Multi-Env Setup (Python)

Overview

A team ships a LangChain 1.0 service to staging with python-dotenv loading .env.staging into os.environ. Security audits — docker exec STAGING-POD env prints ANTHROPIC_API_KEY=sk-ant-api03-... in plain text. Anyone with kubectl exec, any sidecar, any core dump, any error tracker that auto-captures process env sees the key. This is pain P37: secrets loaded from .env in production containers leak via env.

A second failure chains. A developer runs the staging deploy from a shell where LANGCHAIN_ENV=production was set hours earlier. The loader picks the prod .env, staging answers with a prompt commit tuned only for the prod model tier, latency doubles. Two root causes: no type-safe env gate, no startup validation that would have caught the mismatched model id.

Both are one refactor:

# BAD — dotenv populates os.environ; any process with container access sees it
from dotenv import load_dotenv
load_dotenv(".env.production")
api_key = os.environ["ANTHROPIC_API_KEY"]  # P37: leaks via `docker exec env`

# GOOD — SecretStr in a validated Settings object, pulled from Secret Manager
from pydantic import SecretStr
from pydantic_settings import BaseSettings

class Settings(BaseSettings):
    env: Literal["dev", "staging", "prod"]
    anthropic_api_key: SecretStr

settings = build_settings()  # pulls from GCP Secret Manager in prod
api_key = settings.anthropic_api_key.get_secret_value()
# repr(settings) prints `SecretStr('**********')` — safe to log

This skill owns the per-env config plumbing — Settings skeleton, Secret Manager integration, per-env pinning, startup smoke test. It does not own the full secrets lifecycle (rotation, revocation, scope) — that belongs to langchain-security-basics.

Pin: langchain-core 1.0.x, langchain-anthropic 1.0.x, pydantic >= 2.5, pydantic-settings >= 2.1. Pain anchors: P37 (primary), P20 (checkpointer schema — cross-ref langchain-langgraph-checkpointing).

Two numbers: smoke test < 10 seconds; env-var count ~15-30 (more than 30 means Settings is absorbing feature flags and should split).

Prerequisites

Python 3.10+ (3.11+ recommended for Literal and StrEnum ergonomics)
langchain-core >= 1.0, < 2.0
pydantic >= 2.5, pydantic-settings >= 2.1
One secret backend: GCP Secret Manager (google-cloud-secret-manager), AWS Secrets Manager (boto3), or HashiCorp Vault (hvac)
Completed langchain-sdk-patterns — the Settings object is injected into the chain factories from that skill

Instructions

Run these six steps in order — each adds one invariant the next step depends on:

Define a Settings class with SecretStr keys, Literal env, and fail-fast validation.
Add a per-env loader — file in dev, env vars in staging, Secret Manager in prod.
Use the cloud Secret Manager client to pull keys into memory only.
Pin model_id, prompt_commit_hash, and vector_index_name per env.
Configure the checkpointer per env — memory in dev, Postgres elsewhere.
Run a startup smoke test under 10 seconds before the HTTP server binds.

Step 1 — Create a Settings class with SecretStr and fail-fast validation

from typing import Literal
from pydantic import SecretStr, HttpUrl, Field, ValidationError
from pydantic_settings import BaseSettings, SettingsConfigDict

class Settings(BaseSettings):
    model_config = SettingsConfigDict(
        env_file=None,              # see Step 2 — loader picks the file
        env_file_encoding="utf-8",
        case_sensitive=False,
        extra="forbid",             # reject unknown env vars — typo detection
    )

    # --- env switch (drives everything else) ---
    env: Literal["dev", "staging", "prod"] = Field(..., alias="LANGCHAIN_ENV")

    # --- secrets (always SecretStr — never str) ---
    anthropic_api_key: SecretStr = Field(..., alias="ANTHROPIC_API_KEY")
    openai_api_key: SecretStr = Field(..., alias="OPENAI_API_KEY")
    langsmith_api_key: SecretStr = Field(..., alias="LANGSMITH_API_KEY")

    # --- per-env pinning (see Step 4) ---
    model_id: str = Field(..., alias="LANGCHAIN_MODEL_ID")
    prompt_commit_hash: str = Field(..., alias="LANGCHAIN_PROMPT_COMMIT")
    vector_index_name: str = Field(..., alias="LANGCHAIN_VECTOR_INDEX")

    # --- endpoints (validated URLs — typo caught at startup) ---
    checkpointer_url: HttpUrl | None = Field(None, alias="LANGCHAIN_CHECKPOINTER_URL")
    otel_endpoint: HttpUrl = Field(..., alias="OTEL_EXPORTER_OTLP_ENDPOINT")

    # --- budget guards (per-env) ---
    max_cost_usd_per_day: float = Field(10.0, alias="LANGCHAIN_DAILY_BUDGET_USD")
    max_rpm: int = Field(60, alias="LANGCHAIN_MAX_RPM")

SecretStr masks repr(settings) to SecretStr('**********') — a routine logger.info(settings) cannot leak the key. The only way to read plaintext is .get_secret_value(), which greps like a sore thumb in review. extra="forbid" catches typos (LANGCHIN_MODEL_ID) at import time. HttpUrl rejects http:/otel:4318 before the exporter wastes 60s on DNS.

See Settings Skeleton for the full class.

Step 2 — Per-env config loading (file OR Secret Manager, never both)

import os
from pathlib import Path

def build_settings() -> Settings:
    env = os.environ.get("LANGCHAIN_ENV", "dev")

    if env == "dev":
        # Local dev: .env.dev file, values checked into 1Password not git
        return Settings(_env_file=Path(".env.dev"))

    if env == "staging":
        # CI / staging: env vars injected by the orchestrator
        # (GitHub Actions secrets, k8s envFrom: secretRef, etc.)
        return Settings()  # reads os.environ directly

    if env == "prod":
        # Prod: pull from Secret Manager into memory ONLY
        values = pull_from_secret_manager()
        return Settings(**values)

    raise ValueError(f"unknown LANGCHAIN_ENV: {env!r}")

Three loaders, one class. Dev touches a file on disk. Staging inherits env vars from the orchestrator — envFrom: secretRef is readable via docker exec env, but the blast radius is bounded and rotation is weekly.

Prod is the P37 fix: pull_from_secret_manager() builds a dict and passes kwargs to Settings(...). Values land in the instance attribute and never touch os.environ. A subprocess will not inherit them.

Step 3 — Secret Manager pull (GCP example; AWS / Vault in reference)

from google.cloud import secretmanager

def pull_from_secret_manager() -> dict[str, str]:
    client = secretmanager.SecretManagerServiceClient()
    project = os.environ["GCP_PROJECT_ID"]
    secret_names = ["ANTHROPIC_API_KEY", "OPENAI_API_KEY", "LANGSMITH_API_KEY"]
    out: dict[str, str] = {}
    for name in secret_names:
        resource = f"projects/{project}/secrets/{name}/versions/latest"
        response = client.access_secret_version(request={"name": resource})
        out[name] = response.payload.data.decode("utf-8")
    # Non-secret passthrough (model id, prompt hash, endpoints)
    for key in ["LANGCHAIN_ENV", "LANGCHAIN_MODEL_ID", "LANGCHAIN_PROMPT_COMMIT",
                "LANGCHAIN_VECTOR_INDEX", "LANGCHAIN_CHECKPOINTER_URL",
                "OTEL_EXPORTER_OTLP_ENDPOINT"]:
        if key in os.environ:
            out[key] = os.environ[key]
    return out

No os.environ[k] = v line. The dict goes straight into Settings(**values). Workload-identity IAM handles auth; no static key on disk. For AWS / Vault see Secret Manager Integration.

Step 4 — Per-env model and prompt pinning

Dev, staging, and prod run different model ids and different prompt commit hashes. Pinning happens at env-var level so app code is env-agnostic (see the Env Matrix below for values). One function reads settings.prompt_commit_hash and pulls from LangSmith (cross-ref langchain-prompt-engineering):

from langsmith import Client
ls = Client(api_key=settings.langsmith_api_key.get_secret_value())

def get_prompt(settings: Settings) -> ChatPromptTemplate:
    return ls.pull_prompt(f"triage-prompt:{settings.prompt_commit_hash}")

Prevents: staging loading a prod prompt commit. Pinning per env makes promotion explicit — dev → staging → prod moves one hash at a time. See Per-Env Pinning.

Step 5 — Per-env checkpointer selection

Checkpointer choice is per-env too:

from langgraph.checkpoint.memory import MemorySaver
from langgraph.checkpoint.postgres import PostgresSaver

def build_checkpointer(settings: Settings):
    if settings.env == "dev":
        return MemorySaver()          # ephemeral, resets on restart
    # staging + prod: Postgres with env-isolated schema
    # cross-ref langchain-langgraph-checkpointing (P20) for schema migration
    return PostgresSaver.from_conn_string(
        str(settings.checkpointer_url)
    )

Dev uses MemorySaver — no infra dependency, no state between runs. Staging and prod use PostgresSaver against separate databases (or separate schemas). Never share a checkpointer DB between envs; P20 explains — schema migrations on a version bump corrupt cross-env threads.

Step 6 — Startup smoke test (< 10 seconds budget)

import time
from anthropic import Anthropic

def validate_integrations(settings: Settings) -> None:
    t0 = time.monotonic()

    # 1. Model reachable (1-token ping ~ $0.00001)
    anthropic = Anthropic(api_key=settings.anthropic_api_key.get_secret_value())
    anthropic.messages.create(
        model=settings.model_id,
        max_tokens=1,
        messages=[{"role": "user", "content": "hi"}],
    )

    # 2. Checkpointer reachable
    if settings.env != "dev":
        checkpointer = build_checkpointer(settings)
        checkpointer.setup()  # runs SELECT 1 + schema check

    # 3. Vector store reachable (see langchain-embeddings-search)
    # ... describe_index call here ...

    # 4. Observability endpoint reachable (OTLP HTTP health)
    # ... requests.get(f"{settings.otel_endpoint}/health", timeout=2) ...

    elapsed = time.monotonic() - t0
    if elapsed > 10.0:
        raise RuntimeError(
            f"startup smoke test took {elapsed:.1f}s (budget 10s)"
        )

Call validate_integrations(settings) before the HTTP server binds. Failure aborts the deploy — the readiness probe never goes green, the rollout halts, the bad version takes no traffic. Budget: 10 seconds. Past 10s an integration is degraded — fail loudly rather than ship a 30s cold start. See Startup Smoke Test.

Output

Settings class on pydantic-settings with SecretStr for keys, Literal env, HttpUrl endpoints, extra="forbid"
Env-specific loader (file → dev; env vars → staging; Secret Manager → prod); values land in Settings only, never os.environ
Cloud Secret Manager integration (GCP / AWS / Vault) with IAM-bound auth; no static keys on disk
Per-env pinning for model_id, prompt_commit_hash, vector_index_name, checkpointer_url
Per-env checkpointer (MemorySaver dev, PostgresSaver on isolated DBs staging/prod)
Startup smoke test — model / vector / checkpointer / observability under 10-second budget

Env Matrix

Dimension	dev	staging	prod
Secret backend	`.env.dev` file (git-ignored)	orchestrator env vars	cloud Secret Manager, memory only
`os.environ` holds keys	yes (local)	yes (sidecar visible)	no (P37 fix)
`model_id`	`claude-haiku-4-6`	`claude-sonnet-4-6`	`claude-sonnet-4-6`
`prompt_commit_hash`	WIP	canary	stable (1 week old)
`temperature`	0.7	0.2	0.2
Checkpointer	`MemorySaver`	`PostgresSaver` (staging DB)	`PostgresSaver` (prod DB)
Vector index	`dev-index`	`staging-index`	`prod-index`
OTEL sample rate	1.0	1.0	0.1
RPM limit	10	60	provider tier
Daily budget	$1	$10	$500-$5000
Smoke probes	model	model + checkpointer + OTEL	all four

Error Handling

Error	Cause	Fix
`docker exec POD env` shows `ANTHROPIC_API_KEY=...` in prod (P37)	`dotenv` / plain env injection in prod	Pull from Secret Manager into `Settings(**values)`; never write to `os.environ`
Staging answers with prod prompts / wrong model	Loader defaulted or picked stale `LANGCHAIN_ENV`	`Literal["dev","staging","prod"]` on env; raise on unknown; no default
`ValidationError: extra fields forbidden` at startup	Typo (`LANGCHIN_MODEL_ID`)	Fix the typo — `extra="forbid"` working as intended
Startup takes 30s before first request	Serialized probes or degraded integration	Enforce 10s budget; parallelize probes; fail the deploy
`repr(settings)` in a log leaks the API key	Plain `str` used, not `SecretStr`	Change field to `SecretStr`; repr masks to `'**********'`
Prod silently using `MemorySaver`	`build_checkpointer` defaulted when `checkpointer_url` was None	Require `checkpointer_url` in staging/prod via a model validator
Secret Manager auth fails in CI	SA not bound; `google.auth` fell back to ADC	Bind SA with `roles/secretmanager.secretAccessor`
Prompt hash rolled forward in staging without dev validation	Promotion skipped the dev gate	Enforce dev → staging → prod order in CI (see per-env pinning ref)

Examples

Graduating a `.env`-in-dev service to prod

Start: a single .env committed (or leaked via docker exec env). End: Settings class, three loaders, Secret Manager in prod, smoke test under 10s. Three PRs — (1) introduce Settings without changing loader behavior, (2) add SecretStr and migrate call sites to .get_secret_value(), (3) swap prod to Secret Manager and remove the prod .env from the image. See Settings Skeleton and Secret Manager Integration.

Wrong-env prompt loaded in staging — postmortem

Staging inherited LANGCHAIN_ENV=production from a stale shell. The Literal["dev","staging","prod"] field rejects production; CI promotion sets LANGCHAIN_ENV explicitly; direnv pins it per-project. See Per-Env Pinning.

Smoke test blocked a bad model id

A prod deploy went out with LANGCHAIN_MODEL_ID=claude-sonnet-4-7 (not yet rolled out). The 1-token ping failed with model not found, validate_integrations raised, the container crash-looped, the rollout halted, the previous version kept taking traffic. Zero user impact; failure budget stayed under 3s. See Startup Smoke Test.

Resources

Pydantic Settings docs
Pydantic SecretStr
GCP Secret Manager client
AWS Secrets Manager boto3
HashiCorp Vault hvac
LangChain 1.0 release notes
Related skills in pack: langchain-security-basics (secrets lifecycle, owns rotation and revocation — not duplicated here); langchain-langgraph-checkpointing (P20 schema migration); langchain-prompt-engineering (prompt pin / LangSmith pull workflow); langchain-reference-architecture (where Settings fits in the DI layer)
Pack pain catalog: docs/pain-catalog.md (entries P37 primary, P20 cross-ref)