Skip to content

MVP Limitations

ReplayLab's current MVP is local-first and intentionally narrow. These boundaries keep capture, replay, comparison, and generated regressions deterministic while the core loop stabilizes.

Supported Provider Paths

  • OpenAI Responses: non-streaming OpenAI().responses.create(...), OpenAI().responses.parse(...), AsyncOpenAI().responses.create(...), AsyncOpenAI().responses.parse(...), plus event-preserving responses.create(..., stream=True) for sync and async clients.
  • OpenAI Chat Completions: non-streaming sync OpenAI().chat.completions.create(...), currently validated through the CrewAI compatibility scenario.
  • Anthropic Messages: non-streaming Anthropic().messages.create(...), AsyncAnthropic().messages.create(...), sync raw-response Anthropic().messages.with_raw_response.create(...).parse(), and event-preserving core streaming through messages.create(..., stream=True) and messages.stream(...) for sync and async clients.
  • Gemini Google Gen AI: non-streaming Client().models.generate_content(...), Client().aio.models.generate_content(...), and event-preserving models.generate_content_stream(...) for sync and async clients.
  • HTTP clients: sync requests, sync httpx, and async httpx.AsyncClient.
  • Capture modes: same-process startup instrumentation with handle.capture(...), ASGI/FastAPI request middleware, framework-agnostic job decorators, and Python child-process auto-patching through replaylab run for local/CI workflows.
  • Replay modes: same-process adapters and Python child-process auto-patching through replaylab replay.
  • Framework compatibility scenarios: PydanticAI Agent calls using OpenAI Responses, LangGraph StateGraph nodes using supported provider calls (local-source and external realism), LangGraph ToolNode dispatch, LangChain ChatOpenAI(..., use_responses_api=True) calls, and LangChain @tool / bind_tools dispatch using OpenAI Responses, plus OpenAI Agents SDK @function_tool dispatch, LlamaIndex FunctionTool dispatch using OpenAI Responses, and CrewAI custom-tool dispatch using OpenAI Chat Completions.

Full-Payload Requirement

Replay can inspect metadata-only capsules, but it cannot serve provider responses from them. Use full payload capture when a capsule must be replayed or converted into a pytest provider replay guard.

For startup SDK capture:

import replaylab
from replaylab import CapturePayloadPolicy

handle = replaylab.init(
    project_name="support-bot",
    auto_patch_integrations="auto",
    capture_payload_policy=CapturePayloadPolicy.FULL,
)

"auto" enables all supported provider and framework patchers in stable order: OpenAI, Anthropic, Gemini, requests, httpx, PydanticAI tool dispatch, LangChain tool dispatch, LangGraph ToolNode dispatch, OpenAI Agents SDK function-tool dispatch, LlamaIndex FunctionTool dispatch, and CrewAI custom-tool dispatch. Use explicit integration tuples when a production service wants a smaller patch surface.

For wrapper-driven local or CI capture:

uv run replaylab run \
  --capture-payload-policy full \
  --auto-patch-integrations auto \
  -- python app.py

Generated pytest provider replay guards require full payload refs. By default they assert successful provider replay. In the local app, clean regression replays generate provider replay guards. Failed or diverged regression replays can generate diagnostic provider replay guards when ReplayLab can preserve a deterministic failure shape from the source report. Diagnostic guards intentionally protect a known bad shape while it is being fixed; they are not correctness proof. These generated tests do not prove application tool or I/O safety yet. Provider protocol tool calls and tool results are model-facing evidence only until execution-tool control evidence is captured. ReplayLab can show likely local Python implementation candidates for provider-visible model tools when it can recover a local app root, but those candidates are advisory source-inspection evidence, not captured or controlled execution. PydanticAI, LangChain, LangGraph, OpenAI Agents SDK, LlamaIndex, and CrewAI tools registered through supported framework APIs can produce secret-safe execution-tool trace evidence when matching framework auto-patching is enabled. Other frameworks and naked provider SDK usage still use explicit replaylab.trace_tool(...) or handle.trace_tool(...) as the fallback; the legacy control_tool(...) spelling remains compatible. Execution tracing records that the callable ran through ReplayLab-visible instrumentation and preserves return values and exceptions, but it does not sandbox execution, enforce tool policy, or prove non-HTTP effects are controlled. For captured requests and httpx calls, ReplayLab can also show sanitized HTTP effect stack attribution in the safety preflight. That attribution is runtime evidence only: it omits source text, locals, arguments, headers, payload bodies, environment values, and absolute paths, and it does not block or enforce HTTP/I/O behavior unless opt-in HTTP effect policy mode is enabled. When model tools, implementation candidates, and HTTP stack evidence all line up, ReplayLab can show a read-only tool effect map. That map is an advisory source/stack evidence chain, not proof that the Python tool execution was captured or controlled. The local app can save project-scoped effect policy review decisions from those proposal items under .replaylab/app/effect-policies/. Saved rules are not enforcement by themselves, but opt-in HTTP effect policy mode can use accepted saved rules to allow or block requests and httpx effects:

uv run replaylab replay <capsule> \
  --http-effect-policy-mode enforce \
  --auto-patch-integrations requests,httpx \
  -- python app.py

When enforcement is enabled, unmatched, missing, unaccepted, or ambiguous policy evidence blocks the HTTP effect before the real call in capture mode, or before serving a recorded HTTP payload in replay mode. This is HTTP-only control. It does not sandbox Python execution, infer non-HTTP effects, prove that every execution tool was captured, or make safe workflow regression available. ReplayLab can also install opt-in local-effect hooks for filesystem mutations and subprocess launches. In observe mode those hooks must be explicitly requested with local_effects; in enforce mode replaylab run, replaylab replay, and replaylab workflow local install them automatically and fail closed before app-origin file mutations or subprocess launches. Hook evidence is secret-safe and path-safe; ReplayLab-owned .replaylab writes are allowed as internal. This does not sandbox the process, control databases, queues, raw sockets, native extensions, or other operating-system effects, and blocked local effects are not mocked. ReplayLab can also install opt-in SQLite database hooks. In observe mode those hooks must be explicitly requested with database_effects; in enforce mode child run/replay workflows install them automatically and fail closed before a SQLite statement runs unless an accepted exact statement-shape policy rule matches. The policy stores normalized SQL shape hashes and display-safe resources only, never raw SQL, parameters, rows, connection secrets, source text, locals, or absolute paths. This covers standard-library sqlite3 and synchronous SQLAlchemy SQLite/pysqlite only. ReplayLab can also install opt-in raw-socket hooks. In observe mode those hooks must be explicitly requested with network_effects; in enforce mode child run/replay workflows install them automatically and fail closed before app-origin direct socket connect/send I/O. Supported requests and httpx effects remain governed by HTTP policy control rather than raw-socket control. ReplayLab can also install opt-in queue/pubsub hooks. In observe mode those hooks must be explicitly requested with queue_effects; in enforce mode child run/replay workflows install them automatically and fail closed before supported app-origin Celery, RQ, Dramatiq, Kombu, Pika, Kafka Python, or Confluent Kafka enqueue/publish broker I/O. This records provider labels, operation names, display-safe queue/topic/routing-key labels, and nearest user-code origin only. It does not record job args, kwargs, message bodies, broker URLs with credentials, headers, payloads, or worker return values, and it does not replay broker delivery, execute workers, mock blocked messages, or provide distributed-system safety. ReplayLab can also install opt-in unsupported HTTP client hooks. In observe mode those hooks must be explicitly requested with unsupported_http_clients; in enforce mode child run/replay workflows install them automatically and fail closed before app-origin urllib, urllib3, or aiohttp network I/O. This records provider labels, operation names, display-safe method/host/resource labels, and nearest user-code origin only. It does not replay or mock unsupported HTTP client responses and does not add an allowlist policy for those clients. Async SQLite, non-SQLite SQLAlchemy URLs, Postgres, MySQL, MongoDB, unsupported queue/pubsub SDKs, native/FFI, process escapes, and sandbox guarantees remain unsupported and blocked when linked to the workflow. ReplayLab can also run regression replay inside an opt-in local Docker sandbox:

replaylab sandbox build-image --app-root .
replaylab sandbox doctor --app-root .
uv run replaylab replay <capsule> \
  --sandbox-mode enforce \
  --sandbox-backend local_container \
  -- python app.py

V1 copies the app root, ReplayLab store, source capsule, child-bootstrap files, and ReplayLab source roots into a temporary Docker workspace and runs as numeric user 65532:65532 with deny-all network, a read-only root filesystem, split read-only input mounts, a writable copied store/report output mount, dropped capabilities, no new privileges, resource limits, bounded tmpfs /tmp, and no host Docker socket. Sandbox evidence is recorded on the replay report. This is not Daytona, not a managed hosted runtime, not a VM or microVM guarantee, and not a replacement for HTTP/local/database/network/queue effect controls. Docker must be installed and the configured image must already contain the runtime dependencies because V1 does not install dependencies inside the network-denied container. The default image builder installs ReplayLab's runtime packages and supports uv.lock plus pyproject.toml, requirements.txt, bounded [tool.replaylab.sandbox] recipes for local path dependencies, or ReplayLab-only apps; custom dependency layouts should use --sandbox-image. Older non-hardened sandbox reports remain inspectable but no longer satisfy safe workflow readiness. Sandbox setup diagnostics are structured and sanitized: they can tell users to install/start Docker, build the missing image, fix a recipe, inspect streamed Docker output, or use a custom --sandbox-image, but ReplayLab does not persist raw Docker logs or secret values. ReplayLab also ships a bounded adversarial scenario that validates common local-container escape probes; that scenario is developer confidence evidence, not a formal sandbox guarantee. The safety preflight includes a safe workflow readiness gate that explains those blockers as requirement rows. A narrow report-driven workflow can now reach ready when provider replay, model-tool evidence, explicit execution-tool control, reviewed/enforced HTTP policy, and local-effect enforcement are all satisfied with no uncontrolled file or process effects. Workflows with SQLite statements additionally need reviewed/enforced database-effect policy, and all generation-eligible reports need network-effect enforcement active so direct raw-socket escapes fail closed. They also need queue/pubsub enforcement active so supported enqueue/publish escapes fail closed, unsupported HTTP client enforcement active so urllib/urllib3/aiohttp escapes fail closed instead of bypassing supported HTTP policy control, and completed local-container sandbox evidence so replay ran without inherited host secrets, host app mutation, or host networking. ReplayLab also scans recovered local source for representative unsupported effect surfaces. Linked database, queue, raw socket, unsupported HTTP, native/FFI, process-escape, or uncontrolled custom-boundary evidence blocks readiness unless that surface has matching active control evidence. Unrelated project-scope imports are warnings. Plain import os is not a blocker by itself; fork, exec, spawn, multiprocessing, process-pool, and pseudo-terminal spawn APIs are. Only a report whose unsupported effect scope is clear and whose sandbox containment requirement is satisfied can generate a safe workflow regression. Captured-run views and incomplete evidence stay unavailable and show the blocking requirements. Generated-guard CI readiness is local and advisory in v1. ReplayLab can detect whether generated pytest files and fixture dependencies exist, suggest a local pytest command, and recognize GitHub Actions workflows that appear to run generated guards. It does not edit workflow files or provide hosted CI history yet. The CLI also keeps opt-in failed-boundary compatibility for capsules whose final provider boundary failed and includes request and error payload refs.

Replay-Mode Startup

Applications can keep their startup replaylab.init(...) and handle.capture(...) code when running under replaylab replay. In replay mode, capture scopes do not write capture capsules; provider calls are served by the CLI-owned replay runtime. Automatic same-process replay startup for long-lived applications that do not use replaylab replay is not implemented yet.

Wrapper And Child Capsules

replaylab run currently writes separate artifacts:

  • a wrapper capsule for command metadata and exit status
  • a child provider capsule for captured OpenAI, requests, or httpx boundaries

Use replaylab capsule list to find the child provider capsule before replaying or generating tests. Parent/child capsule merging is not implemented yet.

ASGI Middleware Boundary

instrument_app(app, handle=handle) registers or wraps the ASGI app with ReplayLabASGIMiddleware, which captures provider calls made while handling HTTP requests. It records safe framework metadata such as method, path, status code, and optional request ID. For FastAPI/Starlette, ReplayLab also records best-effort route path and endpoint name when the framework exposes them on the ASGI scope. It does not record ASGI request bodies, response bodies, cookies, authorization values, websockets, or lifespan events. Provider-free requests do not write capsules by default.

Worker Job Boundary

capture_job(...) captures provider calls made while a decorated sync or async job function runs. It records safe metadata such as job name, callable module, callable qualname, optional queue name, optional worker name, optional session_id_arg name, and the extracted session ID on the run.

It does not record job args, kwargs, return values, queue payloads, or worker-framework internals. Provider-free jobs do not write capsules by default.

Agent Framework Compatibility Boundary

ReplayLab now has loopback compatibility scenarios for PydanticAI, LangGraph, LangChain, OpenAI Agents SDK, LlamaIndex, and CrewAI. These scenarios validate provider-level capture and replay inside framework-owned workflows:

  • PydanticAI Agent with OpenAIResponsesModel and OpenAIProvider(openai_client=...).
  • LangGraph StateGraph nodes that call requests and OpenAI Responses, including an external langgraph-example realism path.
  • LangGraph ToolNode dispatch with bound LangChain tools.
  • LangChain ChatOpenAI(..., use_responses_api=True) calls routed through OpenAI Responses.
  • LangChain @tool / bind_tools dispatch routed through OpenAI Responses.
  • OpenAI Agents SDK @function_tool dispatch routed through OpenAI Responses.
  • LlamaIndex FunctionTool.from_defaults(...) dispatch routed through OpenAI Responses.
  • CrewAI custom @tool(...) dispatch routed through OpenAI Chat Completions.
  • Native Anthropic Messages and Gemini generate_content calls through provider-SDK loopback scenarios.

This is not a framework graph runtime. For provider-backed LangGraph runs, ReplayLab can infer graph-node grouping from secret-safe provider-call callsite metadata and show those graph nodes in the local app trace. For supported PydanticAI, LangChain, LangGraph, OpenAI Agents SDK, LlamaIndex, and CrewAI tool dispatches, ReplayLab can record execution-tool evidence without manual trace_tool wrappers. ReplayLab still does not capture PydanticAI semantic events, LangGraph graph topology as a replay contract, unvalidated LangChain Chat Completions paths, LlamaIndex-native retrieval/index semantics as replay contracts, CrewAI planning semantics, prebuilt crewai_tools external effects as exact local callables, hosted OpenAI tool execution, or OpenAI agent-as-tool delegation as local Python execution. Initialize ReplayLab before constructing provider clients so provider patching is active before the framework uses those clients.

Anthropic and Gemini support is native provider-SDK support, not broad framework support. V1 captures and replays non-streaming and fully consumed streaming Anthropic Messages and Gemini generate_content calls, formats request and response previews in the local app, and generates replay guards when full payloads are present. Provider-specific live-experiment variant controls remain OpenAI-only in this slice; Anthropic and Gemini experiments can rerun detected workflows, but model/prompt/tool override controls are shown as unsupported instead of silently pretending to apply.

Provider-free framework runs are metadata captures only. The local app labels them as having no replayable evidence and disables replay, live experiment, and generated-guard actions until a supported provider boundary is captured.

AI Assistance Boundary

replaylab report explain, replaylab report diff-explain, and replaylab ai plan-instrumentation are optional BYOK helpers. They use deterministic ReplayLab summaries as input and produce advisory text only.

They do not decide replay success, change report outcomes, auto-edit application code, upload raw payload files, or send source file bodies by default. Use --dry-run to inspect the model-ready prompt without making an AI provider call.

Local Report Viewer Boundary

replaylab report view report writes one self-contained local React viewer file from a capsule and replay report, then opens it in the default browser. replaylab report view diff does the same for baseline-vs-candidate replay report diffs. The non-opening replaylab report export-viewer report|diff commands remain available for scripts. replaylab workflow local is a guided wrapper over replay, compare, viewer export, and optional pytest generation for an existing capsule; the local app calls the same orchestration behind a token-protected localhost API so users can run that workflow from the browser. It does not capture a new run. These are local diagnostic artifacts, not a hosted UI, cloud service, or multi-user SaaS dashboard. It renders inspection and comparison summaries only: status, counts, boundary rows, request hashes, payload availability booleans, filters, search, a "What to do next" section, copyable command blocks, failure groups, expected-vs-actual mismatch details, diff groups, and next commands.

replaylab report export-html and replaylab report diff-html remain dependency-free static HTML fallbacks.

replaylab report diff-html writes a second static file for baseline-vs-candidate replay report diffs, including improved, regressed, changed, and unchanged-failure groups. No viewer path renders payload file contents, raw headers, API keys, source bodies, or framework request/response bodies. Hosted dashboards and live report servers are planned future UI surfaces, not part of the current local viewer.

Not Implemented Yet

  • OpenAI Chat Completions paths beyond non-streaming sync chat.completions.create(...).
  • OpenAI streaming helper APIs beyond responses.create(..., stream=True).
  • Anthropic batches, files, Bedrock/Vertex Anthropic clients, OpenAI-compatible Anthropic routing, and streaming helper paths beyond messages.create(..., stream=True) / messages.stream(...).
  • Gemini multimodal file/image/video flows, Live API, Vertex-specific product breadth, and streaming paths beyond generate_content_stream(...).
  • LangChain and other framework wrappers that require unvalidated Chat Completions semantics beyond the CrewAI loopback path.
  • Anthropic/Gemini framework-native adapters beyond native provider-SDK calls.
  • framework-native graph/runtime traces for PydanticAI, LangGraph, LangChain, LlamaIndex, or CrewAI beyond supported tool dispatch evidence and provider-backed inferred grouping.
  • async requests patterns.
  • file uploads, multipart requests, streaming HTTP bodies, and streaming downloads.
  • header value capture by default. HTTP request matching captures header names only.
  • Broad non-HTTP I/O policy enforcement. Local-effect control is opt-in and limited to app-origin filesystem mutations and subprocess launches; SQLite database control is exact statement-shape only; raw-socket network control fails closed without allow policies; queue/pubsub control covers representative synchronous enqueue/publish APIs only and does not replay brokers or execute workers; unsupported HTTP client control blocks urllib/urllib3/aiohttp escapes but does not replay or mock their responses; linked native/FFI and process-escape evidence blocks safe workflow generation; unsupported queue/pubsub SDKs, native extensions, cross-process escapes, and other effects are not controlled.
  • cloud hosted runners, account auth, billing, and team collaboration.
  • auto-sync scope customization beyond everything_generated.
  • parent/child record streaming or merged wrapper capsules.
  • hard Celery, RQ, APScheduler, and framework-specific worker lifecycle adapters.
  • perturbation mode.
  • broad failed-flow generation for failures that are not the final captured provider call.
  • cloud report-to-report comparison, issue grouping, or trend analysis.
  • AI-generated code patches, automatic code editing, and AI-owned replay decisions.

Secret Safety

CLI output and inspection output should not print payload bodies, API keys, or raw secret values. Payload redaction happens before payload bytes are written when a redaction policy is configured.