MVP Limitations
ReplayLab's current MVP is local-first and intentionally narrow. These boundaries keep capture, replay, comparison, and generated regressions deterministic while the core loop stabilizes.
Supported Provider Paths
- OpenAI Responses: non-streaming
OpenAI().responses.create(...),OpenAI().responses.parse(...),AsyncOpenAI().responses.create(...),AsyncOpenAI().responses.parse(...), plus event-preservingresponses.create(..., stream=True)for sync and async clients. - OpenAI Chat Completions: non-streaming sync
OpenAI().chat.completions.create(...), currently validated through the CrewAI compatibility scenario. - Anthropic Messages: non-streaming
Anthropic().messages.create(...),AsyncAnthropic().messages.create(...), sync raw-responseAnthropic().messages.with_raw_response.create(...).parse(), and event-preserving core streaming throughmessages.create(..., stream=True)andmessages.stream(...)for sync and async clients. - Gemini Google Gen AI: non-streaming
Client().models.generate_content(...),Client().aio.models.generate_content(...), and event-preservingmodels.generate_content_stream(...)for sync and async clients. - HTTP clients: sync
requests, synchttpx, and asynchttpx.AsyncClient. - Capture modes: same-process startup instrumentation with
handle.capture(...), ASGI/FastAPI request middleware, framework-agnostic job decorators, and Python child-process auto-patching throughreplaylab runfor local/CI workflows. - Replay modes: same-process adapters and Python child-process auto-patching through
replaylab replay. - Framework compatibility scenarios: PydanticAI
Agentcalls using OpenAI Responses, LangGraphStateGraphnodes using supported provider calls (local-source and external realism), LangGraphToolNodedispatch, LangChainChatOpenAI(..., use_responses_api=True)calls, and LangChain@tool/bind_toolsdispatch using OpenAI Responses, plus OpenAI Agents SDK@function_tooldispatch, LlamaIndexFunctionTooldispatch using OpenAI Responses, and CrewAI custom-tool dispatch using OpenAI Chat Completions.
Full-Payload Requirement
Replay can inspect metadata-only capsules, but it cannot serve provider responses from them. Use full payload capture when a capsule must be replayed or converted into a pytest provider replay guard.
For startup SDK capture:
import replaylab
from replaylab import CapturePayloadPolicy
handle = replaylab.init(
project_name="support-bot",
auto_patch_integrations="auto",
capture_payload_policy=CapturePayloadPolicy.FULL,
)
"auto" enables all supported provider and framework patchers in stable order: OpenAI, Anthropic,
Gemini, requests, httpx, PydanticAI tool dispatch, LangChain tool dispatch, LangGraph
ToolNode dispatch, OpenAI Agents SDK function-tool dispatch, LlamaIndex FunctionTool dispatch,
and CrewAI custom-tool dispatch. Use explicit integration tuples when a production service wants a
smaller patch surface.
For wrapper-driven local or CI capture:
uv run replaylab run \
--capture-payload-policy full \
--auto-patch-integrations auto \
-- python app.py
Generated pytest provider replay guards require full payload refs.
By default they assert successful provider replay.
In the local app, clean regression replays generate provider replay guards. Failed or diverged
regression replays can generate diagnostic provider replay guards when ReplayLab can preserve a
deterministic failure shape from the source report. Diagnostic guards intentionally protect a known
bad shape while it is being fixed; they are not correctness proof.
These generated tests do not prove application tool or I/O safety yet. Provider protocol tool calls
and tool results are model-facing evidence only until execution-tool control evidence is captured.
ReplayLab can show likely local Python implementation candidates for provider-visible model tools
when it can recover a local app root, but those candidates are advisory source-inspection evidence,
not captured or controlled execution.
PydanticAI, LangChain, LangGraph, OpenAI Agents SDK, LlamaIndex, and CrewAI tools registered through
supported framework APIs can produce secret-safe execution-tool trace evidence when matching
framework auto-patching is enabled. Other frameworks and naked provider SDK usage still use explicit
replaylab.trace_tool(...) or handle.trace_tool(...) as the fallback; the legacy
control_tool(...) spelling remains compatible. Execution tracing records that the callable ran
through ReplayLab-visible instrumentation and preserves return values and exceptions, but it does
not sandbox execution, enforce tool policy, or prove non-HTTP effects are controlled.
For captured requests and httpx calls, ReplayLab can also show sanitized HTTP effect stack
attribution in the safety preflight. That attribution is runtime evidence only: it omits source text,
locals, arguments, headers, payload bodies, environment values, and absolute paths, and it does not
block or enforce HTTP/I/O behavior unless opt-in HTTP effect policy mode is enabled.
When model tools, implementation candidates, and HTTP stack evidence all line up, ReplayLab can show
a read-only tool effect map. That map is an advisory source/stack evidence chain, not proof that the
Python tool execution was captured or controlled.
The local app can save project-scoped effect policy review decisions from those proposal items under
.replaylab/app/effect-policies/. Saved rules are not enforcement by themselves, but opt-in HTTP
effect policy mode can use accepted saved rules to allow or block requests and httpx effects:
uv run replaylab replay <capsule> \
--http-effect-policy-mode enforce \
--auto-patch-integrations requests,httpx \
-- python app.py
When enforcement is enabled, unmatched, missing, unaccepted, or ambiguous policy evidence blocks the
HTTP effect before the real call in capture mode, or before serving a recorded HTTP payload in replay
mode. This is HTTP-only control. It does not sandbox Python execution, infer non-HTTP effects, prove
that every execution tool was captured, or make safe workflow regression available.
ReplayLab can also install opt-in local-effect hooks for filesystem mutations and subprocess
launches. In observe mode those hooks must be explicitly requested with local_effects; in
enforce mode replaylab run, replaylab replay, and replaylab workflow local install them
automatically and fail closed before app-origin file mutations or subprocess launches. Hook evidence
is secret-safe and path-safe; ReplayLab-owned .replaylab writes are allowed as internal. This does
not sandbox the process, control databases, queues, raw sockets, native extensions, or other
operating-system effects, and blocked local effects are not mocked.
ReplayLab can also install opt-in SQLite database hooks. In observe mode those hooks must be
explicitly requested with database_effects; in enforce mode child run/replay workflows install
them automatically and fail closed before a SQLite statement runs unless an accepted exact
statement-shape policy rule matches. The policy stores normalized SQL shape hashes and display-safe
resources only, never raw SQL, parameters, rows, connection secrets, source text, locals, or
absolute paths. This covers standard-library sqlite3 and synchronous SQLAlchemy SQLite/pysqlite
only. ReplayLab can also install opt-in raw-socket hooks. In observe mode those hooks must be
explicitly requested with network_effects; in enforce mode child run/replay workflows install
them automatically and fail closed before app-origin direct socket connect/send I/O. Supported
requests and httpx effects remain governed by HTTP policy control rather than raw-socket
control. ReplayLab can also install opt-in queue/pubsub hooks. In observe mode those hooks must
be explicitly requested with queue_effects; in enforce mode child run/replay workflows install
them automatically and fail closed before supported app-origin Celery, RQ, Dramatiq, Kombu, Pika,
Kafka Python, or Confluent Kafka enqueue/publish broker I/O. This records provider labels,
operation names, display-safe queue/topic/routing-key labels, and nearest user-code origin only. It
does not record job args, kwargs, message bodies, broker URLs with credentials, headers, payloads,
or worker return values, and it does not replay broker delivery, execute workers, mock blocked
messages, or provide distributed-system safety. ReplayLab can also install opt-in unsupported HTTP
client hooks. In observe mode those hooks must be explicitly requested with
unsupported_http_clients; in enforce mode child run/replay workflows install them automatically
and fail closed before app-origin urllib, urllib3, or aiohttp network I/O. This records
provider labels, operation names, display-safe method/host/resource labels, and nearest user-code
origin only. It does not replay or mock unsupported HTTP client responses and does not add an
allowlist policy for those clients. Async SQLite, non-SQLite SQLAlchemy URLs, Postgres, MySQL,
MongoDB, unsupported queue/pubsub SDKs, native/FFI, process escapes, and sandbox guarantees remain
unsupported and blocked when linked to the workflow.
ReplayLab can also run regression replay inside an opt-in local Docker sandbox:
replaylab sandbox build-image --app-root .
replaylab sandbox doctor --app-root .
uv run replaylab replay <capsule> \
--sandbox-mode enforce \
--sandbox-backend local_container \
-- python app.py
V1 copies the app root, ReplayLab store, source capsule, child-bootstrap files, and ReplayLab source
roots into a temporary Docker workspace and runs as numeric user 65532:65532 with deny-all
network, a read-only root filesystem, split read-only input mounts, a writable copied store/report
output mount, dropped capabilities, no new privileges, resource limits, bounded tmpfs /tmp, and
no host Docker socket.
Sandbox evidence is recorded on the replay report. This is not Daytona, not a managed hosted
runtime, not a VM or microVM guarantee, and not a replacement for HTTP/local/database/network/queue
effect controls. Docker must be installed and the configured image must already contain the runtime
dependencies because V1 does not install dependencies inside the network-denied container. The
default image builder installs ReplayLab's runtime packages and supports uv.lock plus
pyproject.toml, requirements.txt, bounded [tool.replaylab.sandbox] recipes for local path
dependencies, or ReplayLab-only apps; custom dependency layouts should use --sandbox-image.
Older non-hardened sandbox reports remain inspectable but no longer satisfy safe workflow
readiness.
Sandbox setup diagnostics are structured and sanitized: they can tell users to install/start
Docker, build the missing image, fix a recipe, inspect streamed Docker output, or use a custom
--sandbox-image, but ReplayLab does not persist raw Docker logs or secret values. ReplayLab also
ships a bounded adversarial scenario that validates common local-container escape probes; that
scenario is developer confidence evidence, not a formal sandbox guarantee.
The safety preflight includes a safe workflow readiness gate that explains those blockers as
requirement rows. A narrow report-driven workflow can now reach ready when provider replay,
model-tool evidence, explicit execution-tool control, reviewed/enforced HTTP policy, and
local-effect enforcement are all satisfied with no uncontrolled file or process effects. Workflows
with SQLite statements additionally need reviewed/enforced database-effect policy, and all
generation-eligible reports need network-effect enforcement active so direct raw-socket escapes fail
closed. They also need queue/pubsub enforcement active so supported enqueue/publish escapes fail
closed, unsupported HTTP client enforcement active so urllib/urllib3/aiohttp escapes fail
closed instead of bypassing supported HTTP policy control, and completed local-container sandbox
evidence so replay ran without inherited host secrets, host app mutation, or host networking.
ReplayLab also scans recovered local source for representative unsupported effect surfaces. Linked
database, queue, raw socket, unsupported HTTP, native/FFI, process-escape, or uncontrolled
custom-boundary evidence blocks readiness unless that surface has matching active control evidence.
Unrelated project-scope imports are warnings. Plain import os is not a blocker by itself; fork,
exec, spawn, multiprocessing, process-pool, and pseudo-terminal spawn APIs are. Only a report whose
unsupported effect scope is clear and whose sandbox containment requirement is satisfied can
generate a safe workflow regression. Captured-run views and incomplete evidence stay unavailable
and show the blocking requirements.
Generated-guard CI readiness is local and advisory in v1. ReplayLab can detect whether generated
pytest files and fixture dependencies exist, suggest a local pytest command, and recognize GitHub
Actions workflows that appear to run generated guards. It does not edit workflow files or provide
hosted CI history yet.
The CLI also keeps opt-in failed-boundary compatibility for capsules whose final provider boundary
failed and includes request and error payload refs.
Replay-Mode Startup
Applications can keep their startup replaylab.init(...) and handle.capture(...) code when running under replaylab replay.
In replay mode, capture scopes do not write capture capsules; provider calls are served by the CLI-owned replay runtime.
Automatic same-process replay startup for long-lived applications that do not use replaylab replay is not implemented yet.
Wrapper And Child Capsules
replaylab run currently writes separate artifacts:
- a wrapper capsule for command metadata and exit status
- a child provider capsule for captured OpenAI,
requests, orhttpxboundaries
Use replaylab capsule list to find the child provider capsule before replaying or generating tests.
Parent/child capsule merging is not implemented yet.
ASGI Middleware Boundary
instrument_app(app, handle=handle) registers or wraps the ASGI app with
ReplayLabASGIMiddleware, which captures provider calls made while handling HTTP requests.
It records safe framework metadata such as method, path, status code, and optional request ID.
For FastAPI/Starlette, ReplayLab also records best-effort route path and endpoint name when the
framework exposes them on the ASGI scope. It does not record ASGI request bodies, response bodies,
cookies, authorization values, websockets, or lifespan events. Provider-free requests do not write
capsules by default.
Worker Job Boundary
capture_job(...) captures provider calls made while a decorated sync or async job function runs.
It records safe metadata such as job name, callable module, callable qualname, optional queue name,
optional worker name, optional session_id_arg name, and the extracted session ID on the run.
It does not record job args, kwargs, return values, queue payloads, or worker-framework internals. Provider-free jobs do not write capsules by default.
Agent Framework Compatibility Boundary
ReplayLab now has loopback compatibility scenarios for PydanticAI, LangGraph, LangChain, OpenAI Agents SDK, LlamaIndex, and CrewAI. These scenarios validate provider-level capture and replay inside framework-owned workflows:
- PydanticAI
AgentwithOpenAIResponsesModelandOpenAIProvider(openai_client=...). - LangGraph
StateGraphnodes that callrequestsand OpenAI Responses, including an externallanggraph-examplerealism path. - LangGraph
ToolNodedispatch with bound LangChain tools. - LangChain
ChatOpenAI(..., use_responses_api=True)calls routed through OpenAI Responses. - LangChain
@tool/bind_toolsdispatch routed through OpenAI Responses. - OpenAI Agents SDK
@function_tooldispatch routed through OpenAI Responses. - LlamaIndex
FunctionTool.from_defaults(...)dispatch routed through OpenAI Responses. - CrewAI custom
@tool(...)dispatch routed through OpenAI Chat Completions. - Native Anthropic Messages and Gemini
generate_contentcalls through provider-SDK loopback scenarios.
This is not a framework graph runtime. For provider-backed LangGraph runs, ReplayLab can infer
graph-node grouping from secret-safe provider-call callsite metadata and show those graph nodes in
the local app trace. For supported PydanticAI, LangChain, LangGraph, OpenAI Agents SDK, LlamaIndex,
and CrewAI tool dispatches, ReplayLab can record execution-tool evidence without manual
trace_tool wrappers. ReplayLab still does not capture PydanticAI semantic events, LangGraph graph
topology as a replay contract, unvalidated LangChain Chat Completions paths, LlamaIndex-native
retrieval/index semantics as replay contracts, CrewAI planning semantics, prebuilt crewai_tools
external effects as exact local callables, hosted OpenAI tool execution, or OpenAI agent-as-tool
delegation as local Python execution. Initialize
ReplayLab before constructing provider clients so provider patching is active before the framework
uses those clients.
Anthropic and Gemini support is native provider-SDK support, not broad framework support. V1 captures
and replays non-streaming and fully consumed streaming Anthropic Messages and Gemini
generate_content calls, formats request and response previews in the local app, and generates
replay guards when full payloads are present.
Provider-specific live-experiment variant controls remain OpenAI-only in this slice; Anthropic and
Gemini experiments can rerun detected workflows, but model/prompt/tool override controls are shown
as unsupported instead of silently pretending to apply.
Provider-free framework runs are metadata captures only. The local app labels them as having no replayable evidence and disables replay, live experiment, and generated-guard actions until a supported provider boundary is captured.
AI Assistance Boundary
replaylab report explain, replaylab report diff-explain, and
replaylab ai plan-instrumentation are optional BYOK helpers.
They use deterministic ReplayLab summaries as input and produce advisory text only.
They do not decide replay success, change report outcomes, auto-edit application code, upload raw
payload files, or send source file bodies by default. Use --dry-run to inspect the model-ready
prompt without making an AI provider call.
Local Report Viewer Boundary
replaylab report view report writes one self-contained local React viewer file from a capsule and
replay report, then opens it in the default browser. replaylab report view diff does the same for
baseline-vs-candidate replay report diffs. The non-opening
replaylab report export-viewer report|diff commands remain available for scripts.
replaylab workflow local is a guided wrapper over replay, compare, viewer export, and optional
pytest generation for an existing capsule; the local app calls the same orchestration behind a
token-protected localhost API so users can run that workflow from the browser. It does not capture a
new run. These are local diagnostic artifacts, not a hosted UI, cloud service, or multi-user SaaS
dashboard.
It renders inspection and comparison summaries only: status, counts, boundary rows, request hashes,
payload availability booleans, filters, search, a "What to do next" section, copyable command
blocks, failure groups, expected-vs-actual mismatch details, diff groups, and next commands.
replaylab report export-html and replaylab report diff-html remain dependency-free static HTML
fallbacks.
replaylab report diff-html writes a second static file for baseline-vs-candidate replay report
diffs, including improved, regressed, changed, and unchanged-failure groups.
No viewer path renders payload file contents, raw headers, API keys, source bodies, or framework
request/response bodies.
Hosted dashboards and live report servers are planned future UI surfaces, not part of the current
local viewer.
Not Implemented Yet
- OpenAI Chat Completions paths beyond non-streaming sync
chat.completions.create(...). - OpenAI streaming helper APIs beyond
responses.create(..., stream=True). - Anthropic batches, files, Bedrock/Vertex Anthropic clients, OpenAI-compatible Anthropic routing,
and streaming helper paths beyond
messages.create(..., stream=True)/messages.stream(...). - Gemini multimodal file/image/video flows, Live API, Vertex-specific product breadth, and
streaming paths beyond
generate_content_stream(...). - LangChain and other framework wrappers that require unvalidated Chat Completions semantics beyond the CrewAI loopback path.
- Anthropic/Gemini framework-native adapters beyond native provider-SDK calls.
- framework-native graph/runtime traces for PydanticAI, LangGraph, LangChain, LlamaIndex, or CrewAI beyond supported tool dispatch evidence and provider-backed inferred grouping.
- async
requestspatterns. - file uploads, multipart requests, streaming HTTP bodies, and streaming downloads.
- header value capture by default. HTTP request matching captures header names only.
- Broad non-HTTP I/O policy enforcement. Local-effect control is opt-in and limited to app-origin
filesystem mutations and subprocess launches; SQLite database control is exact statement-shape
only; raw-socket network control fails closed without allow policies; queue/pubsub control covers
representative synchronous enqueue/publish APIs only and does not replay brokers or execute
workers; unsupported HTTP client control blocks
urllib/urllib3/aiohttpescapes but does not replay or mock their responses; linked native/FFI and process-escape evidence blocks safe workflow generation; unsupported queue/pubsub SDKs, native extensions, cross-process escapes, and other effects are not controlled. - cloud hosted runners, account auth, billing, and team collaboration.
- auto-sync scope customization beyond
everything_generated. - parent/child record streaming or merged wrapper capsules.
- hard Celery, RQ, APScheduler, and framework-specific worker lifecycle adapters.
- perturbation mode.
- broad failed-flow generation for failures that are not the final captured provider call.
- cloud report-to-report comparison, issue grouping, or trend analysis.
- AI-generated code patches, automatic code editing, and AI-owned replay decisions.
Secret Safety
CLI output and inspection output should not print payload bodies, API keys, or raw secret values. Payload redaction happens before payload bytes are written when a redaction policy is configured.