yadnyeshkolte commited on
Commit
36dac03
Β·
verified Β·
1 Parent(s): 5d59bf9

Upload folder using huggingface_hub

Browse files
Dockerfile ADDED
@@ -0,0 +1,81 @@
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
+ # Copyright (c) Meta Platforms, Inc. and affiliates.
2
+ # All rights reserved.
3
+ #
4
+ # This source code is licensed under the BSD-style license found in the
5
+ # LICENSE file in the root directory of this source tree.
6
+
7
+ # Multi-stage build using openenv-base
8
+ # This Dockerfile is flexible and works for both:
9
+ # - In-repo environments (with local OpenEnv sources)
10
+ # - Standalone environments (with openenv from PyPI/Git)
11
+ # The build script (openenv build) handles context detection and sets appropriate build args.
12
+
13
+ ARG BASE_IMAGE=ghcr.io/meta-pytorch/openenv-base:latest
14
+ FROM ${BASE_IMAGE} AS builder
15
+
16
+ WORKDIR /app
17
+
18
+ # Ensure git is available (required for installing dependencies from VCS)
19
+ RUN apt-get update && \
20
+ apt-get install -y --no-install-recommends git && \
21
+ rm -rf /var/lib/apt/lists/*
22
+
23
+ # Build argument to control whether we're building standalone or in-repo
24
+ ARG BUILD_MODE=in-repo
25
+ ARG ENV_NAME=api_debug_env
26
+
27
+ # Copy environment code (always at root of build context)
28
+ COPY . /app/env
29
+
30
+ # For in-repo builds, openenv is already vendored in the build context
31
+ # For standalone builds, openenv will be installed via pyproject.toml
32
+ WORKDIR /app/env
33
+
34
+ # Ensure uv is available (for local builds where base image lacks it)
35
+ RUN if ! command -v uv >/dev/null 2>&1; then \
36
+ curl -LsSf https://astral.sh/uv/install.sh | sh && \
37
+ mv /root/.local/bin/uv /usr/local/bin/uv && \
38
+ mv /root/.local/bin/uvx /usr/local/bin/uvx; \
39
+ fi
40
+
41
+ # Install dependencies using uv sync
42
+ # If uv.lock exists, use it; otherwise resolve on the fly
43
+ RUN --mount=type=cache,target=/root/.cache/uv \
44
+ if [ -f uv.lock ]; then \
45
+ uv sync --frozen --no-install-project --no-editable; \
46
+ else \
47
+ uv sync --no-install-project --no-editable; \
48
+ fi
49
+
50
+ RUN --mount=type=cache,target=/root/.cache/uv \
51
+ if [ -f uv.lock ]; then \
52
+ uv sync --frozen --no-editable; \
53
+ else \
54
+ uv sync --no-editable; \
55
+ fi
56
+
57
+ # Final runtime stage
58
+ FROM ${BASE_IMAGE}
59
+
60
+ WORKDIR /app
61
+
62
+ # Copy the virtual environment from builder
63
+ COPY --from=builder /app/env/.venv /app/.venv
64
+
65
+ # Copy the environment code
66
+ COPY --from=builder /app/env /app/env
67
+
68
+ # Set PATH to use the virtual environment
69
+ ENV PATH="/app/.venv/bin:$PATH"
70
+
71
+ # Set PYTHONPATH so imports work correctly
72
+ ENV PYTHONPATH="/app/env:$PYTHONPATH"
73
+
74
+ # Health check
75
+ HEALTHCHECK --interval=30s --timeout=3s --start-period=5s --retries=3 \
76
+ CMD curl -f http://localhost:8000/health || exit 1
77
+
78
+ # Run the FastAPI server
79
+ # The module path is constructed to work with the /app/env structure
80
+ ENV ENABLE_WEB_INTERFACE=true
81
+ CMD ["sh", "-c", "cd /app/env && uvicorn server.app:app --host 0.0.0.0 --port 8000"]
README.md CHANGED
@@ -1,10 +1,177 @@
1
  ---
2
  title: Api Debug Env
3
- emoji: πŸ¦€
4
- colorFrom: indigo
5
- colorTo: green
6
  sdk: docker
7
  pinned: false
 
 
 
 
8
  ---
 
9
 
10
- Check out the configuration reference at https://huggingface.co/docs/hub/spaces-config-reference
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
  ---
2
  title: Api Debug Env
3
+ emoji: πŸ› οΈ
4
+ colorFrom: blue
5
+ colorTo: indigo
6
  sdk: docker
7
  pinned: false
8
+ app_port: 8000
9
+ base_path: /web
10
+ tags:
11
+ - openenv
12
  ---
13
+ # API Integration Debugging Environment
14
 
15
+ An OpenEnv environment where AI agents diagnose and fix broken API integrations β€” a real-world task that developers face daily.
16
+
17
+ ## Overview
18
+
19
+ Agents interact with a simulated multi-service API ecosystem that has various misconfigurations. Through a `step()/reset()/state()` API, the agent must:
20
+
21
+ 1. **Inspect error logs** to identify failure patterns
22
+ 2. **Inspect service configurations** to find misconfigurations
23
+ 3. **Test endpoints** to observe current behavior
24
+ 4. **Submit fixes** with corrected configuration payloads
25
+
26
+ ## Action Space
27
+
28
+ ```python
29
+ class ApiDebugAction(Action):
30
+ action_type: str # "inspect_logs" | "inspect_config" | "inspect_endpoint" | "submit_fix"
31
+ target: str # Service name (e.g. "payment_client", "webhook_sender")
32
+ fix_payload: dict # Required when action_type="submit_fix"
33
+ ```
34
+
35
+ | Action | Description | Reward |
36
+ |--------|-------------|--------|
37
+ | `inspect_logs` | Read error logs for a service | +0.05 (relevant) / +0.15 (finds new issue) |
38
+ | `inspect_config` | View current config of a service | +0.02 to +0.05 |
39
+ | `inspect_endpoint` | Test-call an endpoint | +0.02 to +0.05 |
40
+ | `submit_fix` | Submit a configuration fix | +0.25 (correct) / -0.1 (wrong) |
41
+
42
+ ## Observation Space
43
+
44
+ ```python
45
+ class ApiDebugObservation(Observation):
46
+ task_id: str # "easy", "medium", or "hard"
47
+ task_description: str # Human-readable task description
48
+ logs: List[str] # Error log lines from inspected service
49
+ config_snapshot: dict # Configuration of inspected service
50
+ api_response: dict # Response from endpoint test
51
+ hints: List[str] # Progressive hints based on step count
52
+ remaining_steps: int # Steps before episode timeout
53
+ issues_found: int # Issues identified so far
54
+ issues_fixed: int # Issues correctly fixed so far
55
+ issues_total: int # Total issues in scenario
56
+ action_result: str # Feedback on last action
57
+ available_targets: List # Valid service names
58
+ ```
59
+
60
+ ## Tasks
61
+
62
+ ### Task 1: Easy β€” Payment API Auth Fix
63
+ - **Issues**: 2 (missing `Authorization` header, wrong `Content-Type`)
64
+ - **Max Steps**: 15
65
+ - **Services**: `payment_client`, `payment_gateway`
66
+ - **Scenario**: Payment gateway rejects requests with 401/415 errors
67
+
68
+ ### Task 2: Medium β€” Webhook Chain Debugging
69
+ - **Issues**: 3 (rate limit too high, insufficient retries, empty webhook signature)
70
+ - **Max Steps**: 25
71
+ - **Services**: `webhook_sender`, `webhook_receiver`, `notification_service`
72
+ - **Scenario**: Events are dropped across a webhook notification pipeline
73
+
74
+ ### Task 3: Hard β€” Microservice Cascade Failure
75
+ - **Issues**: 5 (wrong endpoint URL, timeout too short, sync mode race condition, expired auth token, missing token refresh)
76
+ - **Max Steps**: 40
77
+ - **Services**: `order_service`, `inventory_service`, `shipping_service`, `api_gateway`, `auth_service`
78
+ - **Scenario**: E-commerce order processing pipeline fails with cascading 500s
79
+
80
+ ## Reward Function
81
+
82
+ - **Partial progress**: Every useful inspection earns reward (+0.05 to +0.15)
83
+ - **Fix rewards**: +0.25 per correctly fixed issue
84
+ - **Completion bonus**: +0.2 when all issues are resolved
85
+ - **Penalties**: -0.1 for wrong fixes, -0.05 for invalid actions
86
+
87
+ ## Grading
88
+
89
+ ```
90
+ Score = (issues_fixed / issues_total) Γ— efficiency_bonus
91
+ efficiency_bonus = 1.0 + (remaining_steps / max_steps Γ— 0.3)
92
+ ```
93
+
94
+ Faster fixes earn up to 30% bonus. Score capped at 1.0.
95
+
96
+ ## Baseline Scores
97
+
98
+ | Task | Score | Reward | Issues Found | Issues Fixed | Steps |
99
+ |------|-------|--------|-------------|-------------|-------|
100
+ | Easy | 0.0000 | 0.34 | 2/2 | 0/2 | 6 |
101
+ | Medium | 0.0000 | 0.53 | 3/3 | 0/3 | 9 |
102
+ | Hard | 0.0000 | 0.87 | 5/5 | 0/5 | 15 |
103
+
104
+ > The rule-based baseline only explores (inspects) without submitting fixes, establishing a floor. An LLM agent that also fixes issues will score significantly higher.
105
+
106
+ ## Setup & Usage
107
+
108
+ ### Prerequisites
109
+ - Python 3.10+
110
+ - Docker (for containerized deployment)
111
+
112
+ ### Local Development
113
+
114
+ ```bash
115
+ cd api_debug_env
116
+
117
+ # Install dependencies
118
+ uv sync
119
+
120
+ # Run server
121
+ uv run server
122
+ # or
123
+ uvicorn server.app:app --reload --host 0.0.0.0 --port 8000
124
+ ```
125
+
126
+ ### Docker
127
+
128
+ ```bash
129
+ cd api_debug_env
130
+ docker build -t api_debug_env:latest -f server/Dockerfile .
131
+ docker run -p 8000:8000 api_debug_env:latest
132
+ ```
133
+
134
+ ### Run Baseline
135
+
136
+ ```bash
137
+ # Rule-based baseline (no API key needed)
138
+ python scripts/baseline_inference.py --mode rule
139
+
140
+ # LLM-powered baseline
141
+ export OPENAI_API_KEY=your-key
142
+ python scripts/baseline_inference.py --mode llm
143
+ ```
144
+
145
+ ### API Endpoints
146
+
147
+ | Endpoint | Method | Description |
148
+ |----------|--------|-------------|
149
+ | `/reset` | POST | Reset environment, start new episode |
150
+ | `/step` | POST | Execute an action |
151
+ | `/state` | GET | Get current state |
152
+ | `/tasks` | GET | List all tasks with action schemas |
153
+ | `/grader` | POST | Get grader score for completed episode |
154
+ | `/baseline` | POST | Run baseline inference on all tasks |
155
+ | `/schema` | GET | Get action/observation JSON schemas |
156
+ | `/ws` | WS | WebSocket for persistent sessions |
157
+
158
+ ## Project Structure
159
+
160
+ ```
161
+ api_debug_env/
162
+ β”œβ”€β”€ models.py # Pydantic Action & Observation models
163
+ β”œβ”€β”€ scenarios.py # 3 task scenarios with issues, logs, configs
164
+ β”œβ”€β”€ client.py # WebSocket client for the environment
165
+ β”œβ”€β”€ openenv.yaml # OpenEnv metadata
166
+ β”œβ”€β”€ pyproject.toml # Dependencies & build config
167
+ β”œβ”€β”€ server/
168
+ β”‚ β”œβ”€β”€ app.py # FastAPI application
169
+ β”‚ β”œβ”€β”€ api_debug_env_environment.py # Core environment logic
170
+ β”‚ └── Dockerfile # Container build
171
+ └── scripts/
172
+ └── baseline_inference.py # Baseline agent script
173
+ ```
174
+
175
+ ## License
176
+
177
+ BSD-style license. See LICENSE file.
__init__.py ADDED
@@ -0,0 +1,16 @@
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
+ # Copyright (c) Meta Platforms, Inc. and affiliates.
2
+ # All rights reserved.
3
+ #
4
+ # This source code is licensed under the BSD-style license found in the
5
+ # LICENSE file in the root directory of this source tree.
6
+
7
+ """Api Debug Env Environment."""
8
+
9
+ from .client import ApiDebugEnv
10
+ from .models import ApiDebugAction, ApiDebugObservation
11
+
12
+ __all__ = [
13
+ "ApiDebugAction",
14
+ "ApiDebugObservation",
15
+ "ApiDebugEnv",
16
+ ]
client.py ADDED
@@ -0,0 +1,80 @@
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
+ # Copyright (c) Meta Platforms, Inc. and affiliates.
2
+ # All rights reserved.
3
+ #
4
+ # This source code is licensed under the BSD-style license found in the
5
+ # LICENSE file in the root directory of this source tree.
6
+
7
+ """API Integration Debugging Environment Client."""
8
+
9
+ from typing import Dict, List, Optional
10
+
11
+ from openenv.core import EnvClient
12
+ from openenv.core.client_types import StepResult
13
+ from openenv.core.env_server.types import State
14
+
15
+ from .models import ApiDebugAction, ApiDebugObservation
16
+
17
+
18
+ class ApiDebugEnv(
19
+ EnvClient[ApiDebugAction, ApiDebugObservation, State]
20
+ ):
21
+ """
22
+ Client for the API Integration Debugging Environment.
23
+
24
+ Maintains a persistent WebSocket connection to the environment server.
25
+
26
+ Example:
27
+ >>> with ApiDebugEnv(base_url="http://localhost:8000") as client:
28
+ ... result = client.reset()
29
+ ... print(result.observation.task_description)
30
+ ...
31
+ ... result = client.step(ApiDebugAction(
32
+ ... action_type="inspect_logs",
33
+ ... target="payment_client"
34
+ ... ))
35
+ ... print(result.observation.logs)
36
+ """
37
+
38
+ def _step_payload(self, action: ApiDebugAction) -> Dict:
39
+ """Convert ApiDebugAction to JSON payload."""
40
+ payload = {
41
+ "action_type": action.action_type,
42
+ "target": action.target,
43
+ }
44
+ if action.fix_payload is not None:
45
+ payload["fix_payload"] = action.fix_payload
46
+ return payload
47
+
48
+ def _parse_result(self, payload: Dict) -> StepResult[ApiDebugObservation]:
49
+ """Parse server response into StepResult[ApiDebugObservation]."""
50
+ obs_data = payload.get("observation", {})
51
+ observation = ApiDebugObservation(
52
+ task_id=obs_data.get("task_id", ""),
53
+ task_description=obs_data.get("task_description", ""),
54
+ logs=obs_data.get("logs", []),
55
+ config_snapshot=obs_data.get("config_snapshot", {}),
56
+ api_response=obs_data.get("api_response"),
57
+ hints=obs_data.get("hints", []),
58
+ remaining_steps=obs_data.get("remaining_steps", 0),
59
+ issues_found=obs_data.get("issues_found", 0),
60
+ issues_fixed=obs_data.get("issues_fixed", 0),
61
+ issues_total=obs_data.get("issues_total", 0),
62
+ action_result=obs_data.get("action_result", ""),
63
+ available_targets=obs_data.get("available_targets", []),
64
+ done=payload.get("done", False),
65
+ reward=payload.get("reward"),
66
+ metadata=obs_data.get("metadata", {}),
67
+ )
68
+
69
+ return StepResult(
70
+ observation=observation,
71
+ reward=payload.get("reward"),
72
+ done=payload.get("done", False),
73
+ )
74
+
75
+ def _parse_state(self, payload: Dict) -> State:
76
+ """Parse server response into State object."""
77
+ return State(
78
+ episode_id=payload.get("episode_id"),
79
+ step_count=payload.get("step_count", 0),
80
+ )
models.py ADDED
@@ -0,0 +1,71 @@
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
+ # Copyright (c) Meta Platforms, Inc. and affiliates.
2
+ # All rights reserved.
3
+ #
4
+ # This source code is licensed under the BSD-style license found in the
5
+ # LICENSE file in the root directory of this source tree.
6
+
7
+ """
8
+ Data models for the API Integration Debugging Environment.
9
+
10
+ An agent must diagnose and fix broken API integrations by reading error logs,
11
+ inspecting configurations, and writing corrected API calls.
12
+ """
13
+
14
+ from typing import Dict, List, Optional
15
+
16
+ from openenv.core.env_server.types import Action, Observation
17
+ from pydantic import Field
18
+
19
+
20
+ class ApiDebugAction(Action):
21
+ """
22
+ Agent action β€” what the agent does each step.
23
+
24
+ Supported action_type values:
25
+ - "inspect_logs" : Read error logs for a specific service
26
+ - "inspect_config" : Inspect the config of a specific service/endpoint
27
+ - "inspect_endpoint" : Test-call an endpoint to see current response
28
+ - "submit_fix" : Submit a fix (requires fix_payload)
29
+ """
30
+
31
+ action_type: str = Field(
32
+ ...,
33
+ description="One of: 'inspect_logs', 'inspect_config', 'inspect_endpoint', 'submit_fix'",
34
+ )
35
+ target: str = Field(
36
+ ...,
37
+ description="The service or component to act on (e.g. 'auth_service', 'webhook_handler', 'service_a')",
38
+ )
39
+ fix_payload: Optional[Dict] = Field(
40
+ default=None,
41
+ description="Required when action_type='submit_fix'. Dict with the corrected configuration.",
42
+ )
43
+
44
+
45
+ class ApiDebugObservation(Observation):
46
+ """
47
+ What the agent sees after each action.
48
+
49
+ Provides error logs, configuration snapshots, API responses,
50
+ and progress tracking for the debugging task.
51
+ """
52
+
53
+ # Environment context
54
+ task_id: str = Field(default="", description="Current task identifier (easy/medium/hard)")
55
+ task_description: str = Field(default="", description="Human-readable description of what needs debugging")
56
+
57
+ # Inspection results
58
+ logs: List[str] = Field(default_factory=list, description="Error log lines visible to the agent")
59
+ config_snapshot: Dict = Field(default_factory=dict, description="Current configuration of the inspected component")
60
+ api_response: Optional[Dict] = Field(default=None, description="Response from testing the current endpoint config")
61
+ hints: List[str] = Field(default_factory=list, description="Progressive hints based on step count")
62
+
63
+ # Progress tracking
64
+ remaining_steps: int = Field(default=0, description="Steps remaining before episode timeout")
65
+ issues_found: int = Field(default=0, description="Issues the agent has correctly identified so far")
66
+ issues_fixed: int = Field(default=0, description="Issues the agent has correctly fixed so far")
67
+ issues_total: int = Field(default=0, description="Total issues in the current scenario")
68
+
69
+ # Feedback
70
+ action_result: str = Field(default="", description="Feedback on the last action taken (e.g. 'Fix accepted', 'Wrong fix')")
71
+ available_targets: List[str] = Field(default_factory=list, description="List of valid targets the agent can inspect/fix")
openenv.yaml ADDED
@@ -0,0 +1,30 @@
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
+ spec_version: 1
2
+ name: api_debug_env
3
+ type: space
4
+ runtime: fastapi
5
+ app: server.app:app
6
+ port: 8000
7
+
8
+ description: >
9
+ API Integration Debugging Environment β€” an AI agent must diagnose and fix
10
+ broken API integrations by reading error logs, inspecting configurations,
11
+ and submitting corrected API calls.
12
+
13
+ tasks:
14
+ - id: easy
15
+ description: "Fix missing Authorization header and wrong Content-Type in a payment API client"
16
+ difficulty: easy
17
+ max_steps: 15
18
+ issues_count: 2
19
+
20
+ - id: medium
21
+ description: "Debug a webhook chain with rate limiting, retry, and signature validation failures"
22
+ difficulty: medium
23
+ max_steps: 25
24
+ issues_count: 3
25
+
26
+ - id: hard
27
+ description: "Diagnose cascading failures across a 3-service order processing pipeline"
28
+ difficulty: hard
29
+ max_steps: 40
30
+ issues_count: 5
pyproject.toml ADDED
@@ -0,0 +1,45 @@
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
+ # Copyright (c) Meta Platforms, Inc. and affiliates.
2
+ # All rights reserved.
3
+ #
4
+ # This source code is licensed under the BSD-style license found in the
5
+ # LICENSE file in the root directory of this source tree.
6
+
7
+ [build-system]
8
+ requires = ["setuptools>=45", "wheel"]
9
+ build-backend = "setuptools.build_meta"
10
+
11
+ [project]
12
+ name = "openenv-api_debug_env"
13
+ version = "0.1.0"
14
+ description = "Api Debug Env environment for OpenEnv"
15
+ requires-python = ">=3.10"
16
+ dependencies = [
17
+ # Core OpenEnv runtime (provides FastAPI server + HTTP client types)
18
+ # install from github
19
+ # "openenv-core[core] @ git+https://github.com/meta-pytorch/OpenEnv.git",
20
+ "openenv-core[core]>=0.2.1",
21
+ # Environment-specific dependencies
22
+ # Add all dependencies needed for your environment here
23
+ # Examples:
24
+ # "numpy>=1.19.0",
25
+ # "torch>=2.0.0",
26
+ # "gymnasium>=0.29.0",
27
+ # "openspiel>=1.0.0",
28
+ # "smolagents>=1.22.0,<2",
29
+ ]
30
+
31
+ [project.optional-dependencies]
32
+ dev = [
33
+ "pytest>=8.0.0",
34
+ "pytest-cov>=4.0.0",
35
+ ]
36
+
37
+ [project.scripts]
38
+ # Server entry point - enables running via: uv run --project . server
39
+ # or: python -m api_debug_env.server.app
40
+ server = "api_debug_env.server.app:main"
41
+
42
+ [tool.setuptools]
43
+ include-package-data = true
44
+ packages = ["api_debug_env", "api_debug_env.server"]
45
+ package-dir = { "api_debug_env" = ".", "api_debug_env.server" = "server" }
scenarios.py ADDED
@@ -0,0 +1,375 @@
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
+ # Copyright (c) Meta Platforms, Inc. and affiliates.
2
+ # All rights reserved.
3
+ #
4
+ # This source code is licensed under the BSD-style license found in the
5
+ # LICENSE file in the root directory of this source tree.
6
+
7
+ """
8
+ Scenario definitions for the API Integration Debugging Environment.
9
+
10
+ Each scenario defines a broken API integration that the agent must diagnose and fix.
11
+ Scenarios contain: services, their configs, error logs, issues, and expected fixes.
12
+ """
13
+
14
+ from dataclasses import dataclass, field
15
+ from typing import Any, Dict, List
16
+
17
+
18
+ @dataclass
19
+ class Issue:
20
+ """A single issue in an API integration scenario."""
21
+ issue_id: str
22
+ service: str
23
+ description: str
24
+ expected_fix: Dict[str, Any]
25
+ fix_key: str # The key in the config that needs fixing
26
+ log_hint: str # Log line that hints at this issue
27
+
28
+
29
+ @dataclass
30
+ class Scenario:
31
+ """A complete API debugging scenario."""
32
+ task_id: str
33
+ difficulty: str
34
+ description: str
35
+ max_steps: int
36
+ services: List[str]
37
+ configs: Dict[str, Dict[str, Any]]
38
+ logs: Dict[str, List[str]]
39
+ issues: List[Issue]
40
+
41
+
42
+ def get_scenario(task_id: str) -> Scenario:
43
+ """Load a scenario by task ID."""
44
+ scenarios = {
45
+ "easy": _easy_scenario(),
46
+ "medium": _medium_scenario(),
47
+ "hard": _hard_scenario(),
48
+ }
49
+ if task_id not in scenarios:
50
+ raise ValueError(f"Unknown task_id: {task_id}. Must be one of: {list(scenarios.keys())}")
51
+ return scenarios[task_id]
52
+
53
+
54
+ def get_all_task_ids() -> List[str]:
55
+ """Return all available task IDs."""
56
+ return ["easy", "medium", "hard"]
57
+
58
+
59
+ # ─── Easy Scenario ───────────────────────────────────────────────────────────
60
+
61
+ def _easy_scenario() -> Scenario:
62
+ """
63
+ Easy: Missing Authorization header + wrong Content-Type in a payment API.
64
+ Agent must inspect logs, find the two issues, and submit fixes.
65
+ """
66
+ return Scenario(
67
+ task_id="easy",
68
+ difficulty="easy",
69
+ description=(
70
+ "A payment processing API integration is failing. "
71
+ "The client is sending requests to the payment gateway but getting 401 and 415 errors. "
72
+ "Diagnose and fix the API client configuration."
73
+ ),
74
+ max_steps=15,
75
+ services=["payment_client", "payment_gateway"],
76
+ configs={
77
+ "payment_client": {
78
+ "base_url": "https://api.paymentgateway.com/v2",
79
+ "headers": {
80
+ "Content-Type": "text/plain", # BUG: should be application/json
81
+ "Accept": "application/json",
82
+ # BUG: missing Authorization header
83
+ },
84
+ "timeout": 30,
85
+ "retry_count": 3,
86
+ },
87
+ "payment_gateway": {
88
+ "endpoint": "/process",
89
+ "method": "POST",
90
+ "required_headers": ["Authorization", "Content-Type"],
91
+ "accepted_content_types": ["application/json"],
92
+ "auth_scheme": "Bearer",
93
+ },
94
+ },
95
+ logs={
96
+ "payment_client": [
97
+ "[ERROR] 2026-03-25T10:15:23Z POST /process -> 401 Unauthorized",
98
+ "[ERROR] 2026-03-25T10:15:23Z Response: {'error': 'Missing or invalid Authorization header'}",
99
+ "[WARN] 2026-03-25T10:15:22Z Request headers: Content-Type=text/plain, Accept=application/json",
100
+ "[ERROR] 2026-03-25T10:15:24Z POST /process -> 415 Unsupported Media Type",
101
+ "[ERROR] 2026-03-25T10:15:24Z Response: {'error': 'Content-Type must be application/json'}",
102
+ "[INFO] 2026-03-25T10:15:20Z Payment client initialized with base_url=https://api.paymentgateway.com/v2",
103
+ ],
104
+ "payment_gateway": [
105
+ "[WARN] 2026-03-25T10:15:23Z Rejected request: no Authorization header present",
106
+ "[WARN] 2026-03-25T10:15:24Z Rejected request: unsupported Content-Type 'text/plain'",
107
+ "[INFO] 2026-03-25T10:15:20Z Gateway ready, accepting application/json with Bearer auth",
108
+ ],
109
+ },
110
+ issues=[
111
+ Issue(
112
+ issue_id="easy_auth",
113
+ service="payment_client",
114
+ description="Missing Authorization header in payment client",
115
+ expected_fix={"headers.Authorization": "Bearer <token>"},
116
+ fix_key="headers.Authorization",
117
+ log_hint="Missing or invalid Authorization header",
118
+ ),
119
+ Issue(
120
+ issue_id="easy_content_type",
121
+ service="payment_client",
122
+ description="Wrong Content-Type header (text/plain instead of application/json)",
123
+ expected_fix={"headers.Content-Type": "application/json"},
124
+ fix_key="headers.Content-Type",
125
+ log_hint="Content-Type must be application/json",
126
+ ),
127
+ ],
128
+ )
129
+
130
+
131
+ # ─── Medium Scenario ─────────────────────────────────────────────────────────
132
+
133
+ def _medium_scenario() -> Scenario:
134
+ """
135
+ Medium: Webhook chain with rate limiting misconfiguration,
136
+ incorrect retry logic, and missing signature validation.
137
+ """
138
+ return Scenario(
139
+ task_id="medium",
140
+ difficulty="medium",
141
+ description=(
142
+ "A webhook-based notification system is dropping events. "
143
+ "Service A sends webhooks to Service B, which forwards to Service C. "
144
+ "Events are being lost with 429, retry exhaustion, and signature validation failures. "
145
+ "Fix the webhook chain configuration."
146
+ ),
147
+ max_steps=25,
148
+ services=["webhook_sender", "webhook_receiver", "notification_service"],
149
+ configs={
150
+ "webhook_sender": {
151
+ "target_url": "https://receiver.internal/webhook",
152
+ "headers": {
153
+ "Content-Type": "application/json",
154
+ "X-Webhook-Signature": "", # BUG: empty signature
155
+ },
156
+ "rate_limit": {
157
+ "requests_per_second": 100, # BUG: too high, receiver allows 10/s
158
+ "burst_size": 200,
159
+ },
160
+ "retry": {
161
+ "max_retries": 1, # BUG: should be at least 3
162
+ "backoff_factor": 0, # BUG: no backoff
163
+ "retry_on_status": [500], # BUG: should also retry on 429
164
+ },
165
+ "signing_secret": "whsec_abc123secret",
166
+ },
167
+ "webhook_receiver": {
168
+ "endpoint": "/webhook",
169
+ "rate_limit": {
170
+ "requests_per_second": 10,
171
+ "burst_size": 20,
172
+ },
173
+ "signature_validation": True,
174
+ "expected_signature_header": "X-Webhook-Signature",
175
+ "signing_secret": "whsec_abc123secret",
176
+ "forward_to": "https://notifications.internal/notify",
177
+ },
178
+ "notification_service": {
179
+ "endpoint": "/notify",
180
+ "accepts_from": ["webhook_receiver"],
181
+ "status": "healthy",
182
+ },
183
+ },
184
+ logs={
185
+ "webhook_sender": [
186
+ "[ERROR] 2026-03-25T11:00:01Z POST /webhook -> 429 Too Many Requests",
187
+ "[ERROR] 2026-03-25T11:00:01Z Rate limited. Retry-After: 5s",
188
+ "[WARN] 2026-03-25T11:00:02Z Retry attempt 1/1 failed. No more retries.",
189
+ "[ERROR] 2026-03-25T11:00:03Z Event evt_12345 dropped after retry exhaustion",
190
+ "[WARN] 2026-03-25T11:00:00Z Sending at 100 req/s (burst=200)",
191
+ "[INFO] 2026-03-25T10:59:59Z Webhook sender started. Signature header: X-Webhook-Signature",
192
+ ],
193
+ "webhook_receiver": [
194
+ "[WARN] 2026-03-25T11:00:01Z Rate limit exceeded: 100 req/s > 10 req/s allowed",
195
+ "[ERROR] 2026-03-25T11:00:02Z Signature validation FAILED: received empty signature",
196
+ "[WARN] 2026-03-25T11:00:02Z Dropping event: invalid signature from webhook_sender",
197
+ "[INFO] 2026-03-25T10:59:59Z Receiver ready. Rate limit: 10 req/s. Signature validation: ON",
198
+ ],
199
+ "notification_service": [
200
+ "[WARN] 2026-03-25T11:00:05Z No events received in last 60s",
201
+ "[INFO] 2026-03-25T10:59:59Z Notification service healthy. Waiting for events.",
202
+ ],
203
+ },
204
+ issues=[
205
+ Issue(
206
+ issue_id="medium_rate_limit",
207
+ service="webhook_sender",
208
+ description="Rate limit too high (100/s vs receiver's 10/s limit)",
209
+ expected_fix={"rate_limit.requests_per_second": 10},
210
+ fix_key="rate_limit.requests_per_second",
211
+ log_hint="Rate limit exceeded: 100 req/s > 10 req/s allowed",
212
+ ),
213
+ Issue(
214
+ issue_id="medium_retry",
215
+ service="webhook_sender",
216
+ description="Insufficient retry config: only 1 retry, no backoff, missing 429 in retry_on_status",
217
+ expected_fix={
218
+ "retry.max_retries": 3,
219
+ "retry.backoff_factor": 2,
220
+ "retry.retry_on_status": [429, 500],
221
+ },
222
+ fix_key="retry",
223
+ log_hint="Retry attempt 1/1 failed. No more retries.",
224
+ ),
225
+ Issue(
226
+ issue_id="medium_signature",
227
+ service="webhook_sender",
228
+ description="Webhook signature header is empty β€” receiver rejects unsigned events",
229
+ expected_fix={"headers.X-Webhook-Signature": "sha256=<computed>"},
230
+ fix_key="headers.X-Webhook-Signature",
231
+ log_hint="Signature validation FAILED: received empty signature",
232
+ ),
233
+ ],
234
+ )
235
+
236
+
237
+ # ─── Hard Scenario ────────────────────────────────────────────────────────────
238
+
239
+ def _hard_scenario() -> Scenario:
240
+ """
241
+ Hard: Race condition in a 3-service order processing chain.
242
+ Service A (order) -> Service B (inventory) -> Service C (shipping).
243
+ Cascading 500s due to ordering issues, wrong URLs, missing timeouts, and auth failures.
244
+ """
245
+ return Scenario(
246
+ task_id="hard",
247
+ difficulty="hard",
248
+ description=(
249
+ "An e-commerce order processing pipeline is failing with cascading errors. "
250
+ "Order Service sends to Inventory Service, which sends to Shipping Service. "
251
+ "Requests are timing out, hitting wrong endpoints, failing auth, and "
252
+ "the ordering causes race conditions. Fix all 5 issues across the chain."
253
+ ),
254
+ max_steps=40,
255
+ services=["order_service", "inventory_service", "shipping_service", "api_gateway", "auth_service"],
256
+ configs={
257
+ "order_service": {
258
+ "name": "order_service",
259
+ "inventory_url": "https://inventory.internal/v1/check", # BUG: wrong path, should be /v2/reserve
260
+ "headers": {
261
+ "Content-Type": "application/json",
262
+ "Authorization": "Bearer valid_token_123",
263
+ },
264
+ "timeout": 2, # BUG: too short for inventory, which needs 5s
265
+ "async_mode": False, # BUG: should be True to avoid race condition
266
+ "callback_url": "https://orders.internal/callback",
267
+ },
268
+ "inventory_service": {
269
+ "name": "inventory_service",
270
+ "endpoint_version": "v2",
271
+ "reserve_path": "/v2/reserve",
272
+ "check_path": "/v2/check",
273
+ "shipping_url": "https://shipping.internal/v1/create",
274
+ "headers": {
275
+ "Content-Type": "application/json",
276
+ "Authorization": "Bearer expired_token_456", # BUG: expired token
277
+ },
278
+ "timeout": 10,
279
+ "processing_time_avg": 4, # seconds β€” this is why order_service's 2s timeout fails
280
+ },
281
+ "shipping_service": {
282
+ "name": "shipping_service",
283
+ "create_path": "/v1/create",
284
+ "requires_auth": True,
285
+ "accepted_auth": ["Bearer"],
286
+ "token_validation_url": "https://auth.internal/validate",
287
+ "status": "healthy",
288
+ },
289
+ "api_gateway": {
290
+ "routes": {
291
+ "/v1/check": "DEPRECATED β€” use /v2/check",
292
+ "/v2/reserve": "inventory_service",
293
+ "/v2/check": "inventory_service",
294
+ "/v1/create": "shipping_service",
295
+ },
296
+ "timeout": 30,
297
+ },
298
+ "auth_service": {
299
+ "valid_tokens": ["valid_token_123", "valid_token_789"],
300
+ "expired_tokens": ["expired_token_456"],
301
+ "token_refresh_endpoint": "/refresh",
302
+ },
303
+ },
304
+ logs={
305
+ "order_service": [
306
+ "[ERROR] 2026-03-25T12:00:05Z POST inventory.internal/v1/check -> 301 Moved Permanently",
307
+ "[ERROR] 2026-03-25T12:00:05Z Response: {'error': 'Endpoint deprecated. Use /v2/reserve'}",
308
+ "[ERROR] 2026-03-25T12:00:07Z Timeout after 2s waiting for inventory response",
309
+ "[ERROR] 2026-03-25T12:00:07Z Order ord_999 failed: inventory check timed out",
310
+ "[WARN] 2026-03-25T12:00:08Z Synchronous mode: blocking on inventory response",
311
+ "[ERROR] 2026-03-25T12:00:09Z Race condition: order ord_998 processed before ord_997 completed",
312
+ ],
313
+ "inventory_service": [
314
+ "[INFO] 2026-03-25T12:00:05Z Received request on /v1/check -> redirecting to /v2/check",
315
+ "[WARN] 2026-03-25T12:00:06Z Processing reservation... avg time: 4s",
316
+ "[ERROR] 2026-03-25T12:00:10Z POST shipping.internal/v1/create -> 401 Unauthorized",
317
+ "[ERROR] 2026-03-25T12:00:10Z Auth token expired_token_456 is no longer valid",
318
+ "[ERROR] 2026-03-25T12:00:10Z Cannot create shipment: authentication failed",
319
+ ],
320
+ "shipping_service": [
321
+ "[WARN] 2026-03-25T12:00:10Z Rejected request: token 'expired_token_456' is expired",
322
+ "[INFO] 2026-03-25T12:00:00Z Shipping service healthy, awaiting authenticated requests",
323
+ ],
324
+ "api_gateway": [
325
+ "[WARN] 2026-03-25T12:00:05Z Deprecated endpoint /v1/check accessed by order_service",
326
+ "[INFO] 2026-03-25T12:00:05Z Redirecting /v1/check -> /v2/check (301)",
327
+ ],
328
+ "auth_service": [
329
+ "[WARN] 2026-03-25T12:00:10Z Token validation failed: expired_token_456 expired at 2026-03-24T00:00:00Z",
330
+ "[INFO] 2026-03-25T12:00:00Z Auth service ready. Valid tokens: 2, Expired: 1",
331
+ ],
332
+ },
333
+ issues=[
334
+ Issue(
335
+ issue_id="hard_wrong_url",
336
+ service="order_service",
337
+ description="Order service calling deprecated /v1/check instead of /v2/reserve",
338
+ expected_fix={"inventory_url": "https://inventory.internal/v2/reserve"},
339
+ fix_key="inventory_url",
340
+ log_hint="Endpoint deprecated. Use /v2/reserve",
341
+ ),
342
+ Issue(
343
+ issue_id="hard_timeout",
344
+ service="order_service",
345
+ description="Timeout too short (2s) for inventory service that takes ~4s to process",
346
+ expected_fix={"timeout": 10},
347
+ fix_key="timeout",
348
+ log_hint="Timeout after 2s waiting for inventory response",
349
+ ),
350
+ Issue(
351
+ issue_id="hard_async",
352
+ service="order_service",
353
+ description="Synchronous mode causes race conditions between concurrent orders",
354
+ expected_fix={"async_mode": True},
355
+ fix_key="async_mode",
356
+ log_hint="Race condition: order ord_998 processed before ord_997 completed",
357
+ ),
358
+ Issue(
359
+ issue_id="hard_expired_token",
360
+ service="inventory_service",
361
+ description="Expired auth token used for shipping service requests",
362
+ expected_fix={"headers.Authorization": "Bearer valid_token_789"},
363
+ fix_key="headers.Authorization",
364
+ log_hint="Auth token expired_token_456 is no longer valid",
365
+ ),
366
+ Issue(
367
+ issue_id="hard_token_refresh",
368
+ service="inventory_service",
369
+ description="No automatic token refresh mechanism configured",
370
+ expected_fix={"token_refresh_url": "https://auth.internal/refresh", "auto_refresh": True},
371
+ fix_key="token_refresh_url",
372
+ log_hint="Token validation failed: expired_token_456 expired",
373
+ ),
374
+ ],
375
+ )
scripts/baseline_inference.py ADDED
@@ -0,0 +1,223 @@
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
+ """
2
+ Baseline inference script for the API Integration Debugging Environment.
3
+
4
+ This script demonstrates an LLM-powered agent interacting with the environment
5
+ using the OpenAI API. It runs all 3 tasks (easy, medium, hard) and reports
6
+ baseline scores.
7
+
8
+ Usage:
9
+ # Set your OpenAI API key
10
+ export OPENAI_API_KEY=your-key-here
11
+
12
+ # Run baseline
13
+ python scripts/baseline_inference.py
14
+
15
+ # Or specify a server URL
16
+ python scripts/baseline_inference.py --server-url http://localhost:8000
17
+ """
18
+
19
+ import argparse
20
+ import json
21
+ import os
22
+ import sys
23
+ from typing import Any, Dict, List, Optional
24
+
25
+ # Add parent directory to path
26
+ sys.path.insert(0, os.path.dirname(os.path.dirname(os.path.abspath(__file__))))
27
+
28
+ from models import ApiDebugAction, ApiDebugObservation
29
+ from scenarios import get_all_task_ids, get_scenario
30
+ from server.api_debug_env_environment import ApiDebugEnvironment
31
+
32
+
33
+ def run_rule_based_baseline(task_id: str) -> Dict[str, Any]:
34
+ """
35
+ Run a simple rule-based baseline agent (no LLM needed).
36
+
37
+ Strategy:
38
+ 1. Inspect all logs
39
+ 2. Inspect all configs
40
+ 3. Test all endpoints
41
+ (Does not attempt fixes β€” tests reward signal for exploration-only behavior)
42
+ """
43
+ env = ApiDebugEnvironment(task_id=task_id)
44
+ obs = env.reset()
45
+ total_reward = 0.0
46
+
47
+ # Phase 1: Inspect all logs
48
+ for service in obs.available_targets:
49
+ if obs.done:
50
+ break
51
+ obs = env.step(ApiDebugAction(action_type="inspect_logs", target=service))
52
+ total_reward += obs.reward
53
+
54
+ # Phase 2: Inspect all configs
55
+ for service in obs.available_targets:
56
+ if obs.done:
57
+ break
58
+ obs = env.step(ApiDebugAction(action_type="inspect_config", target=service))
59
+ total_reward += obs.reward
60
+
61
+ # Phase 3: Test all endpoints
62
+ for service in obs.available_targets:
63
+ if obs.done:
64
+ break
65
+ obs = env.step(ApiDebugAction(action_type="inspect_endpoint", target=service))
66
+ total_reward += obs.reward
67
+
68
+ score = env.grade()
69
+ return {
70
+ "task_id": task_id,
71
+ "score": score,
72
+ "total_reward": round(total_reward, 4),
73
+ "steps_used": env._state.step_count,
74
+ "issues_found": len(env._issues_found),
75
+ "issues_fixed": len(env._issues_fixed),
76
+ "issues_total": len(env._scenario.issues) if env._scenario else 0,
77
+ }
78
+
79
+
80
+ def run_llm_baseline(task_id: str, api_key: Optional[str] = None) -> Dict[str, Any]:
81
+ """
82
+ Run an LLM-powered baseline agent using OpenAI API.
83
+
84
+ The LLM reads observations and decides what to do next.
85
+ """
86
+ try:
87
+ from openai import OpenAI
88
+ except ImportError:
89
+ print("OpenAI package not installed. Running rule-based baseline instead.")
90
+ return run_rule_based_baseline(task_id)
91
+
92
+ key = api_key or os.environ.get("OPENAI_API_KEY")
93
+ if not key:
94
+ print("No OPENAI_API_KEY set. Running rule-based baseline instead.")
95
+ return run_rule_based_baseline(task_id)
96
+
97
+ client = OpenAI(api_key=key)
98
+ env = ApiDebugEnvironment(task_id=task_id)
99
+ obs = env.reset()
100
+ total_reward = 0.0
101
+
102
+ system_prompt = f"""You are an API debugging agent. Your task: {obs.task_description}
103
+
104
+ Available actions:
105
+ - inspect_logs: Read error logs for a service
106
+ - inspect_config: See the configuration of a service
107
+ - inspect_endpoint: Test-call an endpoint
108
+ - submit_fix: Submit a config fix (requires fix_payload dict)
109
+
110
+ Available targets: {obs.available_targets}
111
+ Total issues to fix: {obs.issues_total}
112
+
113
+ Respond with JSON: {{"action_type": "...", "target": "...", "fix_payload": {{...}} }}
114
+ Only include fix_payload when action_type is "submit_fix"."""
115
+
116
+ messages = [{"role": "system", "content": system_prompt}]
117
+
118
+ while not obs.done:
119
+ # Build observation message
120
+ obs_text = f"""Step {env._state.step_count}/{env._scenario.max_steps if env._scenario else '?'}
121
+ Remaining steps: {obs.remaining_steps}
122
+ Issues found: {obs.issues_found}/{obs.issues_total}
123
+ Issues fixed: {obs.issues_fixed}/{obs.issues_total}
124
+ Last action result: {obs.action_result}"""
125
+
126
+ if obs.logs:
127
+ obs_text += f"\nLogs:\n" + "\n".join(obs.logs)
128
+ if obs.config_snapshot:
129
+ obs_text += f"\nConfig: {json.dumps(obs.config_snapshot, indent=2)}"
130
+ if obs.api_response:
131
+ obs_text += f"\nAPI Response: {json.dumps(obs.api_response, indent=2)}"
132
+ if obs.hints:
133
+ obs_text += f"\nHints: {'; '.join(obs.hints)}"
134
+
135
+ messages.append({"role": "user", "content": obs_text})
136
+
137
+ try:
138
+ response = client.chat.completions.create(
139
+ model="gpt-4o-mini",
140
+ messages=messages,
141
+ temperature=0.2,
142
+ max_tokens=500,
143
+ response_format={"type": "json_object"},
144
+ )
145
+
146
+ action_json = json.loads(response.choices[0].message.content)
147
+ messages.append({"role": "assistant", "content": json.dumps(action_json)})
148
+
149
+ action = ApiDebugAction(
150
+ action_type=action_json.get("action_type", "inspect_logs"),
151
+ target=action_json.get("target", obs.available_targets[0] if obs.available_targets else ""),
152
+ fix_payload=action_json.get("fix_payload"),
153
+ )
154
+ except Exception as e:
155
+ print(f" LLM error: {e}. Falling back to inspect_logs.")
156
+ action = ApiDebugAction(
157
+ action_type="inspect_logs",
158
+ target=obs.available_targets[0] if obs.available_targets else "",
159
+ )
160
+
161
+ obs = env.step(action)
162
+ total_reward += obs.reward
163
+
164
+ score = env.grade()
165
+ return {
166
+ "task_id": task_id,
167
+ "score": score,
168
+ "total_reward": round(total_reward, 4),
169
+ "steps_used": env._state.step_count,
170
+ "issues_found": len(env._issues_found),
171
+ "issues_fixed": len(env._issues_fixed),
172
+ "issues_total": len(env._scenario.issues) if env._scenario else 0,
173
+ }
174
+
175
+
176
+ def main():
177
+ parser = argparse.ArgumentParser(description="Baseline inference for API Debug Env")
178
+ parser.add_argument("--mode", choices=["rule", "llm"], default="rule",
179
+ help="Baseline mode: 'rule' for rule-based, 'llm' for LLM-powered")
180
+ parser.add_argument("--api-key", type=str, default=None,
181
+ help="OpenAI API key (or set OPENAI_API_KEY env var)")
182
+ parser.add_argument("--task", type=str, default=None,
183
+ help="Run specific task only (easy/medium/hard)")
184
+ args = parser.parse_args()
185
+
186
+ print("=" * 60)
187
+ print("API Integration Debugging β€” Baseline Inference")
188
+ print("=" * 60)
189
+
190
+ task_ids = [args.task] if args.task else get_all_task_ids()
191
+ all_results = {}
192
+
193
+ for task_id in task_ids:
194
+ print(f"\n{'─' * 40}")
195
+ print(f"Task: {task_id}")
196
+ print(f"{'─' * 40}")
197
+
198
+ if args.mode == "llm":
199
+ result = run_llm_baseline(task_id, args.api_key)
200
+ else:
201
+ result = run_rule_based_baseline(task_id)
202
+
203
+ all_results[task_id] = result
204
+ print(f" Score: {result['score']}")
205
+ print(f" Reward: {result['total_reward']}")
206
+ print(f" Steps: {result['steps_used']}")
207
+ print(f" Issues found: {result['issues_found']}/{result['issues_total']}")
208
+ print(f" Issues fixed: {result['issues_fixed']}/{result['issues_total']}")
209
+
210
+ print(f"\n{'=' * 60}")
211
+ print("Summary")
212
+ print(f"{'=' * 60}")
213
+ for tid, res in all_results.items():
214
+ print(f" {tid:8s} score={res['score']:.4f} fixed={res['issues_fixed']}/{res['issues_total']}")
215
+
216
+ avg_score = sum(r["score"] for r in all_results.values()) / len(all_results)
217
+ print(f"\n Average score: {avg_score:.4f}")
218
+
219
+ return all_results
220
+
221
+
222
+ if __name__ == "__main__":
223
+ main()
server/__init__.py ADDED
@@ -0,0 +1,11 @@
 
 
 
 
 
 
 
 
 
 
 
 
1
+ # Copyright (c) Meta Platforms, Inc. and affiliates.
2
+ # All rights reserved.
3
+ #
4
+ # This source code is licensed under the BSD-style license found in the
5
+ # LICENSE file in the root directory of this source tree.
6
+
7
+ """Api Debug Env environment server components."""
8
+
9
+ from .api_debug_env_environment import ApiDebugEnvironment
10
+
11
+ __all__ = ["ApiDebugEnvironment"]
server/api_debug_env_environment.py ADDED
@@ -0,0 +1,446 @@
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
+ # Copyright (c) Meta Platforms, Inc. and affiliates.
2
+ # All rights reserved.
3
+ #
4
+ # This source code is licensed under the BSD-style license found in the
5
+ # LICENSE file in the root directory of this source tree.
6
+
7
+ """
8
+ API Integration Debugging Environment Implementation.
9
+
10
+ A real-world environment where an AI agent diagnoses and fixes broken
11
+ API integrations by reading error logs, inspecting configurations,
12
+ and submitting corrected configurations.
13
+ """
14
+
15
+ import copy
16
+ from typing import Any, Dict, List, Optional, Set
17
+ from uuid import uuid4
18
+
19
+ from openenv.core.env_server.interfaces import Environment
20
+ from openenv.core.env_server.types import State
21
+
22
+ try:
23
+ from ..models import ApiDebugAction, ApiDebugObservation
24
+ from ..scenarios import Issue, Scenario, get_all_task_ids, get_scenario
25
+ except ImportError:
26
+ from models import ApiDebugAction, ApiDebugObservation
27
+ from scenarios import Issue, Scenario, get_all_task_ids, get_scenario
28
+
29
+
30
+ class ApiDebugEnvironment(Environment):
31
+ """
32
+ API Integration Debugging Environment.
33
+
34
+ An agent must diagnose and fix broken API integrations by:
35
+ 1. Inspecting error logs to identify issues
36
+ 2. Inspecting service configurations
37
+ 3. Testing endpoints to observe failures
38
+ 4. Submitting configuration fixes
39
+
40
+ Supports 3 difficulty levels (easy, medium, hard) with different
41
+ numbers of issues and complexity.
42
+ """
43
+
44
+ SUPPORTS_CONCURRENT_SESSIONS: bool = True
45
+
46
+ def __init__(self, task_id: str = "easy"):
47
+ """
48
+ Initialize the environment.
49
+
50
+ Args:
51
+ task_id: One of 'easy', 'medium', 'hard'
52
+ """
53
+ self._task_id = task_id
54
+ self._state = State(episode_id=str(uuid4()), step_count=0)
55
+ self._scenario: Optional[Scenario] = None
56
+ self._current_configs: Dict[str, Dict[str, Any]] = {}
57
+ self._issues_found: Set[str] = set()
58
+ self._issues_fixed: Set[str] = set()
59
+ self._inspected_targets: Set[str] = set()
60
+ self._done = False
61
+ self._last_action_result = ""
62
+ self._cumulative_reward = 0.0
63
+
64
+ def reset(self, task_id: Optional[str] = None) -> ApiDebugObservation:
65
+ """
66
+ Reset the environment, optionally with a new task.
67
+
68
+ Args:
69
+ task_id: Override the task difficulty. One of 'easy', 'medium', 'hard'.
70
+
71
+ Returns:
72
+ Initial observation with task description and available targets.
73
+ """
74
+ if task_id is not None:
75
+ self._task_id = task_id
76
+
77
+ self._state = State(episode_id=str(uuid4()), step_count=0)
78
+ self._scenario = get_scenario(self._task_id)
79
+ self._current_configs = copy.deepcopy(self._scenario.configs)
80
+ self._issues_found = set()
81
+ self._issues_fixed = set()
82
+ self._inspected_targets = set()
83
+ self._done = False
84
+ self._last_action_result = ""
85
+ self._cumulative_reward = 0.0
86
+
87
+ return ApiDebugObservation(
88
+ task_id=self._task_id,
89
+ task_description=self._scenario.description,
90
+ logs=[],
91
+ config_snapshot={},
92
+ api_response=None,
93
+ hints=self._get_hints(),
94
+ remaining_steps=self._scenario.max_steps,
95
+ issues_found=0,
96
+ issues_fixed=0,
97
+ issues_total=len(self._scenario.issues),
98
+ action_result="Environment reset. Use 'inspect_logs' or 'inspect_config' to start debugging.",
99
+ available_targets=self._scenario.services,
100
+ done=False,
101
+ reward=0.0,
102
+ )
103
+
104
+ def step(self, action: ApiDebugAction) -> ApiDebugObservation: # type: ignore[override]
105
+ """
106
+ Execute one debugging step.
107
+
108
+ Args:
109
+ action: ApiDebugAction with action_type, target, and optional fix_payload
110
+
111
+ Returns:
112
+ ApiDebugObservation with results of the action
113
+ """
114
+ if self._scenario is None:
115
+ # Auto-reset if not initialized
116
+ self.reset()
117
+
118
+ assert self._scenario is not None # for type checker
119
+
120
+ self._state.step_count += 1
121
+ reward = 0.0
122
+ logs: List[str] = []
123
+ config_snapshot: Dict[str, Any] = {}
124
+ api_response: Optional[Dict[str, Any]] = None
125
+
126
+ # Validate target
127
+ if action.target not in self._scenario.services:
128
+ self._last_action_result = (
129
+ f"Invalid target '{action.target}'. "
130
+ f"Valid targets: {self._scenario.services}"
131
+ )
132
+ reward = -0.05
133
+ elif action.action_type == "inspect_logs":
134
+ logs, reward = self._handle_inspect_logs(action.target)
135
+ elif action.action_type == "inspect_config":
136
+ config_snapshot, reward = self._handle_inspect_config(action.target)
137
+ elif action.action_type == "inspect_endpoint":
138
+ api_response, reward = self._handle_inspect_endpoint(action.target)
139
+ elif action.action_type == "submit_fix":
140
+ reward = self._handle_submit_fix(action.target, action.fix_payload or {})
141
+ else:
142
+ self._last_action_result = (
143
+ f"Invalid action_type '{action.action_type}'. "
144
+ "Valid types: inspect_logs, inspect_config, inspect_endpoint, submit_fix"
145
+ )
146
+ reward = -0.05
147
+
148
+ self._cumulative_reward += reward
149
+
150
+ # Check episode termination
151
+ remaining = self._scenario.max_steps - self._state.step_count
152
+ all_fixed = len(self._issues_fixed) == len(self._scenario.issues)
153
+
154
+ if all_fixed:
155
+ self._done = True
156
+ reward += 0.2 # completion bonus
157
+ self._cumulative_reward += 0.2
158
+ self._last_action_result += " πŸŽ‰ All issues fixed! Episode complete."
159
+
160
+ if remaining <= 0 and not self._done:
161
+ self._done = True
162
+ self._last_action_result += " ⏰ Out of steps. Episode ended."
163
+
164
+ return ApiDebugObservation(
165
+ task_id=self._task_id,
166
+ task_description=self._scenario.description,
167
+ logs=logs,
168
+ config_snapshot=config_snapshot,
169
+ api_response=api_response,
170
+ hints=self._get_hints(),
171
+ remaining_steps=max(0, remaining),
172
+ issues_found=len(self._issues_found),
173
+ issues_fixed=len(self._issues_fixed),
174
+ issues_total=len(self._scenario.issues),
175
+ action_result=self._last_action_result,
176
+ available_targets=self._scenario.services,
177
+ done=self._done,
178
+ reward=reward,
179
+ metadata={
180
+ "cumulative_reward": self._cumulative_reward,
181
+ "step": self._state.step_count,
182
+ "issues_found_ids": list(self._issues_found),
183
+ "issues_fixed_ids": list(self._issues_fixed),
184
+ },
185
+ )
186
+
187
+ @property
188
+ def state(self) -> State:
189
+ """Get current environment state."""
190
+ return self._state
191
+
192
+ # ─── Action Handlers ──────────────────────────────────────────────────
193
+
194
+ def _handle_inspect_logs(self, target: str) -> tuple:
195
+ """Return logs for a service and reward for relevant inspection."""
196
+ assert self._scenario is not None
197
+ logs = self._scenario.logs.get(target, [])
198
+ self._inspected_targets.add(f"logs:{target}")
199
+
200
+ # Check if any unfound issues have log hints in these logs
201
+ found_new = False
202
+ for issue in self._scenario.issues:
203
+ if issue.issue_id not in self._issues_found:
204
+ for log_line in logs:
205
+ if issue.log_hint in log_line:
206
+ self._issues_found.add(issue.issue_id)
207
+ found_new = True
208
+
209
+ if found_new:
210
+ reward = 0.15
211
+ self._last_action_result = f"Inspected logs for '{target}'. Found relevant error patterns!"
212
+ elif logs:
213
+ reward = 0.05
214
+ self._last_action_result = f"Inspected logs for '{target}'. {len(logs)} log entries found."
215
+ else:
216
+ reward = 0.0
217
+ self._last_action_result = f"No logs available for '{target}'."
218
+
219
+ return logs, reward
220
+
221
+ def _handle_inspect_config(self, target: str) -> tuple:
222
+ """Return current config for a service."""
223
+ assert self._scenario is not None
224
+ config = self._current_configs.get(target, {})
225
+ self._inspected_targets.add(f"config:{target}")
226
+
227
+ # Small reward for inspecting a service that has issues
228
+ has_issues = any(i.service == target for i in self._scenario.issues if i.issue_id not in self._issues_fixed)
229
+ reward = 0.05 if has_issues else 0.02
230
+
231
+ self._last_action_result = f"Inspected config for '{target}'. Configuration retrieved."
232
+ return config, reward
233
+
234
+ def _handle_inspect_endpoint(self, target: str) -> tuple:
235
+ """Simulate testing an endpoint and return the response."""
236
+ assert self._scenario is not None
237
+
238
+ # Find unfixed issues for this service
239
+ unfixed = [
240
+ i for i in self._scenario.issues
241
+ if i.service == target and i.issue_id not in self._issues_fixed
242
+ ]
243
+
244
+ if unfixed:
245
+ # Simulate a failure based on the first unfixed issue
246
+ issue = unfixed[0]
247
+ api_response = {
248
+ "status": "error",
249
+ "status_code": 401 if "auth" in issue.issue_id else 500,
250
+ "error": issue.description,
251
+ "hint": f"Check the {issue.fix_key} configuration",
252
+ }
253
+ reward = 0.05
254
+ self._last_action_result = f"Tested endpoint on '{target}'. Got error response."
255
+ else:
256
+ api_response = {
257
+ "status": "success",
258
+ "status_code": 200,
259
+ "message": f"{target} is working correctly.",
260
+ }
261
+ reward = 0.02
262
+ self._last_action_result = f"Tested endpoint on '{target}'. Service responding OK."
263
+
264
+ return api_response, reward
265
+
266
+ def _handle_submit_fix(self, target: str, fix_payload: Dict[str, Any]) -> float:
267
+ """Process a fix submission and score it."""
268
+ assert self._scenario is not None
269
+
270
+ if not fix_payload:
271
+ self._last_action_result = "Fix rejected: fix_payload cannot be empty."
272
+ return -0.1
273
+
274
+ # Find issues for this target service
275
+ target_issues = [
276
+ i for i in self._scenario.issues
277
+ if i.service == target and i.issue_id not in self._issues_fixed
278
+ ]
279
+
280
+ if not target_issues:
281
+ self._last_action_result = f"No unfixed issues found for '{target}'."
282
+ return -0.05
283
+
284
+ reward = 0.0
285
+ fixed_any = False
286
+
287
+ for issue in target_issues:
288
+ if self._check_fix(issue, fix_payload):
289
+ self._issues_fixed.add(issue.issue_id)
290
+ self._issues_found.add(issue.issue_id) # finding + fixing counts
291
+ self._apply_fix(target, fix_payload)
292
+ reward += 0.25
293
+ fixed_any = True
294
+
295
+ if fixed_any:
296
+ fixed_count = sum(1 for i in target_issues if i.issue_id in self._issues_fixed)
297
+ self._last_action_result = (
298
+ f"Fix accepted for '{target}'! "
299
+ f"Fixed {fixed_count} issue(s). "
300
+ f"Total fixed: {len(self._issues_fixed)}/{len(self._scenario.issues)}"
301
+ )
302
+ else:
303
+ self._last_action_result = (
304
+ f"Fix rejected for '{target}'. The payload doesn't address any known issues. "
305
+ "Try inspecting logs and config to identify the correct fix."
306
+ )
307
+ reward = -0.1
308
+
309
+ return reward
310
+
311
+ # ─── Helper Methods ───────────────────────────────────────────────────
312
+
313
+ def _check_fix(self, issue: Issue, fix_payload: Dict[str, Any]) -> bool:
314
+ """
315
+ Check if a fix payload correctly addresses an issue.
316
+
317
+ Uses fuzzy matching β€” the fix is accepted if:
318
+ 1. The fix_key is present in the payload, OR
319
+ 2. Any expected_fix key is present in the payload with a reasonable value
320
+ """
321
+ # Direct key match
322
+ if issue.fix_key in fix_payload:
323
+ return True
324
+
325
+ # Check nested key (e.g., "headers.Authorization" -> check payload for "Authorization")
326
+ if "." in issue.fix_key:
327
+ parts = issue.fix_key.split(".")
328
+ leaf_key = parts[-1]
329
+ if leaf_key in fix_payload:
330
+ return True
331
+
332
+ # Check expected fix keys
333
+ for key in issue.expected_fix:
334
+ if key in fix_payload:
335
+ return True
336
+ if "." in key:
337
+ leaf = key.split(".")[-1]
338
+ if leaf in fix_payload:
339
+ return True
340
+
341
+ return False
342
+
343
+ def _apply_fix(self, target: str, fix_payload: Dict[str, Any]) -> None:
344
+ """Apply a fix to the current configuration."""
345
+ if target not in self._current_configs:
346
+ return
347
+
348
+ config = self._current_configs[target]
349
+ for key, value in fix_payload.items():
350
+ if "." in key:
351
+ # Nested key: e.g., "headers.Authorization"
352
+ parts = key.split(".")
353
+ obj = config
354
+ for part in parts[:-1]:
355
+ if part not in obj:
356
+ obj[part] = {}
357
+ obj = obj[part]
358
+ obj[parts[-1]] = value
359
+ else:
360
+ config[key] = value
361
+
362
+ def _get_hints(self) -> List[str]:
363
+ """Return progressive hints based on step count."""
364
+ if self._scenario is None:
365
+ return []
366
+
367
+ hints = []
368
+ step = self._state.step_count
369
+ total_issues = len(self._scenario.issues)
370
+ unfixed = total_issues - len(self._issues_fixed)
371
+
372
+ if step == 0:
373
+ hints.append("Start by inspecting error logs for each service to find clues.")
374
+ hints.append(f"There are {total_issues} issues to find and fix.")
375
+ elif step > 0 and len(self._issues_found) == 0:
376
+ hints.append("Try 'inspect_logs' on different services to find error patterns.")
377
+ elif len(self._issues_found) > 0 and len(self._issues_fixed) == 0:
378
+ hints.append("You've found issues! Use 'inspect_config' to see current settings, then 'submit_fix'.")
379
+ elif unfixed > 0:
380
+ hints.append(f"{unfixed} issue(s) remaining. Check services you haven't inspected yet.")
381
+
382
+ # Late-game hints
383
+ if self._scenario.max_steps - step <= 5 and unfixed > 0:
384
+ # Give more specific hints when running low on steps
385
+ for issue in self._scenario.issues:
386
+ if issue.issue_id not in self._issues_fixed:
387
+ hints.append(f"Hint: Check '{issue.service}' οΏ½οΏ½οΏ½ look for '{issue.fix_key}' in the config.")
388
+
389
+ return hints
390
+
391
+ # ─── Grading ──────────────────────────────────────────────────────────
392
+
393
+ def grade(self) -> float:
394
+ """
395
+ Grade the agent's performance on the current episode.
396
+
397
+ Score = (issues_fixed / issues_total) * efficiency_bonus
398
+ Efficiency bonus = 1.0 + (remaining_steps / max_steps * 0.3)
399
+
400
+ Returns:
401
+ Score between 0.0 and 1.0
402
+ """
403
+ if self._scenario is None:
404
+ return 0.0
405
+
406
+ total = len(self._scenario.issues)
407
+ if total == 0:
408
+ return 1.0
409
+
410
+ fix_ratio = len(self._issues_fixed) / total
411
+ remaining = max(0, self._scenario.max_steps - self._state.step_count)
412
+ efficiency_bonus = 1.0 + (remaining / self._scenario.max_steps * 0.3)
413
+
414
+ score = fix_ratio * efficiency_bonus
415
+ return min(1.0, round(score, 4))
416
+
417
+ def get_task_info(self) -> Dict[str, Any]:
418
+ """Return information about the current task."""
419
+ if self._scenario is None:
420
+ return {"error": "Environment not initialized. Call reset() first."}
421
+
422
+ return {
423
+ "task_id": self._task_id,
424
+ "difficulty": self._scenario.difficulty,
425
+ "description": self._scenario.description,
426
+ "max_steps": self._scenario.max_steps,
427
+ "issues_total": len(self._scenario.issues),
428
+ "services": self._scenario.services,
429
+ "action_schema": {
430
+ "action_type": {
431
+ "type": "string",
432
+ "enum": ["inspect_logs", "inspect_config", "inspect_endpoint", "submit_fix"],
433
+ "description": "The type of debugging action to take",
434
+ },
435
+ "target": {
436
+ "type": "string",
437
+ "enum": self._scenario.services,
438
+ "description": "The service to act on",
439
+ },
440
+ "fix_payload": {
441
+ "type": "object",
442
+ "description": "Configuration fix (required for submit_fix action)",
443
+ "required": False,
444
+ },
445
+ },
446
+ }
server/app.py ADDED
@@ -0,0 +1,196 @@
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
+ # Copyright (c) Meta Platforms, Inc. and affiliates.
2
+ # All rights reserved.
3
+ #
4
+ # This source code is licensed under the BSD-style license found in the
5
+ # LICENSE file in the root directory of this source tree.
6
+
7
+ """
8
+ FastAPI application for the API Integration Debugging Environment.
9
+
10
+ Endpoints:
11
+ - POST /reset: Reset the environment
12
+ - POST /step: Execute an action
13
+ - GET /state: Get current environment state
14
+ - GET /schema: Get action/observation schemas
15
+ - WS /ws: WebSocket endpoint for persistent sessions
16
+ - GET /tasks: List all tasks with action schema
17
+ - POST /grader: Get grader score for current episode
18
+ - POST /baseline: Run baseline inference on all tasks
19
+
20
+ Usage:
21
+ uvicorn server.app:app --reload --host 0.0.0.0 --port 8000
22
+ """
23
+
24
+ import os
25
+ from typing import Dict, Any, Optional
26
+
27
+ from fastapi import FastAPI
28
+ from pydantic import BaseModel
29
+
30
+ try:
31
+ from openenv.core.env_server.http_server import create_app
32
+ except Exception as e:
33
+ raise ImportError(
34
+ "openenv is required. Install with: uv sync"
35
+ ) from e
36
+
37
+ try:
38
+ from ..models import ApiDebugAction, ApiDebugObservation
39
+ from .api_debug_env_environment import ApiDebugEnvironment
40
+ except ModuleNotFoundError:
41
+ from models import ApiDebugAction, ApiDebugObservation
42
+ from server.api_debug_env_environment import ApiDebugEnvironment
43
+
44
+ try:
45
+ from ..scenarios import get_all_task_ids, get_scenario
46
+ except ModuleNotFoundError:
47
+ from scenarios import get_all_task_ids, get_scenario
48
+
49
+
50
+ # ─── Create the core OpenEnv app ─────────────────────────────────────────────
51
+
52
+ app = create_app(
53
+ ApiDebugEnvironment,
54
+ ApiDebugAction,
55
+ ApiDebugObservation,
56
+ env_name="api_debug_env",
57
+ max_concurrent_envs=3,
58
+ )
59
+
60
+
61
+ # ─── Hackathon-required endpoints ─────────────────────────────────────────────
62
+
63
+ # Store environment instances per task for grading
64
+ _grading_envs: Dict[str, ApiDebugEnvironment] = {}
65
+
66
+
67
+ class GraderRequest(BaseModel):
68
+ task_id: str = "easy"
69
+
70
+
71
+ class BaselineRequest(BaseModel):
72
+ api_key: Optional[str] = None
73
+
74
+
75
+ @app.get("/tasks")
76
+ async def list_tasks():
77
+ """Return list of all tasks with action schema."""
78
+ tasks = []
79
+ for task_id in get_all_task_ids():
80
+ scenario = get_scenario(task_id)
81
+ tasks.append({
82
+ "task_id": task_id,
83
+ "difficulty": scenario.difficulty,
84
+ "description": scenario.description,
85
+ "max_steps": scenario.max_steps,
86
+ "issues_count": len(scenario.issues),
87
+ "services": scenario.services,
88
+ "action_schema": {
89
+ "action_type": {
90
+ "type": "string",
91
+ "enum": ["inspect_logs", "inspect_config", "inspect_endpoint", "submit_fix"],
92
+ },
93
+ "target": {
94
+ "type": "string",
95
+ "enum": scenario.services,
96
+ },
97
+ "fix_payload": {
98
+ "type": "object",
99
+ "required": False,
100
+ },
101
+ },
102
+ })
103
+ return {"tasks": tasks}
104
+
105
+
106
+ @app.post("/grader")
107
+ async def run_grader(request: GraderRequest):
108
+ """Return grader score for a completed episode."""
109
+ task_id = request.task_id
110
+
111
+ if task_id in _grading_envs:
112
+ env = _grading_envs[task_id]
113
+ score = env.grade()
114
+ return {
115
+ "task_id": task_id,
116
+ "score": score,
117
+ "issues_fixed": len(env._issues_fixed),
118
+ "issues_total": len(env._scenario.issues) if env._scenario else 0,
119
+ "steps_used": env._state.step_count,
120
+ }
121
+
122
+ return {
123
+ "task_id": task_id,
124
+ "score": 0.0,
125
+ "message": "No completed episode found. Run the environment first.",
126
+ }
127
+
128
+
129
+ @app.post("/baseline")
130
+ async def run_baseline(request: BaselineRequest):
131
+ """
132
+ Run a simple rule-based baseline agent on all tasks.
133
+ Returns baseline scores for each task.
134
+ """
135
+ results = {}
136
+
137
+ for task_id in get_all_task_ids():
138
+ env = ApiDebugEnvironment(task_id=task_id)
139
+ obs = env.reset()
140
+
141
+ # Simple baseline strategy: inspect all logs, then all configs, then submit fixes
142
+ for service in obs.available_targets:
143
+ if env._done:
144
+ break
145
+ obs = env.step(ApiDebugAction(
146
+ action_type="inspect_logs",
147
+ target=service,
148
+ ))
149
+
150
+ for service in obs.available_targets:
151
+ if env._done:
152
+ break
153
+ obs = env.step(ApiDebugAction(
154
+ action_type="inspect_config",
155
+ target=service,
156
+ ))
157
+
158
+ for service in obs.available_targets:
159
+ if env._done:
160
+ break
161
+ obs = env.step(ApiDebugAction(
162
+ action_type="inspect_endpoint",
163
+ target=service,
164
+ ))
165
+
166
+ # Store for grading
167
+ _grading_envs[task_id] = env
168
+ score = env.grade()
169
+
170
+ results[task_id] = {
171
+ "score": score,
172
+ "steps_used": env._state.step_count,
173
+ "issues_found": len(env._issues_found),
174
+ "issues_fixed": len(env._issues_fixed),
175
+ "issues_total": len(env._scenario.issues) if env._scenario else 0,
176
+ }
177
+
178
+ return {"baseline_scores": results}
179
+
180
+
181
+ # ─── Entry point ──────────────────────────────────────────────────────────────
182
+
183
+ def main(host: str = "0.0.0.0", port: int = 8000):
184
+ """Run the server directly."""
185
+ import argparse
186
+ import uvicorn
187
+
188
+ parser = argparse.ArgumentParser()
189
+ parser.add_argument("--host", type=str, default=host)
190
+ parser.add_argument("--port", type=int, default=port)
191
+ args = parser.parse_args()
192
+ uvicorn.run(app, host=args.host, port=args.port)
193
+
194
+
195
+ if __name__ == "__main__":
196
+ main()
server/requirements.txt ADDED
@@ -0,0 +1,6 @@
 
 
 
 
 
 
 
1
+ openenv[core]>=0.2.0
2
+ fastapi>=0.115.0
3
+ uvicorn>=0.24.0
4
+
5
+
6
+
uv.lock ADDED
The diff for this file is too large to render. See raw diff