Project Map (AGENTS.md)
This file is a navigation map for agents. Durable knowledge lives in docs/.
Start Here
- Docs index: docs/README.md
- Architecture: docs/ARCHITECTURE.md
- Operations: docs/RUNBOOK.md
- Test:
uv run pytest tests/ -v
System-of-Record Documents
| Category | Location | Type | Purpose |
|---|---|---|---|
| Guides | docs/guides/README.md | how-to | Practical procedures |
| Design docs | docs/design-docs/index.md | explanation | Feature design, ADRs |
| References | docs/references/README.md | reference | External docs |
Project Structure
This project follows the OpenEnv openenv init convention.
The project root is the environment package β no envs/ nesting.
sql-env/ # project root = environment package
βββ __init__.py # exports SQLAction, SQLObservation, SQLEnvClient
βββ models.py # Pydantic models (action w/ tokens, observation w/ messages, state)
βββ client.py # SQLEnvClient(EnvClient) β WebSocket client w/ tensor serialization
βββ conftest.py # pytest config (ignores __init__.py collection)
βββ openenv.yaml # OpenEnv manifest
βββ pyproject.toml # deps + package config (setuptools, torch, transformers)
βββ .python-version # pins Python 3.12
βββ data/
β βββ databases/
β β βββ models.py # SQLAlchemy ORM models (student_assessment)
β βββ questions/
β βββ student_assessment.json # 30+ Spider Q&A pairs with gold SQL
βββ server/
β βββ app.py # FastAPI app (tokenizer factory, MockTokenizer fallback)
β βββ sql_environment.py # SQLEnvironment(Environment) β core logic + Ollama
β βββ test_sql_env.py # MockTokenizer (char-code encoding for dev/test)
β βββ reward.py # Reward computation (stub β Phase 3)
β βββ verifier.py # Answer comparison (stub β Phase 3)
β βββ Dockerfile
β βββ requirements.txt
β βββ install_deps.sh # Docker setup script
βββ scripts/
β βββ download_spider_data.py # Download Spider questions from HuggingFace
β βββ generate_models_from_schema.py # Auto-generate SQLAlchemy models
βββ tests/
β βββ test_smoke.py # 21 tests (models, env, actions, client, schema)
βββ docs/ # Design docs, architecture
βββ AGENTS.md
Guardrails
- Testing: Use the package manager (
uv run pytest ...), never barepytest. - Git safety: No destructive commands (
reset --hard,push --force) unless explicit. - Secrets: Never commit
.envor credentials.
Quick Commands
| Task | Command |
|---|---|
| Install | uv sync |
| Lint | uv run ruff check --fix . |
| Format | uv run ruff format . |
| Test | uv run pytest tests/ -v |
| Run server | uv run uvicorn server.app:app --reload |
| Validate env | uv run openenv validate --verbose |
| Build Docker | uv run openenv build |
| Push to HF | uv run openenv push |
Development Workflow
- Run via package manager (
uv run ...), never bare commands. - List existing files before creating new ones (avoid naming drift).
- Prefer vertical slices over horizontal refactors.
- No premature abstraction until multiple use-cases require it.
Delivery Safety (Move Fast Without Breaking Things)
Move fast by taking the smallest responsible step that produces real feedback, while pre-committing to guardrails so being wrong is survivable.
- Small batches: Prefer vertical slices and small PRs; reduce blast radius and review/debug time.
- Define "broken" first: Before shipping, write down what you will watch (errors, latency, correctness, cost) and the abort threshold.
- Design for reversibility: Make changes easy to turn off, roll back, or ignore.
System Boundaries (Avoid Analysis Paralysis)
Systems are continuous webs; plans require artificial boundaries.
- Boundary rule: Include only variables/components that could change the decision you are making.
- Clouds: Treat everything else as exogenous inputs; track them as risks/assumptions.
- Timebox mapping: If the landscape is moving faster than you can model it, run a probe (spike, canary, A/B) instead.
Maturity Modes
Match guardrails to maturity:
- Exploratory: Learning > durability. Prefer spikes; avoid irreversible state changes; manual verification is OK; expect throwaway code.
- MVP: Ship a thin end-to-end slice. Manual checks are OK, but you still need a fast rollback path and bounded impact.
- Production: Build to last. Automated tests, observability, progressive rollout, and explicit rollback/incident posture.
Expect limiting factors to move as you ship: fix the current bottleneck, then re-diagnose the next.
Progressive Delivery
- Feature flags: Use flags to make risky changes reversible. Categorize flags (release/experiment/ops/permissioning).
- Flags are inventory: Every flag needs an owner, an expiry, and a removal plan.
- Canary/ramp when risk is non-trivial: Start small, watch signals, ramp gradually; prefer "flip off" over redeploy.
Reliability Control Loop (If You Run Production)
- SLO + error budget: If you are within budget, keep shipping; if you burn budget, freeze non-critical changes and pay down reliability.
Avoid
- Big-bang releases, long-lived branches, unowned flags, flaky tests, and alert noise.
Python Guidelines
- Prefer type hints for public APIs; use
typing/collections.abc. - Use NumPy-style docstrings; keep them synced with type hints.
- Error handling: Use specific exceptions; avoid
try: ... except Exception: pass. - Dependencies: Use
uv add <package>; do not manually editpyproject.toml.
Docs Expectations
- Keep durable design/ops knowledge in
docs/(architecture, runbook, decisions). Keep AGENTS.md as a short map, not an encyclopedia.
Testing Standards
- Always use the project's package manager to run tests. Never invoke test runners directly.
- Python (uv):
uv run pytest tests/ -v(NEVER barepytest) - Python (poetry):
poetry run pytest tests/ -v - Node:
npm testornpm run test - Rust:
cargo test
- Python (uv):
- Rationale: Bare
pytestbypasses the virtualenv and may use the wrong Python/dependencies. Package managers ensure the correct environment. Bare invocations also trigger unnecessary permission prompts in automated workflows.